Speech emotion recognition with optimized multi-feature stack using deep convolutional neural networks
Muhammad Farhan Fadhil, Amalia Zahra
Abstract
The human emotion in communication plays a significant role that can influence how the context of the message is perceived by others. Speech emotion recognition (SER) is one of a field study that is very intriguing to explore because human-computer interaction (HCI) related technologies such as virtual assistant that are implemented nowadays rarely considered the emotion contained in the information relayed by human speech. One of the most widely used ways to perform SER is by extracting features of speech such as mel frequency cepstral coefficient (MFCC), mel-spectrogram, spectral contrast, tonnetz, and chromagram from the signal and using a one-dimensional (1D) convolutional neural network (CNN) as a classifier. This study shows the impact of implementing a combination of an optimized multi-feature stack and optimized 1D deep CNN model. The result of the model proposed in this study has an accuracy of 90.10% for classifying 8 different emotions performed on the ryerson audio-visual database of emotional speech and song (RAVDESS) dataset.
Keywords
Convolutional neural network; Mel frequency cepstral coefficient; Multi-feature stack; Optimized deep convolutional neural network; Speech emotion recognition
DOI:
https://doi.org/10.11591/eei.v13i6.6044
Refbacks
There are currently no refbacks.
This work is licensed under a
Creative Commons Attribution-ShareAlike 4.0 International License .
<div class="statcounter"><a title="hit counter" href="http://statcounter.com/free-hit-counter/" target="_blank"><img class="statcounter" src="http://c.statcounter.com/10241695/0/5a758c6a/0/" alt="hit counter"></a></div>
Bulletin of EEI Stats
Bulletin of Electrical Engineering and Informatics (BEEI) ISSN: 2089-3191, e-ISSN: 2302-9285 This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU) .