Bulletin of Electrical Engineering and Informatics

Received Jun 21, 2022 Revised Jul 26, 2022 Accepted Aug 25, 2022 One of the biggest challenges in implementing SER is to produce a model that performs well and is lightweight. One of the ways is using onedimensional convolutional neural network (1D CNN) and combining some handcrafted features. 1D CNN is mostly used for time series data. In time series data, the order of information plays an important role. In this case, the order of stacked features also plays an important role. In this work, the impact of changing the order is analyzed. This work proposes to brute force all possible combinations of feature orders from five features: Melfrequency cepstral coefficient (MFCC), Mel-spectrogram, chromagram, spectral contrast, and tonnetz, then uses 1D CNN as the model architecture and benchmarking the model's performance on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset. The results show that changing the order of features can impact overall classification accuracy, specific emotion accuracy, and model size. The best model has an accuracy of 79.17% for classifying 8 emotion classes with the following order: spectral contrast, tonnetz, chromagram, Mel-spectrogram, and MFCC. Finding a suitable order can increase the accuracy up to 16.05% and reduce the model size up to 96%.


INTRODUCTION
Speech emotion recognition (SER) is a field of science that studies how to recognize emotions from speech input.This field is interesting because emotion is subjective.From an utterance, not everyone can correctly identify the type of emotion that is present when the speaker speaks.Research shows that the accuracy of classification performed by humans is 65.8% [1].By identifying the proper emotions, the response given can be better and the experience of interaction can be improved.The computer is now able to receive and reply to voice commands once the device needed for the task has been installed.SER could be used to enable computers to detect emotions and improve experience in human-computer interaction [2].Some SER applications have been developed in human-computer interaction, such as robots [3], online learning [4], and psychological consultation [5].Although it has many applications, SER is still a challenging task because there is no certain way to extract and categorize emotions from speech.In speech data, several features can be retrieved, such as prosodic features, spectral features, audio quality, and Teager energy operator (TEO).The difference in the features used in the SER will affect the quality of a model [6][7][8][9].
Several methods have been tested in SER, including classical classification methods such as hidden Markov model (HMM) [10], support vector machine (SVM) [11], and Gaussian mixture model (GMM) [12], as well as using deep learning such as long-short term memory (LSTM) [13] and convolutional neural network (CNN) [6].The deep learning approach detects the high-level salient features to achieve better accuracy compared to classical classification methods.From the tests that have been carried out before [8], [14], [15], it is known that the CNN architecture performs better than other methods.The usage of deep CNN boosts the computational complexity of the whole model.
Kwon [15] proposed to modify the stride in 2D convolution layer to mimic pooling layer then remove pooling layer from the model.Spectrogram is extracted from raw speech to be used as input.This results in 79.5% accuracy on the Ryerson audio-visual database of emotional speech and song (RAVDESS) dataset with a model size of 34.5 MB.On the other hand, Issa et al [6] proposed combining 1D CNN and low-level handcrafted features to reduce the complexity of the model and reduce unnecessary information from raw speech.They used Mel-frequency cepstral coefficients (MFCCs), Mel-spectrogram, chromagram, spectral contrast, and Tonnetz as features, by taking the mean value along the time axis and then stacking them on each other as input.The result shows 71.61% accuracy.The accuracy can be improved by finding the suitable feature order.The issue is kernel in 1D CNN slides along one dimension, so information order is important, which is why 1D CNN is most likely to be used for time series data [16].By using multiple features, the order of stacking features plays an important role in the model's performance.The challenges now are to find the best order of stacking features and investigate the impact of different feature orders in 1D CNN.
Due to the issues and challenges, the performance of 1D CNN for SER can be improved by finding the best feature stacking order.This work proposes to brute force on all combinations of the five-feature stack, which are MFCC, chromagram, Mel-spectrogram, spectral contrast, and Tonnetz to find the best order and impact of multi-feature order.The experiments were conducted on standards benchmarked with RAVDESS [17].Brute force was selected since all possible feature order would be tested and finding the impact of feature order would be easier to perform.

METHOD
This study aims to find the impact of multi-feature order by brute force all possible combinations.The method consists of four stages: the dataset preparation including data augmentation to increase the number of samples, feature extraction including parameter configuration for each feature, feature stacking, and models developed from the base model that was used in the previous work [6].This work uses Librosa as a toolkit since it has better feature set performance in comparison to other tools like GeMAPS and pyAudioAnalysis [18].Google Colab is used as a platform to run the experiment.

Dataset preparation
The dataset that will be used in this study is RAVDESS [17], containing audio and visual recordings of 12 men and 12 women who say English sentences with eight different emotional expressions: sad, happy, angry, calm, fearful, surprised, neutral, and disgusted, with a total recording of 1440 samples.In this work, the dataset is split into three partitions, 70% for training, 15% for validation, and 15% for test.In the RAVDESS dataset, the sample distribution for each emotion is fairly even, but for neutral emotions, the total duration is only about 50% of the other emotions (see Figure 1).To overcome the imbalance in the dataset, several methods will be applied to make the dataset more balanced.This can be done by augmenting the data on neutral emotions or by reducing the size of other emotions to balance them with neutral emotions.In this study, data augmentation was applied [19] to multiply the data that will be used to create the model.This issue to the fact that CNN model requires a lot of data to make it more stable.For neutral emotion, random noise was added for augmentation, and then the number of data was balanced.Furthermore, to increase the number of samples, augmentation is performed one more time by combining two processes which are stretching the data with a 0.8 rate and increasing pitch by a factor of 0.7.

Feature extraction
Features that are used in this paper are MFCC, chromagram, Mel-spectrogram, spectral contrast, and Tonnetz.MFCC and Mel-scaled spectrograms are widely used in SER [20], [21].These features mimic a certain degree of acceptance of the intrinsic human sound frequency pattern.The MFCC creates a Melfrequency spectrum, which can be defined as a representation of the short-term sound power spectrum.Fourier transforms and energy spectra were collected and mapped onto a mel-frequency scale.Both Melspectrogram and MFCCs are decent in the identification and tracking of timbre fluctuations in a sound file, they tend to be poor in a distinguishable representation of pitch classes and harmony [6].To handle such situation, chromagram and Tonnetz are added.Chromagram and Tonnetz are similar with respect to the representation of harmony and pitch classes [22], [23].Spectral Contrast represents the energy contrast computed by comparing the peak energy and valley energy in each band converted from spectrogram frames  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol.11, No. 6, December 2022: 3272-3278 3274 [24].In this study, hyperparameters on the MFCC are used with a filter band size of 40.For the Melspectrogram, a mel size of 128 is used, a hop length of 512 with a windowing method in the form of a Hann window.For spectral contrast, a band size of 6 is used, with a hop length of 512, with a windowing method in the form of a Hann window.For chroma, it uses chromagram size of 12 hops length of 512 with a windowing method in the form of a Hann window.As for the Tonnetz feature, the input is a chromagram feature that has been calculated previously.
Figure 1.Time distribution of each emotion

Feature stacking
After feature extraction, the next step is to concatenate all features into one array.Since every feature has a different size, all features are compressed into a one-dimensional array by taking the mean value along the time axis and then stacked on each other.An illustration of the stacking process can be seen in Figure 2 (a) to (c).This process will be repeated for all feature combinations.

Model development
The model in this study is built using the 1D CNN architecture using a baseline Issa et al. [6] with some adjustments.The first layer receives 193 number arrays as input data.The initial convolution layer is composed of 256 filters with a kernel size of 8 and a stride of 1.Its output is activated by the rectifier linear units (ReLU) layer.The next convolutional layer has the same number of parameters but is followed by batch normalization before being activated with ReLU.After that, dropout with a rate of 0.25 is applied.The next 3 convolution layers with 128 filters of size 8 are located.Each convolutional layer is followed by ReLU activation layers.The next layer is a convolutional layer with a 128 filter of size 8, followed by batch normalization before being activated with ReLU.The output of this layer is followed by dropout with a rate of 0.25.Then a max-pooling layer with a pool size of 8 is applied.The next two convolution layers with 64 filters of size 8 are located.The output of each convolutional layer is activated with ReLU.The output of this layer is followed by a flattening layer.The output of the flattening layer is received by a fully connected layer with eight units representing eight classes of emotions and activated with softmax activation function.This model uses the RMSProp optimizer with a learning rate of 0.00001 and a decay rate of 1e-6.
The results of the stacking of features are then entered into the model to be trained.To make sure all the model gets the same treatment, all the training, validation, and test data that are used are the same for every model.For the training process, a batch size of 64 with 300 epochs is used and callback ReduceLROnPlateau [25] is used to monitor loss, and then the learning rate is adjusted by the factor of 0.8, patience of 15, and then a minimum learning rate of 0.000001 is used to make sure the learning rate does not go below 0.000001.Previous research [26] showed that using a learning rate of 0.000001 on CNN architectures to obtain a model with the lowest loss.

Result
The accuracy distribution of all models can be seen in Figure 3.The order of features has a big impact on model performance.From the data collected, the difference between the highest and the lowest accuracy is 16.05%.From Table 1, the highest accuracy is obtained with the following order of features: spectral contrast, Tonnetz, chromagram, Mel-spectrogram, and MFCC.From the top five models, it can be seen that five of them have similarities where Mel-spectrogram and MFCC are placed side by side, while sometimes contrast and tonnetz are also placed side by side.On the other hand, from Table 2, the bottom five models have MFCC and Mel-spectrogram placed far apart.There is no big impact of spectral contrast position on classification performance.
Feature order affects not only overall accuracy, but also accuracy on specific emotions.From Table 3, the result shows that putting specific features close to each other can determine the performance of recognizing specific emotions.Angry and calm emotions have differences in the order of chromagram and Mel-spectrogram.Putting chromagram next to Tonnetz instead of Mel-spectrogram has better performance to recognize angry.On the other side, putting Mel-spectrogram after Tonnetz can result in better accuracy in detecting calm emotion.Fearful and happy emotions have differences in the order of spectral contrast and chromagram.Placing spectral contrast after MFCC yields better performance on recognizing happy while putting chromagram next to MFCC yields better performance on detecting fearful emotion.For happy and sad emotions, the difference is in the order of MFCC and spectral contrast, while putting spectral contrast after Mel-spectrogram has better accuracy at detecting sad.Not only affecting the model's performance, the feature order also affects the model size.The size range of models after training is between 10 MB to 293 MB, while the best model with the order MFCC, Tonnetz, spectral contrast, chromagram, and Melspectrogram has a size of 29.6 MB.

Discussion
In this work, optimized feature order was found by brute force all possible combinations of feature stacking order.1D CNN is more compact than 2D CNN that is usually used for SER.The main issue of 1D CNN is the order of feature affects model performance.Finding the best stacking order can improve 1D CNN performance to match up with 2D CNN while reducing complexity of the model.Comparing to previous work [6], this work put more focus on finding the best order while the previous work did not.Focusing more on finding the best stacking order shows big improvement in model performance.
The highest accuracy is 79.17%, which is obtained from the feature stacking order of spectral contrast, Tonnetz, chromagram, Mel-spectrogram, and MFCC.Table 4 shows comparison between the result of the work presented in this paper with that from the previous work [6], where the best accuracy is 71.61% with the following order: MFCC, chromagram, Mel-spectrogram, contrast, and Tonnetz.The difference in best accuracy happened because the previous work did not properly find the best order.While in this work, all possible combinations of feature order are experimentally tried to find the best feature order.Beside impact on accuracy, feature order also has an impact on model size.In this work, a smaller model than that from the previous work [15] has been successfully achieved.Our best model has an accuracy of 79.17% and a size of 29.6 MB while previous work results in 79.5% accuracy on the RAVDESS dataset with a model size of 34.5 MB.It shows that the model obtained in this study achieves slightly lower accuracy, but a smaller model size.This happened since 1D CNN was used where it was simpler than 2D CNN.

CONCLUSION
Multi-feature usage on SER is a challenging task because many features can be extracted from the speech signal.Each feature has similarities and differences in extracted information.Changing the order of features can have an impact on classification performance, especially using the 1D CNN architecture.This work intends to find the impact of multi-feature order on SER performance on 1D CNN.From the result, it can be concluded that the order of features affects not only the overall accuracy of the model, but also the performance in recognizing specific emotions and the model size.This work achieves better feature order for the RAVDESS dataset with 79.17% accuracy and produces a model with a smaller size of 29.6 MB.Future work can be conducted by using different features like Teager energy to increase the number of features.

Figure 2 .
Figure 2. Stacking process of (a) extracted features in a two-dimensional array, (b) compressed into onedimensional array by taking mean a long-time axis then, and (c) concatenate into one array

Figure 3 .
Figure 3. Test accuracy distribution of 120 model on the RAVDESS dataset

Table 1 .
The order of five top accuracies

Table 2 .
The order of bottom five accuracies

Table 3 .
Best feature order for a specific emotion

Table 4 .
Comparison with previous work