DCT based feature extraction and support vector machine classification for musical instruments tone recognition

Received Aug 9, 2020 Revised Mar 2, 2021 Accepted Aug 3, 2021 The conducted research proposes a feature extraction and classification combination method that is used in a tone recognition system for musical instruments. It is expected that by implementing this combination, the tone recognition system will require fewer feature extraction coefficients than those previously investigated. The proposed combination comprises of feature extraction using discrete cosine transform (DCT) and classification using support vector machine (SVM). Bellyra, clarinet, and pianica tones were used in the experiment, with each indicating a tone with one, several, or many major local peaks in the transform domain. Based on the results of the tests, the proposed combination is efficient enough to be used in a tone recognition system for musical instruments. This is indicated in recognizing a tone, it only needs at least eight feature extraction coefficients.


INTRODUCTION
In a tone recognition system of musical instruments, there are two main parts. The first part takes the characteristics of the tone, or what is called the feature extraction part. The second part classifies the results of the feature extraction part, or what is called the classification part. A transform domain approach could be used to implement the feature extraction part. Discrete fourier transform (DCT) and discrete fourier transform (DFT) are two transformation methods for converting tone signals from the time domain to the transform domain. There are two different ways to feature extraction in the transform domain. The first is feature extraction, which is based on fundamental signals [1]- [6]. The second is feature extraction, which is not based on fundamental signals [7]- [12].
An approach to implementing the classification part in the tone recognition system above is to use a statistical approach. Support vector machine (SVM) is one example of classification that makes use of a statistical approach. SVM is a classification method that originates from statistical learning theory [13], [14]. Initially, SVM was only used for the classification of two classes. In subsequent developments, SVM can be used for multiclass classifications [15]- [21]. The previous tone recognition system research has primarily focused on tones with multiple major (significant) local peaks in the transform domain [9]- [11]. Tone recognition for tones having one, several or many major local peaks in the transform domain has very little to do with it. Previous research [12] proposed combining DFT-based segment averaging for feature extraction and template matching for classification to be used in a tone recognition system. That system could recognize tones with one, several or many major local peaks in the transform domain. However, to identify a tone, the tone recognition system needs at least 16 feature extraction coefficients. Thus, for the tone recognition system, there is still a chance of obtaining feature extraction coefficients fewer than 16. The advantage of using fewer feature extraction coefficients is that we have fewer data to process. The conducted research combines feature extraction and classification methods for musical instruments tone recognition. To be more specific, it presents a DCT-based feature extraction and SVM classification for musical instruments tone recognition. As a first note, the feature extraction method does not use fundamental signals in the transform domain. As a second note, the tones used in the conducted research were bellyra, clarinet, and pianica tones, representing a tone with one, several, and many major local peaks in the DCT transformation domain. The DCT transformation domain of the bellyra, clarinet and pianica tones are shown in Figure 1.

RESEARCH METHOD 2.1. Materials preparation
A tone signal is used as the tone recognition system's input. The tone signal is an isolated signal that is stored in the waveform audio file format (WAV). This tone signal is acquired from three musical instruments played, namely bellyra, clarinet, and pianica. In this research, we recorded a total of eight tone signals for each musical instrument, namely C, D, E, F, G, A, B, and C'. We recorded the tone signals at a sampling rate of 5000 Hz. Essentially, this sampling rate has met the theorem of Shannon sampling [22] as: with fmax being the highest frequency component of the tone signals and fs being the sampling rate. The highest frequency components of the tone C' for bellyra, clarinet, and pianica, according to our visual observations using Octave software, were 2097 Hz, 1406 Hz, and 1584 Hz, respectively. Based on our visual observations also, recording a tone for 2 seconds was adequate to acquire a steady-state part of the tone signal. It should be noted that the recorded tone signal can be divided into three parts, namely silence, transition, and steady-state parts. Only in the steady-state part there is accurate tone information. Three musical instruments were used in this research to acquire the above-mentioned tone signals. They were an Isuzu ZBL-27 bellyra, a Yamaha YCL-255 flute, and a Yamaha P-37D pianica, as shown in Figure 2. The tone signals were captured by using an AKG perception 120 USB microphone.

System design
The entire tone recognition system in this research is shown in Figure 3 as a block diagram. The system's input is a WAV-formatted tone signal. The system's output is a text, which denotes a recognized tone. We used Octave software to develop the system design. The explanation of all blocks in Figure 3.

Initial cutting
Deleting the silence and transition parts in a tone signal is known as initial cutting. The silence and transition parts need to be deleted because there is no accurate tone information in these parts. The accurate tone information can only be obtained in the steady-state part of the tone signal. Based on our visual observations, the silence part could initially be deleted by making use an amplitude threshold value of |0.5| from the tone signal's highest value. Beginning with the leftmost part of the tone signal, the signal was deleted, if the signal's amplitude was less than |0.5| from the highest value of the tone signal. Following the deletion of the silence part, the transition part was deleted. Based on our observations also, the transition part was deleted for 300 milliseconds from the tone signal's left side. Finally, the steady-state part could be obtained, after the silence and transition parts were deleted.

Frame blocking
Cutting a signal frame from a lengthy signal is known as frame blocking [23]. The objective of this frame blocking is to decrease the amount of signal data. The tone recognition system's computation time will be sped up if the amount of signal data is decreased. The length of frame blocking of 2 n , where n is a positive integer, was evaluated in this research to find the shortest length of frame blocking that gave the highest recognition rate.

Normalization
Setting the highest value of a data signal to 1 or -1 is known as normalization. The objective of this normalization is to decrease the disparity between a data signal's highest value and the others. Normalization is implemented using (2).
where and are input and output data signal vectors, respectively.

Windowing
Windowing is smoothing discontinuities in a data signal's edges [23]. This discontinuity happens as a result of the data signal being cut in the preceding frame blocking. Discontinuities will give rise extra signals known as harmonic signals visible in the transformed data signal. The visibility of harmonic signals can be reduced by smoothing discontinuities. The Hamming window, which is extensively utilized in signal processing [24], was used in this research. Aside from that, the length of window 2 n was used in this research, where n is a positive integer. This length of window is the same as the above-mentioned the length of frame blocking.

DCT
DCT is converting data signals from the time domain to the transform domain known as the DCT transformation domain. The length of DCT of 2 n was used in this research, where n is a positive integer. The length of DCT is the same as the length of frame blocking and the length of Hamming window mentionedabove. In addition, this research applied the calculation of absolute values of DCT results because the next process, logarithmic scaling, will not allow the calculation of negative values.

Logarithmic scaling
The gap in peak levels in a data signal can be reduced by logarithmic scaling. The logarithmic scaling results indicate an increase in the number of major local peaks. Previous research [11], [12] shows that feature extraction using segment averaging (which is used in this research) produces superior results for data signals with many major local peaks. The following is a mathematical expression of logarithmic scaling.
where  (3) prevents a logarithmic outcome near to negative infinity if the input data signal has a value close to zero.

Frame warping
Frame warping is reducing the length of a data frame. The results of frame warping show a more dense distribution of data. Basically, this frame warping is carried out by dividing the data frame into two and then combining the two. The algorithm of this frame warping is presented as: Frame warping algorithm: 1. Consider and input data frame Merge the results of 1 ( ) and 2 ( ), in order to be an output data frame As a note, this frame warping will give an output data frame that is half the length of the input data frame.

Segment averaging
One way to reduce the amount of data in a data frame is to use segment averaging. This research used segment averaging from the previous research [11], [12]. Basically, this segment averaging is carried out by dividing a data frame into a number of data segments and then carry out an averaging operation in each data segment. The algorithm of this segment averaging is presented as: Segment averaging algorithm: 1. Suppose there is an input data frame The length of segment L was evaluated in this research at 1, 2, 4, ...,

SVM classification
Classification is determining the pattern class of a data frame. SVM is a method that can be used for this classification. SVM is a linear classification. In training an SVM, the best hyperplane will be explored. This hyperplane separates two data sets, which come from two different pattern classes. Mathematically, this hyperplane is a linear discriminant function.
It is not necessarily that a hyperplane can separate the two data sets from two different pattern classes in the real world. In other words, the two data sets are not linearly separable. Therefore, data transformation needs to be carried out. To carry out this transformation, a function called kernel function [25] can be used. Linear and polynomial kernel functions are two examples of commonly used kernel functions. This research evaluated that two kernel functions.
Initially, SVM was developed for the case of two-class pattern classification. Furthermore, SVM was developed for the case of multiclass pattern classification. This research is a case of multiclass pattern classification. For this multiclass case, one-vs-all (OVA) Tree Multiclass method [20] is used. The selection of this OVA is based on the performance of this OVA, which is comparable to the other methods, particularly [16] and [21]. In this research, we used LibSVM and its default settings [26].

Feature extraction and SVM training
The SVM shown in Figure 3 needs a number of data in the training process. This data is obtained using the feature extraction proposed in this research, shown in Figure 4. The input is a WAV-formatted tone signal. The output is the feature extraction from the input tone signal. As a note, every block in the proposed feature extraction is the standard one. However, if we look at the complete picture (the series of blocks in Figure 4), we can see that the proposed feature extraction is unique. Thus, we can see a novelty in this research. For each musical instrument (bellyra, clarinet, or pianica) in this research, 10 samples were recorded for each tone signal (C, D, E, F, G, A, B, or C'). So there were a total of 240 tone signals. In addition, the feature extraction of each of the tone signals was processed by the feature extraction shown in

Test tones and recognition rate
The number of tone signals used to test the performance of the tone recognition system are the test tones. For each tone signal (C, D, E, F, G, A, B or C') of each musical instrument (bellyra, clarinet, or pianica), a number of 20 samples were recorded for the test tones. So, for each musical instrument there was a total of 160 tone signals. The recognition rate is the magnitude of tone system recognition performance. The following is how the recognition rate is calculated.

RESULTS AND ANALYSIS 3.1. Test results
Test results of the developed tone recognition system for using a linear kernel function in SVM classification, for different combinations of the lengths of frame blocking and the number of feature extraction coefficients, are presented in Table 1. As indicated in Table 1, the recognition rate increases if the number of feature extraction coefficients increases. If the number of feature extraction coefficients increases, it will further increase the dimension of the feature extraction space. The increased dimension of the feature extraction space makes it easier to differentiate between one pattern class and the other pattern classes. The easier differentiation of one pattern class with the other pattern classes will ultimately increase the recognition rate.

The smallest number of feature extraction coefficients
The goal of this research is to discover the smallest number of feature extraction coefficients that can be used in a tone recognition system. Here, the tone recognition system can recognize the tone with one, several, or many major local peaks in the transform domain. As indicated in Table 1, the use of the smallest number of feature extraction coefficients, i.e. eight coefficients, and the shortest length of frame blocking, i.e. 256 points, can result in the highest recognition rate of up to 100%. In other words, by using at least eight feature extraction coefficients, the tone recognition system can recognize all the tested tones. The tested tones in this case are those with one, many, or many major local peaks in the transform domain.
Furthermore, the number of the smallest (eight) feature extraction coefficients above should be noted. It is linked to the usage of a linear kernel function in SVM classification. For the use of other kernel functions (i.e. polynomial functions) in SVM classification, Table 1 has been reworked for the second and third-order of polynomial functions. The results obtained from the use of both polynomial functions are presented in Table 2. The use of linear kernel functions gives the best results, as indicated in Table 2. Only by using eight feature extraction coefficients (the smallest number of coefficients), it can give a recognition rate of up to 100%. This indicates that by using eight feature extraction coefficients, the pattern classes of feature extraction of the tone signals are linearly separable.

Comparison of some feature extraction and classification combinations
The performance of some feature extraction and classification combinations for musical instrument tone recognition are compared in Table 3. As indicated in Table 3, the feature extraction and classification combination proposed in this research is the most efficient for use in a musical instruments tone recognition system. This is because the tone recognition system needs only eight feature extraction coefficients (the smallest number of coefficients) to recognize the tones with several major local peaks in the transform domain.

CONCLUSION
The conducted research proposes a feature extraction and classification combination in a tone recognition system for musical instruments. The purpose of using this combination is to obtain a tone recognition system, which in the recognition process uses the smallest number of feature extraction coefficients. To do this, we combined a DCT based feature extraction and an SVM classification to be used in the tone recognition system. We have discovered from the test results that the proposed feature extraction and classification combination makes the tone recognition system efficient enough. This was because the tone recognition system only needed at least eight feature extraction coefficients to recognize tones with one, several, or many major local peaks in the transform domain. We have also discovered that SVM classification requires only a linear kernel function. This one indicates that the pattern classes from the DCT based segment averaging feature extraction are linearly separable. For further development of this research, we recommend exploring other feature extraction and classification combinations. In this case, these combinations can use different methods for feature extraction (other than DCT based segment averaging) and different methods for classification (other than SVM).