Feature selection for improving Indian spoken language identification in utterance duration mismatch condition

Received Mar 2, 2021 Revised Jul 2, 2021 Accepted Aug 17, 2021 In spoken language identification (SLID) systems, the test data may be of a sufficiently shorter duration than training data, known as duration mismatch condition. Duration normalized features are used to identify a spoken language for nine Indian languages in duration mismatch conditions. Random forest-based importance vectors of 1582 OpenSMILE features are calculated for each utterance in different duration datasets. The feature importance vectors are normalized across each dataset and later across different duration datasets. The optimal number of duration normalized features is selected to maximize SLID system accuracy. Three classifiers, artificial neural network (ANN), support vector machine (SVM), and random forest (RF), and their fusion, weights optimized using logistic regression, are used. The speech material comprised utterances, each of 30 sec, extracted from the All India Radio dataset with nine Indian languages. Seven new datasets of smaller utterance durations were generated by carefully splitting each utterance. Experimental results showed that 150 most important duration normalized features were optimal with a relative increase in 18-80% accuracy for mismatch conditions. The accuracy decreased with increased duration mismatch.


INTRODUCTION
Spoken language identification can be defined as automatically identifying the language in which the person spoke by analyzing, typically a short duration, of the user's speech utterance [1]. Spoken language identification plays a vital role in human-machine voice interactions [2]. A typical requirement is to quickly identify the speaker's language based on a very short utterance so that the user can be provided a personalized service in his or her own language. With the recent development of computer technology, Indian language identification has gained significance in applications such as vernacular call centers to assist customers, services to assist farmers in their regional language, etc. This is because of the need to provide service and communicate to the user in their own language. However, in a vernacular call center, a sufficient dataset of long-duration utterances may be available to train the system. However, equally long duration utterances may not be available; in some cases, the availability of test utterances may be significantly smaller in length (less than 3 sec). This problem is known as duration mismatch. Although there is a number of methods suggested in the literature to enhance the accuracy of short duration utterance, practically the spoken language identification (SLID) system fails to improve the performance for mismatched training and testing utterance All speech utterances listen carefully, and any segment with music, silence or unwanted voice has been filtered out. This speech corpus has utterances by newsreaders, both male, and female (equal in number), on varying sets of topics.
In the baseline system, 1582 features using the openSMILE toolkit [24] were extracted from all the utterances across different duration datasets. A five-fold cross-validation method is used to evaluate these 1582 features across different utterance durations at each iteration. However, a classifier trained on one duration (say, 30 sec) when tested with utterances of different duration, namely, 0.2, 0.5, 1, 3, 5, 10, 15 sec, the system's performance degrades irrespective of the type of classifiers. In a nutshell, a mismatch in the duration of the utterance used to train, and the duration of the utterance used to test significantly deteriorate performance. This is seen across all the classifiers when there is a difference between the train and the test utterances. The same observation has been noted for mismatch train-test utterance duration using output score fusion. This issue can be mitigated by selecting proper relevant features with a machine learning framework which can adapt duration mismatched train-test condition. The feature selection approach helps speed up the classifier's training and often improves recognition accuracy because of a selection of discriminative features. For this reason, we introduced the proposed duration normalized feature selection (DNFS) algorithm to evaluate SLID system performance in duration-matched and mismatched conditions. The rest of this paper is organized as follows: The proposed DNFS method for spoken Indian language identification is discussed in section 2. The experimental setup of research work and different experimental results using ANN, SVM, RF, and score fusions are discussed in section 3. Finally, the conclusion is given in the last section. Figure 1 shows the proposed model for Indian language identification using normalized feature selection in the duration-matched and mismatched conditions in this study. As discussed earlier, complete 1582 openSMILE features degrade the SLID system accuracy in duration mismatched condition. To improve the SLID system's performance, we focused on the proposed normalized DNFS algorithm's outcome. As shown in Figure 1, five cross-validation techniques are used where 80% of spoken utterances from all datasets are used to train the classifier, while 20% of spoken utterances not used in training are used for testing purposes. Empty circles and triangles indicate the spoken utterances used for training while testing utterances are illustrated as filled circles. Figure 1 indicates the duration-matched condition and duration mismatched condition. Firstly, a complete set of acoustic features is extracted. The set contains both relevant and redundant features. The most discriminative features are selected using the proposed duration normalized feature selection to improve the SLID system's performance in duration mismatched conditions. These discriminative features are used to train the classifiers which predict the correct class in duration mismatched condition. The system's performance is analyzed on our own eight different duration dataset.

2581
The selection of most discriminative features helps to speed up classifier training and improve the system's robustness. In mismatch utterance duration, increasing the difference between the duration of the training and test utterances decreases the system's identification accuracy. We propose a duration normalized feature selection algorithm to improve identification accuracy under mismatched utterance duration conditions. Figure 2 shows a flow chart of the proposed duration normalized feature selection. Random forest fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the identification accuracy while controlling the over-fitting. Let n tree represent the number of trees and let n f =1584 and =9 be the number of input features and the number of output (languages) respectively. Let X t , Y t where t represents the duration of the datasets, n f −dimension feature vectors and n c -dimension label vectors, respectively.
For every X t , Y t a separate random forest models each with ensemble of n tree decision trees is trained. In each decision tree of the random forest model, a random set of features are selected from n f features and best possible binary split at each node is performed based on the most important feature to achieve overall n c -class classification. At node n j importance of set of randomly selected features to estimate best possible binary split (i.e. left and right) is calculated as: where is the weighted number of samples at node j its left and right split. G (.) is GINI impurity index calculated as [26]: where fi is frequency of i th label and lasses For each decision tree, an importance of feature k is calculated as the ratio of number of nodes with k as most important feature to all nodes in the tree, namely: Repeat steps for decision trees in random forest model to get feature importance of all features. Based on random forest model trained using t-sec duration dataset( , ), the feature importance of all features is normalized as: Repeate the process to calculate normalized feature importance vector for each of the different segmentlength data-sets. Importance of all features is averaged over all different duration datasets as: Assign rank to all features such that, the most important feature has rank 1 and the least important feature has rank .

SLID system performance using DNFS
The DNFS method is used to improve the SLID system's performance. In order to verify relevant features, logarithmic power of mel frequency band (logMelBand), spectral pair frequency (IspFreq), mel frequency cepstral coefficient (MFCC), PCM-loudness, shimmerLocal have been used. These features are represented mean, a linear approximation of the contour (linregc), outlier robust signal range max-min (pctlrange), percentile, quartile, standard deviation (stddev), and skewness. This comprises a feature vector of 1582 dimensions for each speech signal. The goal of this phase is to select the most important features using DNFS. It is to be noted that the top 25 and 50 feature sets are related to logMelFreqBand-sma (low-level descriptors smoothed by a moving average filter) and their functional; the top 75 and 100 features include additionally logMelFreqBand-sma-de (1st order delta coefficient of the smoothed low-level descriptor), IspFreq-sma, IspFreq-sma-de and mfcc-sma and their functional. The top 125 features additionally include mfcc-sma -de and their functional, and the top 150, 175, and 200 features contain mfcc-sma-de, pcmloudness-sma, pcm-loudness-sma-de and shimmerLocal-sma and their functional. The performance of the different feature sets was evaluated using ANN, SVM, and RF classifiers and output score fusion of ANN+SVM and ANN+SVM+RF. Figure 3 shows the performance of the SLID system for 30sec training dataset and 15 and 0.2 sec testing dataset by varying the number of features from 25 to 200 in the step of 25. Figure 3 indicates that the proposed method selects the better feature subset and achieves the highest accuracy over the 15 sec testing dataset. The complete set of accuracy for duration mismatch condition using ANN are shown in Table 1. The 30 sec is used to train the classifiers, and the remaining datasets are used for testing. All classifiers performed better for all reduced feature sets. An incremental trend was observed for all classifiers are trained by 30 sec utterance durations. However, for 175 and 200, feature set recognition accuracy started reducing, so 150 feature set is taken as an optimum feature set. The performance SLID system for duration mismatched condition was evaluated by all reduced feature sets (25, 50, 75, 100, 125, 150, 175, and 200), and the best results obtained by 150 optimum feature set is presented in the paper.

2583
As described in section 2, the DNFS is used to alleviate the short utterance duration and the mismatched condition issues in the baseline system. Comparative analysis for varying features according to important values showed that the first 150 most important features are optimum for SLID system under mismatched conditions. Tables 2 to 6 compare the effect of optimum duration normalized features with an entire set of features using three individual and two fusion classifiers for varying utterance duration datasets. The diagonal values depict the matched conditions, while off-diagonal values illustrate mismatched conditions. The comparative analysis indicates an increase in performance for all mismatched conditions with a possible slight decrease in performance for some matched conditions.   Table 5. Comparative performance original baseline (B) and proposed (P) SLID systems in mismatched condition using ANN+SVM classifier (%) It is noticeable that despite discarding 90% of the initial features, the performance of optimum feature set is comparable to using all features. Incremental trends are observed in recognition accuracy for different utterance durations. The results indicate that recognition accuracy greatly improved with the proposed feature selection algorithm, especially for mismatched train-test conditions while, reducing feature dimensionality. It is to be noted that there is a significant saving in terms of computational time required, as shown in Figure 4. For 30 and 0.2 sec utterances, the computational time required to extract optimum feature set is 2.14 sec and 15.67 msec.

SLID system performance using mirture of variable-duration utternces
To explore the generalization competency of ML algorithms, we developed a SLID system with a mixture of variable-duration utterances for training and tested using different duration utterances. We used an equal number of training utterances from all eight duration datasets and different languages for training each model. The results of the experiment are shown in Table 7. It can be observed that the system is biased towards higher-duration utterances and provides less than a random chance for very small-duration utterances.

SLID system performance comparison with a state-of-the-art system
Das et al. [18] selected features using a state-of-the-art relief algorithm to improve the performance of of SLID system. Table 8 shows the accuracy of the SLID system for Indian languages using openSMILE features and relief feature selection algorithm. The number of features were optimised by forward selection method. The best performance, as shown in Table 8, was found for 180 features. Comparative analysis of the proposed and state-of-the-art feature selection algorithm, shows that the proposed algorithm selects more relevant features for improving accuracy at all mismatch conditions. Chowdhury et al. [17], used a Grey wolf optimization (GWO) feature selection algorithm, reported 96.6% accuracy using ANN classifier. In comparison, Das et al. [18] showed 92.3% and 100 % using the BBA-LAHC feature selection algorithm for the Indic TTS dataset of IIIT Madras and Speech and Vision Laboratory (SVL) IIIT-Hyderabad, respectively for 5 sec dataset. The proposed work yields 100% accuracy using a duration normalized feature selection algorithm for 5 sec dataset in duration-matched condition. A comprehensive study by Sarith Fernando et al. [27] used i-vector+ BLSTM to compensate for mismatched duration conditions and reported 66.8% accuracy for the 1 sec dataset. In comparison, the proposed duration normalized feature selection algorithm yielded 68.8% accuracy on 30 sec training dataset and tested with 1 sec dataset using ANN+ SVM + RF output score classifier.

CONCLUSION
Indian language identification is crucial for vernacular call centers for automatically routing incoming customer calls to respective language experts. Paper proposed a novel DNFS for spoken language identification using Indian languages for different utterance durations. Each utterance was represented using 1582 features extracted using the openSMILE toolkit. Random forest-based models are developed using reduced features to calculate importance vectors for each feature. The optimum 150 duration normalized features were calculated by averaging over different duration utterance datasets. These features improved SLID system accuracy under training and test duration mismatched conditions, but the system's accuracy reduced with decreasing utterance duration. All experiments were evaluated using the All India Radio dataset developed by us. The dataset was carefully processed to generate eight small-duration databases. Results showed that a combination of duration normalized features improved accuracy for short-duration utterances and mismatched conditions. The drastic improvement exhibits in recognition accuracy from 61.3% to 99.0% accuracy for utterance duration 15 sec and 44.6% to 68.8 % for a very short utterance duration of 1 sec when the classifier is trained with a 30 sec dataset using ANN+ SVM+ RF classifier. Simultaneously, a minor improvement in the recognition accuracy, 16.4% to 25.9% for 0.2 sec duration utterances, was observed. In future work, emphasis will be given to improve the recognition accuracy for very short-duration utterances in mismatched conditions.