A GMM supervector approach for spoken Indian language identification for mismatch utterance length

Received Sep 30, 2020 Revised Dec 15, 2020 Accepted Jan 19, 2021 Gaussian mixture model-universal background model (GMM UBM) supervectors are used to identify spoken Indian languages. The supervectors are calculated from short-time MFCC, its first and sec derivatives. The UBM builds a generalized Indian language model, and mean adaptation transforms it to a duration normalized language-specific GMM. Multi-class support vector machine and artificial neural network classifiers are used to identify language labels from the supervectors. Experimental evaluations are performed using 30 sec speech utterances from nine Indian languages comprised five Indo-Aryan and four Dravidian languages, extracted from all India radio broadcast news data-set. Eight smaller duration data-sets were manually derived to study the effect of training and test duration mismatch. In mismatch conditions, identification accuracy decreases with a decrease in test and train utterance duration. Investigations showed that the 32-mixture model with ANN classifier has optimal performance.


INTRODUCTION
The spoken language identification (SLID) system recognizes the language, from the desired set, by analyzing a short-duration spoken utterance. An SLID system enables the automatic selection of language and grammar models to convert speech to text in conversational interfaces like siri, alexa, and google home. In a vernacular call center, a SLID system can be used to route the incoming call to a human agent conversantly communicating in the customer's native language. Spoken languages vary in dialects and accents, which poses challenges in building an efficient SLID system [1]. In India, a multilingual country, most of the official languages can be grouped into two families, Indo-Aryan and Dravidian. SLID systems based on Indian languages are motivated because the languages belonging to different families (inter-family) are relatively easier to identify than the languages belonging to the same family (intra-family).
A SLID system can be explained in two phases: (i) the training phase and (ii) the testing phase. A language identification model is trained in the training phase by extracting language-specific features from the speech utterance. In the testing phase, the trained model's performance is evaluated using utterances that agnostic to training. Several features reported in the SLID system literature can be broadly grouped into two categories: (a) low-level speech features and (b) high-level speech features. The low-level features exploit the phonetic nature of Indian languages. It consists of phono-acoustic, phototactic, and prosodic features. Phonoacoustic features compare the frequency of occurrence of fundamental phonemes to distinguish languages. Phone recognition and its parallel version, followed by language modeling, are most widely used for the  (Aarti Bakshi) 1115 phonotactic approach. Although SLID systems based on phonetically transcribed speech utterances are accurate, the data is not readily available [2]. Such systems are also prone to errors in manual transcription and phone recognition. Prosodic features discriminate languages based on long term characteristics like tone [3], rhythm [4], duration [5], energy, and pitch contour. The use of speech production model-based features like linear cepstral coefficients (LPCC) [6], perceptual linear prediction (PLP) [7], and Fourier features [8] have been reported in the literature. The efficiency in native recognition problems inspired the use of perception based mel-frequency cepstral coefficients (MFCC) with ∆ and ∆ 2 for SLID tasks [9,10]. The importance of temporal information, suggested by MFCC and its derivatives, motivated the use of shifted delta cepstral coefficients (SDC) [11] in SLID systems. It was reported that the performance of MFCC based systems decreases with decreasing frame size [9,10]. Classifiers like Hidden Markov Model (HMM) [12], vector quantization (VQ) [6], support vector machine (SVM) [3,13,14], artificial neural network (ANN) [15,16], and Gaussian mixture model (GMM) [15][16][17] have been reported to model feature vectors in SLID systems. One of the simplest techniques used for the SLID system is GMM-UBM. In this method, maximum likelihood estimation is used to train the language model, and maximum a posterior (MAP) estimation is used to adapt the UBM model. The speech sample is a series of the independent spectral feature vector, and GMM mathematically models these features with UBM adaption known as GMM-UBM supervectors carries spectral characteristics [10][11][12][13][14][15][16][17][18]. These features are adapted to UBM using the MAP estimation algorithm to obtain utterance-based GMM [19]. GMM-UBM supervector performs well on short utterance length and decides it by calculating the likelihood ratio using spectral features. A comparison of under complete dictionary problem using GMM mean shifted supervector and overcomplete i-vector approach for spare classification were addressed in the [20]. In this approach, GMM mean shifted supervector was obtained using a concatenation of the mean vector of the mean of adapted GMM-UBM, which shows superior performance over the i-vector approach. Bhattacharyyabased GMM system was developed using an adaptive relevance factor to address negative effects on the language characteristics. The author tried to address duration variability for the individual utterance of 30 and 10 sec [21].
In practical SLID systems, such as a vernacular call center, may fulfill the requirement of training speech utterance duration, but equally long test speech utterance may not be available. It has been reported that the SLID system's performance degrades with the increasing mismatch between durations of training and test speech utterances [22]. The paper presents GMM-UBM based SLID system for nine Indian languages under matched and mismatched training and test utterance duration. SLID systems trained with long segment length utterances are known to perform well, but it will become worst when short segment length utterance is counted [22]. We conducted a series of experiments for different utterance length mismatch cases on eight different segment length utterance data-sets to analyze this. It is observed that with a sufficient amount of training data, the GMM-UBM supervector performs very well for short segment length utterance. The rest of the paper is structured as follows. Section 2 discussed the proposed SLID system using Indian languages; section 3 describes the experimental setup and results using ANN and OvA SVM. We conclude in section 4.

PROPOSED SLID SYSTEM
The architecture of SLID using the Indian language is shown in Figure 1. The first step in the process is to develop a data-set for nine different languages, Assamese (AS), Bengali (BN), Gujarati (GJ), Hindi (HN), Marathi (MR), Kannada (KN), Malayalam (ML), Tamil (TM), and Telugu (TL). It split into a training data-set and testing data-set using a 5-fold cross-validation process. The second step involved feature extraction, which converts speech waveform into parametric representation. Each spoken utterance is processed using framing and windowing functions. Mel frequency cepstral coefficients (MFCC) features are computed from each frame and append with delta and acceleration (∆ and ∆ 2 ) coefficients. A total of 39 MFCC feature vectors from all 9 Indian languages are used to develop the GMM-UBM model. The language-specific GMM model is developed by adapting trained UBM using MAP method. Note that the supervector maps an utterance to a high-dimensional vector. We adapt the mean of Gaussian components for each speech utterance using the MAP algorithm and concatenating mean vectors of all Gaussian components formed GMM-UBM supervectors. It forms (39 × M) GMM-UBM supervector matrix per language. ANNs and OvA multi-class SVM are trained to predict the class (language).
Extracting the original signal's meaningful characteristics, thereby representing the original signal with a lesser amount of data without any major loss in the original signal's information, are referred to as feature extraction. Universal background model (UBM) is typically used to model the data distribution and is very popular in speaker recognition. GMM is used to capture characteristics of the language-independent features. For a K-dimensional language-specific feature vector xk, the Gaussian mixture density is represented as (1) [3].
where xk, k=1, 2, …, K is K dimensional feature vectors and b i x k , where i=1, 2…K is component densities and w i , where i=1, 2,…K is the mixture weights, respectively. The mixture weights (wi), mean vectors (µ i ) characterize GMM, and covariance matrices (∑ i ) is represented as (2).
The maximum likelihood estimation algorithm aims to estimate λ (language model), which maximizes the likelihood of GMM for the set of training data. In this work, X represents the acoustic vectors obtained from MFCC features to compute the GMM likelihood is represented as (3).
In this paper, maximum likelihood estimation is calculated using an iterative expectation maximum (EM) method. The basic concept behind the EM method is, to begin with, basic model λ and estimate a new model λ` such that ( ( | ) < ( | ′ )). This model will become a basic model for the next iteration, and error between basic model parameters and the new model start reducing, and this process will repeat until a definite threshold value is achieved. For a given spoken utterance s and estimated language L, the language identification system's role is to determine s belongs to L or not. For a given feature space X, GMM models the feature vector of spoken utterance for estimated H0 such that λL is the estimated language corresponding to spoken utterance s. Another estimated H1 in the same feature space is represented as the likelihood ratio is defined as (4).
The GMM is trained using EM method to compute the language model parameter λ using MAP approach [3,13]. To compute the statistics of GMM-UBM mixture components, the probabilistic alignment of the training vectors needs to be calculated as weight (wi), means (Ei) and variance (Ei 2 ).
In the case of the SLID system, maximum likelihood (ML) algorithm is used to determine the model's parameters developed and MAP algorithm is used to derive the model using UBM adaption by calculating means µi of GMM [3]. The biological neural network influences an ANN with three layers, namely, input, output, and hidden. Each layer is made up of several neurons. Typically, the feature vector's length determines the number of neurons at the input layer, and the number of classes to identify decides the number of the neurons' output layers. The experimental analysis decides the number of neurons and hidden layers. ANN is trained using a backpropagation algorithm which works on the principle of gradient descent. The backpropagation algorithm picks the error and is fed back to the network to modify the network's weights, which will ensure a small loss in the next iteration. This process is repeated iteratively, and updation of the weights ensures a better match between the expected and the network output [3,23].
SVM is generally used for binary classification. It performs non-linear classification with a kernel trick that maps the feature vectors to a high dimensional feature space. Multi-class SVM is designed by combining several binary classifiers and usually is extended to handle multiple classes [13]. These methods are proven to be expensive than the binary classification problem but show faster convergence in handling the same amount of data. A d-class One-vs-All (OvA) SVM constructs d binary classifiers. Each binary SVM classifier is trained the ith class training data labeled as positive and all other (d-1) classes labeled as negative. The i th class test data can be identified by the binary classifier with positive i th class labeled as positive [24]. The most commonly used kernel function in SVM is linear, polynomial, and Gaussian kernels that map the low dimension feature vectors to high dimensional feature vectors. The use of SVM for language identification has two advantages; first, it can be used to solve the multi-class problem, and the second one can handle a sequence of feature vectors.

RESULTS AND DISCUSSION
All experimental evaluations are carried out using own speech corpus developed using all India radio audio files. It comprised 900 audio recordings of news bulletins, each of 30 sec duration and sampling frequency of 16 kHz, read by male and female newsreaders in nine Indian languages [25]. The language selection was based on their phoneme sound distribution, and they belong to language families being spoken by a large population [1] . The languages can be grouped as: Indo-Aryan family and Dravidian family. The Indo-Aryan family consists of Assamese (AS), Bengali (BN), Gujarati (GJ), Hindi (HN), and Marathi (MR). The Dravidian family consists of Kannada (KN), Malayalam (ML), Tamil (TM), and Telugu (TL). Each utterance of 30 sec was manually split into utterances of smaller length to derive seven new speech corpuses of 0.2 sec, 0.5 sec, 1 sec, 3 sec, 5 sec, 10 sec, and 15 sec. Each utterance was manually inspected and utterances containing music, unwanted voices, and long silences were removed.
A 5-fold cross-validation was used to avoid overfitting and measuring the SLID system's accuracy independent of training-test split. ANN (regularization value: 0.1, activation function: ReLu, number of epochs: 200) and multi-class SVM (with Gaussian kernel, OvA decomposition, regularization factor: 1.3) are classification models were trained with the feature vectors as the input and the corresponding language label (one of nine languages) as output.

a. Match condition
Initially data-set was divided into three sets: developments, training, and testing sets with 50%, 30%, and 20% data. The number of mixtures M in GMM-UBM was varied as 8, 16, 32, and 64. Table 1 shows the performance evaluation of eight data-sets when training and testing data have the same duration i.e. matched condition. The optimum number of neurons and hidden layers required was found out experimentally. The performance of SLID system increased with increasing number of mixtures with maximum accuracy of 99.9% at 64 mixtures for 30 sec data-set using ANN classifier. The lowest accuracy of 36% at 32 mixtures for 0.2 sec data-set used OvA SVM classifier. The experimental evaluation explored the use of GMM-UBM supervector approach with ANN and OvA SVM models to solve the problem of short test utterances. A slight increment in the accuracy was observed with increase in the length of utterances. As expected more reliable system can be develop with long length utterances. Table 2 shows a comparison of the accuracy of GMM-UBM supervector based ANN with earlier approaches in the literature for 30 sec data-set. Table 3 compares accuracy GMM-UBM supervector based ANN with earlier approaches in the literature shows that GMM-UBM supervector based ANN with short length utterances 0.2 sec and 0.5 sec an accuracy of 76.1% and 90.2% is achieved respectively. Figure 2 shows ROC curve of 0.2 sec data-set using ANN (green line) and OvA SVM (brown line) for nine Indian languages. The performance of the ANN is marginally better than OvA SVM.    [26] 97.1 i-vector based DNN [15] 90.8 MFCC-SDC based GMM-UBM [19] 76.35 MFCC-SDC with i-vector [19] 50.45  (Aarti Bakshi) 1119 0.2, 0.5, 1, 3, 5, 10, 15 sec segment length utterance duration testing condition. Here 4 folds (80% spoken utterances) of 30 sec segment length data-set were used to train the classifier and remaining 1 fold (20% spoken utterances) of the data-set was used for testing. In mismatch train-test segment length utterances, test utterances of different segments length obtained by splitting utterances in testing fold of 30 sec data-set were used. Table 4 depicts that relative improvement in the recognition accuracy with Gaussian mixtures decreases with the reduction in the segment length utterances. The best performance is achieved using OvA SVM for 15 sec and 0.5 sec training data-set while it drastically degraded for 0.2 sec data-set. The results show the system's encouraging performance for short utterance length for test conditions when system is trained with long utterance length. The experimental results reported in Tables 5 and 6 show a comparison of segment length utterances mismatch condition. Each row of the table indicates the segment length utterance used to train the classifiers, and SLID recognition accuracy columns indicate how accurately our model could classify the correct language. We expect to have high recognition accuracy on diagonal (match condition) with respect to offdiagonal (mismatch condition). The system performance degrades when trained with 30, 15, 10, 5, 3, and 1sec and tested with 0.2 sec. This is because 0.2 sec segment length utterance carries less language discriminative information. In very short utterances, especially (≤ 3 sec), GMM-UBM supervector with ANN works better than multi-class SVM. Overall results show some specific observations. The results of all tables' show that GMM-UBM supervector based ANN and OvA SVM worked significantly better on the long segment length utterances for both match and mismatch conditions. In a real-time application, designing the SLID system works on short segment length utterances is more desirable. However, the system performance degrades for short segment

CONCLUSION
A short-time MFCC and its first and second derivative based GMM-UBM supervectors for SLID system using Indian languages have been presented. GMM-UBM supervector with ANN and multi-class SVM classifiers were compared for matched training-test duration and mismatched training-test duration. In matched conditions, the performance of the GMM-UBM supervector with ANN was similar to multi-class SVM for long segment length utterances; however, for short segment length utterances, ANN performs better than multi-class SVM. In mismatched conditions, GMM-UBM supervector with ANN performed better than multi-class SVM; however, it degrades when the test segment length utterance is below 3 sec. The effect of very short duration utterances on system performance needs to be further investigated. Other feature extraction techniques in the GMM-UBM framework need to be explored. The results show that the SLID system using Indian languages will have promising applications in vernacular call centers, speech recognition.