The first FOSD-tacotron-2-based text-to-speech application for Vietnamese

ABSTRACT


INTRODUCTION
Nowadays, the advances of technologies in artificial intelligence and machine learning have enabled wide development of automated tools for answering customers' queries, collecting surveys, addressing complaints without human involvements. These tools are usually chatbots [1][2][3][4][5][6], or more advanced, voicebots [3,[7][8][9]. For voicebots, it is essential to have engines called text-to-speech (TTS) for performing conversion of answering text to speech and playback to customer during a call. Usually, there are two steps in a TTS conversion: (i) converting text to melspectrogram; and (ii) synthesize melspectrogram to waveform [10].
The recently introduced end-to-end [11], neural network-based models for generating TTS [12,13] are Tacotron [14], Tacotron-2 [15][16][17], Es-Tacotron-2 [15], WaveNet [18][19][20], WaveGlow [21]. In [14], a TTS based on Tacotron model was introduced to generate speech at frame level which enabled faster speech synthesis compared to approach using sample level approach. The training was based on a single professional female speaker with approximately 25 hours of recorded speeches; thus, input audios' quality can be guaranteed and variants are minimal. The input audios had sampling rate of 24 kHz and the training steps were up to 2 million. In order to reduce the training steps, it shall be possible for one to reduce the sampling rate. For synthesizing waveform, Griffin-Lim model (exisiting since 1984 [22] and catching attention to date [23]) was used [14]. For further improving Tacotron-2 model by addressing over-smoothness problem resulting in unnatural generated speeches, Y. Liu and J. Zheng [15] proposed adding an Es-Network into the existing model.

899
The idea was to make generated speeches more natural by employing the Es-Network for calculating the estimated melspectrogram residual and making this an additional task of Tacotron2 model. N. Li et al. [16] improved Tacotron2 model's speed during training by replacing its attention mechanism by a multi-head one. This was inspired by transformer network used in neural machine translation. However, the drawback of this approach is that it used text-to-phoneme conversion for processing data to learn English language which shall discard the meaning of the orginal end-to-end TTS engine proposed for Tacotron [19], Tacotron2. Although using WaveNet for synthesizing speech may improve speech quality [18,[24][25][26][27], its system will need to train two separate networks, one for converting speech to melspectrogram and the other for synthesizing the speech from the melspectrogram [20]. WaveNet variant such as WaveGlow [21], on similar dataset, also required training steps up to 580,000 with audio files sampled at 16 kHz. For synthesizing audio waveform from melspectrogram and for use in very large audio dataset (i.e., 960 hours from 2,484 speakers), multi-head convolutional neural network was proposed [28]. However, its performance for the case of low number of heads, i.e., 2, was just slightly above the average. Even though, [29,30] also attempted to work on very large audio dataset using the proposed Deep Voice models, the results obtained were not as comparative as Tacotron2.
As seen from the above analysis, the developed engines mainly support English and Chinese, the most popular languages in the world. Meanwhile, Vietnamese is not supported yet. Although, the local TTS tools [31,32] are supporting well Vietnamese language, there is little information about their back-end engines. In addition, among the developed models, Tacotron and Tacotron-2 are the most utilied end-to-end TTS. Eventhough, it lacks of support for Vietnamese. Therefore, this work presents the first open approach for tailoring a Tacotron-2-based TTS engine utilizing FPT open speech dataset (FOSD) [31,33,34]. To the best of author's knowledge, this work is the first that attempts to utilize the freely available to public dataset, FOSD. The main contributions of this work are: -The newly developed cleaner for supporting Vietnamese speech generation using the TTS' back-end engine provided by Mozilla [35] -The utilization of the publicly available dataset FOSD [34] for Vietnamese speech generation from text -The method and analysis of a trained (up to 225,000 steps) TTS model for generating Vietnamse speech [9,36] The remaining of this paper is organized as follows; section 2 details the method; section 3 discusses results obtained; section 4 concludes this research.

RESEARCH METHOD
In this section, the overall research method is presented in Figure 1. At first, the approach for processing dataset is presented. Second, the core settings for Tacotron-2 engine to be trained and tested are outlined, this eases readers to further investigate the proposed approach. Third, the role of the developed Vietnamese cleaners, as part of the TTS engine is described to help readers better understanding the differences between English and Vietnamese texts. Next, the information of the trained model is presented to give readers how much effort was put to run the training model and at which conditions of the training model used in this work. Finally, the approach for creating input data (Vietnamese texts) is shown to provide various cases of the tests conducted in this work.

Dataset processing
The dataset contains over 25,000 audio files (approximately 30 hours of recording) in Vietnamese separated into two main subsets [33,34]. All audio files are in compressed format (i.e., *.mp3) while their transcripts are stored in *.txt files within the same subfolders. The audio file bitrate is 64 kbps. In order to feed these audio files into the Mozilla-based TTS engine, by using SOX toolbox [37], they were all converted into *.wav format with bitrate of 352 kbps. In addition, all the audio files were placed together in one folder for training the model. The transcript files were also compiled into one file; each line follows the style: audio_file_name|transcript|speech_start_time_1-speech_end_time_1 speech_start_time_2-end_time_2.
Here, the audio_file_name is the file name including the extension; the transcript is the text in the speech; speech durations are marked by two ends (i. e., speech_start_time_1-speech_end_time_1); if there are multiple speeches in one file, each duration is separated by a space character.
The transcript file was then separated into two *.csv files for training and testing the engine. The training file consisted of 23,000 transcript lines while the testing file consisted of 1,900 transcript lines. The detail step-by-step guidelines for this data processing can be found in [38].

Tacotron-2 architecture settings
In this work, Tacotron-2 architecture based on [19] was utilized since it provides better output quality compared to Tacotron architecture, recommended in Mozilla's notes to developer in [35]. Table 1 presents the typical configuration of the important parameters for training the model. In this table, the number of mel-spectrograms was 80, the number of short-time fourier transform (STFT) frequency levels (equals to size of linear spectrogram frame) was 1,025, same as the default value. The sampling rate was set to 22,050 Hz for faster training the Tacotron-2 architecture. Since the model used in this work was Tacotron-2, softmax function was used for calculating attention norm, suggested by Mozilla. The complete TTS' engine's configuration can be found in [9]. In addition, the minimum and maximum sequence lengths were changed from 6 to 10 and 150 to 100 respectively after the first 100,000 training steps. This is to make the model faster to converge and more suitable with the existing dataset which has minimum sequence length of 2, maximum sequence length of 301, average sequence length of 52.43. As a result, 1,145 instances were discarded since they were out of the aforementioned sequence length range. In this work, using phoneme option was disabled since it was out of this research focus. Meanwhile, a new text cleaner namely "Vietnamese_cleaners" was newly developed for processing Vietnamese texts. The dataset path, meta file for training and validation were provided as well. It should be noted that, the model was trained completely on Google Colaboratory, a free TensorFlowsupported platform.

Vietnamese cleaners
The Vietnamese cleaners was developed to support Vietnamese language instead of English as in the original repository. The cleaner allows the special conversions of: -symbols to words: e.g., "+" to "cộng" (English: plus) -special characters to words: e.g., "%" to "phần trăm" (English: percent) -special words to similar words with the same pronunciations: e.g., "hỷ" to "hỉ" (English: happy) -number to words: e.g., "11" to "mười một" (English: eleven) Here, it should be noted that all capitalized words were converted to lowercase to form uniform source texts before feeding to the network for training, validation and testing.

Training model
In order to prove that the developed Vietnamese cleaners are suitable for the model to generate clear Vietnamese speeches from random texts, the model was trained for 225,000 steps. As a result, the training loss was 0.10406 while the validation loss was 0.12349.

Random texts for speech generation
In Table 2, the uncorrelated random texts were selected for testing the trained TTS model. The first text was an unusual statement comparing sizes of "one" duck and a cow. In this text, the word "một" (English: "one") was used to test if the trained model could generate a speech containing a number. The second text was a statement describing a female having a name of "sơn", here, the letter "s" was not capitalized. The third text was a statement describing the event that two footballers were invited to Spain for career probation. The fourth text was a statement describing Hanoi streets during spring, near Vietnamese Lunar New Year. The fifth text was a statement describing how Vietnamese footballer stars spend money.

RESULTS AND DISCUSSION
In this work, the results obtained from the trained Vietnamese TTS model is discussed. At first, the generated speeches are accessed based on its completeness. This indicates whether the model is able to generate complete speeches based on given texts. Second, the speeches are accessed based on its clearness and naturalness subject to MOS scores, the typical index for accessing the quality of generated speeches from TTS engine.

Completeness of the generated speeches
Out of the five generated speeches, three (the first, the second, and the fifth) were complete. The third speech missed 2/17 words while the fourth one missed 10/14 words (i.e., the second part of the sentence, after the comma). Further analyzing the missing words, Table 3 presents the frequencies of missing words in the training and validation sets which were used for training and validating the developed FOSD model. From the table, it could be seen that, the typical ratio of validation words over training words were from approximately 0.05 to 0.14 It is obvious that too little frequencies in the validation set could cause missing words in the generated speech, i.e., 2 times for the words "sắc" and "xuân". In addition, too many frequencies also could cause the same issue, i.e., from above 1,000 to over 2,000 or 3,000 times for the words "nhiều", "đã", and "Hà" respectively.

Clearness and naturalness of the generated speeches
A crowd-source survey was conducted on a set of 100 random participants who are students at FPT University to assess the clearness and naturalness of the generated speeches. Here, the naturalness refers to the state or quality of being natural (human-like) in the generated speeches while the clearness indicates the clarity (low noise) in the generated speeches. Based on the survey, 50% of the students used headphones while the other 50% used computer speakers for the test. In addition, all of the students had never heard about the sentences and speeches before. Their MOS were outlined in the Table 4. From the table, the MOS for clearness was ranging approximately from 2 to 4.5. Four out of five speeches were considered clear while the second speech was the least clear one. The clearest speech was the fifth, its MOS was 3.39 with standard deviation of 0.98 making it the best speech in the test set. Meanwhile, the MOS for the generated speeches' naturalness were typically slightly lower than those of clearness. Still, the fifth speech was the most natural speech in the test set. Here, three out of five speeches were above the average (about 2.50).

CONCLUSION
This paper has presented the first approach for generating FOSD Tacotron-2-based TTS engine for Vietnamese. The work opens new insights into the generation of speeches from texts. To be particular, too little or excessively large frequencies of texts in training and validation sets could cause missing of the words in the generated speeches. Overall, all the generated speeches are above the average in terms of clearness and naturalness. Future works will explore more possibility of generating quality speeches from an optimal dataset.