Performance comparison of TF-IDF and Word2Vec models for emotion text classification

Received Apr 14, 2021 Revised Jul 27, 2021 Accepted Aug 30, 2021 Emotion is the human feeling when communicating with other humans or reaction to everyday events. Emotion classification is needed to recognize human emotions from text. This study compare the performance of the TFIDF and Word2Vec models to represent features in the emotional text classification. We use the support vector machine (SVM) and Multinomial Naïve Bayes (MNB) methods for classification of emotional text on commuter line and transjakarta tweet data. The emotion classification in this study has two steps. The first step classifies data that contain emotion or no emotion. The second step classifies data that contain emotions into five types of emotions i.e. happy, angry, sad, scared, and surprised. This study used three scenarios, namely SVM with TF-IDF, SVM with Word2Vec, and MNB with TF-IDF. The SVM with TF-IDF method generate the highest accuracy compared to other methods in the first dan second steps classification, then followed by the MNB with TF-IDF, and the last is SVM with Word2Vec. Then, the evaluation using precision, recall, and F1-measure results that the SVM with TF-IDF provides the best overall method. This study shows TFIDF modeling has better performance than Word2Vec modeling and this study improves classification performance results compared to previous studies.


INTRODUCTION
Emotion is the human feeling when communicating with other humans or reaction to everyday events [1]. Human emotions can be expressed in the form of facial expressions, voices, and text [2]. Recently, people used to convey their emotions in the text through social media. Text in social media can be classify as emotion or no emotion. Then the text that contains emotions according to Ekman is divided into six types, namely happy, sad, angry, scared, surprised, and disgusted [3]. The classification of emotion types from texts is of concern in the field of human-computer interaction (HCI), information retrieval (IR), and has been implement in many domains [4], [5]. Emotion classification in the text have received much attention to recognize human emotions in the text [6]- [8].
Text classification has a feature extraction stage to change an unstructured textual format into structured data so that data can be processed with machine learning algorithms for classification [9]. Feature extraction plays an important role in classification because the selection of an effective and appropriate method can affect classification performance [10]. The  vector space model using the term frequency-inverse document frequency (TF-IDF) model approach [11]. Another feature extraction technique that is widely used is word embedding with the Word2Vec model approach [12]. TF-IDF modeling produces data with high dimensions, while Word2Vec modeling produces low dimensional data [13]. This is related to the computation time of the classification process and classification performance. Several feature extraction techniques that can be used (TF-IDF and Word2Vec) make researchers often confused about which feature extraction technique is suitable for their research. Improper use of feature extraction techniques will result in longer computation time and not optimal classification performance results. Based on this background, it is important to do research related to comparison of the performance of the TF-IDF and Word2Vec models in classification. The goal of the research to obtain the best feature extraction model used for the text classification process. The best feature extraction is required for faster computation time in the classification process and and improved classification performance results [13].
The main contribution of this paper is present performance comparison of TF-IDF and Word2Vec models for emotion text classification. This paper improves a performance evaluation of research previously. Several previous studies such as the comparison of the emotions of Commuterline and Transjakarta users using the Multinomial Naïve Bayes method and the TF-IDF model were conducted by Cahyani [14]. While the use of Word2vec as feature extraction has not been used in that study [14]. Then the research on tweet emotion detection uses two stages of classification with the support vector machine (SVM) method and the maximum entropy method and the TF-IDF model conducted by [15]. Furthermore, the analysis of sentiment analysis from Twitter Messages using Word2vec by testing four classifiers, namely Gaussian Naive Bayes, Bernoulli Naive Bayes, SVM, Logistic Regression with two different test models, namely Skip Gram and CBOW in the Word2Vec algorithm was carried out by Acosta [16]. Then, the utilization of the Word2Vec model for sentiment analysis of product reviews with the SVM method was carried out by Fauzi [17]. This paper was different from previous studies because previous studies only used one of the TF-IDF or Word2Vec techniques to perform the classification process. Previous studies have not compared the performance of using TF-IDF and Word2vec models for classification processes in the same time with same data for detecting text emotions. Performance comparisons are needed so that we can find the best model that can be used to produce optimal text classification performance.
This paper discusses the performance comparison of the TF-IDF and the Word2vec model for classification of emotions in text. The classification algorithm used in this paper is the SVM and compared with the MNB (Multinomial Naïve Bayes) algorithm which was carried out in previous studies [14]. This research emotion classification is applied to Commuterline and Transjakarta tweet data which use Indonesian language.

RESEARCH METHOD 2.1. Preprocessing
The research method of the emotion text classification in this study is shown in Figure 1. The preprocessing stage is the process of preparing text data before it is processed in the system. Preprocessing is used in this study to select data so that the processed data becomes more structured. The preprocessing has four-step i.e. case folding, filtering, normalization, stop words removal, and stemming. Case folding is a task of converting text become lowercase. Filtering is a task of filtering the attributes of tweets i.e. links, mentions, URL, Normalization is a task of changing non-standard words into standard words. Stop words removal is a task of eliminating common word that have no meaning. Stemming is a task of removing the affixes in word [18], [19].

TF-IDF and Word2Vec model
In this stage, we perform modeling of TF-IDF and Word2Vec. TF-IDF is a method of weighting a word/term which gives a different weight to each term in a document based on the frequency of terms per document and the frequency of terms in all documents [20]. TF-IDF is used in this study because it provides better performance, especially in improving recall and precision values [21]. There is four-step in the TF-IDF model. The first step is the calculation of the frequency of occurrence of each word in each document (TF). It is shown in (1).
where; tft: number of occurrences of term t The second step is the calculation of the number of documents containing a specific word (DF). Then, the third step is the calculation of inverse DF (IDF). The calculation is shown in (2).
where: idft: inverse document frequency D: number of document dft: the number of document that contains term t The last step is the calculation of TF-IDF. TF-IDF is the multiplication of the TF results with the IDF calculation results for each word. The calculation is shown in (3).
where: : weigth of term (t) in document (d) tft: number of occurrences of term t idft: inverse document frequency that contains term t The TF-IDF model will compare with the Word2Vec model. Word2Vec is the neural network that represents words in vector form [22]. Word2Vec is used in this study because it provides better performance for the semantic task in determining the association of a word with other similar words. For example, man is associated with boy or woman is associated with girl [23]. Word2Vec has two models i.e. continuous bag-ofwords (CBOW) model and the continuous Skip-gram model. This study using Skip-gram because can better represent sparse words in data than the CBOW model [24]. The architecture of Skip-gram model is shown in Figure 2. In the Skip-gram architecture, the model uses the current word as input to predict the surrounding context, where the Skip-gram will study the probability distribution of words in the context with a predefined window. The Skip-gram model has input layer, hidden layer, and output layer [25]. The input layer on Word2Vec is a one-hot vector, where one input word from the given vocabulary will be 1 and the other word will be 0. Each neuron in the input layer represents one word in the vocabulary. In the hidden layer, the number of neurons represents the number of dimensions of the word vector. The activation function in the hidden layer is linear, so the hidden layer neuron value is the input value multiplied by the weight value. The activation function in the hidden layer is shown in (4). Then, the value of the hidden layer is multiplied by a different weight value in the output layer that the function is shown in (5).
where: h: hidden layer W T : transpose of weigth where: uj: output line j to the hidden layer W′ T : transpose of the weight from the hidden layer to the output layer The number of neurons used in the output layer is the same as the number of neurons in the input layer that represents the target word. The output layer uses the Softmax activation function, where the Softmax activation function is shown in (6).
where: yj: softmax output line j u j ′: output of all lines V: number of vocabulary

Emotion classification using SVM
The resulting weigh of TF-IDF and word vector in the previous stage was utilized as the classification features. This study uses the SVM classification method. SVM is a classification method that is widely used in the field of text classification because of the superiority of its performance [26], [27]. SVM classification creates an ideal dividing line or hyperplane in a higher dimensional component space to map information with minimal risk [28]. If the existing data cannot be separated linearly (non-linearly), SVM is modified using the Kernel function, where the ⃗ data is mapped by the function ( ⃗) to a vector space with a higher dimension. Furthermore, the learning process on SVM in finding support vector points relies on the multiplication of the dot product from the transformed data. Since the transformation is not easy to understand, the dot product calculation can be replaced with a kernel function. The kernel function is shown in (7).
The classification results of the ⃗ data is shown from (8), where is Lagrange multipliers, which is zero or positive ( ≥ 0), is the class of test data , b is bias, n is the number of samples in the training set, and SV is a support vector, which is a subset of the training set that has a Lagrange multipliers value greater than 0 ( > 0). The kernel functions that can be used in SVM are Linear, Polynomial, Sigmoid, and Radial Basis Function (RBF) [29]. This study uses a linear kernel function because have good performance, fast, and only require few parameter compared with other kernels [30].
In this study, the SVM classification applies the 10-fold cross-validation technique. The 10-fold cross-validation is a technique that uses the entire dataset as training data and testing data where the classification process is carried out 10 times with various forms of training and testing data [31].

Evaluation and analysis
At the evaluation stage, the calculation of accuracy precision, recall, and f1-measure are performed as shown in (9) In the analysis stage, we compare the results of the SVM with TF-IDF and SVM with Word2Vec classification. The results were also compared with the methods used in previous studies.

RESULTS AND DISCUSSION
This study uses data crawling from tweets of Transjakarta and Commuterline users. Query search in data collection uses the official Transjakarta (@PT_Transjakarta) and Commuterline (@CommuterLine) accounts. Tweet data was obtained from January 1, 2017 to September 30, 2017. All of the dataset is in Indonesian language. This experiment used python programming language with the some library i.e. scikit, numpy, pandas and gensim. The experiment in this study is a continuation of previous research [14], so this research experiment uses the same data. Table 1 shows the experimental dataset in this study. The classification is divided into two steps. The first step classifies the tweet data into emotion and no emotion. The classification results in this study also are compared with the results of previous studies [14]. The result data in the first step classification that contains emotion tweets then processed in the second step classification. The second step classifies tweets that contain emotions into five types of emotions i.e. happy, angry, sad, scared, and surprised. Figure 3 shows a comparison of average accuracy in the first step and second step classification between SVM with TF-IDF, SVM with Word2Vec and MNB with TF-IDF that conducted in previous studies [14]. We not combine MNB with Word2Vec because Word2Vec vectors sometimes contain negative values, MNB classifier does not allow for negative values in the document vectors. It should be possible to scale all vectors uniformly to avoid negative values but this result in poor performance [33]. Figure 3 shows that, the SVM with TF-IDF method generates the highest accuracy compared to other methods for Commuterline and Transjakarta data, both in the first step and second step classification. Then followed by the MNB method with TF-IDF, and the last is SVM method with Word2Vec. This shows that TF-IDF modeling has better performance than Word2vec modeling. Also in general, the accuracy generated by the commuter line data is better than the Transjakarta data for each method. This is because the number of Commuterline data is bigger than Transjakarta data so that the features of Commuterline for the classification process are more diverse. With the many various features in the commuter line data, the resulting accuracy value on the commuter line data is higher than the transjakarta data.  The classification also measures precision, recall, and F1-measure on Commuter line and Transjakarta data. The precision, recall and F1-measure values affect how well the system performs in recognizing an emotion. Figure 4 shows the results of the comparison precision, recall and F1-measure in first step classification. Figure 4 shows the SVM with TF-IDF method provides the best overall precision, recall and F1-measure. This shows that classification using the SVM method with TF-IDF succeed generates the system work properly to recognize emotion and no-emotion data. Furthermore, the second order resulted in the MNB method with TF-IDF although the results with the first order were not much different. Meanwhile, the classification of SVM with Word2vec in third place has a significant difference when compared to the classification of SVM with TF-IDF. This also proves that TF-IDF modeling has better performance than Word2Vec modeling. In general, the precision, recall, and F1-measure generated by the commuter line data is better than the Transjakarta data for each method. The precision, recall and F1-measure were also generated in the second step classification. The average precision, recall, and F1-measure values on the Commuterline and Transjakarta data are presented in Figure 5-7. In Figure 5, the precision value shows the system performance in the three methods is good enough to recognize happy and angry emotions. However, the SVM with Word2Vec method does not succeed in recognizing the emotions of sad, scared and surprised. This is because the data for the emotional class for sad, scared and surprised have a small number. Word2Vec requires a large number of data to learn word representations and to place words that are similar to a closer position so that Word2vec cannot recognize emotions with small data. For the surprised emotions, the three methods fail to recognize emotions correctly on the commuter line data, while for the Transjakarta data the SVM with TF-IDF method has a low precision value. The precision of surprised emotions is low for all metode because the number of surprised emotions is a minority of data which has a large difference in the number of other emotions so that when the surprise emotion is classified, it is classified into other emotions. In Figure 6 and Figure 7, the recall and F1-Mmasure values show that the performance of the three methods is good enough to recognize happy and angry emotions. However, for sad and fearful emotions, there are significant differences where the SVM with TF-IDF method is better at recognizing sad and fearful emotions compared to MNB with TF-IDF. The precision, recall, and F1-measure values in the MNB with TF-IDF method are low, which means that the MNB with TF-IDF method is less able to recognize sad and fearful emotions. Meanwhile, in the SVM with Word2Vec method, the recall and f1-measure values are zero, which means that this method fails to recognize sad, scared, and surprised emotions. This is because the data for the emotional class for sad, scared and surprised have a small number. Then based on the recall value on both the data and the F1-measure value on the Commuter line data, the three methods cannot recognize surprised emotions because the value obtained is zero. Meanwhile, the F1-measure value of the SVM with TF-IDF method on the Transjakarta data has an F1-measure value, although it is low. This means that the SVM with TF-IDF method can identify surprised emotions even though not optimal. The best method for measuring precision, recall and F1-measure is SVM with TF-IDF. This is because the SVM with TF-IDF method can recognize the five types of emotions, including recognizing surprised emotions where other methods fail to recognize surprised emotions. On the other hand, the SVM with Word2Vec method can only recognize happy and angry emotions and cannot recognize other emotions. So this shows the TF-IDF model's performance is better than the Word2Vec model for recognizing every type of emotion (happy, angry, sad, scared, surprised) based on precision, recall and F1-measure values. TF-IDF model's performance is better than the Word2vec model because the number of data in each emotion class is not balanced and there are several classes that have a small number of data. The number of surprised emotions is a minority of data which has a large difference in the number of other emotions. In the small data, Word2Vec can not collect the semantic and syntactic information of words properly. Word2Vec need large training data to learn the word representation. Meanwhile, TF-IDF modeling can generate good accuracy even with a small number of data.
Evaluation of classification performance in this study improves classification performance results compared to previous studies [14]. The SVM with TF-IDF method used in this study gave better results than the MNB with TF-IDF method in previous studies. In the first and second steps accuracy evaluation, the accuracy value of the SVM with TF-IDF method is better than the MNB and TF-IDF methods. Likewise in the evaluation of precision, recall and F1-measure, the SVM with TF-IDF method is superior to the MNB method with TF-IDF. So that in this study, we have the advantage of improving the results of evaluation of accuracy, precision, recall and F1-measure for emotion text classification.

CONCLUSION
In this study we compared the performance of the TF-IDF and Word2Vec models to represent features in the emotional text classification. We use the SVM and MNB methods for classification of emotional text on Commuterline and Transjakarta tweet data. The classification is divided into two steps, namely the first step to determine whether a tweet contains emotions or does not contain emotion, and the second step is to determine a tweet that contains emotions into five types of emotions (happy, angry, sad, scared and surprised). In this study we used three scenarios of classification methods, namely SVM with TF-IDF, SVM with Word2Vec and MNB with TF-IDF. The SVM with TF-IDF method generate the highest accuracy compared to other methods in the first dan second steps classification, then followed by the MNB with TF-IDF, and the last is SVM with Word2Vec. Then, the evaluation using Precision, Recall and F1-Measure results that The SVM with TF-IDF provides the best overall method in the first and second steps classification. The SVM with TF-IDF method succeed to recognize emotion and no-emotion data in first step classification and succeed recognize the five types of emotions in second step. This shows that TF-IDF modeling has better performance than Word2vec modeling in classification emotion text. Evaluation of classification performance in this study using SVM with TF-IDF improves classification performance results compared to previous studies that using MNB with TF-IDF.
In the future work, the researchers are expected to use balanced data on each emotion class and large amounts of data. A large and balanced amount of data in each class is needed to improve the performance of the feature extraction technique so that it affects classification performance. Futhermore, the researchers can also combine TF-IDF and Word2Vec as feature extraction for text classification.