A multi domains short message sentiment classification using hybrid neural network architecture

Received Jan 10, 2021 Revised May 20, 2021 Accepted Jun 3, 2021 Sentiment analysis of short texts is challenging because of its limited context of information. It becomes more challenging to be done on limited resource language like Bahasa Indonesia. However, with various deep learning techniques, it can give pretty good accuracy. This paper explores several deep learning methods, such as multilayer perceptron (MLP), convolutional neural network (CNN), long short-term memory (LSTM), and builds combinations of those three architectures. The combinations of those three architectures are intended to get the best of those architecture models. The MLP accommodates the use of the previous model to obtain classification output. The CNN layer extracts the word feature vector from text sequences. Subsequently, the LSTM repetitively selects or discards feature sequences based on their context. Those advantages are useful for different domain datasets. The experiments on sentiment analysis of short text in Bahasa Indonesia show that hybrid models can obtain better performance, and the same architecture can be directly used in another domain-specific dataset.


INTRODUCTION
The public is discussing various issues through different media platforms. Social media is one of the media channels where public users can share their opinions. However, it inherits some limitations, such as the media types, including video, audio, text, or emoticons, with maximum text length, file format, and other limitations. The various limitations of the social media platform could become another problem for social media analytic. Thus, our research focused on opinions in short text messages shared by a user through social media channels (especially Twitter). It is because the decision-makers are only interested in the sentiment of opinions exchanged during the conversations in social media, not the detailed messages in the discussion. Nowadays, public opinions, in the form of trending topics or virality, could drive and change public policy.
Sentiment analysis is a study to understand people's opinions, sentiments, emotions on something from text messages [1]. The object of emotions might includes products, services, persons, or other topics of conversation. Sentiment analysis of short texts is challenging because its context might present in other short messages or other conversations. Nevertheless, researchers have done various research in sentiment analysis using a statistical model with lexicon methods [2]- [4], machine learning [5]- [9], to semi-supervised methods with deep learning [10]- [12]. In order to get good sentiment accuracy, researchers often come with additional techniques such as data tuning and model tuning. Data tuning uses domain-specific data, which is trained, and tuned to build models on that specific domain. This technique expects overfitting and a good result within that specific domain. On the other hand, model tuning uses various algorithms and combinations to get a new model, which gives results with better accuracy. On deep learning methods, reordering architectures could build a model with better results. For example, combining convolution neural network (CNN) for its coarse-grained features extractions and recurrent neural network (RNN) for its sequential features relationship could give better results [13]. This research explores various architectures of deep neural networks for sentiment analysis. We use the architectures to build a satisfactory model for a specific domain using its domain-specific data and evaluate the model performance against another domain with its domainspecific dataset. For that purpose, we employ two different domains employment and telecommunication.
A great extent of employment issues is frequently discussed in social media, from compensation and employee tenure [14], wages and working locations [15], to the labor market and gender issues [16]- [17]. On the other hand, social media can be used to inform the worker about government policy, labor issues, and bridging communication between the public, laborers, and policymakers. The sentiment of those interactions is essential for policymakers. Another domain with frequent discussion is related to telecommunication operators are also frequently discussed in social media. Those can be related to operator performance on user service, bonuses, bandwidth availability, downtime, and connection quality. Mobile network operators can use the discussion to direct their business policy and services [18]- [20].
In this research, we explore various deep learning methods and architectures and a combination of methods to build models for sentiment analysis on short texts in Bahasa Indonesia. We use domain-specific datasets (on employment and telecommunication issues) for model evaluation and compare the models' performance on the dataset. This research has three-fold contributions in evaluating hybrid neural network methodologies for sentiment analysis on short text messages in Bahasa Indonesia and its application in different domain datasets. They are (1) explaining techniques in building models with hybrid neural network architecture with a domain-specific dataset to induce overfitting model; (2) evaluating the architecture on different domain with direct model implementation and model tuning with another domain-specific dataset; and (3) providing an experience for model migration from one domain to another.
The structure of this paper is presented by defining the purpose of the research in the first section, which is followed by short reviews of various sentiment analysis methods with additional emphasis on deep neural network techniques and hybridization of neural network architectures in the second section. The third section explains the research methods and experiments from data collection, pre-processing, and model building using various techniques for sentiment analysis to compare their analysis results. The fourth section gives experiment results and analysis. The conclusion is given in the last section.

HYBRID NEURAL NETWORK ARCHITECTURE FOR SENTIMENT ANALYSIS
Even though there is a limited natural language resource for Bahasa Indonesia, researchers have conducted various studies on sentiment analysis for text in Bahasa Indonesia, such as using lexicon-based techniques [21]- [22], machine learning [23]- [24], and deep learning [25]- [27]. Le et al. [25] use the LSTM and CNN for analyzing the sentiment of 900 thousand Indonesian tweets. They obtained 73.22% accuracy with a standard deviation of 1.39 using LSTM without normalization. Franky and Manurung [24] evaluate several classification techniques; Naive Bayes, Maximum Entropy, and Support Vector Machines using unigram and word frequency feature on Bahasa Indonesia translation of movie reviews. They achieved 78.82% accuracy, which they considered satisfactory due to simple translation compared to 80.09% accuracy when those techniques were applied directly to the original English movie reviews. Problems that often occur in the use of social media with Indonesian text are unstructured text data and non-standard languages. Putra, et al. [28] showed the use of hybrid models for sentiment analysis by distributing lexicon based on and maximum entropy gives a good evaluation score with 84.31% accuracy.
The hybrid neural network architecture is a combination of several neural network architecture not limited to multi-layer perceptron (MLP), CNN, or long short-term memory (LSTM). Figure 1 shows a hybrid neural network architecture that consists of MLP, CNN, and LSTM. The hybrid architecture is making use of the best of other architectures. The MLP is a simple model of neural networks that accommodates the use of the previous model to obtain classification output. In contrast, the CNN layer uses its local feature advantages to extract the word feature vector from text sequences. LSTM repetitively selects or discards feature sequences based on their context. The feedforward method on MLP multiplies each input neuron with a weight to produce feature maps as a feature vector and passes it to the next layer through networks of neurons. It will carry the learning process before obtaining the 1 vector as the output. On the other hand, CNN uses word vectors as the input dataset and processes them through convolutional layers, pooling layers, and fully connected layers. At the end of the CNN process, the fully connected layer produces 2 vector map. The similar words vector is fed into the LSTM to be processed by extracting the input features using various filters and evaluating the sequence of features within their context in a timely order. Thus, it produces another feature sequence 3 . The output vectors 1 from MLP, 2 from CNN, and 3 from LSTM are combined. The simplest combination function is by concatenating those outputs into the input of the classification process using the sigmoid activation function.

Figure 1. Hybrid neural network architecture
The basic concept of this hybridization is to concatenate the prior process's output as the input to the next one. In this study, we explore those three concept models; MLP, CNN, and LSTM. We use each architecture as the baseline of the neural network models and create a hybrid with the combination of MLP+CNN+LSTM. We store the result models of the last layer process on every single architecture as a parameter vector during the training phase. Those parameter vectors are merged to provide the final process for model classification output. We use sigmoid as the activation function for determining the classification output after the merging of the parameter vectors. It is because the sigmoid function can give a rational value output between 0 and 1. The sentiment value is determined by splitting a lower and higher fraction of the output value. The output value of 0 to 0.500 is considered a negative sentiment, while it is 0.5001 to 1 as a positive one.

RESEARCH METHOD
We evaluate our approach in the hybrid neural network to evaluate its effectiveness for sentiment analysis of short messages in Bahasa Indonesia, especially in employment and mobile telecommunication topics. The analytics phases are described in Figure 2.

Data collection
The data is obtained from the Twitter collection provided and has been labeled by Ivosights. The dataset is classified into positive and negative sentiment. We use human annotators. They are native Bahasa Indonesia speakers because short text messages do not use standard Bahasa Indonesia. However, most are a mixture of Bahasa Indonesia, acronym, slang, English, Arabic, Javanese, or other local languages, emoticons, emojis, and other symbols. The annotator gives a sentiment label on each tweet message using their sense of how they would feel about the tweet's sentence. The most agreed classification by several annotators determines the final sentiment classification.

Data extraction
We take two different topics and characteristics from the Twitter dataset, which relate to employment and telecommunication operators' issues. There are 67,106 tweets in Bahasa Indonesia relate to Employment issues, which discuss issues ranging from welfare to social security to salary. There are 30,102 tweets with positive sentiments and 37,004 tweets with negative sentiments from January 2017 to November 2017. On the other hand, there are 81,159 tweets from January 2017 to May 2018 in Bahasa Indonesia about telecommunication operators' related issues, where 40,938 tweets of those are labeled as positive sentiments and 40,221 tweets others as negative sentiments. We select Bahasa Indonesia tweet messages only and delete tweet messages with local, foreign, or slang language and split the dataset into 80% of training data and 20% of test data.

Dataset pre-processing
Data pre-processing is carried out to prepare a clean dataset for further process. We clean the text messages by filtering unnecessary characters, such as punctuation characters, ASCII codes, tokens with nonalphabetic characters. The sentence is tokenized. Each token and word excluding Bahasa Indonesia's stop words [29] is called vocabulary, which is added to build a dictionary. We group tokens from the word dictionary by their sentiment into vocabulary files representing words from every record with similar sentiment polarity. There are 14,075 tokens with 27,736 dictionaries and 13,994 tokens with 36,264 dictionaries in the employment and telecommunication issue. The use of Indonesian stop words is beneficial for sorting out words that are compliant according to Indonesian language requirements, including five features of twitters (mention, hashtag, URL, discourse maker, and emoticons). The pre-process generates a data dictionary for each dataset. Figure 3 shows the top ten vocabularies in employment issues (a) and telecommunication issues (b).

Hybrid neural network model development for sentiment analysis
We build several models with CNN, MLP, LSTM and the hybrid of those three architectures. Firstly, we build a text representation where each word is represented as a matrix of real numbers using word embedding. A real number representation is required because the neural network model can only accept numerical input values. The word embedding technique is suitable for the neural network model because it keeps the order and interaction of the words within sentences and the probability functions of each word sequence. Thus, it can be more expressive than the classical model, such as bag-of-words, bigram, or trigram [30]. The next step is model building; we use CNN, MLP, LSTM, and the hybrid of those three models for document classification. The CNN uses a filter value for parallel work processing, kernel value, and activation function. The output of this stage is a two-dimensional vector that represents the extracted features. The LSTM units consist of cells, input gates, output gates, and forgotten gates. The MLP uses a back-end model for feature interpretation. The output layer uses an activation function with values between 0 for negative sentiment and 1 for positive sentiment. After that, we look for the best training models which produce the maximum accuracy using gradient descents. A topological change of the architectural model is required to find the optimal configuration for minimizing the error. The last stage is storing the bestgenerated model for evaluation.

Model evaluation
We evaluate the generated model from the training process with the new testing dataset. Evaluation is carried out with a similar estimation function for evaluating training and testing datasets, where new data is encoded with a similar scheme of training data encoder.

Experiment results
Twitter raw data collected offline on the employment and telecommunication issues are stored in a JSON file format. The data contains mentions, hashtags, URLs, discourse makers, and emoticons filtering, which could make an unexpected analysis result. Thus, the documents are extracted and converted into text files for the pre-processing stage. A clean dataset is modified to provide features according to the need for sentiment analysis to be analyzed. The pre-processing and cleaning of the document from retweet, double  Table 1 shows the number of tweets that collected each sentiment feature and the number of cleaned tweets that are prepared after data extraction and pre-processing. The training dataset is prepared by placing all the files that have been grouped into files with positive sentiments labeled with class 1 and files with negative sentiments labeled with class 0. The training is done by tokenization and word embedding to build vectors of real numbers as the word representation. The tokenization is representing documents as a sequence of consecutive integer numbers. The number amps a single token as a vector-specific representation of a real number. The placement of vector numbers is randomly assigned during the training process, and an embedding layer's API can be used to create class initiation for all document datasets. Since the inputs for the training process must have the same vector size, tokens in the documents used as the inputs should be sorted and padded with 0s if the number of tokes is less than the defined vector size.
Each word of each document is represented as a vector of real numbers using word embedding. The vectors are passed through various deep neural network architectures for document classification of sentiment analysis. The Convolutional Neural Network architecture consists of filtering, kernel size definition, and pooling layer for classification output simplification. The LSTM architecture will produce output after passing through forget, input, cell, and output gates. Furthermore, the MLP will classify after passing through input, several hidden, and output layers. The hybrid architectures are designed by appending one architecture to another, such as CNN and MLP, LSTM and MLP, and CNN and LSTM and MLP. Figure 4 (a) shows the performance of implemented neural network architectures to get the minimum loss error value in the training stage. It shows that the hybrid (CNN+LSTM+MLP) outperforms the model algorithm with a more optimal level of error generated. The results of the hybrid model apply to both datasets that are provided as input models. Besides that, the LSTM only architecture is in the second-best. The result shows that LSTM could well enough to find context in sentiment analysis. The MLP shows the worst performance by generating a significant loss error because it used a feed-forward algorithm with a onetime learning process for its every layer. The training accuracy result is shown in Figure 4 (b) affirms the training loss error by showing again that the hybrid (CNN+LSTM+MLP) architecture has the best optimal performance for both trained datasets. In the testing phase, we evaluate the loss error of the experimental results. Figure 5 (a) shows a significant difference between the loss error of evaluation on the employment issues and the telecommunication issues dataset. However, the main concern in this study is the performance difference of different architectures. It shows that the single CNN and hybrid (CNN+LSTM+MLP) architecture show optimal performance. It might because CNN is pretty good for extracting features from the text in sentences. After all, at the testing phase, the trained model is tested against 20% of the dataset. The result shows that the hybrid architecture has an optimal minimum loss error compared to other single architectures' performance. Figure 5 (b) shows the accuracy performance of various architectures on test datasets. It shows that the hybrid (CNN+LSTM+MLP) architecture gives the best accuracy in both employment and telecommunication datasets.

Discussion
The previous section describes the results of our experiments in using seven different neural network architectures (MLP, CNN, LSTM, CNN+MLP, CNN+LSTM, LSTM+MLP, and CNN+LSTM+MLP) for sentiment analysis of short messages in Bahasa Indonesia on employment and telecommunication domains. The MLP architecture uses a simple one and two hidden layers. It takes input from the dataset according to the maximum length defined at the pre-processing stage. The best result of the learning process with MLP is obtained when using two hidden layers. It gives 0.6335 and 0.3797 on training loss error for employment and telecommunication issues datasets and 71.47% and 92.34% on training accuracy for both datasets, respectively. However, the loss error and the accuracy against both testing datasets are significantly worsened. They are to 1.1275 for testing loss error and 57.90% accuracy for employment issues and 0.1068 loss error and 85.92% accuracy on the telecommunication test dataset.
The CNN architecture is more reliable in extracting features from sentences that are defined in "negative" and "positive" sentiments directory. The loss error and accuracy in the training phase are better than those values on MLP with 0.1870 and 0.2227 on training loss error for employment and telecommunication issues datasets and 97.53% and 97.63% on training accuracy for each dataset. The values are not consistent on the testing phase where the loss error and accuracy for employment issues dataset is worsened to 0.4446 on testing loss error and 81.71% on testing accuracy. In contrast, those values for the telecommunication issues dataset becomes much better to 0.0201 on testing loss error and 99.45% on testing accuracy. The result might because the telecommunication issues dataset has more records and richer vocabularies than the employment issues dataset.
The LSTM architecture is more context-oriented on the input sentence encoding. In this experiment, we add a useful dropout function to partially process the LSTM function before the output process to improve accuracy, minimize overfitting and set the value to a maximum of 0.5. The experiment shows that this architecture achieves better than MLP or CNN, especially on a larger dataset (such as the telecommunication issues dataset). The training phase gives optimal loss error (0.0581 and 0.0387) and good accuracy 97.30% and 98.15% for employment and telecommunication issues datasets. However, similar to the CNN, the testing loss error and accuracy worsen in the employment issues dataset to 0.9353 for testing loss error and 65.60% for testing accuracy, and better in telecommunication issues dataset to 0.0133 for testing loss error and 99.77% for testing accuracy.
The hybrid LSTM and MLP architecture is built by concatenation, where the result of LSTM and MLP processes are combined to obtain a single output. The final sentiment classification process is determined using the sigmoid activation function. The experiment shows that appending MLP does not help to improve the architecture performance, where the loss error and accuracy in the training and evaluation phase are somewhat worse than those in LSTM only architecture. The training loss error on hybrid LSTM and MLP architecture is 0.2135 and 0.2608 on the employment and telecommunication issues dataset, while Like other experiments, the analysis on the test dataset of employment issues is worse than the training one. Its loss error is 0.7832 and accuracy 69.00%. On the other hand, the analysis on the test dataset of telecommunication issues is better where its loss error and accuracy are 0.0335 and 99.45%, respectively.
The hybrid (CNN and MLP) architecture is created by combining MLP at the end of the feature extraction on CNN, before executing the flatten function that bridges the CNN process with the output process and giving a pool size of two in the max polling process. The results are quite good and better than hybrid LSTM and MLP in the training stage. The testing stage is slightly better but provides a comparable trial average value. Loss error values in the training stage are 0.4440 and 0.2620 for the employment and telecommunication issues dataset, while its accuracy is 94.22% and 96.61% for each dataset. The testing evaluation on the employment issues dataset is slightly better than single CNN or other dual hybrid architecture with a loss error of 0.5855 and 0.0107 for the employment and telecommunication issues dataset. Its accuracy is 82.24% and 99.70% for both datasets.
The hybrid CNN and LSTM use a combination of feature extraction with context orientation. At the end of the fixed output, it uses the sigmoid activation function for returning the sentiment classification. The performance values are quite good as the training loss error is 0.2199 and 0.2600 for the employment and telecommunication issues dataset, while its accuracy is 95.91% and 96.41% for each dataset. The architecture still has difficulties in employment issues' testing dataset while loss error is 0.6830, and the accuracy is 74.60%. In contrast, the telecommunication issues' testing dataset is evaluated with better performance as 0.6017 for its loss error and 98.50% for its accuracy.
The hybrid (CNN, LSTM, and MLP) architecture is built by concatenating CNN processes with LSTM processes and MLP processes. This integration makes a more complicated learning process, which results in a longer training computation time. However, that gives significant benefits for the architecture performance in sentiment classification. The training loss errors are 0.0490 and 0.0347 for the employment and telecommunication issues dataset. Their accuracies are 98.85% and 98.39% for each dataset. The architecture also shows its best performance for the testing dataset of the employment and the telecommunication issues, with loss error is 0.4850 and 0.0020. at the same time, their accuracies are 85.48% and 99.85%.
The performance of implemented architectures for sentiment classification can be listed as Table 2 for their performance on training datasets of employment and telecommunication issues and Table 3 for their respective testing datasets. We expect the optimal architecture can give the lowest loss error and the highest accuracy. On training and testing datasets, the hybrid (CNN+LSTM+MLP) shows its superiority over other architectures by giving the lowest loss error and highest accuracy on both training and testing datasets. Even though on the employment issues testing dataset, the CNN architecture gains the lowest loss error, but the hybrid (CNN+LSTM+MLP) is the second-best.   Figure 6 shows the model plot of the hybrid (CNN+LSTM+MLP) architecture. The input for each model contains vocabulary tokens with a limit of maximum length of 962 words. The CNN converts the input with word embedding by mapping each word into 100-dimensional vectors. The convolutional process extracts features from the input data with 32 parameters with four kernels. In order to reduce overfitting and optimization, we use a 0.5 dropout to minimize the computation process. The CNN is stopped on the flatten function because its result will be merged with results from other models. Similar to CNN side, the LSTM is initiated by applying word embedding on the input data. We use ten nodes neural network layer with the ReLu activation function for context-based processing to generate a ten-dimensional output vector using a 0.2 dropout optimization. The MLP side uses three hidden layers that consist of 20, 10, and 10 nodes with varying input dimensions. The flattened output CNN, LSTM, and MLP are merged by concatenation. We process the merged output using a feedforward neural network with a single hidden layer with ten nodes and predict the analysis output using a sigmoid function to give a positive and negative sentiment classification. The result shows that combinations of CNN, LSTM, and MLP architectures give the best results and can be used in different domain datasets from its training domain. It is because the hybrid architecture gets the best of its building architectures models; the MLP that accommodates the use of the previous model to obtain classification output, the CNN layer that extracts the word feature vector from text sequences, and LSTM that repetitively selects or discards feature sequences based on their context. Those advantages are useful for different domain datasets, where one architecture might be better than others, but the combination might make it constantly similarly good. The experiments on sentiment analysis of short text in Bahasa Indonesia show that hybrid models can obtain better performance. The same architecture can be directly used in another domain-specific dataset.

CONCLUSION
Deep learning is a valuable method for developing sentiment analysis on short text messages, especially on limited resource language. It is due to its ability to infer and extract hidden information from a large number of data. CNN, LSTM, MLP architectures have shown their ability to solve the classification problems in natural language processing with a pretty good result. This research confirms that a hybrid architecture (CNN+LSTM+MLP) can give an even better result, and its architecture can directly be applied in different domain datasets. It is because the hybrid architecture can utilize the advantage of each building component.