Hoax analyzer for Indonesian news using RNNs with fasttext and glove embeddings

Received Oct 21, 2020 Revised Dec 22, 2020 Accepted Mar 27, 2021 Misinformation has become an innocuous yet potentially harmful problem ever since the development of internet. Numbers of efforts are done to prevent the consumption of misinformation, including the use of artificial intelligence (AI), mainly natural language processing (NLP). Unfortunately, most of natural language processing use English as its linguistic approach since English is a high resource language. On the contrary, Indonesia language is considered a low resource language thus the amount of effort to diminish consumption of misinformation is low compared to English-based natural language processing. This experiment is intended to compare fastText and GloVe embeddings for four deep neural networks (DNN) models: long short-term memory (LSTM), bidirectional long short-term memory (BI-LSTM), gated recurrent unit (GRU) and bidirectional gated recurrent unit (BI-GRU) in terms of metrics score when classifying news between three classes: fake, valid, and satire. The latter results show that fastText embedding is better than GloVe embedding in supervised text classification, along with BI-GRU + fastText yielding the best result.


INTRODUCTION
Ever since the growth and development of internet, digital information is consumed daily regardless of their validity. The presence of social media such as Facebook, Tik Tok, and Instagram has a great impact on how digital information is created and consumed. These platforms give their users the freedom to create, access, and process information in which the chance of consuming misinformation is very likely. Misinformation itself is incorrect information with accidental or intentional purposes which has been an issue since 16 th century [1]. The issue is gradually becoming more problematic and popular ever since the 2016 U.S presidential election where 73 publications in English language are deemed fake in 2016 and increased to 2210 by January 2017 [2].
Misinformation or fake news can be completely made up and manipulated in order to gain attention, can be designed to mislead readers and can be purposely false. The objective of fake news is to earn benefits in politics and finance, commonly with exaggerated or unique headlines to attract readers [3]. In Indonesia political fake news has increased by 61% between December 2018 and 17 th April 2019, which was the schedule of presidential election [4]. Aside from political fake news, another example of fake news in Indonesia is a false issue of earthquake aftershocks, thus directly affects traumatized victims of the earthquake and tsunami that has just occurred [5]. One research shows that fake news and valid news can be distinguished based on numerous aspects, most notably in the title of the news. Additionally, fake news title uses significantly fewer stop-words and nouns but more proper nouns and verb phrases. In comparison, valid news persuades the readers using backed-up arguments and citations whereas fake news gains trust through heuristics [6]. Moreover, since English is a high resource language, it can be hypothesized that most of the news are conveyed in English. For natural language processing (NLP) task to perform smoothly, the training corpus is required to be the same language as the latter input text. Therefore, as Indonesian language is considered low resource language, hoax analyzer for Indonesian language news is not as many as for English news. In this experiment, we attempted to reduce fake news consumption by sorting out the fake news from a pool of unfiltered news in Indonesian language. The latter news is sorted out based on the validity with fake news having two sub classes: hoax news, and satire news, resulting in a total of three classes: valid news, hoax news, and satire news. The sorting process is performed using a supervised text classification using deep learning approach, recurrent neural network (RNN). We use four models of RNN: long short-term memory (LSTM), gated recurrent unit (GRU), bidirectional LSTM, and bidirectional GRU. All of four models use embedding layer as the input layer, with fastText and global vectors (GloVe) as word embeddings. The objective of this experiment is to evaluate and discover new insights between different RNN models using different embeddings as input layer for low resource language, mainly Indonesian language.

INTRODUCTION RESEARCH METHOD 2.1. Related works
A study is presented on the analysis of language used in news media in terms of fake news detection and political fact-checking in which the researchers compared language of real news to satire, hoaxes, and propaganda to identify the linguistic characteristics of unreliable news. In the experiment, English Gigaword corpus was used as the dataset for reliable news label and seven sources were used as the dataset for satire, hoax and propaganda. The process of differentiating the characteristics between news type uses lexical sources from previous works in communication theory and stylistic analysis in computational linguistics. The next process was to run LSTM model with maximum entropy (MaxEnt) and naïve bayes classifier to train and predict the reliability and validity of Politifact dataset. The latter results showed that LSTM outperformed other models when using text as input, while MaxEnt and naïve bayes performed better when using linguistic inquiry and word count (LWIC) as a feature, allowing both to increase the performance. Same treatment was applied to LSTM yielding lower performance [7].
Another work for validity checking was the introduction of dataset, LIAR. LIAR is a dataset specifically created for fake news detection which contains 12,836 short statements that have been labelled according the truthfulness, subjects, contexts, speakers, states, parties, and prior histories. Truthfulness label are split into six parts: pants-fire, false, barely true, half-true, most-true, and true. LIAR dataset was used to evaluate the performance of hybrid convolutional neural network (CNN) model in automatic fake news detection along with support vector machine (SVM), logistic regression classifier, bidirectional LSTM and vanilla cable news network. The proposed hybrid CNN model is a combination of text and meta-data; subjects, speakers, jobs, states, parties, contexts, and histories. The results showed that text combined with speaker as meta-data only performed better in validation process, while text combined with all aforementioned meta-data performed greatly in testing process. On the contrary, vanilla CNN performed the best in testing and validation process. LIAR dataset is also viable for stance classification, argument mining, topic modelling, rumour detection and political natural language processing (NLP) research [8].
A satire news focused research was attempted in order to build a satirical news classifier. Since satirical news is relatively ambiguous, it can be distorted into both humour and criticism with unknown objective [9], thus the research target is to compare satirical news to their truthful equivalent in 12 contemporary news topics from four different domains. SVM-based algorithm was used for the research with five features: absurdity, humour, grammar, negative affect, and punctuation. The testing process of 360 news yielded 90% precision and 84% recall when classifying between satirical news and their truthful equivalent [10].
Novel method for automatic fake news detection was also proposed. The idea is to include the speaker's profile as an additional feature into an attention-based LSTM model. The speaker's profile includes: party affiliation, speaker title, location and credit history. The profile would benefit LSTM model by adding the profile as an additional input data into LSTM model. Result of this novel method outperforms state-of-the-art technique's accuracy by 14.5% when using a benchmark fake news detection dataset [11].
A framework called hierarchical discourse-level structure for fake news detection (HDSF) was also introduced for better understanding of fake news. Incorporating hierarchical discourse-level structure of fake news and valid news is an important step to understand better about fake news. HDSF operates in data-driven and automated manner, by learning and extracting discourse-level structures of fake news and valid news. A strong point of HDSF is its ability to operate without an annotated corpus, considering that the structure between fake news and valid news is recognizable [12].
Another framework, multi-source multi-class fake news detection (MMFD) was proposed to measure spectrum of "fakeness" severity. Automated feature extraction, multi-source fusion, and automated degrees of "fakeness" detection were used to create a logical and interpretable model which can effectively classify news to different levels of "fakeness" [13]. Additionally, similar papers regarding text classification can be found but to our limitations this paper is the first to classify news in Indonesian language into three classes using LSTM and GRU with fastText and GloVe. Recent papers [14]- [17] has studied fastText word embedding, deep learning models and traditional classifiers but no regards to GloVe and satirical news.

Proposed method 2.2.1. Data preprocessing
The dataset used for this experiment can be obtained from GitHub [18]. Inside is a large corpus of news articles that have been tagged accordingly in which we retrieve 1000 data each from news annotated as reliable, fake, and satire for a total of 3000 data. Since the dataset was still in English, an online translator powered by Google Translate was used to convert the dataset to Indonesian language [19].
The dataset is then cleaned by changing all upper-case letters to lower case, removing Indonesian language stop words like (di and ke) which effectively translates to (at and to) in English. Punctuation, digits, and extra spaces were also removed in the data cleaning process. Data cleaning is an important step to reduce the memory used to store words in the vocabulary dictionary as well as reducing any potential noise in the dataset. Visualization of data pre-processing for this experiment can be seen in Figure 1. The final dataset that has gone through data pre-processing include a total of 1.036.041 words with 55.411 unique words which can be seen in Figure 2. It can be inferred that fake news tends to use longer texts, while satire news has considerably lower word count.

Data training process
The training process for RNN models uses keras dense layer as the classifier and the output nodes. The last node uses SoftMax for activation function since SoftMax always transforms the input values whether it is negative values, zero, and positive values. The value would be transformed into 0 and 1 which can be interpreted as probability. For model optimizer, we use Adam optimizer because of its ability to automatically adjust its learning rate. Categorical cross entropy is used as the loss function since it improves robustness for multi-label classification [20].
For the training process, 2800 data were used with random starting position. The first step is to create local corpus for fastText or GloVe embeddings and assigning unique integers for each word in dataset based on corpus values. Each weight for connected nodes is randomized, before connecting embedding layer into RNN layer. We use 100 units for each RNN layer with 10 epoch and 32 batch size. As for testing purposes, we use cross validation technique on the same dataset using 10 KFold cross validation to test the models' accuracy, precision, recall, and F1-score. All four models are trained with fastText and GloVe as the word embedding layers. The detailed visualization of the training process can be seen in Figure 3. Brief explanations of the RNN models and word embedding layers used can be seen below: a. Long short-term memory LSTM is a type of RNN architecture that was proposed by Hochreiter and Schmidhuber [21]. LSTM works by storing values over arbitrary time intervals which enables it to handle long-term dependency in a sequential event. The main reason of using LSTM is due to its ability to extract features from sequential input data. While originally LSTM requires input data that has timesteps, news text is suitable for LSTM since each word is recorded as one timestep. Both unidirectional and bidirectional LSTM (BI-LSTM) were used for this experiment. Each type has the same functionality with the only difference is that BI-LSTM is essentially two regular LSTM models using normal time order (from past to future) and reverse time order (from future to past) simultaneously which allows predictions to be made from both time orders. b. Gated recurrent unit Unlike LSTM, GRU is an RNN model that was only recently discovered, originally proposed in 2014 [22]. GRU is also implemented alongside LSTM to compare their performances in terms of model accuracy and computational efficiency. GRU has been proven to yield higher computational efficiency which is made possible by the smaller number of gates it possesses. GRU only has two gates (update and reset) while LSTM has three (input, output, forget). For this experiment, we included both unidirectional and bidirectional (BI-GRU) versions of GRU. The logic on how data is processed by BI-GRU is the same as in BI-LSTM, where there are two GRU models with normal and reverse time order. c. fastText fastText was developed by facebook's ai research (FAIR) lab and is a machine learning library used for efficient learning of word representations and sentence classification [23]. The algorithm for fastText is based on two papers released in 2016: enriching word vectors with subword information [24] and bag of tricks for efficient text classification [25]. fastText already has language support for 176 languages and have distributed pre-trained word vectors for 157 languages [26]. fastText is an extension of the word2vec model that represents each word as an n-gram of characters instead of learning vectors of words directly. fastText is GloVe was first introduced by Pennington [27] in 2014. It is an unsupervised learning algorithm used to obtain vector representations for words. In aforementioned paper, Pennington et al. have proven that GloVe outperforms other models like continuous bag of words (CBOW) in terms of word analogy, word similarity, and named entity tasks. GloVe learns word embeddings in a different way than word2vec. It uses a term co-occurrence matrix of size A x A, where a is the vocabulary size, in which will train the word vectors to predict co-occurrence ratios. An example is the word father will have higher cosine similarity with the word male as both words are semantically close.

RESULTS AND DISCUSSION
From Table 1 and Table 2, it can be seen that fastText embedding for all deep neural network (DNN) have higher metrics score compared to its counterpart GloVe which is made possible from fastText's algorithm that goes one level deeper, consisting of characters n-grams and words as the training focus instead of only words. From Table 1 and Table 2, performance score of each model can be seen with the standard deviation score. Standard deviation itself is a tool of measurement to show the amount of dispersion of the training data.
Compared to the previous experiment [16], their highest F1 macro score for bidirectional LSTM model combined with fastText was 64% while our experiment on bidirectional LSTM + fastText yields better result which is 89.234% and 1.255% standard deviation value. As for fastText experiment [17], they used fastText for text classification which produce 84% F1 score. It can be noted that our experiment using Bidirectional GRU + fastText has higher F1 score, 94.298%. From [14], their best model is 'stochastic gradient descent (SGD) modified hurbe' with 80% accuracy, 65% precision, 100% recall and 80% F1 score. Bidirectional GRU + fastText has higher overall performance, with recall score is lower. Further discussion will be separated into two parts, with one focusing on word embedding and the other focusing on recurrent neural network (RNN).

Word embedding layers discussion
fastText outperformed GloVe for all models except LSTM in the experiment, which is an interesting discovery. LSTM using GloVe embedding yielded higher accuracy and recall compared to LSTM that used fastText embedding. It is important to note that LSTM's and bidirectional LSTM's performance only shifts a little when using GloVe and fastText embeddings, averaging only 0.550 difference for LSTM and 0.925 for BI-LSTM in each metric. On the other hand, GRU and bidirectional GRU showed more notable differences when using fastText compared to GloVe embedding with average of 4.600 for GRU and 4.125 for BI-GRU.
Difference between fastText and GloVe lies in their approach for texts. GloVe treats each word in the corpus like an atomic entity and generates a vector for each word, where respective vectors are treated as the smallest unit to train on. On the other hand, fastText treats each word in the corpus as a combination of character n-grams and generates a vector based on the sum of each n-grams vector which notably can handle out of vocabulary (OOV) words and generate more accurate vectors of rare words since character n-grams in OOV words and rare words may still be shared with words inside the corpus. Based on this, fastText embedding have higher results compared to GloVe if the dataset has a broad spectrum. fastText performing better than Glove in this experiment does not indicate that it will always yield better results than GloVe when used for other situations. In a paper created by Wang [28], they took six word embedding layers: Skip-gram negative sampling (SGNS), continuous bag of words (CBOW), fastText, ngram2vec, and dict2vec and conducted experiments evaluated on: word similarity, word analogy, concept categorization, outlier detection, and QVEC-a tool used for measuring intrinsic quality of word vectors. They conclude that there is no word embedding layer that is consistently better than the rest for the tasks that they were done on, which include: part-of-speech (POS) tagging, chunking, named-entity relation, sentiment analysis, and neural machine translation (NMT).

Rnn models discussion
From Table 1 and Table 2, both BI-LSTM and BI-GRU achieved better results than LSTM and GRU while disregarding the word embedding layer used. This means that the ability bidirectional models have, that is to train the dataset from both positive and negative timesteps, helped in achieving better results in all performance metrics. It is highly acceptable that bidirectional RNN is more suited to the task of supervised text classification compared to its unidirectional counterpart mainly because bidirectional RNNs can train from different standpoints.
The results from this experiment points out that GRU models are more effective compared to LSTM models in terms of metrics score. As GRU does not have a cell like LSTM, GRU yielded better metrics score when the dataset has less frequent occurrence. On the long run LSTM will yield better metrics score compared to GRU due to its model having more stability to control the flow of data as well as the presence of cell to store arbitrary data in the case of longer texts. Since the dataset used for this experiment only have few similarities and shorter texts length, GRU thrives better than LSTM.

CONCLUSION
In this paper, we created neural network models for classifying fake news for the Indonesian Language using fastText and GloVe as the word embedding layer. This experiment provides conclusions is: 1) GRU has better performance compared to LSTM when the dataset has less frequent occurrence and is widely spread; 2) Both bidirectional models of LSTM and GRU yield better metrics score than their unidirectional counterparts; 3) fastText is better than GloVe in performance as fastText can handle out of vocabulary (OOV) words and rare words better than GloVe; 4) fastText and bidirectional GRU combined yielded the highest result in this experiment, mainly because the dataset is widely spread and has shorter text length. The statements above are the conclusions we reached from this experiment, which should contribute more to the natural language processing (NLP) field regarding RNN and word embedding layers. We hope that the results from our experiment can be used for future research and the study of supervised text classification and to develop a more sophisticated fake news classifier for Indonesian language. In this paper, we encountered a problem regarding the dataset. Since we used a low resource language, satirical news is harder to find so we used English news and translated them to Indonesia language. We are aware that translating English news into Indonesian language news can disrupt the actual result of the experiment, so our suggestion is to find Indonesian language news without being translated.