Sentiment analysis of imbalanced Arabic data using sampling techniques and classification algorithms

ABSTRACT


INTRODUCTION
The corona epidemic invites people to use social media to share their thoughts, hold discussions, and express their feelings about the pandemic.Microblogging, such as Twitter, is a popular and widely used social media platform in which people submit short blogs about daily life occurrences.One of the most important occurrences that has spread worldwide is the coronavirus (COVID- 19), where some people express their anxiety and tension about COVID-19 while others simply believe it is a rumor.With the large number of tweets regarding the epidemic, a detailed observation at how people felt during the pandemic can be made.Despite the significant increase in the number of infections and deaths, some people still believe that COVID-19 is a rumor.Twitter is a rich source of data for scholars who want to study emotions in depth [1].
Sentiment analysis (SA), also known as opinion mining, is an important field in natural language processing (NLP) that determines the direction of emotions represented in a text as positive, negative, or neutral and provides a suggestion about the feelings in a text.It becomes an indispensable tool for developing  ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 13, No. 1, February 2024: 607-618 608 recommendation systems, monitoring brands, or assessing survey replies or evaluations, and it aids in spotting significant concerns in real time [2].Furthermore, SA can be applied at several levels, such as document, paragraph, sentence, or aspect [3].In addition, SA is used to build models that can predict and classify sentiments from sentences using text analysis and machine learning (ML) approaches [4].
The classification task's success is influenced by the quality of data and ML method.Given the availability of the requisite NLP tools for english texts, several researchers concentrated on SA for the English Language [5].SA can take one of three approaches: ML, lexicon-based, or a mixed approach [6].The supervised learning strategy in ML employs labeled datasets to train the model to predict sentiments, while unsupervised learning focuses on unlabeled data to identify a possible structure [7].Support vector machines (SVM), naive bayes (NB), and k-nearest neighbors (KNN) are among the most commonly used ML techniques in text classification.
However, class imbalance in datasets is a critical issue that affects the performance of these methods.The imbalanced dataset problem is experienced when various classes have substantially varied proportions in the dataset; the majority classes account for a large part of the data, while minority classes have a small amount of data [8].When gathering data for SA, we may encounter data imbalance; if the nature of this data is ignored, ML algorithms will be biased toward the majority class, resulting in the misclassification of the minority class compared with the majority class.To address class imbalance, many solutions have been developed, which may be at the data or algorithm level.
Resampling is used in data-level approaches to balance the dataset; typical examples include over-sampling techniques, under-sampling techniques, and combinations of both.Algorithm-level strategies focus on altering the insight of learning algorithms that use cost-sensitive learning to generalize in favor of the minority class [9].The ensemble learning approach combines cost-sensitive learning with performance-enhancing algorithms, such as bagging, boosting, and stacking.Aside from these strategies, one of the most important concerns to address when dealing with data imbalance is the metrics used to evaluate the model.
In the classification process for a highly imbalanced dataset, the use of some measures such as accuracy can be misleading because the classifier may always predict the majority class without performing any analysis and may have a high accuracy score, which is clearly erroneous.Despite being the world's sixth most spoken language [10], Arabic does not receive great interest like English because of its dialectal variety and complex structure and the limitations in the annotated Arabic datasets.Classical Arabic or quranic Arabic; modern standard Arabic, which is the formal written and spoken Arabic Language taught in the Arab world; and dialectical or colloquial Arabic, which is informal and does not follow any grammatical rules [3], are the three categories of Arabic Language.
Numerous studies on Arabic SA have been conducted on balanced datasets, while the imbalanced learning of Arabic sentiment has received little attention.To the best of our knowledge, SMOTE is the most common sampling technique used to balance Arabic datasets, and when using the ensemble approach, the random forest (RF) classifier is predominantly used in most research.In addition, most of the existing research was conducted on small imbalanced datasets.
To address this research gap, we performed our analysis on an imbalanced Arabic dataset that has 15779 samples, using various sampling and ensemble techniques to address the imbalance problem.This study aims to improve the imbalanced learning of Arabic sentiments.Experiments were conducted at several levels, and resampled data was used to train different single and ensemble classifiers to study the effects of these techniques and find an optimized classifier that can distinguish the largest number of negative Arabic tweets while maintaining the performance of the model in terms of positive tweets.The contributions of this study are summarized as follows: i) the methods used for SA with imbalanced arabid datasets were compared and ii) the impact of using sampling techniques to train single and ensemble classifiers and find the best classifier using different evaluation metrics was analyzed.
The rest of the paper is organized as follows: section 2 discusses and summarizes the research papers from the literature related to imbalanced Arabic learning.Section 3 introduces the proposed method and the approaches adopted for the current classification task.In section 4, the conducted experiments are described and the results are discussed.Finally, conclusion is provided in section 5.

LITERATURE REVIEW
Most of the existing Arabic SA research was conducted on balanced datasets.Furthermore, classification algorithms can perform better on balanced datasets than on imbalanced datasets, so re-sampling techniques, including under-sampling and over-sampling, have been adopted to balance datasets [11].Mountassir et al. [12]  Al-Azani and El-Alfy [13] conducted three studies to address the imbalanced dataset problem.In 2017, they studied the impact of the SMOTE over-sampling technique on an imbalance dataset of tweets in dialectal Arabic, and the experiments were evaluated on basic and ensemble classifiers using the accuracy, F1, precision, and recall evaluation metrics.The results show that using SMOTE with ensemble classifiers increased the performance by 15% compared with the baseline experiments.In 2018, Al-Azani and El-Alfy [13] studied the impact of the bootstrap aggregating algorithm with SMOTE on a concatenated version of imbalanced Arabic dialectical twitter datasets, namely, Syrian tweets [14], Arabic sentiment tweets dataset (ASTD) [15], ArTwitter [16], tweet corpus for subjectivity and sentiment analysis (SSA) [17], and Semeval-2017 [18].NB, KNN, and decision tree (DT) classifiers were evaluated in terms of F1, Matthews's correlation coefficient (MCC), geometric mean (GM), and area under the receiver operator characteristic curve (AUC) values with different imbalance ratios.The experimental results show that balanced bagging classifiers produced the best results [19].
In 2020, El-Alfy and Al-Azani [20] investigated the performance of nine ML classifiers on highly imbalanced Arabic tweet datasets using neural word embedding and over-sampling techniques.The performance is discussed in terms of various measures, like AUC, GM, and F1.The results reveal that the stochastic gradient descent (SGD) classifier with over-sampling exhibits the best performance, achieving the highest GM value.
Al-Sorori et al. [21] analyzed the impact of using synthetic minority over-sampling technique and edited nearest neighbors (SMOTENN) to balance an Arabic dataset collected from Twitter.They used Word2Vec word embedding with various single and ensemble ML classifiers.Their experiments showed that using SMOTENN improves F1 score for both single and ensemble classifiers where the best result obtained by nuSVM produced an average F1 score value of 99.07.Khalifa and Elnagar [22] focused on studying the performance of their Twitter dataset in its imbalanced and balanced versions using term frequency-inverse document frequency (TF-IDF) and word embeddings.They conducted a comparative evaluation of the gradient boosting, logistic regression (LR), nearest centroid, DT, multinomial NB, SVM, XGBoost (XGB), RF, and AdaBoost classifiers and investigated the performance of the MLP and condensed nearest neighbor (CNN) deep learning classifiers.The LR classifier using TF-IDF produced the best F1 result of 87.71% on imbalanced training data.
Addi and Ezzahir [23] conducted a study to address the imbalance problem in the hotel Arabicreviews dataset [24] using various under-sampling and over-sampling techniques on SVM, NB, and RF classifiers.They evaluated their results using accuracy and F1 metrics and concluded that under-sampling techniques, namely, edited nearest neighbors (ENN), the repeated ENN rule, tomek links, and the neighborhood cleaning rule, showed the best results among the sampling techniques.Recently, Al-Hashedi et al. [25] used the COVID-19 Arabic tweets dataset to investigate the effect of the Word2Vec word embedding and SMOTE over-sampling techniques on several single and ensemble ML classifiers.Their experiments showed that ensemble classifiers and SMOTE outperform base classifiers without SMOTE in terms of F1 score.
In the work of Obiedat et al. [26], different versions of an imbalanced dataset about restaurant reviews collected from the Jeeran website were examined using different over-sampling techniques, such as SMOTE, SVM-SMOTE, adaptive synthetic sampling (ADASYN), and borderline-SMOTE (BSMOTE).The authors proposed an approach that combines particle swarm optimization (PSO) and SVM and compared the results of this hybrid approach (PSO-SVM) with different single and ensemble classifiers such as SVM, LR, RF, DT, KNN, and XGBoost.They reported that the proposed PSO-SVM approach is superior to the aforementioned classifiers.The best result was obtained from version 3 of their dataset using BSMOTE with a GM value of 0.81.
In this paper, we propose the use of several single and ensemble classifiers that identify the majority of negative Arabic tweets while maintaining the model's performance in terms of positive tweets to improve the imbalanced learning of Arabic sentiments.Ridge classifiers, LR, SGD, SVM, DT, KNN, and Gaussian NB are used as single classifiers, while RF and AdaBoost are employed as ensemble classifiers.Table 1 presents a comparison between the previously mentioned studies on the problem of imbalanced Arabic datasets.
The comparison highlights the most important points in these studies in terms of the dataset used, the imbalance ratio, and the resampling methods.The imbalance ratio (IR) is the ratio of the number of samples in the majority class to that in the minority class [27].IR indicates the extent of imbalance in the dataset, that is, the higher the IR is the larger the load of imbalance in the dataset is.In Table 2, we present a comparison between these studies in terms of classifiers, year of publication, evaluation metrics, and their best results.

PROPOSED METHOD
The aim of this research is to build a robust model to for predicting Arabic sentiment for COVID-19 tweets.Some people believe that COVID-19 is a real and dangerous virus, while others believe that it is just a rumor.The steps of our research methodology are provided in Figure 1, and the details of these steps are presented in the following subsections.As shown in Figure 1, the first step for conducting the proposed model starts by collecting Arabic COVID-19 related tweets using a Twitter API based on pre-specified search terms.Then, these tweets were saved in a CSV file.Next, the collected data was cleaned and manually labeled.After that, we used CountVectorizer to extract the features and to represent the input to the classifier.In the next step, the dataset is divided into training and testing sets, with 90% for training set and 10% for testing set, as illustrated in Table 3.The classification accuracy metric and other metrics were used, such as recall, precision, F1-score, GM, and AUC.In addition, a confusion matrix was built to provide an overview of the mislabeled data that the classifier provides and obtain an improved view of how well our model performs.

Data collection
Arabic tweets were collected.The collected tweets were written in standard or dialect Arabic.We finally obtained 15,779 Arabic tweets related to COVID-19 from 75,794 tweets using the Tweepy python library and a Twitter API [28].Search queries were determined according to the most frequently used words about COVID-19 among people on social media platforms.The collected data was saved in a CSV file for the preprocessing phase.The number of tweets decreased because of the presence of duplicated tweets (i.e., retweets), which were excluded with tweets that do not represent feelings, such as news and decisions.

Data preprocessing
The gathered tweets are not clean.The tweets contained noise, such as stop words and special characters, requiring the application of preprocessing techniques to prepare the data for the classification process.First, duplicate tweets were removed.Then, all non-Arabic words, letters, URLs, names, hash tags, numbers, diacritics, and special characters were removed using a python regular expression.Some tweets contained words with repeating letters for emphasis; these words were handled by returning them to their correct format by removing duplicate letters [29].To improve the accuracy of the predictive model, normalization was used to unify analogous letters [30].In the preprocessing phase, we did not apply any stemming because most of the collected tweets were written using dialect language, which complicates the stemming process.

Data annotation
Given the complexity of the morphology and the diversity of the Arabic dialect, we manually annotated the dataset.Three sentiment labels were given to the tweets, namely, positive, negative, or neutral.A negative sentiment is given to tweets that reflect people's views of COVID-19 as a lie or rumor; a positive sentiment is given to tweets that reflect people's beliefs about the existence of COVID-19 and the necessary procedures to protect themselves; and a neutral label is given to tweets that do not carry any kind of emotion, such as news, facts, and decisions.The dataset contains 12,176 positive tweets out of the 15,779 tweets, and the remaining 3,603 tweets were labeled as negative.Figure 2 shows the distribution of the dataset in terms of positive and negative, providing evidence of an imbalanced dataset.

Data representation
ML algorithms cannot directly deal with texts, so the text must be tokenized and then encoded into numerical representation that can be processed by ML algorithms.In this work, we used the CountVectorizer technique, which converts text into word count vectors, where each unique word has a unique dimension.The resulting sparse encoded vectors are transformed back into arrays that contain the occurrences of each word [31].

Model implementation
The model implementation process started by reading the dataset from the CSV file into a pandas data frame and extracting features using CountVectorizer to obtain 37,501 input features.To run the experiments, we utilized seven single classifiers, namely, the ridge classifier, LR, SGD, SVM, DT, KNN, and Gaussian NB (G-NB).We also used two ensemble classifiers, namely, RF and AdaBoost.RF, which is a collection of DTs, is one of the most popular bagging techniques and it provides better predictive performance than a single DT classifier because it attempts to reduce the variance and the chance of classifier overfitting classifier [32], while AdaBoost, which belongs to boosting algorithms, has received great attention in classification problems.
In this work, we used AdaBoost (AdaBST) with a DT classifier as a weak learner for training.However, AdaBST relies on weighted training samples and iteratively increasing weights for incorrectly classified samples and reducing weights for correctly classified samples to reduce the total error and ensure the accurate prediction of incorrectly classified samples [33].In addition, we conducted three experiments.The first experiment studied the performance of the basic default behavior of ML classifiers with GridSearchCV to automatically select the optimal parameters for the classifier with a 10-fold cross validation, without applying any resampling techniques.The second experiment used several over-sampling techniques, including random over-sampling (RO), SMOTE, borderline-SMOTE (BSMOTE), and ADASYN.The last experiment is similar to the second one, but it was conducted with under-sampling techniques, including condensed nearest neighbors (CNN), one-sided selection (OSS), and random under-sampling (RU).However, in RO technique, more random samples are added to the minority class in the training dataset to match the number in the majority class.This technique is performed by duplicating the minority class samples multiple times to complete the training dataset [34].It is simple and fast but does not use any heuristics.Furthermore, no information is lost but the possibility of overfitting may increase.
While SMOTE technique increases the number of samples in the minority class of the imbalanced dataset by finding k nearest neighbors of random samples to add more synthetic instances on the basis of similarities in the feature space.SMOTE should only be applied on the training data to avoid creating new samples that might appear in the testing data, which could provide misleading results.
BSMOTE is an extended version of SMOTE.In this technique, borderline minority class points that are near the decision surface are used to add new samples instead of using normal minority points that are far from the borderline [35].ADASYN uses data points of a minority class that have many neighbors from the majority class.These points are called "hard to learn" data points, which are used to generate new synthetic samples using a probability density function [11].On the other hand, CNN is one of the condensation methods [36] that condense the original dataset by looking for a minimal consistent subset that does not result in performance degradation [37] while OSS is a modified version of CNN introduced by Kubat and Matwin [38]; which combines the CNN rule and tomek links.This technique creates a new balanced dataset that includes all minority class samples, removing noise, borderline, and redundant samples from the majority class and retaining the normal majority class samples.RU randomly eliminates samples from the majority class to balance the training dataset distribution.This technique is similar to the RO technique in simplicity and speed and also does not use heuristics.However, RU may lead to the loss of valuable information to fit the model.

Evaluation metrics
Evaluating the performance of ML models on imbalanced datasets using accuracy is insufficient to judge the quality of the model because of the accuracy paradox.In this work, the quality of the classifier output was evaluated by several metrics, namely, precision (prec.),recall (rec.),F1-score, GM, and AUC, to consolidate a reliable unbiased evaluation.Recall is the measure of actual positives that are predicted correctly by the ML model out of all the positives.Precision is the measure of actual positives out of all the positives predicted correctly by the model.GM and F1-score combine precision and recall metrics.measures the accuracy on the positive and negative class samples [38], while F1 is the harmonic mean of precision and recall.The receiver operator characteristic (ROC) curve is an easy-to-interpret graph that visualizes the performance of a binary classifier; ROC summarizes the performance in one value called AUC which is the area under the receiver operator characteristic curve.
The AUC determines the classifier's ability to distinguish between positive and negative classes.We can use the AUC to compare the classifiers models, where a good model has an AUC value near 1, which indicates that the model can predict negative and positive labels correctly, while bad models have an AUC value near to 0 [39].To visualize the performance of the classifier, we also plotted a confusion matrix that shows the truly classified and miss-classified tweets for both positive and negative classes [40].

RESULTS AND DISCUSSION
The main goal of the experiments in this work is to balance the dataset using various sampling approaches and then apply and compare several classifiers to determine the best-functioning classifier that can differentiate as many negative Arabic tweets as possible while maintaining the model's performance with respect to positive tweets.The positive tweets in the dataset are referred to the majority class and the negative tweets as the minority class.Initially, in the baseline experiment, the entire dataset was passed to the classifier after splitting it into the training and testing sets.Grid search was applied to perform hyper parameter optimization for the learning algorithms.A dictionary of parameters for the grid search was defined to find the best combinations that were optimized by 10-fold cross-validation over a parameter grid.Two functions were applied; the "fit" function was applied to train the classifier with optimal parameters, and the "predict" function was applied to test the classifier.
In Table 4, we highlight the best results in this experiment; the results indicate that ensemble classifiers outperform single classifiers.LR, SVM, SGD, and DT exhibited a comparable performance, but the best among them is DT.However, DT classifier achieved the highest F1 score of 0.84, while the GNB and ridge classifiers obtained the lowest values in all evaluation metrics.The confusion matrices for the best single (DT) and ensemble classifiers (RF) in Figures 3(a) and (b) reflect the class imbalance according to the poor scores of F1, recall, and precision for negative tweets (minority class) compared with positive tweets (majority class).Figures 3(a  Figure 4 shows the AUC values for all the classifiers.RF was the best classification model in this experiment; the highest value was achieved by RF.The results show that resampling techniques are needed to represent the dataset for evaluating classifiers well.The next experiment was conducted using over-sampling methods to balance the dataset and GridSearchCV to fine tune the parameters.In this experiment, we applied RO, SMOTE, BSMOTE, and ADASYN for each classifier. In Table 5, the performance results show an improvement after over-sampling the training set and RF exhibits superior performance for all metrics compared with other classification models.It is shown from Table 5 that the ridge classifier using SMOTE achieves a high F1 score value of 0.98 compared with other single classifiers, while KNN and AdaBST were the worst classifiers.The results also reveal that RO has a lower performance than SMOTE, BSMOTE, and ADASYN in terms of F1 score.In the last experiment, we used the same strategy as the second one but with under-sampling techniques to study the impact of applying under sampling on the dataset.The results of applying under-sampling are illustrated in Table 6.The results show that using random under-sampling and condensed nearest neighbors (CNN) produces high precision values and low recall values.Therefore, we have a picky classifier that did not predict many tweets as positive (i.e., people believe in COVID-19) and miss-predicted many actual positive tweets.In addition, the results show that the behavior of classifiers using OSS is better than that using RO and CNN in terms of F1, accuracy, GM, and AUC.
We studied AdaBST performance using grid search for parameter optimization and made some manual visualizations to measure the effect of the number of base estimators ( estimators) and learning rate hyper parameters, as shown in Figure 7. Figure 8 shows a comparison between all the classifiers with all the sampling techniques in terms of F1 score.The results show that over-sampling outperforms random and CNN under-sampling techniques.Ridge, LR, SVM, SGD, RF, and AdaBST have a good performance when applied with any under-sampling approaches, as shown in Figure 5.The best classifier performance was exhibited by the DT using OSS, with a value of 0.99 for both AUC and GM, while the CNN decreased the performance of the DT.Interestingly, AdaBST performed well without using any sampling technique, as shown in Figure 6.The results of our experiments show the benefits of balancing the training dataset before applying the classifier.However, over-sampling techniques outperform under-sampling techniques because the latter depends on eliminating samples, which may lead to the exclusion of important features that would negatively affect the performance of the classifier models.Considering all the criteria, we found that the best model is RF with SMOTE, BSMOTE, and ADASYN over-sampling, which was able to fully distinguish positive and negative labels and achieve the goal.

CONCLUSION
Although there are numerous studies on Arabic SA using ML algorithms have been conducted, most of these studies them deal with balanced datasets.In the context of imbalanced classification, most studies used small datasets.This paper gives provides an overview on the impact of using a data-level sampling approach within a classification task before training single and ensemble classifiers.These methods turn transformthe an imbalanced dataset into a balanced dataset.The results indicate that the models performed poorly on the imbalanced dataset, while theand a balanced dataset tends to increase the classification accuracy.
Our experiments were conducted on single, bagging-based, and boosting-based ensemble classifiers.In addition, we focused on how resampling techniques specifically affect the performance of both single and ensemble classifiers.The experiments revealed that over-sampling and under-sampling provide good results for various classifiers when evaluated using different metrics, such as F1, accuracy, and AUC, compared with the poor performance using an imbalanced dataset.
The over-sampling approaches (SMOTE, BSMOTE, ADASYN) produced superior results, while the OSS under-sampling approach is the best among the under-sampling approaches.However, over-sampling approaches outperform under-sampling approaches because no data is lost and a considerable feature set is provided in over-sampled data compared with the under-sampled data, resulting in the enhanced performance of the classifiers.The RF ensemble classifier using SMOTE, BSMOTE, or ADASYN over-sampled data exhibits good efficiency with F1 value of 0.99.
Surprisingly, the performance of the AdaBST classifier was ambiguous.AdaBST produced better results on the original dataset with hyper parameter tuning than those using sampling approaches.We argue that the high dimensionality of the data space, the existence of noise, and base learners influenced AdaBST's performance.In future work, the insufficient information on the performance of AdaBST can be addressed through further investigation.

Figure 2 .
Figure 2. Dataset distribution Sentiment analysis of imbalanced Arabic data using sampling techniques and … (Maisa J. Al-Khazaleh) 613 ) and (b) show that these models were confused when they made predictions.

Figure 3 .
Figure 3.The performance results of baseline experiment: (a) decision tree classification report, confusion matrix and (b) random forest classification report, confusion matrix

Figure 4 .Figure 6 .Figure 7 .Figure 8 .
Figure 4. AUC values in baseline experiment Figure 5. Geometric mean of classifiers using undersampled data investigated the impact of four under-sampling techniques, namely, remove similar, remove farthest, remove by clustering, and random remove.They conducted experiments on two Arabic and

Table 2 .
Comparison of classifiers and results of ralated work

Table 3 .
Units for magnetic properties

Table 4 .
Results of experiment 1: baseline performance of classifiers on the original imbalanced dataset with test size 0.1

Table 5 .
Results of experiment 2: using over-sampling techniques

Table 6 .
Results of experiment 3: using under-sampling techniques Sentiment analysis of imbalanced Arabic data using sampling techniques and … (Maisa J. Al-Khazaleh) 615