Enhancing Arabic offensive language detection with BERT-BiGRU model

ABSTRACT


INTRODUCTION
Web 2.0 has given rise to numerous platforms and tools that allow internet users to express their viewpoints and ideas on various topics and happenings.Unfortunately, some individuals misuse these platforms to propagate hate speech and offensive content, leading to adverse impacts on the mental wellbeing of the online community [1], [2].According to a 2021 pew research center survey, 41% of Americans have experienced online harassment, including offensive name-calling, intentional embarrassment, physical threats, stalking, and sexual harassment.Additionally, the cyberbullying research center reports that over 30% of teenagers in the United States have endured some form of cyberbullying, including hurtful comments, spreading rumors, and threats.
Therefore, the detection of offensive language has become an active research task in natural language processing (NLP).Offensive language can be defined as text that uses abusive slurs or derogatory terms [3].Different forms of offensive language include hate speech, aggressive content, cyberbullying, and toxic comments.Many workshops and shared tasks have been conducted to encourage research in this field from various perspectives [4]- [7].

RELATED WORK
Compared to the amount of work done in English, only a few studies have been conducted on detecting offensive language in Arabic [8], [9].One of the earlier studies in this area was conducted by Mubarak et al. [22].The authors built a list of 288 Arabic obscene words and another list of 127 hashtags.They then used this list along with additional patterns to gather Arabic abusive tweets from the Twitter API in 2014.These tweets were classified into two categories: tweets that did not contain any obscene word from the list of seed words, and those that included at least one of the words in the list.
Alakrot et al. [11], comments from YouTube were collected and manually labeled by three annotators as either offensive or non-offensive.They trained an SVM classifier with different combinations of word-level features, N-gram features, and various pre-processing techniques.They achieved an F1-score of 82% using pre-processing applied with stemming.
Mohaouchane et al. [23] sought to enhance the previous results by using Word2Vec embeddings with different neural network models, including CNN, bidirectional long short-term memory (BiLTSM), and BilTSM with attention.The CNN model achieved the highest accuracy score of 87.84%, and an F1-score of 84.05% over other models.
In 2020, a shared task was conducted by the SemEval workshop [24] that targeted the offensive language detection task.It provided labeled datasets for many languages, including Arabic.The team of Alami et al. [25] ranked first in this competition for the Arabic language.The authors used AraBERT to encode the Arabic tweets, followed by a sigmoid layer for classification.They also examined the impact of translating the meaning of emojis on the overall performance of the proposed model.They achieved a macro F1-score of 90.17%.
Hassan et al. [26] attained the second rank by combining of CNN-BiLSTM, SVM, and multilingual BERT.The SVM classifier employed character n-grams, word n-grams, and word embeddings as features, whereas the CNN-BiLSTM model learned character embeddings and additionally employed pre-trained word embeddings as input.Their performance yielded a macro F1-score of 90.16%.
Wang et al. [27] ranked third for Arabic.They proposed a unified approach to detect the offensive language in all languages, including Arabic.To this end, they used the XML-R model, which was pre-trained to learn all the language representations together.They then fed the output of [CLS] token of the top layer of XLM-R into a fully connected layer, using the same parameter for all languages.The proposed model achieved a macro F1-score of 89.89%.
Safaya et al. [28] attained the fourth rank for Arabic.They combined the AraBERT model with a CNN layer to handle this task.The output of the last four hidden layers was fed into several filters and convolution layers of the CNN.Then, the output of CNN was fed into a dense layer with a sigmoid activation function for classification.They reported a macro F1-score of 89.72%.
Another shared task was conducted in 2022 [29] and was divided into three subtasks: i) identify whether a tweet is offensive or not; ii) determine whether a tweet contains hate speech or not; and iii) determine the fine-grained type of hate speech (disability, social class, race, religion, ideology, and gender).
The team of Mostafa et al. [30] ranked first in subtask A. Seven language models were examined in this paper.Moreover, an ensemble learning approach was used to further enhance the model performance.Besides, different loss functions were evaluated to address the data imbalance problem.The best results (macro F1-score=85.2%)were achieved using a majority voting technique between three models: i) QARiB trained using Dice loss, ii) MARBERT trained using VS loss, and iii) MARBERTv2 was trained using Focal loss + label smoothing.
AlKhamissi and Diab [31] achieved second place by proposing a multi-task learning approach to handle all three sub-tasks simultaneously.They first encoded input tweets using the fine-tuned MARBERT model, and then passed the output embedding to three task-specific classifiers.Each classifier consisted of a multilayered feedforward neural network with layer normalization.Their method achieved a macro F1-score of 84.5% in subtask A.

METHOD 3.1. Task description
The objective of this study is to classify every text into one of two distinct classes: offensive or nonoffensive.Therefore, this objective can be approached as a binary classification problem.In pursuit of this goal, the study aims to effectively differentiate texts based on their offensive content, simplifying the task into a two-class classification scenario.

Model overview
The whole architecture of the proposed model is illustrated in Figure 1.First, a BERT layer is used to generate the vector representations of the text input, followed by a BiGRU layer to further extract context

Bidirectional encoder representations from transformers model
BERT [32] is a pre-trained language model built based on transformers, which is an attention mechanism that employs an encoder to read the input text and a decoder for generating a prediction for the task.BERT uses only the encoder part for providing a language representation model.Besides, BERT makes the training bidirectional by considering context from both left and right directions across all layers.
Moreover, the pre-training process of BERT involved two unsupervised tasks: masked language modeling (MLM) and next sentence prediction (NSP).For the first task, BERT randomly masks a portion of the input tokens and subsequently attempts to predict those hidden tokens.The second task allows the model to predict whether a sentence is the next sentence in a given sequence of sentences.The BERT model has improved the results of many NLP tasks including named entity recognition [33], [34] text classification [35], [36] and sentiment analysis [37], [38].Figure 2 illustrates the architecture of the BERT model.Figure 2. The architecture of BERT model [32] There are two types of BERT-based models for the Arabic language: monolingual models and multilingual models.The first type pre-trained the BERT architecture on Arabic content only.This content can be written in classical Arabic, modern standard Arabic (MSA), or dialectical Arabic (DA).For the second type, the BERT model is pre-trained on multilingual content, including Arabic.MSA Various Arabic corpora like El-Khair [40] and OSIAN [41] 8.6B tokens AraBERTv02twitter

29B tokens
Qarib [44] MSA+DA news and movie/TV subtitles, while the dialectical text includes tweets 14B tokens CamelBERT [45] DA A range of dialectical corpora like NADI [46] and QADI [47] 5.6B tokens mBERT [32] MSA Wikipedia 7292 tokens In this study, the BERT model is fine-tuned to learn specific knowledge relevant to the downstream task.Additionally, we employed the final hidden state vector of the special token [CLS] as the representation of the entire input sequence.The output of the BERT model can be represented as (1): Where the value of the dimension d is equal to 768.

Bidirectional gated recurrent unit layer
GRU is a variant of the recurrent neural network (RNN) that was created to tackle the issue of longterm dependencies and the gradient vanishing problem.Its structure is simpler than LSTM, as it combines the input and forget gates into a single update gate and merges the hidden and cell states into a single hidden state, as depicted in Figure 3.The update gate, denoted as   , regulates the volume of past information that should be transmitted to the next state.On the other hand, the reset gate, indicated as   , controls the amount of previous information that should be disregarded.The calculation formula is provided as ( 2)-( 5): = (    +  ℎ ℎ −1 ) = ℎ(    +   ⨀  ℎ ℎ −1 ) (4) Where  represents the sigmoid function and ⨀ denotes the matrix's element-wise product.The weight matrices W and U must be learned.Since GRU networks can only handle sequences from front to back, we employed a BiGRU layer to process the data from both directions and generate complete contextual features.The output of the hidden layer   at time t is the concatneation the backward and forward states as (6): The output of BiGRU can be represented as (7): We then use a fully connected dense layer with a sigmoid function to generate the final predictions, which classify the input text as offensive or not offensive.
̂= ( + ) Here,  ̂ represents the predicted probabilities,  is a weight matrix that can be adjusted during training, and  is a bias term.

EXPERIMENTS 4.1. Dataset
The dataset used in this study was released by SemEval 2020 task 12 [24].It comprises 10,000 tweets gathered during the period of April to May 2019 using the Twitter API and annotated manually as either offensive or non-offensive.More details about the dataset can be found in Mubarak et al. [48].Table 2 illustrates the distribution of the data in terms of training and testing sets.

Experimental settings
The proposed model was implemented in Python using TensorFlow and Keras libraries.For the BERT model, we used the base version, containing 12 layers of transformers with 12 self-attention heads and a hidden size of 768.Additionally, we used a max sequence length of 128, a batch size of 32, and a number of epochs of 5.

Evaluation metrics
To compare our model with the baseline and related work models, we used the accuracy and macro F1-score metrics, computed using as (9): Where , , and  denote the number of true positives, false positives, and false negatives, respectively.
Where  denotes the number of classes.  ,   are the precision and recall for class , respectively.

Baseline and related work models
The proposed model is compared with the following baseline models: − Majority baseline [24]: it is the baseline model provided by the SemEval task for the Arabic dataset.− BERT: we remove the BiGRU layer and fine-tuned the BERT model with a linear layer and a sigmoid activation function.− BERT-BiLSTM: we replaced the BiGRU layer with BiLSTM in the proposed model to examine its impact on the overall performance.

. Selection of the BERT model
There are many BERT-based models that have been implemented to support research in the Arabic language, as illustrated in Table 1.Thus, we first conducted various experiments to select the best BERT model for our proposed system.The experimental results are depicted in Table 3.
It can be noticed that mBERT achieved the worst results, which can be explained by the fact that this model was pre-trained on much less amount of Arabic datasets compared to the monolingual models.Among MARBERTv2, AraBERTv02, AraBERTv02-twitter, and Qarib, MARBERTv2 yielded the best results, likely attributed to the extensive pre-trained dataset (refer to Table 1).Furthermore, the dataset is a combination of MSA and DA tweets, which aligns with the evaluated data in this study.Therefore, we use MARBERTv2 to implement the evaluated models in this paper.

Effect of hyper-parameters
To determine the optimal hyper-parameters for our proposed model, we conducted a sensitivity analysis by testing different configurations.We started by tuning the learning rate, which is crucial for weight control during back-propagation and affects training time until convergence.A high initial learning rate can cause unstable learning and divergence, while a low learning rate can result in slow convergence.We illustrated the results of testing different learning rates on the proposed model in Figure 4 and found that a learning rate of 5e-5 produced the best performance.Higher or lower learning rates reduced performance, so we used this learning rate for all implemented models in this study.The second hyper-parameter we optimized was the optimizer.We evaluated our model using different optimization methods, including Adam, Adamax, RMSProp, and SGD, as shown in Figure 5.The evaluation results showed that SGD performed poorly with a macro F1-score less than of 46%, whereas Adam and RMSProp achieved comparable results.Meanwhile, the Adamax optimizer outperformed the other methods in terms of macro F1-score.Therefore, we used it to implement our proposed model.Another crucial parameter in network design is the number of hidden units in the GRU layer, which significantly impacts the model's training duration and complexity.Therefore, optimizing this parameter is essential to reduce model complexity and improve its execution performance and predictive capability.We illustrated the results of the sensitivity analysis for this parameter in Figure 6 and found that using 32 or 64 hidden units yielded comparable results, while the best value was achieved when using a number of hidden units of 128.However, when this number increased to 256 and 512, the overall performance decreased.Thus, we set the number of hidden units in the GRU layer to 128 based on the best value of the macro F1-score.Figure 6.Experimental results using different optimizers

Comparative analysis
The main experimental results are presented in Table 4.They indicate that the proposed model achieved an overall enhancement of more than 25% in terms of F-1 score compared to the baseline model.Moreover, our model outperforms related work models that used extra features with the BERT model (i.e., AraBERTEmojisOUT) or adopted an ensemble structure (i.e., SVM and ValenceList + C-LSTM + Mult-BERT) to resolve this task using the same dataset as in this study.Additionally, the MARBERTv2-BiLSTM and MARBERTv2-BiGRU achieved better results than the MARBERTv2 model, which was fine-tuned using a linear layer only.This indicates the effectiveness of incorporating the fine-tuned MARBERTv2 model with more powerful neural network layers to further enhance the extracted semantic and contextual features.
Furthermore, our model significantly outeperfoms the BERT-CNN model, which can be justified by the fact that capturing long-range dependencies in the data and considering the order of the words are crucial for understanding the context and improving the classification performance.Besides, our model achieves better results than BERT-GRU, which indicates the effectiveness of using bidirectional layers to encode features from both left and right sides for handling this task.In addition, our model outperforms the BERT-BiLSTM model.This can be explained by the fact that GRU has a simpler architecture than LSTM, potentially simplifying the training process.-44.41 † AraBERTEmojisOUT [25] 93.9 † 90.17 † SVM and ValenceList + C-LSTM + Mult-BERT [26] 93.85 † 90.16 † XLM-R [15] -89.89 † BERT-CNN Kuisal [28] -89.

CONCLUSION
In this paper, an enhanced BERT-based model is proposed to address the offensive language detection task on an Arabic reference dataset.The proposed model employs BERT to generate contextualized vector representations, followed by a BiGRU layer to further improve the extracted context and semantic features.The experimental results showed the effectiveness of our model compared to the baseline and related work models by achieving a macro F1-score of 93.16%.Additionally, the obtained results prove the effeciency of combining BERT with bidirectional sequential layers to further improve its semantic understanding.
Future work directions include evaluating our model on other Arabic NLP tasks, such as fine-grained hate speech detection.Additionally, we intend to implement our model using other pre-trained language models than BERT, such as the XLNET model.Moreover, we plan to adapt our model to handle the task offensive language detection on multilingual corpora.In addition, the dataset used in this study is imbalanced.Thus, future work direction includes investigating various methods to handle the class imbalance issue and examining their impact on the overall performance of our model.


ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol. 13, No. 2, April 2024: 1351-1361 1354 and semantic features.A fully connected dense layer with sigmoid activation function is then used to classify the text into one possible class.

Figure 1 .
Figure 1.Overall architecture of the proposed model

Figure 4 .
Figure 4. Experimental results using different learning rate values

Figure 5 .
Figure 5. Experimental results using different hidden units' values

Table 1
describes the main Arabic BERT models that are publicly available for the researchers' community.

Table 1 .
Main publicaly available Arabic BERT models

Table 2 .
The distribution size of the dataset

Table 3 .
Results of our proposed model using different Arabic BERT models

Table 4 .
Main evaluation results.The results with " †" were retrieved from original papers Enhancing Arabic offensive language detection with BERT-BiGRU model (RajaeBensoltane)1359