Robust Arabic tweet NER via label-aware data augmentation and AraBERTv2

Brahim Ghazoui, Ismail El Bazi, Ibtissam Essadik, Brahim Ait Benali, Hicham Moussa

Abstract


Named entity recognition (NER) is vital for turning unstructured social media text into structured information. However, Arabic tweets pose distinct challenges; informality, brevity, dialectal variation, and inconsistent orthography. This study targets those challenges by coupling targeted data augmentation with a transformer model, bert-base-arabertv2. We design a lightweight augmentation pipeline—synonym replacement, name and location replacement, and deletion of third-person Arabic names—to expand linguistic variety and reduce overfitting under limited annotation. The approach is simple, but deliberate: preserve labels when substituting entities with type-consistent alternatives; remove corresponding tags when deleting names; and keep tweet semantics intact where possible. We then fine-tune bert-base-arabertv2 on the combined original and augmented data and evaluate on a held-out set of tweets. The result is a substantial gain in overall performance: F1=0.93 with augmentation versus 0.72 without. These findings indicate that controlled, label-aware augmentation can improve robustness and generalization for Arabic tweet NER, where data scarcity and linguistic variability otherwise degrade accuracy. Beyond empirical gains, our work offers a practical recipe—clear augmentation heuristics and a standard transformer backbone—that can be replicated and adapted to similar low-resource, noisy domains. This contributes to more reliable Arabic social media analysis and downstream information extraction.

Keywords


AraBERTv2; Arabic tweets; Data augmentation; Named entity recognition; Transformer-based models

Full Text:

PDF


DOI: https://doi.org/10.11591/eei.v15i1.10462

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats

Bulletin of Electrical Engineering and Informatics (BEEI)
ISSN: 2089-3191, e-ISSN: 2302-9285
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).