Unsupervised outlier detection in high-dimensional text data: a comparative analysis

Zuleaizal Sidek, Sharifah Sakinah Syed Ahmad, Noor Hasimah Ibrahim Teo

Abstract


Outlier detection in user reviews is a critical task for identifying anomalous and potentially valuable insights within large datasets. This study presents a comparative analysis of three different algorithms for outlier detection in user reviews: isolation forest, local outlier factor (LOF), and latent dirichlet allocation (LDA). The performance of each algorithm was evaluated using accuracy and silhouette score for outlier detection and clustering quality. LDA performed best with 0.98 accuracy and a silhouette score of 0.13. Isolation forest followed with 0.90 accuracy and a score of 0.11. LOF had lower results with 0.42 accuracy and a score of -0.05 due to its sensitivity to neighbors. The study contributes by systematically exploring the impact of parameter variations on algorithm performance, providing valuable insights for high-dimensional text data analysis. Despite the promising results, limitations include the dependence on preprocessing and specific parameter settings. Future work will explore hybrid approaches and broader datasets to enhance scalability and adaptability.

Keywords


Anomalies; Isolation forest; Latent dirichlet allocation; Local outlier factor; Outlier detection; User reviews

Full Text:

PDF


DOI: https://doi.org/10.11591/eei.v14i4.9573

Refbacks

  • There are currently no refbacks.


Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats

Bulletin of Electrical Engineering and Informatics (BEEI)
ISSN: 2089-3191e-ISSN: 2302-9285
This journal is published by the Institute of Advanced Engineering and Science (IAES) in collaboration with Intelektual Pustaka Media Utama (IPMU).