Distributed big data analysis using spark parallel data processing

Hoger Khayrolla Omar, Alaa Khalil Jumaa


Nowadays, the big data marketplace is rising rapidly. The big challenge is finding a system that can store and handle a huge size of data and then processing that huge data for mining the hidden knowledge. This paper proposed a comprehensive system that is used for improving big data analysis performance. It contains a fast big data processing engine using Apache Spark and a big data storage environment using Apache Hadoop. The system tests about 11 Gigabytes of text data which are collected from multiple sources for sentiment analysis. Three different machine learning (ML) algorithms are used in this system which is already supported by the Spark ML package. The system programs were written in Java and Scala programming languages and the constructed model consists of the classification algorithms as well as the pre-processing steps in a figure of ML pipeline. The proposed system was implemented in both central and distributed data processing. Moreover, some datasets manipulation manners have been applied in the system tests to check which manner provides the best accuracy and time performance. The results showed that the system works efficiently for treating big data, it gains excellent accuracy with fast execution time especially in the distributed data nodes.


Apache Spark; Big data; Sentiment analysis; Hadoop HDFS; Machine learning

Full Text:


DOI: https://doi.org/10.11591/eei.v11i3.3187


  • There are currently no refbacks.

Creative Commons License
This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Bulletin of EEI Stats