Bulletin of Electrical Engineering and Informatics

Received Nov 26, 2022 Revised Apr 5, 2023 Accepted Jun 4, 2023 With the rapid development of emerging technologies in the industrial revolution 4.0 or 5.0, social media has become one of the social environments to carry out social activities, both socializing and advertising. However, since it is an open platform by nature, cybercrime occurrence in social media is inevitable. Currently, more than a million fake accounts are existing on Instagram, Twitter, and Facebook, intending to increase followers, spread hoaxes, and spam. On one hand, it is difficult to manually eliminate these accounts on social media platforms. On the other hand, research on automatic fake account detection has been carried out for more than a decade. This study provides literature reviews aiming to deliver information about several methods and machine learning algorithms with the performances measured in identifying fake accounts on three well-known social media platforms: Twitter, Instagram, and Facebook.


INTRODUCTION
In the era of industry 4.0, social media has become one of the most preferable platforms to socialize and connect with other people.Interactive social media platforms such as Instagram, Twitter, and Facebook have become popular over a decade to share and find important information [1], [2].People can connect with others around the world without limitations of time and place.Currently, the purpose of social media interaction is not just for communication and online interaction, but also for conducting business activities such as advertising, promoting, and doing campaigns.Meanwhile, the government could use these platforms to deliver government services to citizens effectively [3].All of these activities require a lot of followers to be engaged with meaningful interaction to achieve a profitable business purpose.To achieve this, it is inevitable for business people or a company to utilize fake accounts on social media intentionally.Fake accounts are used in many different ways.Most businesses and institutions today choose social media as their main platform for marketing and advertising campaigns [4].Meanwhile, influencers receive many tangible profits from endorsing brands and sponsorship [5], [6].Both cases need a huge amount of followers and finding fake accounts provides the fastest solution for a bigger profit.Although they seemed to have no significant impact, fake accounts could also run a lot of devious activities over the internet such as launching massive online attacks [7], spreading hoaxes, review-bombing of products with misleading content, spreading spam, and even impersonating someone [8].Not only that cases, these fake accounts could take Bulletin of Electr Eng & Inf ISSN: 2302-9285  Fake account detection in social media using machine learning methods: … (Nalia Graciella Kerrysa) 3791 advantage of people by faking news or text messages to steal from innocent social media users [9], [10].These activities will affect the reputation of individuals or groups of people on a larger scale [11].One example is when a 17-year-old high school student's identity was stolen by a large company that sells Twitter followers to anyone that wants to become popular [12].The company claimed approximately 3.5 million automated accounts including stolen identities from which they profited.Another case is when Twitter faced a problem where fake accounts would get verified which slows down the verification of important and official accounts since the process needs to be improved [13].As mentioned by Elyusufi et al. [11], the existence of fake accounts is considered more dangerous than any other cybercrimes.Moreover, the existence of fake accounts in social media can't be tracked and removed easily.We need some techniques to automatically solve this problem.On the other hand, with the advancement of technology and algorithms, some fake accounts are trained to mimic the activities of a real social media user so that they can avoid deletion from the respected social media platform [14].People also make some traditions to purchase many fake accounts with affordable prices to meet their business purposes [15].This condition will lead to an increasing number of fake accounts over time.The inability to minimize these fake accounts automatically and effectively is the main reason for current research topics focused on this part.This literature review aims to summarize some of the research studies focusing on machine learning techniques to detect social media fake accounts.This study also provides information for a high-performance model that can be implemented in detecting fake accounts on Instagram, Facebook, and Twitter.

METHOD
The method used for the literature review of this study refers to the method conducted by Zuhroh and Rakhmawati [16] which consists of 4 steps that include defining research questions, literature keywords and sources, study criteria selection, and the findings of the literature study.We also followed a guide to structure a literature review by Kitchenham and Charters [17].This literature review included 4 stages as follows: a. Setting up the literature review goals and questions The objective of this stage is to find methods for fake account detection with better performance on 3 social media platforms e.g., Instagram, Twitter, and Facebook.We compose the research questions as follows: -What are the attributes or features that can be used to effectively detect fake accounts in social media?-What are machine learning methods commonly used in fake account classification tasks?-What is the performance of each machine learning method in detecting fake accounts in social media?b.Research article selection The keywords such as "machine learning", "online social network", "social media", "fake account detection", and "fake account classification" are used to search for some related articles from the database i.e., Google Scholar, IEEE, and Scopus.Articles regarding literature reviews are excluded from this study.The search process obtained 30 articles which are shown in Table 1.c. Discussion From the collected articles, three aspects of the research will be discussed.The first is to discuss the dataset used in the study.The second is to discuss the attributes selected by the study.The final task is to discuss the machine learning model used in the study.

d. Data synthesis
After the discussion, the data from the respected study will be elaborated and summarized.The performance of the models will be mentioned with performance metrics used in the article.Aditionally, the evaluation of the results will be explained.

RESULTS AND DISCUSSION
Several studies performed multiple steps to obtain the best model for fake accounts classification including: i) gathering datasets in several ways e.g., manually, automatically, and using an existing dataset; ii) applying feature selection for increasing the effectiveness and efficiency of the model; iii) selecting a machine learning model to classify fake accounts; and iv) measure the performance of the model as well as evaluate the result.

Dataset
Two types of datasets can be used for fake account detection: free datasets and self-made datasets.The majority of researchers chose to make their datasets that comprise fake and real accounts.One of the reasons to build their dataset is because of no available public open datasets for detecting fake accounts [24].Some researchers collected survey data using a questionnaire [21] or even hired a company to make a part of the dataset [23].For fake account classification, the dataset is collected from Facebook, Instagram, and Twitter (as shown in Tables 2 and 3).Other platforms are excluded from this literature review.1,162 accounts Gupta and Kaushal [7] 4,708 accounts Khalil et al. [19] Fake accounts: 13,000 Real accounts: 5,386 Twitter Ersahin et al. [8] Fake accounts: 501 Real accounts: 499 Cresci et al. [18] 13,101 accounts Walt and Eloff [20] 223,796 accounts Akyon and Kalfaoglu [24] Fake accounts: 700 Real accounts: 700 Bharti and Pandey [33] Real accounts: 1,103 Narayan [34] Fake accounts: 1,056 Real accounts: 1,176 Instagram Meshram et al. [14] Fake accounts: 3,231 Real accounts: 6,868 Purba et al. [15] Fake accounts: 32,869 Real accounts: 32,460 Sheikhi [1] Fake accounts: 3,132 Real accounts: 6,868 Durga and Sudhakar [40] Fake accounts: 201 Real accounts: 1,002  [29] The fake project dataset 11,737 accounts Khaled et al. [26] MIB dataset Fake accounts: 3,351 Real accounts: 1,950 Wang et al. [31] CLEF2019 dataset 7,120 accounts Bharti and Pandey [33] The fake project [18] 5.870 accounts Chakraborty et al. [36] MIB dataset Fake accounts: 3,474 Real accounts: 3,351 Kadam and Sharma [38] GitHub 2,820 accounts Instagram Kesharwani et al. [32] Fake, spammer, and genuine Instagram accounts 696 accounts Das et al. [37] Kaggle dataset 576 accounts Various methods are used in gathering and compiling new datasets.Some of them take advantage of third-party websites [15], web data crawlers, and social media API.After data has been gathered, commonly the fake accounts and the real accounts are separated manually.There are also other methods to simplify the data-gathering process without classifying the accounts one by one.The method used by Khalil et al. [19] Bulletin of Electr Eng & Inf ISSN: 2302-9285  Fake account detection in social media using machine learning methods: … (Nalia Graciella Kerrysa) 3793 involved a university's Twitter account that has a lot of followers and verifies which accounts are real or not.Meanwhile, fake accounts are obtained by buying them from a website with affordable prices.

Feature selection
According to Elyusufi et al. [11], the feature selection phase is a basic concept in machine learning that affects the performance of detection and classification, hence the features can provide a significant influence on the result.This phase can be done with a few techniques like using the spearman correlation test, dimensionality reduction, the markov blanket technique, and wrapper feature selection with support vector machine (SVM) [26].Furthermore, researchers can also choose many features that could be divided into multiple classes.Furthermore, they are inserted into the model to find the best class [18].Table 4 presents the results of the feature selection process from several studies which includes information about which features are important for training the model.The number of features varied ranging from 4 to 49 attributes.The most used feature is the features that can be obtained by the researcher without having permission from third-party software.Most of them are related to the number of followers, the number of following accounts, likes, profile pictures, status, posts, and account names.Table 4. Feature selection for fake account classification

Reference
Features selected Total Gupta and Kaushal [7] Received likes, likes, received comments, comments, tags, tag user, tags from other users, page tags, tags in comments, page tags in the comments section, tags by other users in the comments section, shared posts, wall posts, like wall posts, comments in wall posts, used applications.17 Elyusufi et al. [11] Status, followers, friends, favorites.

10
Walt and Eloff [20] Account age, duplicate accounts, follower and friend ratio, followers, friends, geographical Mohammad et al. [27] Likes, favorites, followings, followers, location, status replies, user replies, amount of registration, hashtags, mentions, URL, profile picture, replies, shares, the status of the account that shared, status.

16
Cresci et al. [18] Profile feature (features consist of information in the follower's profile of the target account), timeline feature (information of tweets in the follower's timeline of the target account), relationships feature (features from accounts that has a connectiona with the target account's followers).

49
Sheikhi [1] Profile picture, followed accounts, whether the follower count is greater, and the number of posts.4 Purba et al. [15] Posts, following, followers, biography, link, length of description, the presence of a description, the presence of pictures, likes, comments, location, hashtags, keywords, followers, post similarities, posts per hour.17 Meshram et al. [14] Post count, followers, followings, profile picture, private or public account, biography, username length, numbers in the username.

8
Akyon and Kalfaoglu [24] Media number total, followers, following, numbers of integer in name, private or public account.5 Bharti and Pandey [33] Number of followers, friends, tweets per day, status count, mentions, and hashtags per tweet, added into a user's favorite list, has over 50 tweets, URL, followers to the following ratio, replies.

Table 2 .
The original dataset

Table 3 .
The dataset from repositories

Table 5 .
Table 5lists the algorithms used in several kinds of research to create a fake account classification model according to the target social media.From Table5, we can conclude that 38 algorithms can be used for the fake account classification task. 2 of them are a combination of 2 classification methods.According to the result, the most used method to detect fake Bulletin of Electr Eng & Inf, Vol. 12, No. 6, December 2023: 3790-3797 3794 accounts on Facebook is random forest.On the other hand, the SVM method is used commonly for Twitter.Instagram has several common approaches such as ANN, naïve bayes, random forest, and SVM.Classification algorithm used based on the social media platform