Realistic influence maximization based on followers score and engagement grade on instagram

Received May 24, 2020 Revised Nov 13, 2020 Accepted Feb 17, 2021 In recent years, the emergence of social media influencers attracts the study of a realistic influence maximization (IM) technique. The theoretical performance of IM has become matured. However, it is not enough since IM has to be implemented in a social media environment. Realistic IM algorithms and diffusion models have been proposed, such as the addition of user factors or a learning agent. However, most studies still relied on the influence spread benchmark, which makes the usefulness questionable. This research is among the first IM study using Instagram data. In this study, two diffusion models are proposed, which are based on the original IC and LT models, with the addition of the engagement grade (EG) factor. An algorithm called IMFS (IM with followers score) is proposed to accommodate the new models as well as IC and LT. In addition, realistic benchmark methods are proposed, namely the average engagement of the activated users, and the overlapping between post likers and activated users. The result shows that the proposed models are 2-3x more realistic if compared to IC and LT.


INTRODUCTION
Social networks are growing at an unprecedented rate, and it has opened various business opportunities, especially brand marketing. Compared to other social networks, Instagram is the best platform to target millennial audience [1], and the platform with the highest engagement [2]. Choosing influencers can be a difficult task, where common beliefs such as picking influencers with the highest number of followers and likes don't always produce the best results [3].
Choosing influencers that can spread influence to a maximum number of audience with a minimum budget is widely studied in a field called influence maximization (IM) [4]. An IM algorithm is used to generate seeds set (influencers) that produces the best possible influence spread (the number of activated users) under specific diffusion models [5]. However, the commonly used diffusion models, i.e., linear threshold (LT) and independent cascade (IC), assume that each user has a similar level of influence degree and susceptibility. This makes LT and IC models less useful in the real world, even though some state-of-theart IM algorithms such as IMM [6] and SSA [7] can produce a very high influence spread.
The number of influence spread itself has been matured, where recent theoretical improvements are only in terms of runtime [8]. Recent studies are more focused on making IM more realistic. Incorporation of various factors have been studied, such as influence susceptibility [9], sentiment [10,11], freeloaders [12], targeted ads [13], and engagement [14,15]. There are also bandit-based IM algorithms [16,17], which 1047 typically uses feedbacks from actual data. However, the benchmark methods were still based on influence spread, which makes the usefulness in real-world questionable. This study aims to develop diffusion models and an IM algorithm that activate more engaging users. Two new diffusion models based on LT and IC by incorporating engagement value are proposed, namely IC-eg and LT-eg (EG=engagement grade). An IM algorithm called IMFS (influence maximization with followers score) is proposed to provide the best solution for the proposed diffusion models. In addition, realistic and practical benchmark methods are proposed, based on the average engagement rate and engagement grade of the activated users, and the overlap between the activated users and actual post likers. To best of our knowledge, this is the first study of IM using Instagram data.
The following questions are studied in this research, i.e. (1) does incorporation of engagement grade produce a more realistic influence maximization? (2) how realistic is the proposed diffusion models if compared to the classic IC and LT models? This study is a step towards a practical IM, that can be used by business users to choose brand marketers more realistically. The rest of this paper is organized in the following sections, i.e., related studies, methodology, experimental results, conclusion.

RELATED STUDIES
There were recent studies on improving the theoretical and real-world performance of IM. Influence spread and runtime are commonly used as the theoretical benchmarks. The first notable state-of-the-art IM algorithm was TIM and TIM+ algorithms [18], with remarkably high influence spread and low runtime. The influence spread was further improved by IMM [6] by adding martingale. Further improvement in runtime was made by SSA [7], which was up to 1,200 faster than IMM. More recently, machine learning-based IM algorithms emerged, such as DISCO [8], which improved the runtime of SSA. However, DISCO required a training phase that took up to three days of execution. Thus far, IMM and SSA are the best performers in terms of influence spread, while DISCO is considered to have similar performance.
Real-world improvement on IM can be made through bandit agents or incorporation of factors. In a bandit-based IM, IM is executed multiple times, and the algorithm tunes the cumulative regret parameter based on the outcome of the diffusion process [19]. Unlike usual IM, the target of bandit-based IM is to minimize cumulative regret, i.e., the loss of influenced nodes cumulated from every iteration. However, the vital part of bandit-based IM, i.e. the outcome of diffusion, was mostly synthesized using techniques such as graph sampling [16,20] and diffusion random vector [19].
The influence factor has been studied [9]; however, the study used a prediction technique called correlated label propagation (CLP) to generate influence degree and susceptibility values. These values were predicted based on a synthetic and real graph instead of actual values. Engagement is one of the most prevalent factors in IM, with engagement forms such as conversation content and reply [14], assortativity, influence on second neighborhoods [15], network topology [21], silent users [22]. However, these studies were either relied on assumptions [15] or only worked in a limited environment [14]. Furthermore, the influence spread remains to be the most popular benchmark method, which remains theoretical.

METHODOLOGY
This section discusses the data preparation, engagement rate (ER) and engagement grade (EG) metrics, the proposed IM diffusion models and algorithm. The proposed IM algorithm was tuned to work with the existing IC and LT models, as well as the proposed models.

Data preparation
The dataset used in this study was collected from Instagram on April to May 2020 from the followers of 24 private universities in Malaysia. This localization was intended to create many connections among users. From the users, the related data was collected, i.e., posts, hashtags, post liker, and followers. This was done using Instagram API and various third-party Instagram websites. The users were cleaned using the fake user's classification model from an earlier study [23]. The raw data consists of 70,409 nodes/users, 1,007,107 edges/connections, 1,031,348 posts, and 47,689,496 likers entry. The simplified and anonymized user data and the network are available at https://www.kaggle.com/krpurba/im-instagram-70k-eg. There are two network data, i.e., a network for IC and LT models, and a network for IC-eg and LT-eg.

Engagement rate and engagement grade metrics
Engagement rate (ER) is among the most popular metrics for social networks, which is defined by the number of (likes+comments) divided by followers divided by the number of posts. Comparing users with ER, however, is not fair across users with different size of followers, where a high number of followers leads  [24]. Based on the average ER across a different number of followers [24], we established engagement grade (EG), which ranged from 0.0 to 1.0. The average ER and followers are shown in Table 1. Engagement grade (EG) is formulated (1).
where: ER baseline=The average ER for the user's number of followers For any user, EG value between 0.0 to 0.33 is below average ER, between 0.33 to 0.67 is above average ER, and between 0.67 to 1.0 is far above average. EG value gives a fair reward for users across the different size of followers. For example, the same EG=2.0 will be assigned to users with (ER=4.6 and followers=800 k) and (ER=21.4 and followers=1 k). EG value is capped at 1.0.

IM diffusion models
This study proposes two diffusion models, namely IC-eg (independent cascade-EG) and LT-eg (linear threshold-EG). Compared to the respective original models, these models only modify the edge weight by adding a multiplication of EG, as formulated below: where: Pu is edge probability for IC-eg model (or edge weight for LT-eg model) from a user to a follower The constant of 2 can be set to an arbitrary number to keep a reasonable influence spread value. In our dataset, the average EG of all users is 0.395. Removing this constant causes a very low influence spread. The addition of engagement in the edge weight was due to its usefulness in indicating several things, i.e. (1) Low engagement means a possibly high number of fake followers [25]. On FakeCheck.co (https://www.fakecheck.co), a website for fake followers analysis, the ER of a user is compared to "industry standard", which most likely means it uses an EG-like metric instead of ER, (2) Using influencers with high engagement rate leads to more effective marketing [26], (3) High engagement means high activity (less passive users) [27], (4) Engagement tells naturally means how much a user is liked in the society.

Influence maximization with followers score algorithm
The proposed IMFS algorithm mainly focused on the IC-eg and LT-eg models. However, in the experiments, this algorithm also worked well with IC and LT models. The main basic idea of IMFG comes from "calculating the followers in multiple depths using a sampled graph." A sampled graph was commonly used in IM algorithms with RR-Set (reverse-reachable set) [6,18], which is created by removing each edge with a probability of (1-edge weight).
The RR-Set collects influential users by generating the sampled graph several times and keeping the users who frequently appear in the graph [7]. In contrast, this research generates sampled graph only one time. The graph is used to calculate followers score, which is the aggregation of the number of followers in multiple depths (up to 10), which is formulated (3).
where: flr(depth,user)=The number of followers at depth, where depth=0 means direct followers IMFS algorithm uses a sampled graph to minimize runtime, which means only users in the sampled graph are calculated. The whole process of IMFS is shown in Figure 1. IMFS starts with followers score calculation (estimation phase). During snum (number of seeds)=1, the flrs of all users are calculated. In snum>1, however, flrs calculation is stopped when there is no improvement in the last noimpr_flrs loops.
Before designing the next phases of the algorithm, we further examined the usefulness of flrs in terms of directly predicting influence spread. By simulating all users individually (snum=1), it was found that flrs has a correlation (Pearson's) of 0.86 to influence spread in the IC model, and 0.84 in IC-eg model. Since these numbers are not extremely high, some inaccuracies are expected to happen during the "conversion" of flrs to influence spread. Thus, IMFS requires simulation phase that simulates several candidates.

Figure 1. IMFS algorithm
To mitigate the potential inaccuracies during the "conversion", we added three parameters, i.e. noimpr_flrs, noimpr_an, take_bestprev. Removing noimpr_flrs and noimpr_an simply means executing a greedy algorithm, which sacrifices runtime. The take_bestprev, on the other hand, aims to extend the classic greedy, which only uses one best previous combination. The final parameter values that were used in this research are Figure 2 noimpr_flrs=50, noimpr_an=50, take_bestprev=5. The parameters have effects on the influence spread result. The values were acquired by increasing the values gradually until breaking points of influence spread were reached, as seen in Figure 2. Note that these experiments were done by adjusting a single parameter at a time while keeping the others at the default values. In the simulation phase, each new candidate is combined with take_bestprev best previous combinations and simulated with ε=0.1. This ε value was used by many studies [6,28] to get an accurate enough influence spread. Since during the estimation phase, users are already sorted by flrs in descending order, users with higher flrs are expected to have higher influence spread (with some inaccuracies). If there is no improvement in the last noimpr_an candidates, the simulation phase is stopped, and the best candidate will be taken as the current seed. The pseudocode of IMFS is Algorithm 1.

EXPERIMENTAL RESULTS
As the baseline algorithms, IMM [6] and SSA [7] were chosen since they are the best performers in terms of influence spread and runtime (SSA). There were four diffusion models to be tested, i.e., the classic IC and LT models and the IC-eg and LT-eg models. There were four benchmark methods, i.e., influence spread and runtime, with the addition of user metric benchmarks, i.e.: a. Average engagement (EG and ER) When influencers are simulated according to a diffusion model, a number of users are activated. The simulations were executed 1,000 times, and the average EG and ER of the activated users were calculated. Activating less engaging users means activating passive users, which is not realistic.

b. Likers overlap (LO)
This LO value assumes that a user is influenced if he/she liked an influencer's post, regardless of being a follower or outsider. An additional form of influence, such as product buyers, are much harder to get. The LO value is the proportion of activated users who are likers, which is formulated (4).

Synthetic benchmarks
The performance under IC and LT models are shown in Figure 3. The proposed IMFS algorithm performed similarly in terms of influence spread compared to other algorithms. The runtime of SSA is still much superior, which consistently runs under 1s, while IMFS is around 2-3x faster than IMM. The existing IMM and SSA algorithms should work well for IC-eg and LT-eg models since only edge weights adjustment were made in the network. All algorithms performed similarly under IC-eg model, as shown in Figure 4. However, IMFS outperforms the influence spread of other algorithms under LT-eg model. This means that the use of followers score is more suitable for LT-eg, if compared to RR-set.

User metric benchmarks
Most algorithms performed almost similar in terms of the average EG and ER. However, the difference is between diffusion models. As can be seen in Figures 5, 6, and Table 2, LT-eg outperforms other models, with IC-eg as the second-best, while IC and LT models have much lower EG and ER. This shows that both LT-eg and IC-eg are more realistic. IMFS algorithm performed slightly better in EG and ER under IC-eg and LT-eg models if compared to other algorithms.  In terms of likers overlap (LO), IC-eg and LT-eg models performed 2-3x better than IC and LT models, as seen in Figure 7 and Table 3. IMM performed not as good as other algorithms in LT-eg, similar to the influence spread result. This shows that IMM is not suitable for LT-eg, while SSA can keep up. Based on the average LO provided in Table 3, IMFS performed better than other algorithms.

Seeds similarity
The outcome of an IM algorithm is the chosen seeds set (influencers). Practically, for example, a business user queries the algorithm to produce ten influencers for marketing purposes. Allocating budget for the influencers is a difficult task, where accurate identification of influencers become crucial. The following Figure 8 shows the seeds set similarity for each number of seeds, compared to IC-eg, as the most realistic   were more focused on making IM more realistic by adding user factors, such as engagement, sentiment, multiple network analysis, or learning during the diffusion process. This research added users' engagement value as the measure of activeness in the network, as well as proposed user-based benchmark methods. Experimental results showed that the proposed IC-eg and LT-eg diffusion models were superior in terms of the average EG, ER, and likers overlap (LO) of the activated users if compared to IC and LT. The high values of these user metrics have proven that the proposed models are more realistic, and can produce more engaging users, than IC and LT. The produced LO value by IC-eg and LT-eg is 2-3x better than IC and LT, indicates that the proposed models are closer to reality. The less realistic models (IC and LT) have a 50% difference in terms of the chosen influencers if compared to IC-eg.
The proposed IMFS algorithm, which was explicitly tuned for IC-eg and LT-eg, produced slightly better results in terms of the user metrics if compared to SSA and IMM. Furthermore, IMFS achieved better influence spread under LT-eg model if compared to other algorithms, while performing similarly under other models. This has proven that the followers score, which is the backbone of IMFS algorithm, is suitable for all diffusion models. In future work, additional user metrics can be added, such as followers growth. The edge weights of the diffusion models can also be further tuned to achieve higher user metrics. To enhance the practical usage, topic consideration has to be added to suit marketing on specific brand category.