eHMCOKE: an enhanced overlapping clustering algorithm for data analysis

Received Apr 17, 2020 Revised May 20, 2021 Accepted Jun 15, 2021 Improved multi-cluster overlapping k-means extension (IMCOKE) uses median absolute deviation (MAD) in detecting outliers in datasets makes the algorithm more effective with regards to overlapping clustering. Nevertheless, analysis of the applied MAD positioning was not considered. In this paper, the incorporation of MAD used to detect outliers in the datasets was analyzed to determine the appropriate position in identifying the outlier before applying it in the clustering application. And the assumption of the study was the size of the cluster and cluster that are close to each other can led to a higher runtime performance in terms of overlapping clusters. Therefore, additional parameters such as radius of clusters and distance between clusters are added measurements in the algorithm procedures. Evaluation was done through experimentations using synthetic and real datasets. The performance of the eHMCOKE was evaluated via F1-measure criterion, speed and percentage of improvement. Evaluation results revealed that the eHMCOKE takes less time to discover overlap clusters with an improvement rate of 22% and achieved the best performance of 91.5% accuracy rate via F1-measure in identifying overlapping clusters over the IMCOKE algorithm. These results proved that the eHMCOKE significantly outruns the IMCOKE algorithm on mosts of the test conducted.


INTRODUCTION
Extraction of patterns from data is a method called data mining [1]. In knowledge discovery in database (KDD) process, data mining is an important part that is used to find significant information and discover hidden patterns from the huge collection of data [2]. Mining is used to dig through data and discover new knwoledge from a various information which is then used in many applications which is sometimes referred to us data science [3]. Preventive medicine is one of the fields which uses knowledge discovery in data to analyze partient information for diagnosis of the diseases. There are two categories of functions involved in data mining: supervised and unsupervised learning [4]. In supervised learning, the model is trained on a labeled data sets while unsupervised learning the model is used to identify patterns in unlabeled data sets [5].
Clustering can be considered as an unsupervised learning technique. It is one of the most significant and challenging data mining techniques in the knowledge discovery process. The goal of clustering is to discover groups of objects from unlabeled data such that all similar data object is within the same clusters while dissimilar data object from different clusters [6]. However, most of the real world data sets have overlapping information [7] where data objects or patterns can belong to one or more clusters. Numerous research works have focused on this problem known as overlapping clustering technique. for example, in a social network, a person may belong to two or more communities [8]. In music, emotion data set can be categorized as relaxing and happy at the same time [9]. A method called scalable spectral clustering was used to detect underlying communities in a larger networks [10]. The multi-cluster overlapping k-means or MCOKE extension was newly introduced as another method in segmenting a data into clusters as well as finding overlapped data [11]. Despite providing better results in detecting data that overlapped, MCOKE is sensitive to outlier which can have negative effects on the accuracy in identifying overlapping objects within clusters. Therefore, improvement of MCOKE algorithm have been introduced for better performance in identifying overlap clusters. According to recent studies, [12]- [14] observation of unusual value has a significant role in the field of data mining. The IMCOKE [15] algorithm was presented that focuses on the incorporation of median absolute deviation (MAD) as the outlier detection method used to detect outliers. However, the study did not consider the positioning of MAD procedure applied in the algorithm. Furthermore, the concentration of IMCOKE algorithm is only on the measurement between the distance of the data to the centroid in finding the data that overlap within clusters and disregard other vital parameters.
In this paper, the study is to enhance the algorithm by determining the best position of MAD in identifying the outlier before applying it in the algorithm procedures. The study will examine if the outlier detection positioning affects its capability in detecting outliers in the datasets. In addtion, measurement of parameters such as distance between clusters and radius of the clusters are also considered in the study to achieve faster and more accurate identification of overlapping clusters.

RESEARCH METHOD 2.1. Outlier detection
In the process of detecting the outliers, each data objects were collated and classified inascending order. To detect the anomaly in the data, first compute the median value ( ), where is the median of the sequence of distances of data objects. Then, compute the MAD values by deducting the median from each distance of a data object. Next, the calculated MAD values were classify in ascending order, and the median of absolute deviation values were determined. After this, the median was multiplied by b, the contrast b equal to 1.4826 which is constant linked to the assumption of normality of the data [16]. The (1) shows the MAD formula.
Once MAD is calculated, a threshold value was determined which serve as a basis to guide the outlier detection. A study [17] suggest that the values of 3, 2.5, and 2 as the threshold value of an outlier. A decision value was computed using (2). Values greater than or smaller than the decision value are considered outliers which are removed from the clusters. In this study, a threshold value of 2.5 was adopted since it provides a reasonable choice for outlier detection [18].
Decision Value=M ± threshold value * MAD (2) This method will be terminated once all outliers have been isolated from the data sets.

Radius of cluster and distance between cluster calculation
Radius of a cluster and the distance between clusters [19] are two measurements that were considered to improve the algorithm in terms of time spent in identifying overlap clusters. To get the radius, R, of the cluster, the mean distance of the data in the cluster is multiplied with the number of clusters as defined in (3). Illustration for the calculation is depicted in Figure 1.
To obtain the distance between clusters, D, (4) is used. A sample calculation is shown in Figure 2.

Enhanced MCOKE algorithm
Two strategies were employed to the algorithm. The first strategy was to remove outliers. The second strategy involved the incorporation of added parameters. The formation of the new and Enhanced MCOKE algorithm is shown in Figure 3.
The enhanced MCOKE (eHMCOKE) algorithm consists of three phases. Phase 1 is the used of MAD to discover the anomalous value (outlier) in the datasets and this value is isolated before the clustering of data. Phase 2 is to group the data into cluster using K-means algorithm. Then finally in the last Phase, overlapped clusters were identified with the used of maxdist and the added parameters modifying the previous procedure that identify clusters that overlapped.

Evaluation
The performance of the eHMCOKE was evaluated based on its accuracy and speed. This process allows for the comparison between the IMCOKE and the eHMCOKE algorithm and determines whether one algorithm outperform or superior to another one. a. Speed or execution The speed was measured by subtracting the elapsed time from the start time. b. Percentage of improvement The percentage improvement was computed to compare the performance of the eHMCOKE and the IMCOKE algorithms (5).

Accuracy
Recall, precision and F-measure were calculated over pairs of points used in the evaluation of the accuracy of overlapping clustering results. Precision is calculated based on the correct identification of pairs in the same cluster and recall is the actual pairs that were identified. The formula for precision is shown in (6) while that for recall is shown in (7) [20].   The actual calculation for precision and recall were made by using true outliers as few false positives. A large number of false positives indicates a low precision. A recall is to measure the performance of the outlier detection in capturing the most or all outliers as few false negatives as possible. A low recall indicates a large number of false negatives. The (8) shows the formula for precision while the (9) shows the formula for recall [21]. Where true positives (TP) is the accurately predicted true outliers, false positives (FP) is the predicted true outlier, but is not, and false negative (FN) is the predicted not an outlier, but it is a true outlier.
To model the desired precision and recall, the F-measure, also referred to as the F1 score combined with precision and recall was used. F1 score computes the weighted harmonic mean of recall and precision [22]. Having higher F1 score result constitute to an excellent detection accuracy, where 0 mean the worst and 1 mean the perfect detection [23]. The (10) shows the calculation of F-measure.

RESULTS AND DISCUSSION
In this section, three experimentations were conducted to test the eHMCOKE algorithm. Synthetic and Real datasets were used.

Experiment 1
The objective of the first experiment was to contrast the results between the two strategies for outlier detection used in the clustering analysis. The study intended to compare the accuracy rate of the outlier detection procedure MAD before clustering of data and after the clustering of data. Experiments were made on synthetic and real data sets.
Two attributes (Rating, Absences) with 50 instances are form in the synthetic data set. Five outliers were intentionaly incorporated in the data set; therefore, 45 instances are normal, and five instances are unusual data or also known as outliers (Student 46 to Student 50). Table 1 shows the synthetic dataset. In this work, synthetic dataset was used for the first experimental run, data were plotted through 2-dimensional spaces as shown in Figure 5. Then, the outlier detection MAD procedure was tested to find outliers before the clustering of the data. Figure 6 shows the visualization results, red dots are the outliers found in the dataset recognized by MAD before performing the clustering method. Found outliers were removed from the datasets. The same synthetic dataset was processed for the identification of outlier. This time, MAD was tested after the clustering of data. First, data objects were segregated into various of clusters with the used of K-means algorithm. K was initiated randomly, then cluster centroids were formed based on the initial number of K where data objects are being assigned. For this experiment, the user selects three (3) where K=3 clusters centroid and based on its Euclidian distance measurement each data was assigned to its nearest cluster. The test data was run five (5) to 20 times with a dissimilar k number of clusters, and the best result was used in the experiment. As shown in Figure 7, the output of 50 data objects with 2 clusters.
The second experiment was conducted to test the outlier detection MAD on real datasets obtained from UCI machine repository. In this experiment, Iris plant dataset was considered. The Iris plant dataset  Figure 8 and Figure 9. The results of the tests conducted are summarized in Table 2.  Based on the results, MAD achieved a higher accuracy rate of 100% before the clustering of data under synthetic dataset. For the iris plants dataset, MAD obtained the best performance of 89% accuracy rate before the clustering of data.
As seen in Table 2, the implementation of MAD before clustering of data achieved higher performance accuracy rate in terms of finding outliers in the datasets. The outcomes of this series of experiments gave a piece of substantial evidence that the detection of outlier before performing clustering analysis works well with different types of datasets.

Experiment 2
The aim of the second experiment is to test whether the additional parameters added in the algorithm significantly affects the time to detect objects that overlaps.
To test the runtime execution of each algorithm, synthetic datasets were used considering two Gaussian clusters datasets (G2-2-30, G2-2-50), one high dimensional dataset and one compound dataset [24]. To obtain a clear insight of the clustering capability of different clustering methods, a simulation of the clustering results on each dataset was done. The simulation results for different scenarios using the IMCOKE and eHMCOKE are shown in Figure 10. Summary of the experimental results for runtime execution of the two algorithms is shown in Table 3.  Results indicate that the measurements for cluster size and distances between clusters affect the execution time in identifying objects that overlap between clusters. The IMCOKE ignores this size of the clusters and distance between clusters which makes the identification of overlap clusters quite timeconsuming especially on a more significant number of clusters. This makes the eHMCOKE perform better in terms of runtime execution even with a profoundly more substantial amount of data objects.

Experiment 3
The third experiment is to test the accuracy performance of the two algorithms in terms of identifying overlap clusters, the synthetic data set (Synthetic 1) used is composed of 37 observations with two attributes, three considered as linked pairs that will overlap, and two treated as outliers. Figure 11 illustrate the simulation result of the actual data.
The study performed three tests with two approaches, one with the used of the IMCOKE algorithm and another with eHMCOKE algorithm for comparison. In the IMCOKE algorithm, segmentation of objects into clusters was established first before the detection of outliers or before the incorporation of MAD. In this experiment, the user inputted two K clusters centroid, and clusters are formed once each object is assigned to its nearest cluster center.
Based on the simulation result, IMCOKE consider the outliers as members of one cluster therefore outliers were not been identified because clusters are formed prior to the identification of outliers. Then maxdist was used to identify the belonging of objects to multiple clusters. As shown in Figure 12, using the IMCOKE algorithm, there are no identified overlaps.
In the eHMCOKE algorithm experiment, the study considered the incorporation of MAD before the clustering of dataset since it results a higher accuracy rate in detecting outliers based on the first experimentation that was conducted. The same synthetic data set (Syntheic 1) was used. Before segmenting the data to its assign cluster, the objective of eHMCOKE is to isolate the outliers in the datasets with the used of MAD. With the integration of MAD as shown in Figure 13 evidently display that outlier were accurately discovered. Researchers emphasized that isolating unusual data in the dataset produce a more correct and precise outcome in the field of data mining thus isolating of this data from the dataset is significant [25], [26]. These found outliers are separated from the normal dataset and were no longer considered part of the procedure in detecting clusters that overlap. Figure 14 shows the visualization result of a cleaned dataset.
The same dataset was processed, the algorithm takes an input of two clusters centroid to form a cluster. Followed by the identification of overlap clusters. In this stage, additional parameters such as the measurement of radius and distance between clusters were added into the algorithm procedure. The study assumed that these parameters could also assist in the overlapping clustering processes. Calculating these parameters followed by the used of maxdist will have a high probability in finding patterns that overlap with other clusters. As shown in Figure 16, the simulation results of the eHMCOKE proved that the enhance algorithm was able to accurately detect the three considered linked pairs that overlap in the dataset.  Figure 16 shows the simulation result of the actual data. The test data contains two clusters, and the results are shown in Figure 17 and Figure 18.
The summary of the experimental results for all the cluster combinations performed using the two synthetic datasets are shown in Table 4. Based on the results, the eHMCOKE achieved the best performance of 100% under Synthetic 1 dataset, which means that the eHMCOKE algorithm outruns the IMCOKE algorithm. For the Synthetic 2 dataset, the eHMCOKE algorithm obtained a higher accuracy rate of 83% which outperformed the IMCOKE algorithm. Table 4 shows that the eHMCOKE achieved higher performance accuracy rate in terms of finding overlap data.

CONCLUSIONS AND RECOMMENDATIONS
Based on the findings of this research, MAD procedure was applied before clustering of the data in eHMCOKE since it results in consistently higher accuracy rate compared to the application of MAD after clustering of the datasets. The eHMCOKE algorithm performed faster over the IMCOKE algorithm with an improvement rate of 22% in identifying overlapping clusters. The eHMCOKE algorithm achieved an improvement rate of 99% over the IMCOKE algorithm based on its F1-score. The conclusions stated above shows that the incorporation of outlier detection prior to clustering improves the performance of the eHCOKE to detect outliers. This has led to better identification of overlap clusters. The used of the additional parameters also contributed to the enhancement of the algorithm in terms of runtime execution. Thus, the study has successfully achieved its objective of producing an eHMCOKE algorithm with better performance compared to the existing IMCOKE algorithm.
Furthermore, it is recommended that other test measures such as FBCubed and Pair-based evaluation may be considered to evaluate the performance of the Enhanced algorithm. Since the eHMCOKE still uses the traditional k-means algorithm, it is still sensitive to the random initialization of the cluster's centroid. An alternative approach to the random initialization is recommended. eHMCOKE can only be used with numeric data input, improvement of the algorithm may be done for it to accept textual inputs. New applications of the enhanced algorithm may be exhausted.