Handling concept drifts and limited label problems using semi-supervised combine-merge Gaussian mixture model

ABSTRACT


INTRODUCTION
Today, we live in a digital age where practically anything can be done online. Consequently, the volume of data generated and produced is increasing at an exponential rate. According to the recent Statista report [1], the data generated in 2020 is estimated to be 64.2 zettabytes (trillion gigabytes). It is predicted to increase by over 19.2% in 2021-2025. Usually, some of this data is generated continuously and comes in infinite streams, such as IoT and social media data. This data type is known as data stream. Due to these characteristics, the distribution in the data stream might change over time. This change is caused by many factors like the change of human behavior, environment, and trends. The phenomenon where the distribution of data changes over time is called concept drift [2]. Concept drift caused model performance to decrease because the model trains and evaluates with data with different distributions. Many adaptive methods have been developed to address this problem [3], [4]. Most of them work by updating or retraining the model then redeploying them periodically or when the concept drift is detected. The critical component of updating or retraining the model is labeled data. It is impossible to label all the data in streaming data because of the size and the speed of data. Nonetheless, these solutions could be time-consuming and expensive because data labeling may not be available without human input.
Several methods have been proposed to solve the concept drift problem [2], [5]- [7], or label limitation problem [8]. However, when these two problems exist simultaneously, the retraining in a limited label situation may not be effective because it cannot capture the complete concept. In addition, several methods of detecting concept drift are also very dependent on its model output. It automatically limits the ability of the algorithm to detect concept drift because of the label limitation problem. Therefore, apart from adapting to concept drift, the ability to adapt to limited label conditions is an essential feature of the model to solve simultaneously.
In our previous work in [9]- [11], we proposed combine merge Gaussian mixture model (CMGMM) as an incremental algorithm to handle the concept drift problem based on the Gaussian mixture model (GMM). This algorithm can adapt concept drift by including a new component, updating, and removing it to accommodate the emerging concept drift. The main benefits of this algorithm are the ability to adapts and constantly learns from stream data using a local restoration technique to maintain previously learned information and avoid catastrophic forgetting. However, this solution only works when the data are fully labeled.
This paper proposes semi-supervised CMGMM (SCMGMM) to address the concept drift problem and limited label simultaneously. SCMGMM extends the CMGMM that allows us to adapt to new data that is partially labeled. The training and classification process are the same as the CMGMM, but the adaptation process is different. In SCMGMM, the model assigns a pseudo-label to fill the missing label before the adaptation process. This study aims to develop an incremental model and avoid performance degradation due to noisy pseudo-labels. This paper has been divided into the following parts: section 2 is the related work of this research; section 3 explains our proposed method; section 4 describes the experimental setup; section 5 discusses the experimental results; and finally, section 6 provides our conclusion.

RELATED WORK 2.1. Concept drift
In this paper, the term concept drift indicates arbitrary statistical change property of a target domain [7], [12]. Statistically, we define a concept as prior class probabilities ( ) and class conditional probabilities ( | ) [13] where is the input features and is the corresponding label in dataset . ( ) and ( | ) determine the joint distribution ( , ). In the future, the relationship of ( ), ( | ), and ( , ) might change because the change of human or user behaviour and environtment. This situation is expressed as being as: In (1) shows a difference in the distribution between the joint probabilities at the two-time window 0 and . As a result, the model built based on 0 is not suitable for use on because the relationship between and might change, then increasing error and decreasing accuracy on [14].

Combine-merge Gaussian mixture model
Combine-merge Gaussian mixture model (CMGMM) [9] is an incremental learning algorithm designed to solve the concept drift problem. CMGMM can include new components from newly coming data, update and delete existing components as a response to the change in the current data . This algortihm consist of four main process, namely training best model, data prediction, detect concept drift and model adaptation. In the training process, we employ the expectation maximization (EM) [15] algorithm to train the model and the Bayesian information criterion (BIC) [16] to select the best model. To classify the data, we compute the log-likelihood for each class. Model adaptation intends to adapt the current model in response to newly received data, including new concepts or concept modifications. The result of this adaptation is an adapted weighted mixture component that respects the original mixture. Please refer to [9] for a detailed process and implementation of CMGMM.

PROPOSED METHOD
In this paper, we propose semi-supervised CMGMM (SCMGMM) as an extension of the combinemerge Gaussian mixture model (CMGMM) [9] to classify data in concept drifts and limited label situations. Therefore, SCMGMM and CMGMM have many similarities except in the incremental phase. The illustration of the process on the SCMGMM can be seen in Figure 1.
First, we build an optimal model Μ from a fully labeled dataset using the EM [15] algorithm and select the best model using BIC [16]. Not all the data is labeled in the incremental process, so we employ label propagation to generate pseudo labels from unlabeled data. In the incremental process, when Kernel density drift detector (KD3) [7] detects the concept drift, the model triggers the concept drift adaptation process. The adaptation process employs both labeled and pseudo labels data. The detailed difference processes between SCMGMM and CMGMM are explained in the subsections that follow.

Pseudo label propagation
The pseudo label propagation process uses labeled data to predict labels for unlabeled data with a prior consistency assumption. The result of the prediction to label the unlabeled data is called a pseudo label. The model used to predict pseudo labels is trained based on current window data using the label spreading method [17], assuming that nearby data instances and the similar structure share the same label. Finally, using the labeled and pseudo label data, we evaluate and adapt the concept drift.
In training, we build a Laplacian matrix of a directed graph utilizing the RBF Kernel from . The edge of the graph is calculated using the RBF kernel in (2). Then we construct the Laplacian matrix S using = −1/2 −1/2 where D is a diagonal matrix where the diagonal elements are the sum of the i-row of the E. Finally, we compute F as a class probability distribution to train the data, then apply the regularization function to smooth the label. F is × non-negative matrices that contain the classification result of . A detailed equalization of the probability and regularization function can be seen in [17] at (1) and (4).

SCMGMM drift adaptation
SCMGMM drift adaptation is carried out to combine all existing components and then optimize by reducing the similar components. The main component of this method is a -Gaussian mixture component denoted by {( 1 , 1 , 1 ), ( 2 , 2 , 2 ), … ( , , )} where , , are the prior probability or weight, the means, and the covariance. This weight must satisfy 1 + 2 + ⋯ + = 1, and has probability density function in (3).
In this study, the adaptation process is carried out by adding new components, reducing similar components, and remove components. The adaptation process begins by building a local model Μ from the recent incoming data that represent the new concepts or concept updates in the data. Then, we combine the Μ and Μ components to Μ . This process aimed to append any new concepts from Μ that may not exist in the Μ at the initial training.   (7). ( 2 ) det ( 1 ) ] (7) The detailed algorithm for model adaptation is shown in algorithm 1

EXPERIMENT SETUP
This section describes the datasets, comparison methods, evaluation metrics, and hyperparameters used in this study to train and evaluate the proposed method.

Datasets
We used real-world and synthetic datasets to evaluate the proposed algorithm. These datasets contain concept drift with a certain level of label availability in the dataset, namely 90%, 75%, 50%, 25%, 5%, and 1%. The datasets used in experiments are given as shown in: − Rotating hyperplane dataset [18] is an artificial dataset containing hyperplane data in d-dimensional space that continuously changes its position and orientation. − SEA dataset [19] is an artificial dataset that consists of three attributes and 50,000 data instances. This dataset also contains abrupt concept drift simulated by four different concepts every 12,500 data points by changing the class decision boundary. − CR4 [20] is an artificial dataset containing 144,400 samples using four classes rotating separately in 2dimensional space. − FG2C2D [20] is an artificial dataset that contains two bidimensional classes and two concept drift, namely gradual and incremental concept drift, every 200 data points. There are 200,000 samples and two classes in this dataset. − GEAR2C2D [20] is artificial dataset that contain two rotating gears represented as two classes. There are 200,000 samples and two classes in this dataset. − MG2C2D [21] is artificial dataset that contains two bidimensional multimodal Gaussian classes. − NOAA is real-world weather dataset collected over 50-years at Bellevue, Nebraska. This dataset has 18,159 samples in two classes, namely rain, and no-rain. − Electricity is one of the famous real-world dataset used to evaluate concept drift problems. This dataset collected by [22] contains electricity market records in New South Wales, Australia. − Spam dataset is collected by [23] to separate malicious spam emails from legitimate ones. − Phishing is collected by [24] containing data about malicious web pages.

Comparison methods
In our evaluation, we compare our proposed algorithm SCMGMM with three state-of-the-art algorithms from the related literature, namely: − Self adjusting memory KNN (SAM-kNN) [25] is an incremental learning algorithm based on k-nearest neighbor (kNN). This algorithm is designed to able to deal with varied types of concept drift within streaming data. − Learn++.NSE ensemble classifier [12] is an incremental ensemble classifier to solve a nonstationary environment (NSE). This algorithm uses weighting strategy and base classifiers association mechanism to track concept drift.

3365
− Hoeffding tree [26] is an incremental decision tree learning algorithm for streaming data. It keeps track of the most recent streaming instances in a graph. To replace an old instance with a new one, a starmesh transform technique is used. These algorithms are designed to solve only the concept drift problem. Therefore, we apply the pseudo label propagation method to these algorithms to solve the limited label problem. The source code of the proposed and comparison method implementation and the dataset is available in our public repository 1 .

Evaluation method and metric
We evaluate the proposed method performance using the interleaved test-then-train or prequential evaluation method. All the classifiers are trained using the same fully labeled training dataset. In the evaluation, newly incoming data is evaluated in specific windows. At this stage, the label of the data is removed according to the label availability setting. Furthermore, we perform pseudo labeling and test the incoming data stream, then updates or retrain the model if concept drifts are detected. The model is constantly tested on the data stream that has not been seen when evaluated in this order. The benefit of this approach does not require a holdout set for testing and allowing us to utilize the existing data efficiently. Table 1 presents the experimental results of SCMGMM, SAM-kNN, Learn++NSE ensemble classifier, and hoeffding tree in 10 datasets. SCMGMM shows better accuracy than other methods, especially in CR4, GEARS2C2D MG2C2D, and NOAA. In all models and datasets, the model's accuracy is in line with the accuracy of the label propagation. When the label propagation accuracy is low, then the model accuracy is also low and vice versa. For example, in SEA 90%, SCMGMM accuracy is 0.8258, and label propagation accuracy is 0.7534.

RESULTS AND DISCUSSION
In the artificial dataset, CR4, FG2C2D, GEARS2C2D, MG2C2D, where the data is generated using a Gaussian distribution, the use of graph-based and SCMGMM gave better results. However, SCMGMM shows its worst performance in the hyperplane dataset. This dataset contains a set of data point that were created by the hyperplane = 0 rotate with random direction so that the graph-based pseudo labeling method cannot work well.
Other algorithms like the Hoeffding Tree and SAM-KNN also show good accuracy. The Hoeffding tree shows its best accuracy in Hyperplane, SEA, and Electrical. SAM-KNN shows its best accuracy only in Phishing. However, the accuracy of Learn++NSE is very low on several datasets such as CR4, MG2C2D, FG2C2D, and GEARS2C2D, where this method was accurate less than 10% of the time. This algorithm is not suitable for the syntectic dataset generated by the Gaussian distribution.
Misclassification on the pseudo label can be said to be noise at the adaptation stage. This noise could decrease the model accuracy. The lower the label availability, the higher the noise on the pseudo label. In Figure 2, all classifiers that use the label propagation method can maintain their performance until label availability is 5%. Decreasing label availability from 90% to 5% only reduced label propagation performance by 5.2% (on average). The accuracy of label propagation reflected the model performance; the higher the label propagation accuracy, the better the model performance. However, in label availability 1%, there are significant performance decrements, especially in SEA, FG2C2D, and Phishing.
The statistical analysis of experimental results was conducted for four algorithms with 60 paired samples. The significant test level is alpha=0.050. For the populations SCMGMM (p=0.001), hoeffding tree (p=0.000), SAM-kNN (p=0.003), and learn++NSE (p=0.001), we rejected the null hypothesis that the population is normal. Thus, we assumed that not all populations are normal.
Furthermore, we utilize the non-parametric Friedman test as an omnibus test to check whether there are any significant differences between the median values of the populations because we have more than two populations, and some of them are not normally distributed. We also use the post-hoc Nemenyi test to infer which differences are significant. In this test, we report the median (MD), the median absolute deviation (MAD), and the mean rank (MR) among all populations over the samples. Differences between populations are significant if the mean rank difference is greater than the critical distance CD=0.606 of the Nemenyi test. Figure 3 shows the result of the Post-hoc Nemenyi test. We

CONCLUSION
This paper proposes an incremental algorithm that solves the concept drifts and label scarcity situations. The experiment result shows that our proposed algorithm outperforms 5 of 10 datasets and has the highest overall accuracy. The accuracy of the label propagation dramatically affects the accuracy of the model; the higher the accuracy, the higher the accuracy of the model. Our contributions are our proposed method to maintain model performance using pseudo label propagation even though label availability is drastically reduced from 95% to 5%, and the label propagation also can be applied to other incremental methods. As part of our future work, we plan to investigate the effectiveness of adapted data supplied by concept drift detectors in several types of concept drift. Furthermore, we plan to investigate this algorithm with complex feature datasets like images, sounds, and videos.