A novel imbalanced data classification approach using both under and over sampling

Received Jan 6, 2021 Revised Apr 13, 2021 Accepted Aug 2, 2021 The performance of the data classification has encountered a problem when the data distribution is imbalanced. This fact results in the classifiers tend to the majority class which has the most of the instances. One of the popular approaches is to balance the dataset using over and under sampling methods. This paper presents a novel pre-processing technique that performs both over and under sampling algorithms for an imbalanced dataset. The proposed method uses the SMOTE algorithm to increase the minority class. Moreover, a cluster-based approach is performed to decrease the majority class which takes into consideration the new size of the minority class. The experimental results on 10 imbalanced datasets show the suggested algorithm has better performance in comparison to previous approaches.


INTRODUCTION
An essential challenge faced by the traditional classification algorithms is the distribution of data where the classes are imbalanced. For example, healthy transactions are significantly bigger than fraudulent transactions. In this situation, the classifiers trend to the majority class and ignore the minority one. There are three categories for classical imbalanced data classification approaches. The algorithmic level methods that try to strengthen the classification algorithm to enforce the learning towards the minority samples [1], [2]. The second group of approaches is ensembles classifiers that contain two methods [3], [4]: bagging and boosting. Bagging includes various classifiers that are used to subsets of the dataset [5]. Likewise, in boosting, the complete dataset is applied to train classifiers so that it gives more attention to the samples that are misclassified [6], [7]. The third category entails scenarios such as pre-processing the data to balance before providing as the input data or improving the classifiers. The data processing known data level techniques are preferred as it has vast applications [8], [9]. The major objective of data level algorithms is to either decreasing or increasing the class number. These approaches try to achieve the same sample number for both classes. The under-sampling approach attempts to reduce the instances of the majority class. This technique discards useful information which could be essential for classifiers. Moreover, it is an inaccurate representation of the population. The over-sampling algorithm increases the minority class number by replicating samples [10], [11]. Unlike under-sampling, this approach leads to no information loss. However, it increases the probability of overfitting because of reproducing the minority class samples [12], [13]. To overcome the challenges of under and over sampling algorithms, some researchers proposed to combine the approach with other techniques. Khan et al. [14] described an approach in which a cost-sensitive method based on the neural networks can train representations of the feature for both the minority one and the majority category. This method tries to improve the classifier. Therefore, the original data is no change. Castellanos et al. [15] suggested a strategy based on a string converting. This strategy converts the SMOTE technique to a string space. The improvement of this method is 97.5 according to the F-measure score.
Many researchers have tried to perform a clustering method to balance the classes. Prachuabsupakij et al. [16] suggested an approach in which a k-means based method decreases the overlapping of the classes. This method clusters the original dataset into two classes. Then, a clustering switching method and the SMOTE technique are performed on each class. The output is two balanced training set. This model clusters the majority class into two classes without regarding the size of the minority class. The average F-measure of this model was 0.975. Czarnowski et al. [17] presented an approach based on clustering where similarity coefficient computed for samples of each data class independently. Then, similar samples are clustered into the same class. The maximum accuracy of this method is 98.01. Lin et al. [18] proposed an under-sampling method using the clustering technique that the majority data is divided into k class. This algorithm calculates k regarding the minority class size. The best average classification accuracy is 0.904. This paper introduces a combinatorial algorithm to overcome the imbalanced problem. It tries to produce the minority class item by the SMOTE method. Likewise, it uses a clustering algorithm to decrease the majority class. Unlike previous approaches, it clusters the majority one regarding the new minority one. The novelty of this work is that the rate of increase of the minority class and decrease of the corresponding majority class is done together. The paper has been arranged as; the next section includes some basic techniques relevant background and the proposed algorithm, section 3 and 4 provide the results of the experiments. Finally, concluding points are in section 5.

RESEARCH METHOD
Before the proposed algorithm was introduced, a summary of the basic knowledge would be presented. This work uses SMOTE technique to increase the minority classes number. Moreover, the approach uses a clustering method as an under-sampling algorithm to decrease the majority class.

SMOTE technique
This algorithm carries out an over-sampling method to balance the imbalanced data [19]. The major idea of the method is to produce synthetic samples. The new instance is created according to the interpolation of some samples in minority class that are neighborhood space. Therefore, it focuses on the feature aspect instead of the data one. In other words, the method considers both the value of features and the relationship between them [12]. Figure 1 depicts a simple example of SMOTE. First, a minority class sample i is considered to produce a new synthetic point. Then, several nearest neighbors regarding a distance metric are selected. Finally, k samples are selected in a random way to obtain the new samples by insertion (i to k). Therefore, the distance between the considered instance and its neighbors is multiplied by a random coefficient between 0 and 1. Consequently, some new points are added which one is chosen at random (rd1 to rdk).

K-means algorithm
The k-means method is widely used in the machine learning area. It is an iterative technique that attempts to divide the dataset into k distinct cluster where each data item belongs to only one cluster [20]. (1)

Proposed algorithm
This paper tries to combine both under and over sampling approaches. The proposed method performs the SMOTE algorithm on the minority class to increase its samples. Moreover, it uses a clustering method to decrease the majority class without losing data. The phases of the algorithm are as: − Performing SMOTE technique on minority cluster − Computing the clusters number by proportion the majority size and the size of the new minority one − Performing the k-means algorithm on majority cluster − Combine each cluster with new minority class − Performing a classifier for each class − Classification with maximum probability vote Figure 2 shows the flow chart of proposed algorithm. The method tries to increase the minority class by considering the IR of the dataset. The number of clusters (known K) is determined according to the new minority size and majority size. The value of K is equal to the size of the majority class divided by the size of the minority class. In the next step, the K-mean algorithm is performed on the majority class to produce k clusters. Then, each cluster is combined with the new minority class. A classifier categorizes each new cluster. Finally, the model selects the cluster with maximum probability vote.

Dataset and experimental setting
The experimental datasets are all from the KEEL repository [21]. The datasets have a various imbalanced ratio. The number of data samples is from 214 to 5472. Table 1 shows the experimental parameters of them. This paper performs SMOTE algorithms to increase minority instances. The number of oversampling instances is determined according to the IR of the dataset. Then, the number of clusters (k) is calculated regarding the new minority cluster and the majority size. In the next step, the K-means method produces the clusters. Then, each cluster is combined with the new minority class. Classification is done for each new dataset. Finally, voting selects the best one. Figure 3 presents the proposed algorithm steps in detail. To evaluate the classification by the proposed algorithm, four different classifiers were performed including decision tree [22], support vector machine (SVM) [23], nearest neighbor classifiers [24], and ensemble classifiers [19].
where: FP is an outcome that indicates something is present when really is not FN is a result that presents negative when it should not TP is an upshot indicates positive when really is TN is a result that shows negative when really is But if the distribution is unbalanced it can be misleading. If the distribution is unbalanced accuracy can be misleading. Therefore, it is better to rely on precision and recall. Likewise, in the same way, a Precision-Recall curve is suitable to evaluate the classifier in an imbalanced class. Moreover, the region under the curve known AUC is a performance measurement for the classification method. Table 2 presents the accuracy of the approaches on 10 datasets. Moreover, Table 3 shows the AUC for all datasets in the three situations. For comparing the performance of the suggested approach, the classification results for the normal dataset, dataset after performing SMOTE, a density-based undersampling algorithm (DBU) [25] and the proposed method outcome were evaluated. As mentioned before, it is easy to get a high accuracy without actually making a suitable prediction when there are imbalanced classes. Therefore, precision and recall were computed. Then, the area under the precision-recall curves (AUC) was used as a summary of the model performance. A precision-recall curve shows a balance achieved between the TP rate and the positive value that the model predicts using different probability thresholds. Table 3 presents the AUC for all datasets in the three situations. The proposed algorithm combines oversampling and under-sampling techniques. The decreasing rate of majority class is done by considering the rate of increasing minority class. Therefore, the proposed method uses more instances of the original data. Moreover, the increasing rate of minority class is according to the IR of the dataset. The results show increase accuracy and AUC of the proposed model compared to the SMOTE method on benchmark imbalanced datasets from the KEEL repository.

CONCLUSION
This work proposes a novel technique to bias imbalanced data. To overcome the unbalance problem, both under and over-sampling approaches are used. Most imbalanced data classification techniques try to balance the data using increasing the minority class or decreasing the majority class that results in changing the original data. The proposed algorithm tries to reduce the changing rate of the primary dataset. The algorithm firstly performs the SMOTE method on the minority cluster to produce new instances. Then, the kmeans clustering algorithm decreases the majority class, which considers k regarding the size of the new minority class. Finally, each cluster and the new minority class are considered as the input data of a classifier. The experimental results show increase accuracy and AUC of the proposed model compared to the SMOTE method on benchmark imbalanced datasets from the KEEL repository. The method is further performed on other real-world engineering datasets.