http://beei.org Hybrid approach redefinition-multi class with resampling and feature selection for multi-class imbalance with overlapping and

Info 2021 Class imbalance and overlapping on multi-class can reduce the performance and accuracy of the classification. Noise must also be considered because it can reduce the performance of classification. With a resampling algorithm and feature selection, this paper proposes a method for improving the performance of hybrid approach redefinition-multi class (HAR-MI). Resampling algorithm can overcome the problem of noise but cannot handle overlapping well. Feature selection is good at dealing with overlapping but can experience a decrease in quality if there is a noise. The HAR-MI approach is a way to deal with multi-class imbalance issues, but it has some drawbacks when dealing with overlapping. The contribution of this paper is to suggest a new approach for dealing with class imbalance, overlapping, and noise in multi-class. This is accomplished by employing minimizing overlapping selection (MOSS) as an ensemble learning algorithm and a preprocessing technique in HAR-MI, as well as employing multi-class combination cleaning and resampling (MC-CCR) as a resampling algorithm at the processing stage. When subjected to overlapping and classifier performance, it is discovered that the proposed method produces good results, as evidenced by higher augmented r-value, class average accuracy, class balance accuracy, multi class g-mean, and confusion


INTRODUCTION
The class imbalance occurs when a class has a significantly smaller number of instances than other classes, as determined by the imbalance ratio (IR), which is the ratio of a class with a significantly smaller number of instances (minority class) to a class with a significantly larger number of instances (majority class) [1] and basically machine learning algorithms work optimally if each class has a number of instances that are not much different [2]. This problem is one of the causes of the low accuracy of classification problems and also causes important information contained in the minority class can not be obtained due to better coverage on the majority class [3]. Handling of multi-class imbalance has greater difficulty compared to two-class problems, especially when it comes to accuracy and difficulty of training data on large datasets with high imbalance ratios [4]. Another thing that escapes attention is the overlapping problem, where several classes (1) Where 0 , 1 , … , −1 are k class labels with | 0 | ≥ | 1 | ≥ ⋯ ≥ | −1 | and [ ]: Dataset D restraining predictors in set V. A higher indicates a higher overlap degree.

Confusion matrix
For the general classification results the classification results can be grouped into 4 (four) groups, namely; true positive (TP), true negative (TN), false positive (FP), and false negative (FN) and can be presented in the confusion matrix as can be seen in Table 1 [28].

Classifier performance
The following parameters are used to determine the classifier performance. − Class average accuracy with C classes can be calculated using (2) [29].
Where C is number of class with TP, TN, FP, and FN are the result of predicted (classified) that was obtained from confusion matrix. − Class balance accuracy for any C k , confusion matrix, class balance accuracy is defined as [30].
According to confusion matrix for class balance accuracy as can be seen in Table 2. − The G-mean was proposed as the geometric mean of recall (R) values of two groups by multi class Gmean (mGM) [31]. Sun et al. [32] To apply this measure to multiple-class situations, define the G-mean as the geometric mean of each class's recall values.
− Confusion entropy (CEN). Wei et al. [33] proposed using the confusion entropy to determine classifier efficiency. According to the confusion entropy, the misclassification information includes both how the samples with true class label cli were misclassified to the other N classes and how the samples from the other N classes were misclassified to class cli.
In (10) shows the loss penalties that will be used to determine the optimum ̂.
The method of handling overlapping has started at the preprocessing stage by adding MOSS, as seen in the previous pseudocode. The MOSS process begins with the provision of p predictors and class labels y. The oversampling process in the minority class will be carried out using SMOTE, and then the sparse regularization process will be carried out using sparse collection, which can be measured using (2), and then loss penalties will be calculated using (3).

Multi-class combined cleaning and resampling
The MC-CCR algorithm is being as [11]. Input: x (c) denotes a subcollection of observations belonging to class in the Set of Observations x. Parameters: Each sphere has an energy budget for expansion, and the p-norm is used to calculate distance Output: Observations that have been translated and oversampled X 1: The MC-CCR method is used to eliminate noise at the processing stage. It should be noted that the impact is limited and basically the combined majority observations.

Hybrid approach redefinition for multi-class imbalance
The algorithm of hybrid approach redefinition for multi-class imbalance is being as [21]. Based on the preceding algorithm, it is clear that the HAR-MI method is divided into 2 (two) major stages: preprocessing and processing. The random balance ensemble method and dynamic ensemble selection are used in the preprocessing stage. It is clear that the preprocessing stages will generate preprocessing datasets, which will then go through processing stages using different contribution sampling. There are biased support vector machine stages in different contribution sampling that will produce SV sets and no scalpel vasectomy (NSV) sets for both the majority and minority classes. NSV sets from majority classes are then processed using multiple random under sampling, while SV sets from minority classes are processed using SMOTE boost. Figure 1 shows the stages of this research. According to the previous Figure, the process will start with the preprocessing stage, which employs MOSS. The sparse selection and lasso penalty values are determined first. This stage's output will be a preprocessing dataset, which will then go through processing stages using MC-CCR. The results from HAR-MI with resampling and feature selection will then be compared to the results from neighborhood-based undersampling.

Preprocessing stage
MOSS will be used to modify the HAR preprocessing stage for multi-class problems. The following algorithm depicts the preprocessing stages.

Figure 1. Stages of research methods
Based on the preceding algorithm, it is clear that the HAR-MI preprocessing stage will be carried out using one of the feature selection methods, namely MOSS. MOSS is intended to do overlapping handling before entering the processing stage. This MOSS stage begins with determining the value of sparse selection and loss penalty. Furthermore, MOSS will be used to sample each instance in the minority class. The result is a preprocessing dataset which will then be measured in augmented R-value values.

Processing stage
The following algorithm depicts the processing steps. According to the previous algorithm, the processing stage begins with the biased support vector machine process to determine SV Sets and NSV Sets for the majority and minority classes. The next step is for each SV Set and NSV Set in the majority class to go through the process of noise removal and resampling using the MC-CCR. On minority classes, the same thing will be done with SV sets and NSV sets. The SMOTEBoost process will be applied to SV sets in the minority class in particular. This whole process will result in a result dataset.

RESULTS AND ANALYSIS 4.1. Dataset description
We conducted our experiments using 6 (six) multi-class imbalanced datasets from the knowledge extraction based on evolutionary learning (KEEL) repository, each with a low, moderate, or high. For datasets with a low IR are new-thyroid and balance, datasets with moderate infrared (IR) are flare and car, and dataset with high IR are red wine quality and yeast. Table 3 contains a description of the dataset [35]. Following the selection of the dataset, the next step is to assess the presence of noise. This experiment will use a subset of training examples and randomly replace their labels to generate noise. This experiment will use a noise level of 0.1.

Testing result
The first test compares the augmented R-value and class average accuracy obtained by using the HAR-MI with resampling algorithm and feature selection. Table 4 shows the test results. According to Table 4, the results obtained by the HAR-MI method with resampling algorithm and feature selection and neighborhood-based undersampling are not significantly different in terms of overlapping. This is indicated by the value of augmented R-value which is not much different. The lower the augmented R-value, the lower the overlapping level. There is a strong relationship between overlapping and accuracy. The lower the overlapping, the better the average class accuracy obtained. It should also be noted that neighborhood-based undersampling tends to have a slight advantage in datasets with a large number of attributes such as flare and red wine quality. The HAR-MI with resampling algorithm and feature selection has the advantage of 4 other datasets. It should be noted that the imbalance ratio has an impact on the results. The higher the imbalance ratio, the more overlapping there will be, and the accuracy obtained will also be lower.
The second test compares the class balance accuracy, multi class G-mean, and confusion entropy obtained by using the HAR-MI method with resampling algorithm and feature selection, as well as neighborhood-based undersampling. Table 5 shows the test results. Based on the Table 4, it is obvious that the number of attributes, the number of classes, and the level of IR all have a significant impact on class balance accuracy. The number of attributes and classes will largely determine the results of class balance accuracy for datasets with similar imbalance ratio levels. This can be seen in the dataset balance results for improved class balance accuracy when compared to the New-Thyroid dataset. When it comes to class balance accuracy, it can be seen that the HAR-MI with resampling algorithm and feature selection method produces better results than neighborhood-based undersampling. Test results for multi class G-mean show that for both HAR-MI with resampling algorithm and feature selection and neighborhood-based undersampling, the higher the IR, the lower the multi class G-mean value obtained because G-means stated the equilibrium between positive samples and negative samples. The test results for confusion entropy show that the results obtained depend on the number of classes and imbalance ratios. The number of classes determines the results obtained for imbalance ratios that are not significantly different, such as those in the flare and car datasets. In general, the results obtained by the two methods for confusion entropy are not significantly different.

Statistical tests
The Wilcoxon signed-rank test is used to perform the statistical test, which is a statistical procedure in order to assess perfromance on the basis of pairwise comparisons [36]. The result for statistical tests can be seen in Table 6.

Discussion
According to the results in Tables 4-6, there is no significant distintion in augmented R-value, class average accuracy, and confusion entropy between HAR-MI with resampling algorithms and feature selection and neighborhood-based undersampling. It indicates that both methods have successfully overcome overlapping with positive outcomes. The confusion entropy obtained is good and this means that the misclassification is spread evenly for all classes. The test results for class balance accuracy and multi class Gmean show that HAR-MI with resampling algorithms and feature selection gives better results compared to neighborhood-based undersampling.
It should be noted that overlapping and accuracy are two interrelated things, the higher the overlapping, the lower the accuracy. The imbalance ratio is the main factor that determines how much overlap there is. The higher the imbalance ratio, the higher the overlapping will be. In terms of overlapping, HAR-MI with resampling algorithms and feature selection has few limitations on datasets with a large number of attributes, in addition to imbalance ratios. The results revealed that, in addition to IR, the number of attributes and the number of classes determined the value of class balance accuracy. The number of classes and the imbalance ratio have a strong influence on multi class G-mean and confusion entropy.

CONCLUSION
Based on the test results that for handling multi-class imbalances accompanied by overlapping and noise, the results obtained by HAR-MI with resampling algorithms and feature selection and neighborhoodbased undersampling are good. The results obtained by HAR-MI with resampling algorithms and feature selection are generally better than neighborhood-based undersampling. This is indicated by better augmented R-value, class average accuracy, class balance accuracy, multi-class G-mean, and confusion entropy. Although statistically for augmented R-value, class average accuracy, and confusion entropy based on the test results statistically it does not have too significant differences. It should be noted that HAR-MI with resampling algorithms and feature selection and neighborhood-based undersampling has limitations in handling overlapping, where there is a slight decrease in performance in datasets with large numbers of attributes. Imbalance ratio also has a direct relationship with the performance classifier. Future research is expected to be able to handle the decrease in performance in datasets with large number of attributes and also a high IR.