Boosting with crossover for improving imbalanced medical datasets classification

Received Oct 16, 2020 Revised Mar 17, 2021 Accepted Jul 6, 2021 Due to the common use of electronic health databases in many healthcare services, healthcare data are available for researchers in the classification field to make diseases’ diagnosis more efficient. However, healthcare-medical data classification is most challenging because it is often imbalanced data. Most proposed algorithms are susceptible to classify the samples into the majority class, resulting in the insufficient prediction of the minority class. In this paper, a novel preprocessing method is proposed, using boosting and crossover to optimize the ratio of the two classes by progressively rebuilding the training dataset. This approach is shown to give better performance than other state-of-the-art ensemble methods, which is demonstrated by experiments on seven real-world medical datasets with different imbalance ratios and various distributions.


INTRODUCTION
Imbalanced datasets are referred to the situation where there are much more examples in one class than in the other. Training classifiers with imbalanced datasets is a common problem in the machine learning researchers' community. The trained classifier would become under fitted in categorizing test examples of minority class and over-fitted with massive median examples of the majority class. The classification in class imbalanced datasets has drawn great concern in the medical field because often the classes of instances that are diagnosed as not having a disease are significantly more than the classes of instances that are diagnosed as having a disease. In order to enhance the performance of classification in this field, many efforts have been made and are still being made. Some preprocessing rebalancing methods have been proposed in the past, especially in the aspects of artificially extending the minority class examples (over-sample), resampling down the amount of the majority class examples (under-sample), or the combination of them. Random over-sampling [1] and under-sampling [2] are the simplest methods. The first increase minority amount through copying its examples, and the second randomly delete majority class examples to achieve the balance. Synthetic minority oversampling technique (SMOTE) and its improvements [3]- [6] are the most widespread re-sampling methods that often achieve an efficient performance. In this algorithm, the characteristics of minority class examples' spatial structure are observed and analyzed to fabricate extra minority examples into the dataset.
Another type of proposed algorithm solves the problem of class imbalance during the training phase using cost-sensitive or ensemble-based learning approaches [7]- [9]. In cost-sensitive learning approaches, different weights are assigned to each part of the confusion matrix using the cost matrix to obtain a result with minimum cost. Since the largest cost is the cost of minority class' misclassified examples, so the classifier will bias to the minority class. Hybridization between cost-sensitive learning techniques and decision tree (DT) [10] or feature selection [11] were proposed for solving the class imbalance problem. Ensemble learning-based mainly on voting and integrating a strong classifier from a collection of weak classifiers produced in several rounds or iterations. Boosting [12], bagging [13], and random forest [14] are the most widely used ensemble-based learning techniques. Most of these techniques were used for solving imbalanced medical data classification problems [14]- [16]. In this paper, we propose a novel approach called boosted crossover (BC) that is derived by the hibernation of the oversampling technique and boosting the classification performance. BC is a two-phase approach that first uses the bio-inspired crossover to rebuild new examples of the minority class, these examples have the same characteristics as their parents. Next, a weighting modification algorithm-the boosting algorithm-gives each classifier a weight-based on its own training error, which provides better performance over other oversampling algorithms, especially with highly imbalanced datasets. While other oversampling algorithms are based on repeating the minority class examples; the main advantage of our algorithm that it is based on using the crossover to build the new minority examples which mean that the resultant examples are new ones but also have the same characteristics as the original examples. The rest of this paper is organized as follows. Section 2 introduces the proposed approach. A description of the datasets and the experimental setup and results are presented in section 3. Finally, section 4 concludes the paper.

PROPOSED METHOD
The proposed algorithm is based mainly on two phases that run independently, for fixing imbalance problems in datasets. The idea of the first phase is dividing the training dataset into two groups-one of them contains examples of majority class and the other is the group of minority class-and resemble it back again after adding more examples generated by the bio-inspired crossover operator to the minority group. While the second phase depends mainly on boosting the classification process performance. Figure 1 shows the two phases of the proposed algorithm.

First phase
The training dataset is divided into two -majority and minority class-groups. The group that contains the examples of minority class is fed to the first phase where the crossover operator is applied to it to generate new instances. Crossover, also called recombination, is a genetic operator used in the bio-inspired genetic algorithms and evolutionary computation, to generate new offspring via combining the genetic information of two parents. It is a way to randomly generate new solutions from an existing population, and it is similar to the crossover that happens during biological sexual reproduction. Crossover creates a new offspring by selecting genes from parent chromosomes. There are different types of crossover operators [17] that varies according to the number of crossover points and their locations on each chromosome. The simplest type is Single-point crossover which creates offspring by choosing a crossover point randomly and all genes before or after this point are exchanged between the two parents. This results in two offspring, each carrying some genetic information from both parents. This bio-inspired operator facilitates the inheritance of "characteristics" or "traits" by an offspring from its parents [18], so we choose it to generate new minority examples that carry the same characteristics as the original ones. Single-point crossover is selected among other types of crossover operators to be used in this work, for simplicity and decreasing time consumption of the code.
After applying the crossover operator on the minority class group, the majority class group is recombined with it, resultant in a new balanced training data that is ready to be fed into the second phase. Before applying the second phase, the new balanced training data is tested. Five classifiers [14], [19], [20]; random forest (RF), K-nearest neighbors (KNN), discriminant analysis (DA), naive bayes (NB), and support vector machine (SVM) -are used to test the seven-medical data and a comparison of the results with the existing SMOTE and safe-level SMOTE approaches [3], [4] indicates a significant performance improvement of the proposed method over them. Then, the second phase is applied to them.

Second phase
After confirming the readiness of the medical data, the second phase is progressed by implementing the adaptive boosting (AdaBoost) algorithm using the new balanced training data and the test data. It is a general approach for improving the classification performance of any given classifier. It converts weak classifiers to strong classifiers by improving the model predictions of the given algorithm. It produces a series of trained classifiers. Each member of the series modifies its training set based on the performance of the prior classifier in the series. All examples that are predicted incorrectly by earlier classifiers in the series are chosen moreover than examples that were predicted correctly. Thus, boosting tries to produce series of improved classifiers that have a better ability to predict examples for which the current classifier's performance is poor. There are many boosting approaches such as AdaBoost (adaptive boosting), gradient tree boosting, LightGBM, and XGBoost. in this paper, adaptive boosting is the selected boosting algorithm. The AdaBoost algorithm is one of the boosting algorithms that were proposed in [21]. The generalized version of the AdaBoost algorithm for binary classification problems is shown in Algorithm 1. followed by a threshold. As in Algorithm 1, the algorithm takes as input a training set ( 1 , 1 ), ( 2 , 2 ), . . . , ( , ) where each belongs to some domain or instance X, and each label is in some label set Y. T rounds of AdaBoost training are iterated where T is the number of weak classifiers and ensemble weights are yielded by learning to constitute the final strong classifiers [21], [22]. The weak classifier is the core of an AdaBoost algorithm; in our work, the classification and regression tree (CART) algorithm-proposed by Breiman et al. [23] is used as weak classifiers.
The diagnostic wisconsin breast cancer dataset contains information on the diagnostic of wisconsin breast cancer. the Mammographic Masses dataset discriminates the benign and malignant mammographic masses. The heart disease dataset is the part obtained from Cleveland Clinic Foundation and used to detect the presence of the disease in the patient's heart. Diabetes, diagnosis the chronic diabetes disease. Known as AIM-94 diabetes dataset and obtained from two sources: paper records and automatic electronic recording device. The Pima dataset contains two classes to test whether the patient is positive or negative for diabetes. The patients' records are for Pima Indian Women who live near Phoenix Arizona, USA. Haberman data contains records of patients who had undergone surgery for breast cancer at the University of Chicago's Billings Hospital between 1958 and 1970 and collected for a study that was conducted on the survival. Meander Hand Parkinson Disease dataset diagnoses a patient with Parkinson's disease at its early stage utilizing handwriting images acquired during handwriting exams performed by meanders are filled in forms. The numerical information of the data contained in the included imbalanced datasets is summarized in Table 1.

Experimental results
To verify the performance, experiments were conducted in the MATLAB R2015a platform. on a computer equipped with 2.20GHZ core i7 processor and 6GB RAM. We performed our experiments in two phases: First, the original dataset is divided into two-Majority and Minority-class groups then, each group is divided equally into two subsets. After that, the equivalent percentage subsets were combined resultant training and testing sets with percent 50% and 50% respectively and have the same percentage of minority and majority classes.
The training set then used in the first phase and its minority class examples oversampled using the crossover operator to have the balance between the two classes. In order to test the validity of the balanced training set the five classifiers mentioned previously were used and the results compared with the results of two widely used oversampling methods (SMOTE and SLSMOTE). Recall, Precision, FScore, and GMean (geometric mean) are the performance measures used in this test besides the accuracy since, it is typically not enough information alone to validate algorithms performance, these measures are defined as [1], [26]: With imbalanced datasets, often increases in recall come at the cost of decreases in precision, since in order to increase the TP for the minority class, also the number of FP is often increased, resulting in reduced precision. FScore provides a way to combine both recall and precision into a single score that achieves both properties and provides a way to express them with a single measure that can give a good indication to the classification of imbalanced data [1], [14].

FScore = 2 x Precision x Recall/(Precision + Recall)
On the other hand, GMean is the square root of the product of class-wise accuracy (sensitivity for positive (minority) examples and specificity for negative (majority) examples). This measure tries to maximize the accuracy of both classes in balance. So, it is often used to evaluate the per-class accuracy of the classifiers. Traditionally if one class is unrecognized well by the classifier, GMean tends to zero [11].
GMean = √sensitivity * specificity (5) Table 2 and Table 3 show the classification performance measures (accuracy and FScore) for the five classifiers-RF, KNN, DA, NB, and SVM-applied on the seven medical datasets. In our experiment, each classifier is applied first on the imbalanced data and the performance measures are calculated for the test dataset (Or.). Then, the first phase is applied to the training data by implementing the bio-inspired crossover operator and the performance measures are recalculated for the same test dataset after train the classifiers using the new balanced data (Cross.). All datasets were rebalanced with SMOTE and SLSMOTE and tested in the same manner and the results were also included to compare the efficiency of our proposal against the other methods. We can notice from the results in Tables 2 and 3 that on 6 out of 7 datasets, the highest performance is achieved by our method. Of course, on the remaining datasets, sometimes the performance of our method is very close to the performances of the other two methods. But in other cases, it shows improvements in performance by more than 10%; as in FScore of D1, D2, and D4. The record named Win in both tables represents the number of datasets with which each method performed the best among the others and it is evident that our proposal has the superiority with all datasets for both accuracy and FScore of the RF classifier. It also gains an excellent performance with KNN and SVM classifiers and good performance with DA and NB classifiers. This can be noticed also in Figures 2-6. Figures 2-6 show the improved distance between the performance measures (Precision, Recall, and GMean) for the five classifiers RF, KNN, DA, NB, and SVM respectively.   Table 4 shows all performance measures used after applying boosting-with regression tree (CART) as weak classifier-on the imbalanced training set (Or.) and on the new balanced training set resulted by applying the proposed method (BC) in the second phase. As noticed some data D1, D3, D4, and D7 show a decrease in the precision value due to an increase in the TP for the minority class as mentioned earlier. The remaining datasets show an increase in their precision. Although, All datasets record increase in accuracy, recall, FScore, and GMean measures except the D5 dataset which have a decrease in these measures, except Precession and GMean, This may return to the nature of the data since it has an extremely low number of features. From the conclusions drawn above, the results presented in Tables 2, 3, and 4 also reveal the efficiency of our proposed algorithm where, it behaves excellently on all major performance metrics, especially for the metrics that can reflect the trade-off between negative and positive classification performance (FScore and GMean) and outperforms SMOTE and SLSMOTE with 3 classifiers out of 5 applied in all used medical datasets.

CONCLUSION
Boosted crossover (BC) is an effective method for solving the problem of imbalanced medical data. To best of our knowledge crossover operator with boosting has been utilized for the first time to balance the imbalanced medical datasets. The proposed method rebalances the data by increasing the number of minority class examples which improves the performance of medical diagnosis systems. First, the bio-inspired crossover operator is used to build new examples, then the new balanced data is tested using five different classifiers and compared with two other oversampling methods. Finally, adaptive boosting is used to boost the performance of the system. Experimental results conducted on seven medical datasets prove that Boosted Crossover is very efficient for enhancing the classification performance measures especially with RF, KNN, and SVM classifiers.