An improved feature selection approach for chronic heart disease detection

ABSTRACT


INTRODUCTION
Heart disease diagnosis with supervised machine learning model involves identifying the class of a given observation based on previous experience without explicitly programmed [1]. The input features represent the class to be determined. However, the complete set of input features in the medical dataset does not determine the class of an observation. Irrelevant input features tend to mislead the machine-learning algorithm and result in low performance on heart disease classification. Thus, sequential feature selection (SFS) algorithm is a novel approach to input feature selection to improve the performance and computational time complexity for classification problem involving disease diagnosis using medical diagnostic datasets. Moreover, eliminating irrelevant and less important features from the medical dataset tends to improve the precision and classification accuracy of machine learning models in heart disease diagnosis. The goal of sequential feature selection (SFS) algorithm is to ensure that the best possible subset of features are used for training a model on medical dataset for classification ultimately improving the precision and classification accuracy of a model on the classification task.
Irrelevant features in a real world medical dataset such as heart disease dataset suggest strong correlations between features and the target class label arising by chance and strong correlation of between features tends to deteriorate the classification accuracy of model [2]- [4]. Moreover, large number of features in a dataset significantly increases computational time complexity without corresponding model performance improvement. Consequently, training classification model with smaller and optimal feature subset tends to improve classification performance. Thus, we have proposed an efficient sequential feature selection algorithm for selecting the relevant and more important feature subset among the larger input feature in a real world dataset for improving the performance of machine learning model for classification task. Dealing with a large number of features brings us to reduction of the dimensionality of dataset features [2]. More features in training set tend to make models more complex to learn and difficult to interpret the classification performance. In addition to model complexity, more features tend to lead model overfitting.
In this study, we primarily focused on heart disease classification model optimization with input feature selection. The problem of supervised learning is to approximate the functional relationship f() between an input X={X1, X2, …, XN} and output Y called the class label on a memory of data points {Xi, Yi,} for i={1,…,N} where Xi is input vector and d Yi, I is a real number. However, some of the input features are irrelevant in medical diagnosis datasets. For example, patient ID is not relevant for machine learning model. Moreover, using all input features requires sufficient time and irrelevant features introduce overhead time complexity and result in lower classification accuracy. Thus, this study, introduces sequential feature selection algorithm for optimizing heart disease classification accuracy of random forest model. We implemented exhaustive, correlation and permutation based features selection algorithm to compare with the proposed feature selection algorithm by employing real world Pima Indian heart disease dataset data repository for testing the classifier model performance.

LITREATURE REBVIEW
Numerous research works have shown that large number of feature have impact on performance of supervised machine learning model for classification. Recent literature review by [5]- [7] summarized the current optimization approaches employed for improving the performance of supervised machine-learning algorithms on medical dataset classification tasks. In the study, the authors suggested that the application of model optimization methods such as parameter tuning, correlation based feature selection and dimensionality reduction with principal component analysis (PCA) have significant importance for improving the performance of supervised machine learning model on medical dataset classification task. Moreover, irrelevant input features induce extra computational cost such as processor time and memory space. Moreover, irrelevant feature lead to model overfitting, where the learning model performs good on training set as compared to the test set. Thus, irrelevant feature not only incur additional computational cost, but also mislead the model and ultimately results in low performance on disease classification.
The researchers studied the effect of high dimensional dataset on the performance of supervised classification model [8], [9]. The authors proposed an information gain based (IFG) feature selection algorithm for reducing a high dimensional input feature for improving classification performance of Naïve Bayes classification algorithm on text data classification. Moreover, the authors carried out an extensive experiment test on the classification performance of the proposed model and the experimental result appears to prove that information gain based feature selection improved the classification performance of Naïve Bayes model on text dataset classification. Moreover, the computational time and storage space required for training and testing the proposed text classification model is lower as compared to the complete high dimensional input feature. In heart disease diagnosis, our purpose is to infer the relationship between the symptoms, input features and their corresponding class label (heart disease positive or heart disease negative). If mistakenly one feature is included in model training, this the learning model comes to false conclusion due to mistakenly included feature for training.
Feature selection significantly reduces the computational costs such as computational time complexity and memory space requirement. A review on the effect of input features in [10]- [12] suggested that feature selection algorithms are categorized into two categories namely, filter and wrapper approach. The filter approach assigns weight to each feature subset and based on the weight value assigned to each feature subset, the optimal feature set is selected for training a classification model. The weight is determined based on distance, statistical method such as correlation between a given feature set and the target or class label. The features with higher weight are selected as optimal feature set and the classification model is trained on the selected feature set. The motivation for feature selection in medical dataset classification is that, since the goal of medical diagnosis model is to approximate the underlying relationship between the input features or symptoms and the class label, ignoring those input features with little effect on the class label leads to better performance.
A current literature review in [13], [14], related to the dimensionality reduction problem in medical dataset classification showed that the issue of irrelevant feature on the accuracy of classification model is still an ongoing and open research issue. Developing classification model with higher accuracy on medical dataset is one of the major concern of automated medical diagnosis systems. In [15]- [18], the researchers further investigated the performance of machine learning model with various feature selection methods. The methods employed for feature selection include the PCA, statistical method such as chi-squared test for selecting relevant features in a medical dataset. The PCA is employed for reducing the dimensionality of the feature before training model on a medical dataset for classification. The authors carried out extensive experiment, experimental result appears to prove that more accurate, and effective heart disease classification model is achieved with feature selection. Overall, the classification accuracy 85% is achieved when feature selection is applied on heart disease dataset. Dimensionality reduction is the most important and popular approach for noise reduction (removing irrelevant features) and redundant features [19]- [25]. Moreover, input feature extraction method such as PCA reduces the dimensionality of the original or the complete input feature by projecting the original input feature space into a new constructed feature space, preserving the combinations in original feature space. Thus, principal component analysis is important to visualize a high dimensional dataset and investigate the relationship among input feature subset. However, finding overall, optimal input feature subset incurs additional computational overhead, because we employ exhaustive search to find the optimal feature subset. The accuracy and efficiency of the feature selection process depends on the type of feature selection algorithm employed for searching optimal feature subset. Overall, feature selection algorithm leads to great accuracy in classification but rarely applied to medical diagnosis and optimal feature selection may incur additional time complexity. However, in medical diagnosis, accuracy is more important than efficiency.
In this study, the existing feature selection algorithms, namely exhaustive, permutation and correlational-based feature selection algorithms are critically reviewed. Moreover, we have proposed sequential feature selection (SFS) algorithm to improve the performance of random forest model for heart disease classification. In addition to that, we have implemented random forest model for heart disease detection to evaluate the efficiency and accuracy of the proposed approach. For comparison with the existing feature selection algorithm we have implemented exhaustive, permutation and correlation based feature selection well known algorithms for feature selection. An extensive experiment is conducted on the proposed approach and the existing methods for feature selection. In the experiment, we have used 5-fold cross validation to test the efficiency and accuracy of the proposed and existing approaches. Result appears to prove that sequential feature selection is better in terms of efficiency and the exhaustive feature selection is better in terms of accuracy but computationally expensive.

METHODS AND MATERIALS
In this study, we have conducted experiment and comparisons among feature selection algorithms. At the end, we have suggested an efficient sequential feature selection method for selecting optimal feature subset. To conduct our study, we followed the flowing procedure. First, we have conducted a preliminary review of previous related work in section 2 and then we have collected dataset. Finally, we have conducted an experiment using the collected dataset with different feature selection algorithms discussed in previous section. The data is collected from Kaggle and contains 1025 observations. In Kaggle dataset, each observation belongs to either the heart disease patient (positive class) or not patient class (negative class). Hence, the problem of heart disease detection is binary classification problem where the class of a particular observation belongs to the positive or negative class.

Dataset characteristics
The descriptive statistics, the mean, standard deviation, maximum and minimum values for the numeric features in the dataset is summarized in Table 1.

Sequential feature selection
Sequential feature selection algorithm is used to reduce an initial d-dimensional feature subset to a k-dimensional feature set for k<d. The motivation behind sequential feature selection algorithms that automatically selects feature subset that is most relevant to the problem. The goal of feature selection is to improve the computational efficiency and reduce the classification error of predictive model by removing irrelevant features or noise from a dataset. Sequential feature selection algorithm removes or adds a feature at a time based on the relevance of the feature to the classifier performance. Let us consider the (min_f, max_f) is a tuple representing the minimum and maximum feature in the range min_f to max_f in the feature set. The best feature combination that produces optimal performance for a classification model is obtained by iteratively testing the performance of the classification model on feature subset on 1 to max_f (forward) or max_f to min_f (backward). The size of the returned feature subset within max_f to min_f depending on which combination scored higher classification accuracy during cross validation is selected as best combination of features.

RESULTS AND DISCUSSION
The features selecting by sequential feature selection algorithm among the 13 heart disease dataset features in the heart disease dataset are the following: The better combination of features selected by the proposed approach are as: Best combination (highest accuracy achieved: 0.971): (0, 1,2,4,6,7,8,9,11,12).
As demonstrated in the output, the classification accuracy achieved by the random forest model with the sequential feature selection algorithm is 97.1% on the heart disease detection. The features used for training the model are discussed in Table 2. The most representative features among the 13 features in the original dataset selected by the proposed method are discussed as follows. indexes, index 0 age feature index 1 sex, index 2 chest pain, index 4 cholesterol, index 6 exercise induced angina compared to rest, index 7 maximum heart rate achieved, index 8 exercise induced angina, index 9 old peak, index 11 cardio vascular disease and index 12 thalassemia. Highest classification accuracy achieved with the selected feature using the proposed sequential feature selection algorithm is 97.1% as demonstrated in Figure 1.   Figure 1 demonstrates the performance of random forest model using sequential feature selection. As illustrated in Figure 1, 10 features among the 13 features, characterizing heart disease dataset are selected as optimal feature subset by sequential feature selection as shown in Table 1. We observe from Figure 1 that the highest accuracy is archived with 10 features. As demonstrated in Figure 1, the performance the random forest model improves with an increase in feature space. However, increasing irrelevant feature space will degrade the performance of the classification model. As shown in Figure 1, an increase in only selected features with sequential feature selection have positive effect on the performance of random forest model.

Comparison of the proposed and existing feature selection algorithms
Our first experiment demonstrated the sequential feature selection algorithm. The second experiment-demonstrated permutation based feature selection algorithm and the third experiment is conducted on the exhaustive feature selection algorithm. Experimental results on the three feature selection algorithm appears to prove that the exhaustive feature selection algorithm is time consuming, computationally costly but performs well on selecting optimal features set. The sequential feature selection algorithm has lower computational overhead as compared to the exhaustive feature selection algorithm. The optimal feature selected by sequential, exhaustive, permutation and correlation based feature selection is summarized in Table 3. As we observe in Table 3, the permutation based feature selection algorithm removes irrelevant features as compared to the sequential feature selection and exhaustive feature selection algorithm. However, the model does not perform well as compared to the sequential and exhaustive feature selection algorithm. We also realize that different feature selection algorithms find different features set as optimal features. To measure the goodness of these algorithms we evaluate the mean fold score with different feature selection algorithms, namely the sequential, exhaustive, correlation and permutation based feature selection. The performance is higher for sequential feature selection method as compared to exhaustive, permutation and correlation based feature selection method.

CONCLUSION
In this study, we have proposed sequential feature selection algorithm for feature selection. Moreover, we have studied different types of feature selection algorithms along with their practical 23 implementations. The goal of the proposed sequential feature selection algorithm is to improve model performance for machine learning classifier. We have implemented a number of feature selection algorithms such as permutation based, exhaustive and correlation based feature selection method and compared with the proposed algorithm. We employed random forest model to test the proposed algorithm on heart disease classification. We employed real world Pima Indian heart disease dataset for evaluating the performance of the proposed and exiting feature selection methods. The exhaustive features selection algorithm produces better performance result. However, the computational time is higher for the exhaustive feature selection algorithm as compared with sequential feature selection algorithm. Exhaustive algorithm has advantage of fitting to specific machine learning algorithm. The sequential feature selection algorithm is preferred for large datasets. Moreover, the computational time complexity for sequential feature selection algorithm is less as compared to the exhaustive feature selection algorithm. We have conducted experiment on random forest model for testing the performance of selected feature subset.
As a future work, we recommend the researchers to extend this work by using different feature selection algorithms such as filter based feature selection algorithm and compare performance results with the current work to optimize the model to more robust and effective level. Moreover, we recommend researchers to conduct empirical study with the existing feature selection algorithms with different high dimensional high input space datasets such as text dataset.