Supervised machine learning based liver disease prediction approach with LASSO feature selection

Received Jul 17, 2021 Revised Sep 18, 2021 Accepted Oct 16, 2021 In this contemporary era, the uses of machine learning techniques are increasing rapidly in the field of medical science for detecting various diseases such as liver disease (LD). Around the globe, a large number of people die because of this deadly disease. By diagnosing the disease in a primary stage, early treatment can be helpful to cure the patient. In this research paper, a method is proposed to diagnose the LD using supervised machine learning classification algorithms, namely logistic regression, decision tree, random forest, AdaBoost, KNN, linear discriminant analysis, gradient boosting and support vector machine (SVM). We also deployed a least absolute shrinkage and selection operator (LASSO) feature selection technique on our taken dataset to suggest the most highly correlated attributes of LD. The predictions with 10 fold cross-validation (CV) made by the algorithms are tested in terms of accuracy, sensitivity, precision and f1-score values to forecast the disease. It is observed that the decision tree algorithm has the best performance score where accuracy, precision, sensitivity and f1score values are 94.295%, 92%, 99% and 96% respectively with the inclusion of LASSO. Furthermore, a comparison with recent studies is shown to prove the significance of the proposed system.


INTRODUCTION
The liver is a vital organ in the human body that performs the functionalities like the production of bile, chemicals detoxification, and source of important proteins which is for blood clotting. In recent years, a significant increase of various liver diseases has been observed around the globe. In India, the mortality rate because of the disease is 2.4% of the population [1]. There are more than 100 types of liver diseases among which cirrhosis is diagnosed when the liver cells are damaged and replaced by non-living scar tissues [2]. One of the most traditional ways to detect liver disease (LD) is to analyze if the liver tissue is abnormal by a specialized radiologist. However, studies show that a decision of accuracy of around 72% can be made by a simple visual interpretation of liver diseases [3]. Since most of the medical centers, hospitals, or diagnosis centers are equipped with modern computer-based machines for testing and diagnosis, early detection of LD is possible for faster cure. Using machine learning algorithms on the lab data, a model can be generated for a much efficient diagnosis. Analysis based on the input and different classification algorithms may give various accuracy rates [4]. According to Gogi and Vijayalakshmi [5], a prognosis of LD was detected using machine learning techniques. For detecting LD LFT dataset was used that has 11 attributes. In the research paper, 5 data mining classification techniques were used and the platform was MATLAB2016. The accuracy found for linear discriminant algorithm was 95.8 % and ROC was 0.93. In the research paper, Midhila et al. [6] described a computer-based analysis and classifications for detect 10 types of LD from ultrasound images using some techniques such as segmentation despeckling, feature extraction and gray level difference weights method. The accuracy of classification and segmentation in detecting cysts was 90% and 80% respectively. Kumar and Katyal [7] briefs a method for analyzing LD using data mining techniques. In this research paper, they created a classifications model for diagnosis and to forecast liver problems using 5 data mining algorithms and 1 boosting algorithm. Without boosting, the method's best accuracy of 72.18% was found. Spann et al. [8] explained a comprehensive review about LD and transplantation based on machine learning approaches. In the review paper, the authors found that if the patient's data are too large supervised machine learning tools can detect nonalcoholic fatty liver disease (NAFLD) at an early stage. In their research model, Ramkumar et al. [9] depict liver cancer prediction based on conditional probability Bayes theorem. In the research WEKA tools and data mining techniques were used to predict liver cancer. It is found that drinking alcohol caused LD based on Bayes theorem. Kefelegn and Kamat [10] presented a survey that used data mining methods to predict and analyze liver problem diseases. Three algorithms such as Naïve Bayes, SVM and C4.5 have been utilized in the study approach. The model has evaluated the performance utilizing a confusion matrix and 10-fold cross-validation for the partitioning of data.
In the proposed study, the symptoms of LD are analyzed and prediction is done to identify if a patient is prone to LD using machine learning models. The classification algorithms used in the process are logistic regression, decision tree (DT), random forest (RF), AdaBoost, k-nearest neighbor (KNN), linear discriminant analysis (LDA), gradient boosting and support vector machine model (SVM). From the research gap of previous studies done on the subject, in this proposed method of prediction, the LASSO feature selection technique is used to identify the features that play a significant role in the occurrence of LD. The accuracy of the proposed model is compared with the existing studies to demonstrate superiority.

PROPOSED METHOD
This proposed model takes in the dataset of a number of LD patients from UCI machine learning Repository for the prediction of the disease. Firstly, the raw data is pre-processed to get clean data. From all the attributes in the dataset, the related data are chosen using the LASSO feature extraction method to ensure better accuracy based on only relevant data. Using a 10 fold cross-validation approach and the classification algorithms, data was analyzed. The classification algorithms used in the process to analyze the data are logistic regression, DT, RF, AdaBoost, KNN, LDA, gradient boosting and SVM. The classification results are measured to determine the performance rate. To verify the performance of the system, a performance comparison of the algorithms is done based on accuracy, sensitivity, precision and f1-scores which help to identify the highest performing algorithm as well. A flow of the entire system is illustrated below in Figure 1.

RESEARCH METHODOLOGY 3.1. Data collection
For this research, the Indian Liver Patient Dataset (ILPD) is downloaded from UCI machine learning repository [11]. The ILPD dataset has 583 instances and 10 attributes (age of the patients, gender of the patients, total bilirubin, direct bilirubin, alkaline phosphate, alamine aminotransferase, aspartate aminotransferase, total proteins, albumin and albumin and globulin ration) and also a selector field to determine if the subjects are liver patients or not. There are 167 non-LD patients and 416 LD patients that are determined by using the sum of each of the sector fields. Figure 2 shows the data distribution in the dataset. The dataset attribute characteristics are multivariate and attribute characteristics are integer and real [12]. The models are trained and tested on these data and give output for their own which are evaluated for the models' performance.

Preprocessing dataset
Data preprocessing is one of the most vital stages in machine learning classification as the cleaner the data, the better the classification result trends to be [13]. The pre-processing techniques applied in the model are described as: a. Reduce noisy data: There are two kinds of data noise in machine learning: attribute noise and class noise.
However, for best accuracy in the proposed model, attribute noise is reduced for better accuracy using the panda library. b. Data transformation: Data transformation refers to the process of reorganizing or restructuring raw data.
It is used to transform raw data into a suitable format that allows data mining to get strategic information more effectively and quickly. c. Standard scalar: Standard scalar transforms data such that its distribution has a mean value of zero and a standard deviation of one. Aggregate functions perform operations on the column values and return a single value.

LASSO feature selection
In this proposed method, LASSO techniques are implemented for data fitting and are the best feature selection to reduce overfitting, improves accuracy and reduce training time. LASSO is used to remove unnecessary features from the dataset with high correlation without much loss of information. LASSO techniques minimize the absolute sum of the coefficients. LASSO ridge combines the benefits of regression with subset selection to enhance model understanding and prediction accuracy. If the set of parameters has a strong connection, LASSO selects one of them and reduces the other to zero. This minimizes the variability of the estimate by compressing certain zero coefficients, resulting in a model that is simple to understand [13]. Algorithm 1 shows the working process of LASSO that is implemented in this system. Step 6: frequency of selection of each feature is calculated according to qi, k = 1, 2, … , T Step 7: Return q¯i: the set of features selected most frequently

10 fold cross-validation
The data set must be divided into a training set and a test set in order to train and test a model. The system uses a 10-fold cross-validation method for this purpose. Algorithm 2 describes the 10 fold crossvalidation work procedure.
Step 2: iterate x = 10 ~> step 3 to step 5 Step 3: for x th iteration, consider x th subset as test set and rest as Training sets Step 4: train the model on the training sets Step 5: test the model on x th test set Step 6: take the Mean of the ten results as the final output

Data classification a. Decision tree:
The decision tree is the most popular supervised learning algorithm for prediction. As the name suggests, the algorithm is formed in a tree structure with the root node, branches and leaf nodes that indicate attributes, conditions and outcomes respectively [14]. Entropy as denoted in (1) shows the homogeneity as well as the purity of a dataset, and information gain is the change in an input's entropy, which is usually a reduction [15].
E (D) = −P (positive) log2 P (positive) -P (negative) log2 P (negative) b. Random forest: RF is a constituent of multiple decision tree algorithms. It can be used for both classification and regression. Random forest prevents the model from overfitting to give better predictions [16]. According to the data provided, this system ranges from the lower limit of b=1 to the higher limit of B. The unknown samples  are generated by averaging the predictions ∑ ( 

=1
) from each tree on  as stated in (2), c. K-Nearest Neighbor: KNN algorithm works based on similar things that exist in close proximity. Some advantages of the KNN algorithm for instance: simple and easy to implement [17]. The KNN algorithm uses Euclidean distance in (3) where p and q are two different data points.
d. Support vector machine: The data points are split into two classes by a hyperplane affected by the support vectors using the SVM supervised classification method with a RBF kernel [18]. The Euclidean distance is used to calculate the distance between the support vectors and the hyperplane, as shown in (4).
β0 + β1X1 + β2 X2 + . . . + βnXn = 0 Where β0, β1, β2 … βn represent hypothetical values and Xn represent data points in the n-dimensional sample space. The original goal of creating SVM was to handle a two-class classification issue; however, it was subsequently adjusted for multi-class situations. e. AdaBoost classifier: It is an ensemble model. Constructed using n numbers of decision tree [15]. The incorrectly classified data after training in the first decision tree is passed to the next tree for classification until n th tree to get the most accurate prediction as shown in (5).
The frequency of training instances is represented by n, and the i th training instance is represented by xi. The decision stump generates an output for each input variable. ) / B g. Linear discriminant analysis: As the name implies, the model reduces dimension in the dataset yet keeps enough information for classification [19]. It uses the information of the kept dimensions and constructs a next axis to minimize the variance and distance between the classes as illustrated in Figure 3. h. Gradient boosting: To learn gradient boosting first we need to know about boosting, boosting is a method that converts week learners to strong learners. Here each tree is a fit on a modified version of the original data set. The boosting technique differs from traditional machine learning in that function space does not allow for optimization. After m th iterations, the optimal function F(X) is found, which is computed using (6): Where fi (x) (i=1, 2…., M) indicates feature increments, the fi (x) = − ρi x gm(X).

Performance measurements
In this research paper, we use confusion metrics because of its best and easiest way to calculate the performance of a classification result that has two or more types of classes for output [20]. Using the matrices (TP, TN, FP, FN), the performance of the models is measured using (7) to (10). Table 1

RESULTS AND DISCUSSION
In the system, the LASSO feature selection technique was used to determine the important features for the classification. Figure 4 illustrates and rates the importance of each feature in the dataset. It is seen that only six features are needed for classification according to LASSO feature selection which aids in achieving a higher accuracy rate. To verify that using LASSO feature selection, the accuracy of the models has increased drastically, a comparison between the accuracy of classification of the algorithms with and without using the feature selection technique is done. In Figure 5, the accuracy rate of without and with LASSO rate is depicted. It is observed that when used all features, the logistic regression model shows the highest accuracy of 77.1428% whereas, with LASSO, the decision tree model shows an accuracy of 94.285%. The performance of the models is assessed using the confusion matrix for all of the characteristics in the dataset. The linear regression method has already been shown to be better in terms of accuracy for all characteristics. Precision scores, recall scores, and f1-scores are computed and shown in Table 1 among the models using the performance measure matrices. It is observed that the linear regression has the highest precision, recall and f1-score as well with 76.7%, 75.3% and 75.1% respectively. Using a confusion matrix, the performance of the models is measured. It is already established that the decision tree algorithm shows superior accuracy after feature extraction. Among the models, with the performance measure matrices, precision scores, recall score and f1-score are calculated and illustrated in Table 2. It is observed that the decision algorithm has the highest precision, recall and f1-score as well with 92%, 99% and 95.3% respectively.  Table 3 illustrates the comparison among various studies done on the topic in recent years with the proposed model of the paper. It can the observed from the comparison that the proposed model shows much higher accuracy compared to the studies done using other machine learning models to predict LD patients.  [21] 2018 ILPD LG 73.97 % Singh et al. [22] 2019 ILPD LG 72.50 % Thaiparnit et al. [23] 2018 Liver Disorder RF 75.76 % Rahman et al. [24] 2019 ILPD LG 75% Kumar and Thakur [25] 2020 BUPA, ILPD Fuzzy-NWKNN 78.46% Rabbi et al. [26] 2020 ILPD AdaBoost 92.19% Poonguzharselvi et al. [27] 2021 UCI repository Random Forest 84%

CONCLUSION
The system proposed in the paper contributes to the field of medical science by helping to identify LD in a patient from certain data at an early stage that will allow to start the treatment and cure the disease before it becomes fatal. In order to do so data of the age of the patients, total bilirubin, direct bilirubin, alkaline phosphate, albumin are needed. This is determined by the LASSO method. Using classification algorithms, it is observed that the decision tree algorithm has the most accurate prediction amongst seven other classifiers i.e. RF, LR, SVM, KNN, LDA, AdaBoost and gradient boosting. It has an accuracy rate of 94.285% followed by SVM with an accuracy rate of 93.7142% The decision tree model also has precision, sensitivity and f1-score of 0.92, 1.00 and 0.96 respectively. Using this proposed model in the future, other diseases such as cancer, parkinson, alzheimer can be predicted as well.