Twin support vector machine using kernel function for colorectal cancer detection

Received Jul 22, 2020 Revised Jun 14, 2021 Accepted Oct 12, 2021 Nowadays, machine learning technology is needed in the medical field. therefore, this research is useful for solving problems in the medical field by using machine learning. Many cases of colorectal cancer are diagnosed late. When colorectal cancer is detected, the cancer is usually well developed. Machine learning is an approach that is part of artificial intelligence and can detect colorectal cancer early. This study discusses colorectal cancer detection using twin support vector machine (SVM) method and kernel function i.e. linear kernels, polynomial kernels, RBF kernels, and gaussian kernels. By comparing the accuracy and running time, then we will know which method is better in classifying the colorectal cancer dataset that we get from Al-Islam Hospital, Bandung, Indonesia. The results showed that polynomial kernels has better accuracy and running time. It can be seen with a maximum accuracy of twin SVM using polynomial kernels 86% and 0.502 seconds running time.


INTRODUCTION
One of the diseases that cause death in the world is cancer. Cancer is the second leading cause of death globally [1]. Detecting these diseases when still at an early stage is associated with markedly improved survival prospects [2], [3]. Early-stage of the cancer is more likely to treat [4]. Colorectal cancer is cancer with the third death rate. responsible for around 600,000 per year worldwide [5]- [8]. Information technology has an important role in the field of medicine. Cancer is a disease that can be detected by machine learning. Data is very useful in the medical field. It can be seen from the development of data mining in medical science is increasing rapidly. This increase can be seen from the high prediction results, can reduce treatment costs, increase the chances of recovery of patients, and decisions to save lives [9], [10].
Machine learning is an application of artificial intelligence that provides systems the ability to automatically learn and improve from experience without being explicitly programmed [11]. One method that is popular because the learning performance is very good is the twin support vector machine (SVM) [12]. Kernel method is a method that uses functions when the algorithm operates in feature space with a higher dimension. This process uses product operations between images, all feature pairs. This method is used directly or indirectly by a SVM and twin SVM to classify data [13]. The kernel functions commonly used for SVM methods are linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. This paper proposes the twin SVM method as a novel approach for the early detection of colorectal cancer. The kernel functions used are the linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. This paper compares the performance of the twin SVM with each kernel to get the best kernel for the detection of colorectal cancer.

RESEARCH METHOD 2.1. Twin support vector machine
SVM is a method used to find a single hyperplane to classify samples [14] proposed twin SVM is found where samples are given to classes with two hyperplanes according to their distance from their hyperplanes. Equations of the two hyperplanes are as: 1 + 1 = 0 2 + 1 = 0 i-th hyperline parameters shown by and . Each hyperline is closest to its class sample, nonparallel in nature, and farthest from the opposite class sample. Assume a binary classification task with classes +1 and −1, and A ∈ ℝ 1 and B ∈ ℝ 2 indicate each matrix has a sample with each class +1 and -1 [15]. Based on the appropriate class, one sample is shown with each matrix row. The two hyperplanes of twin SVM obtained from (1) and (2): ξ is a non-negative vector component, therefore ξ ≥ 0. Vector of the size slack variable n represented by e. letting the margin of decision make a few mistakes is the standard approach. a standard approach is taken if the sampling service cannot be separated linearly. (for example, some points are in or on the wrong margin). the cost for a wrong-classified sample that is proportional to the distance between the sample and the decision margin is determined by each zero-zero element of the slack variable vector. Based on these equations, 1 and 2 are penalty parameters. Twin SVM is in great demand in various fields with various versions of the proposed algorithm [16]. Recently, several fuzzy formulations from twin SVM have also been proposed [17]

Kernel function
Kernel method is a method that uses kernel functions to operate algorithms in feature spaces that have higher dimensions. This method uses product operations between images of all image pairs in the feature space [18]. Accuracy for classifying objects in the right cluster is difficult to obtain in high dimensional data sets, measuring euclidean distances on k-means, c-means, or fuzzy c-medoids. Distribution data can be represented to validate the truly central cluster. This difficulty can be overcome by using the kernel method [19]. Let X n be an input space; F is a feature space and ϕ : Xn →F. In (3) defines kernel functions [20], [21]: where 1 , 2 ∈ X n . Kernel functions that are often used are linear kernel, polynomial kernel, RBF kernel, and gaussian kernel.

k-Fold cross validation
The dataset is divided into two, i.e training data and testing data. This is done so that the resulting model can be evaluated and obtained. Colorectal cancer data patterns are studied and recognized by machines with training data. Testing data are data used to evaluate models obtained after a machine learns data patterns [24]. By using the k-fold cross validation method, the dataset is divided into training data and testing data [25]. Training data samples were selected by the k-fold cross validation method. This method works by dividing the dataset with k-parts of the same size. Models and repetition of processes k times tested for each subsample taken as validation data.

Proposed method
Several stages are proposed in this study, including data divided into training and testing data. then the data is tested with k-fold cross validation. The k-value chosen was 10 and 45 for the random state. This means that the dataset was divided into 10 samples of the same size. In the second stage, the training data were used by the twin SVM method based on linear kernel, polynomial kernel, RBF kernel, and gaussian kernel to study data patterns and build classification models. The next step is to classify the models obtained and evaluated based on the parameters of accuracy and running time. To find the best kernel, the evaluation parameters produced by each kernel are compared.

RESULTS AND ANALYSIS
This research using Jupyter Notebook as software for running the program of twin SVM using linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. The stages carried out in this paper using the Python 3 programming language.

Data
In this study, the data consisted of 210 samples and seven features. these seven features consist of CEA, hemoglobin, leukocytes, hematocrit, platelets, age. diagnosis features become a target feature in detecting colorectal cancer. The data are colorectal cancer data obtained from Al-Islam Hospital, Bandung, Indonesia with cancer diagnoses (1), and no cancer (0). Table 2 represented part of the data:

Confusion matrix
In this paper, a confusion matrix was used to assist in calculating the evaluation parameters of the classification model. Table 2 shows the confusion matrix used to evaluate the twin SVM classification model based on the kernel for the diagnosis of colorectal cancer. Table 3 shown confusion matrix.

Evaluation parameters
The parameters to evaluate the performance of the twin SVM classification model were accuracy and required running time. In 4 shows the formula for accuracy: Accuracy is used to compare the number of cases of colorectal cancer and not colorectal cancer that identified correctly with the total number of cases.

Results
In this section, we discuss the performance evaluation of the twin SVM classification model with linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. The twin SVM classification model based on kernel detects colorectal cancer using a twin SVM with a linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. In this research, the highest accuracy is from the polynomial kernel. This indicates that the polynomial kernel is the appropriate kernel in detecting colorectal using a twin support vector machine. In this paper, we have built the twin SVM classification model with linear kernels, polynomial kernels, radial basis function kernels, and gaussian kernels in detecting colorectal cancer. Table 4 presents a comparison of twin SVM performance linear kernel, polynomial kernel, RBF kernel, and gaussian kernel. All kernel parameter is 1. The performance evaluation parameters compared are accuracy and running time. Table 4 shows the result of the accuracy and running time twin SVM classification model based on kernel. Based on Tabel 4, that can be seen that for accuracy, twin SVM models the highest accuracy of 86% was recorded when using the polynomial kernel at 0.502 seconds. While the lowest accuracy at 76% was recorded when RBF and Gaussian kernel with a running time of 1.605 seconds for RBF kernel and 1.612 for the gaussian kernel. For consideration of running time, the twin SVM model with polynomial kernel has the fastest running time compared to linear, RBF, and gaussian kernels, which is around 0.502 s. The twin SVM model with the gaussian kernel actually produces the longest running time which is around 1.612 s. Based on the results obtained, the polynomial kernel gets the best results in terms of accuracy and running time. Thus, the polynomial kernel is the best kernel for the twin SVM in detecting colorectal cancer dataset.

CONCLUSION
Colorectal cancer detection quickly is very important. it is useful for handling cancer quickly before being infected to all organs of the body. However, this is difficult because colorectal cancer has no specific symptoms. The twin SVM method can help detect colorectal cancer based on blood tests and age. The most appropriate kernel for the twin SVM method in detecting colorectal cancer is the polynomial kernel which produces an accuracy of 86% and the required running time is 0.502 seconds.