Bulletin of Electrical Engineering and Informatics

Received Sep 26, 2022 Revised Nov 2, 2022 Accepted Nov 23, 2022 Breast cancer is one of the leading causes of death and most frequently diagnosed cancer amongst women. Annually, almost half a million women do not survive the disease and die from breast cancer. Machine learning is a subfield of artificial intelligence (AI) and computer science that uses data and algorithms to mimic how humans learn, and gradually improving its accuracy. In this work, simple machine learning methods are used to classify breast cancer microarray data to normal and relapse. The data is from the gene expression omnibus (GEO) website namely GSE45255 and GSE15852. These two datasets are integrated and combined to form a single dataset. The study involved three machine learning algorithms, random forest (RF), extra tree (ET), and support vector machine (SVM). Grid search cross validation (CV) is applied for hyperparameter tuning of the algorithms. The result shows that the tuned SVM is best among the tested algorithms with accuracy of 97.78%. In the future it is recommended to include feature selection method to get the optimal features and better classification accuracies.


INTRODUCTION
Breast cancer is the most prevalent disease among women worldwide. Many women are affected by this life-threatening cancer. It is the second biggest factor in female cancer-related fatalities [1], [2]. Breast cancer is a malignant tumor caused by the breast's cells growing and dividing out of control thus creating a lump of tissues. However, not all lumps are cancerous, benign tumors are non-cancerous growths that are treatable with medication and are not life-threatening [2]. Whereas malignant tumors are cancerous growths that can be fatal if left untreated. Early diagnosis is important if such a lump appears in the patients' breast, they must discuss with a medical doctor for early diagnoses and medical treatment [3], [4].
One of the most essential technologies in bioinformatics research is the gene chip, commonly known as the DNA microarray [5]- [7]. A great amount of biological information is available in gene expression microarray data. This is contributed by the rapid development of sequencing technologies [5], [6]. Breast cancer gene expression profiles are among information available in microarray data, which is important in prognosis of breast cancer patients [7]- [9]. The expression variables in the microarray dataset are often organised as a MxN matrix, with column containing several features [10] (also known as genes) and each row representing a sample, as illustrated in Figure 1 [6].
In recent years, researchers have shown a great deal of interest in the detection and classification of cancer through microarray data using machine learning algorithms. The classification of microarray data classifies cancer samples according to their class based on their gene expression profiles. Meanwhile, machine learning is a subset of artificial intelligence (AI) that enables systems to learn from the training data and get better over time. According to Almugren and Alshamlan [8], a machine learning algorithm known as support vector machine (SVM) is hybridized with firefly algorithm for classification of several type of cancers through selection of microarray features. There are several challenges in classification of microarray data. In a gene expression study, thousands of genes features are obtained from a smaller number of samples [9]. This is led to what is known as high dimensionality problem in microarrays [10]. In addition, gene expression also contains numerous ineffective and unnecessary attributes, and just a handful of the assessed genes may have a meaningful impact on cancer classification. Therefore, the classification of microarray data is still challenging and difficult due to the small samples number and high dimensionality problem [11], [12].  [6] In this study, we apply simple machine learning algorithms to classify high dimensional microarray breast cancer data. The three machine learning methods applied namely are, random forest (RF), extra tree (ET) and SVM. The classification models are applied on the data without using any feature selection methods. The hyperparameters of the three machine learning models are selected using grid search cross validation (CV) method. This study aims to determine the best classifier among the three after performing grid search CV.

THE MACHINE LEARNING ALGORITHMS
Classification is a data mining technique that identifies or assigns categories to a set of data to enable more accurate analysis. Supervised classification is a type of learning in which labels are determined [11], [13]. There are two steps involved in constructing a classifier: i) the learning phase, during which the model or classifier is constructed based on a set of training data and paired with a class label and ii) predicting the accuracy of the model on unseen data. Three common machine learning methods [13]- [15], RF, ET and SVM with grid search CV are applied in this work. These three machine learning models were chosen as a classifier technique in this study for several reasons namely: i) they are fast and ii) they are able to deal with high dimensional dataset. Grid search CV was used to aid in tuning hyperparameters and fitting the model to the training data using the optimal parameters. This study implements kfold cross-validation (CV), with the number of folds is set to 10.

Random forest
RF method is a collection of tree-structured learning classifiers. It categorizes a fresh sample using the most frequently occurring prediction produced by these algorithms. The trees are grown via feature selection, and at each node, random features are chosen for splitting. This helps to reduce over-fitting and, as a result, RF classification is quick [16].

Extremely randomized tree
For classification, a group of many decision trees is utilized. This depicts a forest of decision trees like the RF method, but are constructed differently [17]. Every decision tree chooses the best feature from a set of K randomly chosen qualities to divide each node based on some chosen criterion. Using the training dataset, the ET algorithm generates unpruned trees and numerous decision trees. This algorithm averages the predictions for regression and majority voting to produce final predictions for all decision trees.

Support vector machine
SVM [17] focuses on locating a hyperplane that best divides the tuples of one class from those of another. Using the support vector and the margin, the hyperplane is identified. The support vector is calculated using the hyperplane's vectors (data points). The margin is the closest point to the hyperplane (on two sides). However, when the data is linearly separable, the hyperplane is the line that divides the data into two pieces, with each portion ultimately belonging to a single class. Maximizing the margin, which is the distance between the nearest data point (called the support vector) in each class, enables the identification of the optimal hyperplane. SVM Kernels (linear and radial basis function (RBF)), the C (cost), and the gamma values were all tuned to achieve the best SVM model [18].

Grid search cross-validation for hyperparameter tuning
With the right combination of hyperparameters, a machine learning model that is resilient and accurate can be built [14], [18]. Hyperparameter tuning refers to the process of selecting the optimal set of parameters. To increase the performance metric, the dataset must be trained using all machine learning methods and different combinations of hyperparameters. The dataset can be trained using a variety of machine learning methods using the CV technique. Here are some of the common terms that should be considered when using grid search CV (GridSearchCV).
-Estimator: this term is used in scikit-learn to set up the estimator interface. This parameter gets the classifier that needs to be trained. -Parameter grid: parameter names and settings are in a Python key-value dictionary. All parameters are checked for most accurate results. -CV: this establishes the CV splitting approach. Resampling the available data is a technique called CV that is used to assess machine learning models. The major objective of this is to assess how well machine learning models perform on new data. It operates by first randomly shuffling the dataset. Then, k groups are created from the complete dataset. While the other groups are utilized as training data, each group is used as a test group. Each sample is utilized k-1 times and only appears once in the testing results.

PROCEDURE 3.1. Microarray breast cancer dataset
Two sets of breast cancer datasets were downloaded from the gene expression omnibus (GEO) [19], [20] for this study. GSE45255 and GSE15852 are the accession numbers, and the chip platform is GP96. GSE45255 only included 139 breast cancer patients. GSE15852, on the other hand, has 43 paired normal and breast cancer patients. These two datasets were combined together to form an integrate dataset with 182 breast cancer patients and 43 normal cases, each sample with 22,215 genes [21]. From this point forward, the combined dataset is referred as grating-outcoupled surface-emitting (GSE_integrate). In the dataset, when various platform of the probes was indicated to the same genes, the average of the probes was taken from a specific dataset, and the probes that started with "AFFX" were deleted as this data had no related genes for these probes [22]. The train and test data are split into an 80:20 ratio in this study.
Before classification is applied, some pre-processing method is essential. Two processing steps were implemented for the BC dataset. First, all sample were split into binary class where relapses were represented as set 1 and non-relapse were represented as 0 (a good prognosis). Second, the input features or gene values were normalized and standardized to the interval of [0,1]. The following is min-max normalization method [11].
Where represents the normalized from input features data , and and are the minimum and maximum number respectively. Format of original microarray breast cancer profiles before and after pre-processing methods are as shown in Tables 1 and 2.

Method
Three-machine learning model with CV (grid search) are investigated for classifying BC microarray data. The flow-chart is shown in Figure 2. The following steps describes the procedure of the methods: -Dataset is split into training and testing data with ratio of 80:20 (80% for training data and 20% for testing data). -During data splitting, a stratify method is applied to ensure that the training and testing ratio having an equally balance amount during training and testing the dataset. A scikit learn package from python library was used for module splitting and stratifying. -The datasets were classified using SVM, RF and ET using k-fold CV method, in which K represents as 10. Using 10-fold CV, the data is split into 10 subsets, in which each fold had 9 subsets that used as training set, and the remaining subset will be used for the testing set. -A hyperparameter tuning was applied for the machine learning model. Hyperparameters store the information that governs the training process and cannot be learned during the training process because it can increase capability of a model and results overfitting. Before running the experiments, a set of hyperparameters value need to be set. GridSearchCV was applied from scikit learn package in python to determine the best hyperparameters for the models. After this, the optimal hyperparameters gained from the GridSearchCV were used to re-train the model on the training set and to predict the accuracy value on the test. The optimal range gathered from hyperparameters value are different depending on the trained datasets and the models used. The output obtained from the dataset can be predicted to identify the performance on each dataset.

Performance metric
Performance of all classifiers are evaluated by different measure metrics such as classification accuracy, f1-score, sensitivity, and specificity [11], [22].

Classification accuracy
Classification accuracy [11] is a commonly used evaluation criterion for a standard classification system and can be calculated using the following.
TP represents true positive and correctly classified positive samples. TN represents true negative and correctly classified negative samples. FP represents false positive and misclassified negative samples and, FN represents false negative that misclassified positive samples.

F1-score
F1-score measures model's classification ability. The F1-score combines a classifier's precision and recall into a single metric by taking their harmonic mean. Its principal function is to compare the performance of two classifiers. Assume classifier A has a higher recall but classifier B has a higher precision. In this case, the F1-scores for both classifiers can be used to assess which one delivers superior results A perfect model has f1-score equivalent to 1. The formula of f1-score is in equation (3).

Sensitivity and specificity
Sensitivity is also known as true positive rate (TPR) or recall. Sensitivity evaluates how well a model can recognize the classifier. It identifies proportion of accurately classified positive samples to total samples. Whereas the ability of a test to correctly identify person who do not have the disease is referred to as specificity.

SIMULATION RESULT AND DISCUSSION
This section discusses results obtained from all 3 classifiers models namely RF, ET and SVM, for binary class microarray breast cancer dataset. All the classifiers are implemented in the following environment, operating system: Windows 10, CPU: Intel Core i5-10210U (2.11 GHz), and memory: 8GB RAM. Table 3 shows the hyperparameters and their range tuned by the GridSearchCV. Hyperparameters setting that are not stated in this table were set to default values. Table 4 show the best classification accuracies demonstrated by all three models with GridSearchCV [22] for the microarray BC dataset, GSE_integrated. The best result is obtained by SVM with 97.78% accuracy, 99% f1 score, 97% sensitivity and 100% specificity. This followed by RF and ET with both obtaining 93.33% accuracy. However, the accuracy obtained is lesser than 100%. This is due to the dataset does not have equal class ratios. This is known as imbalanced datasets. Although, the dataset is a binary data which has only two possible class: zero for normal and one for relapse, the imbalanced dataset makes it more challenging to train and predict. The lower sensitivity and higher specificity confirm the problem of imbalance data. All three algorithms achieve 100% specificity which indicates all samples classified as negative (normal) are correctly classified. This is due to the significantly lesser normal samples in the GSE_integrate. Figure 3 show the results of area under curve (AUC) of three models respectively RF, ET, and SVM. The results are significantly good, suggesting no overfitting. The time cost for overall tuning parameter takes around 1-2 minutes. However, in the future, the feature selection method will be used to choose relevant features that contribute to the characteristics of the BC microarray dataset [23]- [25].  Figure 3. The receiver operating characteristic (ROC) and AUC curves and values of classifiers

CONCLUSION
Microarray data has thousands of features. The features are informative in diagnosis and prognosis of diseases including breast cancer. Machine learning algorithms are suitable for analysis of this type of data. They offer automated and faster system. Thus, this study applies RF, ET and SVM with simple parameter tuning based on GridSearchCV. The performance of the machine learning methods is compared using several performance metrics, accuracy, f1-score, sensitivity, and specificity. The data used, GSE_integrate, has 182 cancer/relapse samples and 43 normal samples. The result shows, the SVM method is the best model compared to RF and ET. In the future, the usage of the feature selection method to select relevant features that contribute to characteristics of the BC microarray dataset is to be investigated. Additionally, data balancing techniques are to be incorporated to tackle problem observed due to imbalance data. ISSN: 2302-9285 