Bulletin of Electrical Engineering and Informatics

Received Jun 28, 2022 Revised Jul 15, 2022 Accepted Nov 27, 2022 Type 1 diabetes (T1D) disease is considered one of the most prevalent chronic diseases in the world, it causes a high level of glucose in the human blood. Despite the seriousness of this disease, T1D may affect people and their condition develops to an advanced stage without feeling it, which makes it difficult to control the disease. Early prediction of the presence of this disease may significantly reduce its risks. There are many attempts to overcome this disease, some of them are heading towards biological solutions and others towards bioinformatic solutions. Several studies have used a single feature selection method with a machine learning (ML) model to predict or classify T1D. In this paper, ML techniques were used for classification, such as Naive Bayes (NB), support vector machine (SVM), and random forest (RF) using a T1D gene expression dataset that has a multiclass to classify the genes associated with this disease. The proposed model can identify the genes related to T1D with high efficiency, which helps a lot in predicting a person carrying the disease before symptoms appear. The highest accuracy of 89.1% was obtained when applying SVM with chi as the feature selection method.


INTRODUCTION
High blood glucose is one of the most important features of diabetes, which is caused by a defect in insulin secretion or insulin action, or both.One of the complications of high blood glucose is an imbalance in the body's functions and failure of a number of organs, and the disease becomes long-term [1].Signs of high blood glucose include constant thirst, excessive urination, and excessive hunger [2].Diabetes disease included a three types: type 1 diabetes (T1D), type 2 diabetes (T2D), and gestational diabetes [1].Gene expression offers the possibility of diagnosing different diseases, by using precise deoxyribo nucleic acid (DNA) arrays containing powerful gene expression data acquisition technology.The level is interpreted as a mixture of different messenger molecule ribonukleat acid (RNA) in the cell and using it can identify good treatments and detect diseases early as well as detect mutations [3].In many different sectors, research has been conducted to improve the quality of care for patients with diabetes and reduce its effects, including artificial intelligence (AI) and machine learning (ML).In several studies, ML models used to predict and classify diabetes have been mentioned [4].
According to Deberneh and Kim [5], the gene expression dataset (GSE55098) for T1D was used to classify the immune cell-related genes that have a role in the occurrence and development of T1D and possibly contribute to finding an immunotherapy for it.Hub genes were investigated using least absolute shrinkage and

THE PROPOSED METHOD
The proposed system implemented for classifying the gene expression dataset of T1D.Classification using machine learning models may avoid major problems for patients with T1D.The system was executed using four main stages.The stages are explained in detail next subsection:

Preprocessing stage
In machine learning models, there is a necessity for the suitability of the dataset to the business requirements as the initial stage.The pre-processing stage includes many steps, such as handling missing values and data cleaning.Some of which will be discussed below [12].

Missing value
Gene expression datasets that are highly dimensional contain a huge number of features.It may include missing values for some of the features [12].There are several ways to solve the missing values problem such as guessing the missing values, ignoring the missing values, and removing the data objects [13].

Normalization
Normalization is one of the initial processing steps for processing the dataset that is applied before the data is used, such as increasing or decreasing the range of values.Normalization is convenient and useful in dataset problems that depend on classification, the conversion of feature values for a specific and small range such as 0 to 1.There are many ways to normalize, such as z-normalization and min-max normalization.Min-max normalization is a linear transformation technique used in processors in which preserving the relationship between the original dataset is important.In addition, it is considered one of the simple techniques that are suitable for the dataset within predefined limits [14], [15].Normalization is done according to (1): where   represent normalized .

Ranking stage
Student's t-test was used for ranking genes, it is a statistical test of the parametric type.This test is used in the case of comparing the mean of two sets of data, the formula of the test as in (2): Where ̃ is the mean of sample,  is the mean of expected population, and s is the variance of estimated population [16], [17].

Feature selection stage
Feature selection stage is performed before the classification stage to effectively reduce the data.Feature selection methods are used as a pre-processing step for selecting the related features (genes).This stage helps to reduce the execution time and increase the classification accuracy [18].

Chi-square
Chi 2 is a non-parametric statistical method of analysis data.Using (3), the value of the chi-square is calculated and the best set of features is selected.
Where  is observed value,  is expected value and k is a number of classes [4].

Analysis of variance (ANOVA)
ANOVA is also known as the F statistic, it is used for a dataset that contains multiple categories in case the mean values contain a large difference between them [19].ANOVA is a technique used to reduce the dimensions of datasets that contain huge numbers of features.In the result, the dataset can be expressed with the fewest number of variables [20].

Mutual information
For each original random variable that contains information about another random variable, the exchanged information measures the amount of this information and is also considered to reduce the uncertainty between the original variable in relation to the other variable.The use of MI to select features is done by selecting a subset of features n from the dataset X that includes all the features N. The value of the MI for this subset is greater with the category [21].

Principal of component analysis (PCA)
PCA is one of the linear transformation techniques that used for reducing high-dimensional datasets.It creates basic components for the input features that have been converted from correlated features to unconnected featuers [22].As the resulting, the reduced data contains fewer unrelated features [20].

K-means clustering
One of the aggregation algorithms, where a single object is assigned to one set of aggregates.Using an objective function value, the quality of each group is measured.Where k is considered as centers of groups and are initially empty, then each object is assigned to a group according to the closest distance between them in the end, a number of classes is obtained as the number of k [23].

Machine learning stage
The machine learning models have the ability to solve problems within many domains and facilitate dealing with data.Some of problems such as prediction or detection, are solved using the classification and the regression models.The learning of the ML models are supervised or unsupervised, this indicated accordeing to the type of the problem [23].

Random forest
RF is a set of decision trees, that assembled after averaging the prediction for each tree in the forest [24].RF is considered one of the useful methods used recently in biological studies, due to the simplicity and flexibility with the presence of variables in large numbers.As well as it determines for each variable used its role in responding to the prediction.It is noteworthy that it provides high accuracy and interpretability [25].

Support vector machine
A supervised ML algorithm is used for solving classification and regression problems, which is a supervised classifier.Classification is commonly used in many applications by allocating the points of the dataset by the hyperlevel as a limit to the classification decision.Since there is a maximum margin between the hyperplane and the classes, the data is sorted using the hyperplane [2].

Naïve Bayes
A supervised ML method, it is most commonly used with datasets that include classification problems, due to its accuracy in classification results.NB is a classifier that depends on probability in its classification, where for each class in the dataset guesses its probability in the prediction.A classifier that based on learning using training data and then predicting a class for the test record that has a high subsequent probability [26], [27].

METHOD
Gene expression dataset of T1D was used in this work.The proposed model shows in Figure 1, includes five stages: the pre-processing as a first stage, which consists of missing values and normalization, second stage is ranking for the features using the student's t-test.Then a subset of features were selected according to the feature selection methods.Then, ML models were used to classify dataset.Finally, the proposed system was evaluated using accuracy metric.Gene expression dataset was used can be accessed from national center for biotechnology information (NCBI) GEO database [28].Microarray technology the coefficients in genes at the same time.Gene expression data for T1D were used in the proposed system [29].The values obtained by mapping the RMA algorithm using Limma package in R language.In this work, the raw data was collected according to what was used in the research [30] the clarification as follows: only the samples on which the experiment (longitudinal) on auto-antibody-negative (AA-) high HLA risk siblings (60) and on low HLA risk siblings (31) has been taken from GSE52724 and concatenated it with unrelated healthy control plasma (44) and recent onset T1D plasma (46) from GSE35725 in new dataset of 181 sample and 54675 genes.Two datasets were used the first dataset with code GSE35725 is downloaded in raw file.It has 114 sample, the samples ID between GSM874033 to GSM874146.The second dataset with code GSE52724.It has 286 samples; the samples ID are between GSM1274585 to GSM1274870.

Pre-processing
According to the data that was used in this work, there was a need to make two steps as preprocessing to facilitate the process of implementing the model.The first step was handling missing values, the raw data contained a set of genes that did not contain values or NaN.Therefore, (4) calculates the mean is used to be the estimated value of the missing values of the corresponding column that contains values for the same gene for all remaining samples.
Where X represent value of data and N represent number of data values in column [13].The other step was Min-Max normalization; it was applied according to (1) and the Figure 2 After the two processes above, the result was a dataset has the same original numbers of genes and samples.The difference was in the data values only, where the missing values had compensated, and the data had transformed to the required range.

Ranking
Student's t-test was used for ranking genes, the number of genes required for all samples was predetermined.The number of genes has been reduced from 54,675 to 10,000.The resulting dataset after applying the ranking contained 10,000 genes and 181 samples.

Feature selection
The gene expression data are characterized by high dimensions; due to a large number of genes.The feature selection methods provided the ability to select the most important genes and those most closely related to the disease and neglect the redundant genes.Therefore, these methods had used to select a subset of the dataset that includes the most necessary genes associated with T1D disease.The methods that had used in this paper explained with their results in the following paragraphs: a. MI: genes with a value of MI greater than 0.05 were selected and 7,542 genes out of 10,000 genes were selected as informational and important features.b.Chi-square: the Chi 2 test was applied to all dataset with the number of genes of 10,000 genes and 8415 genes were selected according to the threshold of 0.5.c.ANOVA: the number of features (genes) was reduced from 10,000 to 8583 after using ANOVA and the p-value was 0.05.d.Principle component analysis (PCA): the dataset was entered in the form of an array with a size of 10,000 genes with 181 samples.After applying PCA to it, the dimensions were reduced to 181 genes and with 181 samples.

Machine learning models
ML models were developed to process gene expression data to classify many different diseases.RF, SVM and NB models were used as classification models for T1D.When implementing RF, the dataset was divided into two sets, a training set and a test set.The dividing percent of 80% for training set and 20% for testing set.The number of trees used was determined by 100 trees of the classification model.The result of RF model showed the data obtained from chi-square produced the accuracy of 86.4%.The other ML model was linear kernel SVM, when classify the subset of data selected by the chi-square produce accuracy of 89.1%.Finally, gaussian NB model was used in the observation and the dataset was divided into 80% training set and 20% test set.All the results are summarized in Table 2.The work was compared with some related work in which different datasets for T1D and T2D were used for classification and prediction using ML methods for the most influential and most relevant genes for this disease.Table 3 shows the evaluation of ML models were used in the related works.GSE55098 for T1D LASSO-SVM AUC=0.918[7] Single-cell RNA-sequencing for T2D Bayesian network, SVM, RF, LR and NN ACC=0.907[8] lncRNA expression for T2D KNN, SVM, LR and ANN AUC=0.95 [9] GSE38642 and GSE13760 for T2D LR and SVM ACC=90.23%[10] GSE164416 for T2D SVM Sensitivity=100%

CONCLUSION
To classify the genes affecting T1D diabetes, the gene expression dataset GSE52724 and GSE35713 were used.After applying the pre-processing methods, we concluded that this data is not suitable for working directly with ML techniques, as it needs to apply normalization to all data values, while it does not contain missing values.Then implementing the feature selection methods, we concluded that many of the features are not related to T1D diabetes, as they are not useful for classification.Therefore, during the implemention of feature selection stage, a very large number of these features were canceled, and the classification was based on the features related to T1D diabetes only.The highest accuracy of 89.1% is obtained from SVM model, the accuracy can be improved by using other methods of selecting features and applying another ML classification model.

Figure 2 .
Figure 2. Data min-max normalization (a) before normalization and (b) after normalization Classification of gene expression dataset for type 1 diabetes using machine learning (Noor Ali Al Refaai) 2991 e. K-means clustering: the k-means algorithm requires defining a predetermined number of clusters before applying it to the data.Therefore, in this work the elbow method was used to estimate the appropriate number of clusters, this method estimated that four clusters are the appropriate number of clusters, the number of genes reduced from 10000 to 8005.The result is either entered to another feature selection method or to a classification model.Table1illustrate the result after applying feature selection methods on 10,000 genes.

Table 1 .
Selected genes number of many feature selection methods

Table 2 .
Accuracy of ML models for result of feature selection method

Table 3 .
Evaluation of ML models for different diabetes dataset