A genetic algorithm for prediction of RNA-seq malaria vector gene expression data classification using SVM kernels

Received Oct 13, 2020 Revised Jan 23, 2021 Accepted Feb 14, 2021 Malaria larvae embrace unpredictable variable life periods as they spread across many stratospheres of the mosquito vectors. There are transcriptomes of a thousand distinct species. Ribonucleic acid sequencing (RNA-seq) is a ubiquitous gene expression strategy that contributes to the improvement of genetic survey recognition. RNA-seq measures gene expression transcripts data, including methodological enhancements to machine learning procedures. Scientists have suggested many addressed learning for the study of biological evidence. An enhanced optimized Genetic Algorithm feature selection technique is used in this analysis to obtain relevant information from a highdimensional Anopheles gambiae dataset and test its classification using SVMKernel algorithms. The efficacy of this assay is tested, and the outcome of the experiment obtained an accuracy metric of 93% and 96% respectively.


INTRODUCTION
Next-generation high-throughput sequencing technology has created great wide-ranging data sets. This gigantic data expanse helps biologists to analyze and conduct daunting gene transcripts, such as diseaseassociated and RNA such as diseases (malaria), cancer, inherited, genomic, physiological, among others [1]. Blood-sucking mosquitoes such as mosquito anopheles with main Plasmodium falciparum malaria vectors are found mostly in Africa. Anopheles mosquito is a lethal malaria parasite which is responsible for thousands of deaths. When a fight against blowouts in antimalarial suppositories, state-of-the-art care for antimalarials improves, looking for groundbreaking drugs requires a greater knowledge of these species. Why anopheles mosquito parasite tolerates precise gene expression parameters, it has been a major interrogation with the need for an enhanced, thorough extrapolation model for its transcriptions [2,3].
In the RNA-seq analysis, approachable disclosing genetic inquiries were made by designing a careful purposeful biological technique by improving the sequencing sample. RNA-seq data includes eliminating the high-dimensional curse, such as; sounds, illnesses, inconsistency, irrelevance, duplication, unfitting data, among others [4]. Advanced capabilities have strengthened solutions to the development of groundbreaking treatment frameworks such as effective public wellbeing nursing systems, advanced treatments, and other medical diagnosis and disorders [5]. In the last decade, numerous machine learning methods have been established with eloquent novelties to investigate the enormous volume of next-generation sequencing of RNA gene expression data analysis by learning the biologically applicable backgrounds [6]. Quite a lot of scientists have used machine learning approaches with high-performance levels for the RNA-Seq gene expression data  [7,8]. The problem of curse dimensionality in high dimensional data has generated limitations in several traditional machines learning approaches, working with an efficient approach is of the essence.
Gene expression data needs significant developments for diagnosis, predictions of ailments and classifications, due to the fact that they are met with challenges such as irresistible numbers of genes relative to numbers of samples, comprising of irrelevant discrepancies. It is required to fetch optimal genes in the given data, to provide better accuracy and performance, previous works in literature have shown the importance of algorithms such as genetic algorithm, due to its strength of fetching range of medium relevant features, by identifying the small subset of genes to improve the gene analysis [5]. This study proposes an enhanced optimized Genetic algorithm approach to achieve the high dimensionality in the gene expression data, SVM kernel classification approaches are utilized to assess discrete genetic structures with classifications that may be suggested as useful techniques in the prediction and finding new genes for malaria infection.
The remainder of the study is organized as follows: Section 2 discusses the literature reviews. Section 3 discusses the research materials and method. Section 4 provides experimental research results. Section 5 discusses the conclusion of the study.

LITERATURE REVIEW
Computational approaches are effective on a large genomic dataset, and genes can be found which are responsible for the existence of ailments. Numerous methods are used to identify differentially expressed genes (DEG). Measures for machine learning (ML) are important in identifying differences between genes obtained from the human genome. Numerous approaches to machine learning are rivaled when analyzing and classifying patterns of gene expression from many diseases. The importance of unfolding gene expression data and its approaches was bestowed through several machine learning. Numerous academic findings in this area are being discussed. Current violations of work in analyzing gene expressions are known [5].
Oh et al. [9] suggested estimation of autism variation ailment with blood-based gene expression signs and machine learning to identify effective transcripts in classification. For machine learning algorithms, RNA information from the Gene expression compilation database is used on the R computer set. Rated cluster review found a fairly well-discriminated autism variation ailment from panels existed. SVM and KNN classifiers where data acceptance is used lead in a full classification class accuracy of 93.8%. Ren et al. [10] suggested a clustering and classification of RNA-seq utilizing a cumulative assessment, emphasizing the methods using clusters and classifier approaches as prevailing anomalies in recent times, non and linear scRNA-seq dimensional reduction approaches, incorporated and recorded scRNA-seq data.
Rating broad collections of genes calculated with RNA-seq focused on supervised learning methods for collecting RNA-seq genes was proposed, using variable range measurements generated through random forest classifier and defined extreme pseudo-sample channels with autoencoder variations and regressions extracting ranks from 12 RNA-seq cancer datasets containing about 1,200 samples. Results proved latent of the supervised learning-based selection of features in RNA-seq training and addressed the need for gene selection approaches to gene expression analysis [11].
A supervised approach to research was proposed [12], on RNA-seq data classification. By incorporating unbiased function collection from a simplified dimensional space inference method by introducing a generalizable method with a greatly detailed classification of single cells. They added scPred to the from mononucleate cells, pancreas tissue, biopsies of colorectal tumours and dendric cells that circulated. Proving scPred is highly effective in classifying different cells. A machine learning RNA-DNA analysis was proposed [13] specifying low gene expressed data that can mutually be inclined by PAH disease. We suggested a groundbreaking collection of features and advanced methods for classifying a trivial range of incredibly useful genes in machine learning algorithms. Studies revealed that clusters of genes with limited expression reveal modified types of PAH when forecasting and discriminating.
Characterization of data on gene expression using CNN for stomach cancer was proposed [14], they established a classification method focused on deep learning on patients with stomach cancer to demonstrate its application to data communication. PCA, heatmaps, and CNN algorithm were used to test 60000 genes of data from 300 patients. Researchers joined the scientific review of clinical evidence and RNA-seq gene analysis, and CNN to test these. They had 95.96% and 50.51% accuracy. RNA-seq discovery of secret transcripts in malaria parasites was proposed [15], by explaining the distinctions of RNA-seq technique to deconvolute transcript differences for approximately 500 different rodents and malaria parasites for human beings; they found distinct transcript signatures tucked inside.
An ensemble machine learning algorithm was proposed [16], to identify data on the expression of the cancer genes, based on the C4.5 decision tree, and improved ensemble decision tree classifiers supervised cancer classification methods, seven freely obtainable malignant microarray data relating to the classification methods and perform better than the independent decision trees. The design of a classification method for the 1073 gene expression of cancer information by the analytical ensemble was proposed [17] using combinatory recursive feature elimination was done through adaboost algorithm for appropriate classification features and reported changes.
Tarek et al. [18] focused on classifying cancer for evidence on gene expression. We suggested an approach to the classification of the operational Ensemble that improves the introduction of the description and the poise of the performance. The results of the Ensemble are less dependent on the originalities of a particular range of instruction. Duval and Hao [19], summarized current advances of metaheuristic-based approaches an embedded feature selection method, developed a metaheuristics method for selecting genes and classifying RNA/DNA data, highlighted the usefulness and importance of mixing problem-specific data into the search operators of such a process. The worked in what way linear classifier constants like SVM can be used lucratively in successful local experimentation for the collection and classification of elements. Shukla et al. [20] focused on a genetic algorithm-based hybrid system by implementing a groundbreaking hybrid feature selection algorithm using a filter-wrapper-based feature selection method to identify problems and resolve shortcomings of existing approaches. Five UCI biological datasets with several instances and dimensionality were proposed for the study. The findings demonstrate that the proposed method offers adequate support for major feature reduction and beats the state-of-the-art with the lowest classification accuracy of 40.04% and the highest precision of 99.32% using k-NN and SVM.
To improve tree model classification in selecting features of ensemble classifier, [21] employed an ensemble classification function collection with random trees and wrapper method. Future classification technique knowledge of an ensemble creates subdivision through the bagging, wrapper, and random tree methods. Potential strategy removes unnecessary features and uses a likelihood weighting method to select the best features for classification. Potential function selection method is tested using SVM, RF, and NB tests with its output correlating with the proposed techniques. The procedure reaches a ranking accuracy of 92%. Ching et al. [22] reported on the study of multiple function extractions gene expression analysis, such as the PCA, ICA, PLS, and LLE. Discussions and software purpose was discussed in the method.

MATERIALS AND METHODS
A lot of methods have been suggested in the literature for the investigation of high dimensional data. Genetic algorithm and SVM classification algorithm are considered in this analysis to minimize RNA-seq data tremendously in terms of dimensionality. Two thousand four hundred fifty-seven instances with seven gene attributes are used, data from western Kenya containing mosquito genes, comprising of significant genes of resistant and susceptible mosquitoes [4]. A descriptive overview of the dataset is shown in Table 1, where the dataset comprises of the attributes of the sample genes and the instances of the feature samples.

Methods
MATLAB was used to evaluate the data obtained from [19] as an experimental tool, and optimized GA was utilized to select features from the high dimensional data, to fetch a relevant subset of features. The fetched features were classified using the SVM classification [20]. Figure 1 shows the experimental workflow of this study. In this study the high dimensional data is passed into the genetic algorithm to fetch for a relevant subset of the data, the reduced data is sent to the SVM classifier for evaluating the performance of the experiment in terms of accuracy and other performance metrics. principles of binary parameters. Genetic algorithms are used to recognize appropriate features [21]. The RNA takes N numbers of features correspondingly representing structures with values 1 and 0 as picked and unselected. Addressing the value of functions, GA is utilized to consider the optimum subclass of features with the designated function number for dynamic presentation of classification. In Algorithm 1 below the general structure of the GA is defined by adopting [20]: M is a population dimension, r is an arbitrary number flanked by 0 to 1, chromium corresponds to the designated or undesignated function by 0.5, and α is the maximum number of listed functions. The key problems of the particular method are the identification of the highest appropriate functionality from the known datasets. In this study, the genetic algorithm uses the 0.5 thresholds with an optimized iteration in the mutation ranging from 0 s to 1 s.

Support vector machine
Support vector machine (SVM) is a machine learning system which Vapnik presented in 1992 [23]. SVM works to find the best hyperplane in input space which isolates between groups. SVM is a linear classifier; it is generated by combining the kernel ideas into high-dimensional workspaces to deal with non-linear problems. For non-linear problems, SVM uses a kernel to train the data to spread the dimension narrowly. When tweaking the proportions, SVM should search for the ideal hyperplane, which can distinguish a class from other classes [23]. The method for finding the strongest hyperplane using SVM, as shown by the adoption of Aydadenta and Adiwijaya [23]: i. Let ∈ { 1, 2 , … , }, where are the p-attributes and target class ∈ {+1, −1} ii. Assuming the classes +1 and -1 divided totally by a hyperplane, as defined in (2) and (3):

v.y+c=0
(1) From (1), can get (2) and (3): v.y + c ≥ +1, for class +1 (2) v.b + c ≤ -1, for class -1 SVM is a machine learning system which Vapnik proposed in 1992 [23]. SVM works to find the best hyperplane in the input space which isolates between groups. SVM is a linear classifier; it is generated by combining the kernel ideas in high-dimensional workspaces to deal with non-linear problems. For non-linear problems, SVM uses a kernel to train the data to spread the dimension narrowly. If tweaking the proportions, SVM can look for the ideal hyperplane and can distinguish a class from other classes [23]. The technique to find the best hyperplane using SVM, as shown by the adoption of Aydadenta and Adiwijaya [23]: − SVM-Gaussian kernel Gaussian kernel [24] is related to a general supposition of smoothness in all subordinates of the kth order. Kernels that manage a certain prior data recurrence material can be built to represent earlier learning problems. Every input vector x is translated to an infinite-dimensional vector with all the polynomial extensions of the x components [25][26][27].

Performance evaluation
Assessing machine-learning algorithm efficiency needs certain validation metrics. The uncertainty matrix is often used to evaluate four characteristics of classification models; inaccurately from the data set sample specified to check the model [5]. Performance metrics are presented below with its formula [26].

Applications
Analysis of gene expression provides an enhanced route for the detection of RNA-seq results. The necessity to explore specific genes is beneficial in creating various applications such as modified treatment, cancer detection, gene and drug development, tumour recognition, illnesses such as malaria and typhoid. Machine learning knowledge in discovering designs and data inconsistency, it holds excellent procedures as instruments that apply to diverse areas.
Program development simplicity for designers, physicists, academics, among others, matrix laboratory (MATLAB) is used to experiment. MATLAB is an arithmetical processing environment with multiworldview and a limited programming language documented by MathWorks. It allows application controls, tasks and knowledge visualization, algorithm execution, user interface development, written in C, C++, C #, Fortran, Java, and Python languages [16]. The key idea of this analysis is to predict Malaria infection, using the RNA-seq data technology on the MATLAB method. The computer conformation used as the executing tool for determining this study is iCore2 processor, 64-bit System, 4 GB RAM size, and MATLAB 2015a.

RESULTS AND ANALYSIS
This research explores the RNA-seq innovation of vulnerable and tolerant genes, carrying 2457 instances of Anopheles gambiae mosquitoes. Optimized genetic algorithm to diminish the burden of dimensionality was applied to the results. GA selection feature dimensionality reduction captures the optimal data sub-set and eliminates uncorrelated attributes to determine the maximum variance with a lesser number of mutable subset features. GA is optimized and used on the Anopheles mosquito data in this analysis, which offers important gene detail that is valuable for further research. MATLAB tool uses the SVM classification kernel algorithms to execute the pattern. With 0.5 thresholds, 708 significant optimal subset features of genes were using optimized GA as a feature selection method.
SVM classification algorithms, 10-fold cross-validation and 0.05 parameter holdout were utilized to evaluate the performance implementation of classification models, and training data uses 75% and 25% testing to verify the classification accuracy. The classifier uses a learning valuation procedure for sampling bias eradication, by training and testing estimated segments. Using MATLAB, this procedure is implemented. The measurement outcome described is based on the quantitative time and efficiency parameters (accuracy, specificity, sensitivity, precision, f-score and recall) [26]. This analysis measures the model classification efficiency with 93.3% and 95% accuracy, respectively, using L-SVM and RBF-SVM classifiers. The result performance and the matrix for uncertainty are revealed in Figure 2. This learning uses GA to gather relevant components from the loaded data. The chosen features are passed to the SVM classification, and the outcome is seen in the following Figures 2 and 3. The uncertainty matrix gives quality metrics a solution. The L-SVM classification kernel analysis achieved 93.3% accurate, the RBF-SVM kernel classification system is 95% accurate; other efficiency metrics are shown in tabulated form in Table 2. In this study, the classification of the experiment was performed. It yielded the confusion matrices used to calculate the evaluation performances, Figure 2 and Figure 3 shows the confusion matrices that depicts the True positive. This outcome shows the correctly predicted positive classes: the true negative shows the outcomes that are correctly predicted but negative class. The false-positive shows the outcome of the model that is incorrectly predicted with negative classes. In comparison, the false-negative shows the outcomes of the model that are incorrectly predicted with negative classes. RNA-seq results for Anopheles gambiae mosquito [28]. Two thousand four hundred fifty-seven gene features were obtained, GA was utilized as a guide for the lessening of dimensionality, 708 features were chosen as a subset of the results. Then these components are categorized using the SVM classification to forecast their performance. The outcome demonstrates the machine-learning technology's success in embryos. The success findings are shown and compared in Table 2 below for confirmation of the method. The analysis reveals that RBF-SVM outperforms L-SVM in terms of less training time and output accuracy. Table 2 shows the comparative analysis of the classification results of the proposed experiment using two types of SVM kernels which are the linear and radial basis function SVMs. The output of the processed malaria vector data using the proposed model is evaluated and validated by clinicians. The results have to be validated for observations and automated machine learning-based approach in terms of the percentage of gene characterizations. It is evident that one of the major causes of the poor performance of classification is either overfitting or underfitting the data, this study trained and tested the reduced data with an approximate target function, in order to positively impact the performance of the model, by fetching out noises from the model. Proper validation is of the essence before clinical usage and testing, in order to provide more accurate services to clinicians and patients. Machine learning procedures have proven to provide an accurate percentage of genes when compared with other methods and can be monitored for accurate prediction. This study is a better approach than traditional ways of determining observations and can provide a better assessment for malaria infection and transmissions in human. Table 3 shows the comparison with other state-of-the-art results. This study will help clinicians in decision making of prediction, detection and designs of efficient drugs as well as better ways of eradicating malaria infections in Africa. This work is Bulletin of Electr Eng & Inf ISSN: 2302-9285  limited to malaria infections and its computational analysis for clinicians, which can be extended in future to other ailments and introducing other approaches. Table 3 shows the comparison of this study with other techniques in literature. Table 3. Comparative approaches Methods Accuracy (%) PSO+SVM [29] 89 Mutual information+KNN [30] 95 GA+MLP [31] 89 RF [32] 94 Bayesian [33] 91

CONCLUSION
This study can be useful in human malaria ailment prognosis and diagnosis. The theoretical solution uses machine learning methods such as model and classification procedures for the reduction of dimensionality. Dimensionality reduction approach follows the GA filtering function model, which uses the SVM classification. This study carried out the success analysis and assessment and showed the findings obtained the SVM classification algorithm. This study evaluated and enhanced the classification of malaria vector data. Multiple studies have suggested evaluations by investigators using performance metrics, the findings have shown that dimensionality reduction model utilizing feature extraction methods such as GA can boost classification efficiency such as SVM. It will be important to explore how the recently proposed research can strengthen the feature selection models and algorithms. Future work proposes to use hybridized dimensionality reduction approaches. In future, this study can optimize the genetic algorithm for better fitness iteration and integrating the approach with other dimensionality reduction methods such as the ant colony optimizer, as well as introducing other beneficial classifiers such as the KNN, then compare and fetch for better efficiency of classification of the genes.  He has published widely, and he is a reviewer for many reputable journals. oludayoo@dut.ac.za.