Simulated annealing for SVM parameters optimization in student’s performance prediction

Received Sep 29, 2020 Revised Feb 17, 2021 Accepted Apr 8, 2021 High education is an important and critical part of education all over the world. In last year, the world has been turned increasingly to online education due to the outbreak of the Covid-19 pandemic; therefore, improving this education system became an urgent matter. Online learning systems are a primal environment for acquiring educational data which can be from different sources, especially academic institutions. These data can be mainly used to analyze and extract utilizable information to help in understanding university students’ performance and identifying factors that affect it. To extract some meaningful information from these large volumes of data, academic organizations must mine the data with high accuracy. In this work, three different real datasets were selected, pre-processed, cleaned, and filtered for applying support vector machine (SVM) with multilayer perceptron kernel (MLP kernel) and optimize its parameters using simulated annealing (SA) algorithm to improve the objective function value. While examining the search space, SA has the advantage of escaping from local minima since it offers the chance for accepting the worse neighbor as a solution in a controlled manner. The results show that the designed system can determine the best SVM parameters using SA and therefore presents better model evaluation.


INTRODUCTION
COVID-19 known as a pandemic according to was declared by the World Health Organization (WHO) in the past year. This pandemic disruped the education across the globe, as nationwide closures forced institutions to temporarily shut down. It is estimated that the closures affected about 70% of the total student population worldwide. Data mining algorithms are widely used for discovering the hidden patterns of data to help the decision-makers, it became an efficient tool to find the uncovered information from the big data., Like business organizations [1], universities are operating today in a highly dynamic and strongly competitive environment [2] and the education nowadays is not limited to classroom teaching only but it goes away to other forms such systems of online education, web-based education, seminar, project-based learning, workshops, etc. Data mining is very important in educational systems as shown in Figure 1, but all these systems can not success without accurate evaluation so, for having a successful education system a welldefined and accurate evaluation system must be maintained, the prediction of the students' performance with high accuracy is too helpful for selecting the students with low-performance levels from the beginning of   [1] Both data mining and machine learning used the same methods. But there is a difference, machine learning focuses on prediction, based on known properties, whereas data mining focuses on the identification of unknown properties. Support vector machine (SVM) is a machine learning technique that builds a linear binary classifier. It defines the decision boundary between two classes [2]. Optimization is the process of achieving the best solution for a problem, there are many optimization algorithms like the standard SPSA algorithm [3], [4] which used for optimizing systems with multiple unknown parameters, Gradient descent which used for finding a local minimum of a differentiable function [5], and simulated annealing (SA) which used for approximating the global optimum of a given function in a large search space [6]. Therefore, it was chosen for this work.
SA is a popular optimization algorithm inspired from the basis of of melted metals' annealing (slow cooling after heating) to crystallize their structures [7], it was invented in 1983 by Kirkpatrick et al. and theyand also some other researchers-analytically proved that SA can escape from the local optima and converges to the global optimum. A group of researchers studied data from students over the past decade to predict student performance. Data mining approaches and correlation analysis each of these approaches generate different levels of success. V. K. Pal and Vimal Kamlesh Kumar Bhatt [8] proposed research on the first dataset by applying the artificial neural network (deep learning) after splitting the data into two subsets, training set containing 70% of original data and test set containing the remaining 30%. The resulted accuracy for test set is 97.749% and the corresponding error rate is 2.251%. Y. K. Salal, S. M. Abdullaev and Mukesh Kumar [9] also proposed research for building classification models for the same dataset and implement algorithms like NaiveBayes with accuracy 73.1895%, decision tree (J48) with accuracy 76.57%, Randomtree with accuracy 67.95%, REPTree with accuracy 76.73%, JRp with accuracy 74.11%, OneR with accuracy 76.73%, simplelogistic with accuracy 73.65% and ZeroR with accuracy 30.97%. After implementing these algorithms on the student performance dataset, He compares the implementation result for the best model in the prediction process.
Another research proposed a classification method based on a meta-heuristic PSO algorithm to predict the students' final outcome according to their activities and the results improved by 89% [10]. D. Kabakchieva [11] also proposed an algorithm for classification by applying four different classifiers: OneR Rule Learner, Neural Network, Decision Tree, and K-Nearest Neighbour, and neural network achieved the highest classification accuracy 73.9%, followed by 72.74% for the Decision Tree and 70.49% for the k-NN model. S. Hussain, Neama Abdulaziz Dahan, Fadl Mutaher Ba-Alwi and Najoua Ribata [12] used classification algorithms in WEKA and apply feature selection to select 12 of 33 attributes to predict the student performance.
Optimization is the process of achieving the best solution for a problem (SVM parameters) in this article using SA optimization technique help in improving the objective function (classification accuracy) value by avoiding the local minima and present comparative study for academic students' performance, each of this algorithms is compared based on its accuracy to identify the most appropriate model for this job. Comparing our results in section 7 with previous published works clearly show that our proposal SVM-SA gives better results for accuracy, precision, sensitivity, and f-measures which improve student academic performance predection for the decision-makers. This paper is organized as follows: In section 2 and 3 both SVM classifier and SA algorithm are explained, In section 4 the proposed SVM-SA model is described, section 5 describes the used data, section 6 evaluation measures, section 7 and 8 the results and conclusions.

SUPPORT VECTOR MACHINE
In 1995, SVM was originally developed using the structural risk minimization principle and Vapnik-Chervonenkis theory, it is a supervised machine learning technique that used for both classification and regression problems. SVMs are more commonly used in classification problems because it has high performance and generalization capability.
SVMs are based manly on the idea of finding the best hyperplane that maximizes the margin (distance to nearest points) between the nearest +ve and -ve data points [13], the class boundary for linearly separable data, giving a greater chance of new data being classified correctly [14], assume the training data has the dataset data={yi, xi}; i=1,2, . . ., n, where xi ∈Rn represents the i-th candidate vector and target labels ∈ {−1, +1}, represents the output label corresponding to the class of item xi, the original formulation of the SVM algorithm seeks a linear decision surface using the formula ( ) = + , where w is a dimensional coefficient vector and b is the offset [15]. The linear SVM achieves an optimal hyperplane by solving the following optimization problem: This quadratic optimization problem can be solved by finding the saddle point of the Lagrangian function: Where is Lagrange variables, after applying KKT conditions for a maximum of (2) are obtained by setting the gradient of Lagrangian with respect to the primal variables w and b to zero and by writing the complenentary conditions [16]: By (3), the weight vector w solution of the SVM problem is a linear combination of the training set vectors 1 , … , . According to complementary conditions (6) w depends on vector that corresponds the ≠ 0. Which called support vectors that fully define the maximum margin hyperplane, after substitute (3) and (4) into (2) the dual form Lagrangian ( )of (2) is derived as follows: In (7), (8) and (9) presents the polynomial kernel, sigmoid kernel, and radial basis functions, respectively. These functions are used to find the optimal hyperplane, in this proposal we used sigmoid kernel (MLP) which also called feedforward ANN with three layers of neurons each neuron uses a nonlinear function for activation except the input one and also applies the concept of backpropagation during the network training [17]. The weight, bias and are the setting parameters of multilayer perceptron (MLP) [18].
Polynomial kernel: ( , ) = (1 + ⋅ ) sigmoid kernel: where the intercept constant Radial basis function kernel (RBF): where is the kernel parameter There are two problems in the SVM classifier's optimization procedure [13] : 1) How to select relevant features and filter out irrelevant features to construct the SVM classifier; 2) How to properly adjust the penalty parameter C and the hyperplane parameters [19]. SVM parameters such as kernel parameters and the penalty parameter have a great influence on the accuracy and complexity of the classification models. numerus evolutionary optimization algorithms were proposed for optimizing SVMs; in this paper, SA is proposed as an optimization algorithm which follows search strategy that improves the value of the objective function to find the best parameter settings that can highly enhance the performance of SVM classifier.

SIMULATED ANNEALING ALGORITHM
SA algorithm is a local search method invented to avoid local minima [7], [20]. SA's major advantage in comparison with older optimization methods is its ability to escape the local minima. This method based mainly on electing a move randomly in each stage instead of the best move (best neighbor) selection among the available moves, if the new state enhanced (reduced) the cost, it is accepted as the next state while if it caused the cost increment, it is accepted just with a P probability. P named Metropolis probability and is defined as: Where ΔE represents the change in energy (value of the cost function) caused by the change in state T is the temperature or temperature-like variable that controls this probability. A "generative function" exists that denotes the way of updating variables in each attempt and indeed it is the function that specifies the speed of convergence. In typical SA, the generative function is a Gaussian or Boltzmann function: where D is a dimension of the search space (number of variables in the cost function). ΔX shows the rate of change of X (variables' vector). So, X=X0+ΔX where X0 the current state and X the next state of variables. The temperature in the kth stage of the algorithm can be found using (12). The steps of SA algorithm [21] showed in Figure

RESEARCH METHOD
Penalty and kernel are SVM parameters with a great impact on the accuracy and complexity of the model of classification. This paper proposes a novel evolutionary for the SVM model by using SVM with MLP kernel and employ SA to optimize its parameters which are expressed as P1 for the slope or weight and P2 for the intercept constant or bias where P1>0 and P2<0. By delinquency, the value is set at 1 and -1; thus, the classification error can be decreased. In this section, we describe the proposed SA-SVM model as shown in Figure 3 to find the optimal values of SVM parameters. The main steps in the SA algorithm are: 1) generating neighbor; 2) evaluating the objective function (classification accuracy); 3) assigning an initial temperature; 4) changing the temperature; 5) cooling schemes, and 6) stopping [21]. The initial solution is one of the important components of SA which generated randomly selected among a feasible solution space in this paper. The initial solution in our algorithm is represented by a two-element vector P as (13). While P1 is assigned to the weight, P2 is assigned to the bias.
[P1 P2] is a vector specifies the MLP kernel' parameters of. The MLP kernel takes the form: A feasible solution is randomly selected to be an initial solution, The objective function is an important factor on which SA depends during its performance for evaluating the individual solutions. We formulated the objective function to depend mainly on the classification accuracy of SVM represented by the given solution. How accurately the training data is classified when the classification is conducted using the parameters presented by a solution serves as the cost for a given solution. The cost Z(P) for a solution P is calculated over the training dataset (with a size of N) using (15).

Z(P)= #Truly classified instances/ N
The initial temperature has also high importance as a parameter has a huge effect on the chance of selecting a bad solution. So, if the initial temperature has a high value, a solution with a bad objective function value may has a high chance of being accepted. While considering a low value for the initial temperature increases the probability of the solution to be a local optimum, in this work the initial temperatures is chosen to be in a range from 0:500.

DATA PREPROCESSING
The input and output data were pre-processed by cleaning the missing values, convert the nominal data to numerical data, convert data to binary class (0 means fail, 1 means success), and splitting the dataset into two parts: training and testing datasets (ratio of 70%: 30%) without any feature selection for any dataset, we did not use cross-validation to make the comparison fair because the papers used in the comparison used the same ratio of training and testing data.

DATA DESCRIPTION
The first dataset consists of 649 different instances with 33 different attributes, this student's performance dataset is collected from two secondary schools of Portuguese (Gabriel Pereira (GP) and Mousinho da Silveira (MS)). The dataset contains attributes for students like academic grades, social attributes, demographic attributes, and school-related attributes. Data was collected from the students using the school reports and questionnaires. Dataset' details are shown in Table 1 [9] [22]. The second dataset is from three different colleges, Duliajan, Doomdooma, and Digboi College of Assam, India. Initially, data of twenty-two attributes were collected [12]. The third dataset is from the Common Entrance conducted by Dibrugarh University, The collected data with12 attributes were of students who came for counseling cum admission into medical colleges of Assam in the year 2013 [23]. The three datasets are imbalanced this is due to the low repetition rate among students in the database compilation places according to Table 2.

MODEL EVALUATION
The classification accuracy always seizes the first look when a model is built for a classification problem as the number of instances predicted correctly from all predictions made, but the classification accuracy is not sufficient alone to evaluate a model, especially in case of imbalanced data classification. Therefore, we considered some other measurements such as sensitivity, precision, and F-measures [24]. The measurement equations used in model evaluation listed in Table 3 were: TP for True positives, TN for True negatives, FP for false positives, and FN for False negatives [25].

RESULTS
The platform adopted to develop the SA-SVM algorithm is a laptop with the following features: Intel(R) Core (TM) i7-4600 CPU@2.10GHz, 8G RAM, a Windows 10 pro as operating system using  Table 1, each dataset split into two parts in which training and testing datasets wth ratio of 70% and 30% respectively and the results are shown in Table  4, Table 5 and Table 6. From the Tables 4 to 8, it is clear that our proposed method shows comparative performance without feature selection to all the other classification algorithms in term of prediction accuracy. Portuguese course dataset showed 78.35% accuracy for SVM with MLP kernel, this accuracy increased to 90.72% after applying the proposed method as well as sensitivity and f-measures as shown in Table 5 and it is better than other presented classifiers' accuracy as shown in Table 7. CEE dataset showed 61.8% accuracy for SVM with MLP kernel, this accuracy increased to 69.34% after applying the proposed method as well as sensitivity and f-measures as shown in Table 4 and it is better than NaiveBayes, and decision tree (J48), ZeroR, REPTree, OneR, RandomTree, JRip, and SimpleLogistic accuracy as shown in Table 8.
Sapfile dataset showed 61.1% accuracy for SVM with MLP kernel, this accuracy increased to 67.77% after applying the proposed method as well as sensitivity and f-measures as shown in Table 6 and it is better than BayesNet accuracy as shown in Table 9. Also, the highly improvement in the f-measures for the three used datasets that reached 13% strongly proves the efficiency of our proposed SVM-SA model in dealing with the problem imbalance in data. All the forgoing confirms the effectiveness of the proposed method.

CONCLUSION
Machine learning techniques with educational data can be used to improve the learning process of students in higher education institutes. Different methods were developed by researchers to predict students' performance in the enrolled courses, to provide valuable information that helps in facilitating the students' retention in those courses. This information can be used by instructors to early identify students who might need assistance in their study. In our work SVM applied on three different real datasets then, a hibernation between SVM with MLP kernel and SA was used to enhance the results and finally, compared with the results of other algorithms. The results showed that the proposed method became better after applying the SA optimization technique and presents higher performance than other methods.