Machine learning prediction for academic misconduct prediction: an analysis of binary classification metrics

ABSTRACT


INTRODUCTION
Machine learning techniques have been utilized in the field of education for predicting academic misconduct [1]- [4].Academic misconduct, usually referred to as academic dishonesty, is a global problem.Academic misconduct is defined as a purposeful fraud [5] as well as a specific form of regulation violation in higher education institutions [6].Plagiarism, exam or test cheating, unauthorized collaboration, and fabrication are a few examples.Recently, incidents of academic misconduct become more prevalent due to the implementation of emergency remote teaching in curbing the spread of COVID-19 disease [7], which in turn raises the crucial need to use automated machine learning in academic misconduct prediction study in achieving more accurate outcomes.A review of literature documents various risk factors associated with the occurrence of academic misconduct such as personality traits [8], individual and situational factors [9], [10], ethical orientation [11], religiosity [12], and fraud theories factors [13].Predicting academic misconduct is challenging but if the detection can be done at an earlier stage, then preventive measures can be taken more effectively at an earlier point of time.
In the education domain, machine learning techniques play a major role in predicting various academic problems and issues such as student academic performance [14]- [16] and dropout [17]- [20].Despite the importance of machine learning techniques in predicting academic misconduct more accurately, a review of literature shows very limited studies on this area [1]- [4] as most prior studies employed traditional statistical methods in predicting such unethical behavior [8]- [13].Research by Kamalov et al. [1] is one of the studies that uses machine learning technique, recurrent neural network (RNN), and outlier detection method to predict exam cheating.In particular, this study uses RNN to identify unexpected high scores on the final exam for an average student, then the anomalies grade will be an input for the outlier detection method to identify potential academic cheating.Overall, the findings show that this study method significantly outperforms the benchmark method by achieving an average true positive rate (TPR) of 0.95 and false positive rate (FPR) of 0.05 for the classification results.Further, Wray et al. [2] aims to predict propensity academic dishonesty using decision tree (DT) analysis.The findings show that DT analysis complements the traditional approach probit regression model in terms of predictive accuracy.In addition, the results suggest students' moral character as the most important factor in determining the propensity for academic dishonesty.In line with [1]- [3] construct machine learning detection on academic cheating via copying answers using multiple existences online (CAMEO) method.The prediction model is based on three categories of features namely student features, problem features, and submission features.Using a bayesian network, the model shows a high performance offering an area under curve (AUC) close to 1 and a sensitivity and specificity of 0.96 and 0.99 respectively.The findings reveal that student features are more important than problem features and submission features.Tiong and Lee [4] employed four deep learning algorithms; deep neural network (DNN), DenseLSTM, long-short term memory (LSTM), and RNN to develop prediction models on online exam cheating.Using two exam datasets (mid-term and final-term exams) of Pyeongtaek University in South Korea, the results revealed accuracies of 68% for the DNN; 92% for the LSTM; 95% for the DenseLSTM; and 86% for the RNN.
By reviewing the prior studies, it has been found out that the performance of the existing systems is comparatively less.Hence, this study aims to add the existing body of knowledge [1]- [4] by investigating the use of a machine learning classification approach for predicting academic misconduct among undergraduate students of higher education institutions in Malaysia.Following prior works [13], this study uses fraud triangle theory (FTT) factors; pressure, opportunity, and rationalization to predict academic misconduct incidence in a unique setting; emergency remote teaching during COVID-19 pandemic.
There are two major contributions to this study.First, it attempted to extend previous work on academic misconduct prediction using machine learning techniques [1]- [4] by presenting evidence on a machine learning-based academic misconduct prediction model among Malaysian undergraduate students.To the best of our knowledge, the machine learning prediction study on academic misconduct has been reported with limited evaluation metrics that are not highlighting confusion matrix, precision, and recall.Second, it presents a new design and execution of machine learning prediction on academic misconduct based on FTT's constructs to be compared with demography constructs.The rest of the paper is laid out as follows: section 2 discusses the data set for this investigation, as well as the machine learning experimental setting, the empirical findings for each algorithm are shown and discussed in section 3, and the summary and conclusions are presented in section 4.

METHOD 2.1. Data collection and datasets
This study employed a questionnaire instrument to collect the dataset for constructing the machine learning prediction model on academic misconduct.The survey was distributed to undergraduate accounting students of Malaysian higher education institution during the implementation of emergency remote teaching.The questionnaire consists of two sections that was designed to acquire information on the students' demographic; gender, attitude on learning, health status, peer academic misconduct, and academic misconduct experiences as well as perception on the attributes of FTT; pressure, opportunity, and rationalization [21].This study uses 6 indicators, 8 indicators, and 5 indicators to measure pressure, opportunity, and rationalization respectively.The mean of total from each indicator was used for presenting each FTT attribute.Five indicators have been used to gauge students' experiences engaging in academic misconduct as the dependent variable (DV).The misconduct includes asking for external assistance, exchanging responses during online testing, plagiarizing, illicit collaboration, and searching for internet answers through discussion or forum groups.The mean of academic misconduct experience is the target variable of the prediction model.If the mean total for academic misconduct of a student is >2.5, the student is

Correlations of variables
Table 1 lists the independent variables (IVs) from demography and FTT attributes.The DV is the class of academic misconduct either dishonesty or honesty, represented as 1 and 0 as given in Table 2.The percentages of distribution present the sampling number for each class and it can be seen that the figure of academic honesty is much higher than the academic dishonesty.Therefore, it is interesting to observe how the distribution can affect the ability of machine learning in predicting the case of academic dishonesty.Based on pearson correlation test, most of the attributes have low correlation coefficient to the DV and two demography attributes (peer academic misconduct and gender) have very low dependency with DV (below 0.1).However, in machine learning prediction, each of the attributes even with very low contribution of influence is expected to be useful in providing some degree of knowledge to the algorithm.Therefore, all attributes remain included in all machine learning models.The most important thing to be described is how much and how different each of the attributes worked in the different machine learning algorithm.

Machine learning
Four machine learning algorithms namely generalized linear model (GLM) [22], logistic regression (LR) [23], DT [24], and random forest (RF) [25] have been selected for comparison in this study.These five algorithms were selected based on the preliminary findings from the AutoModel module in the RapidMiner software that uses optimization search strategy to identify the suitable algorithms for the given dataset.Table 3 lists the optimal parameters set of DT and RF from the preliminary machine learning hyper-parameters tuning.
For the DT, the range of maximal depth used in the preliminary testing is between 2 to 25, with a consistent error rate for all the settings at 12.5%.Therefore, the minimal maximal depth 2 is taken for the algorithm.The number of trees used in the preliminary hyper-parameters tuning of RF are 20, 60, 100, and 140.For each of the four numbers of trees, three values of maximal depth (2, 4, 7) have been used to be observed.The worst error rate was 18.8% with the number of trees equaling 20 and its maximal depth was 4. The best error rate is 10.9% with the configuration given in Table 3.  Figure 1.Process for split ratio

Performances metrics
Because the machine learning algorithms were used to predict the probability of two classes of academic misconduct, the models used classification metrics that can be calculated based on the production of confusion matrix as depicted in Figure 2, which can be explained to the context of the academic misconduct as of the following; i) true positive (TP): the number of academy dishonesty can be correctly classified, ii) true negative (TN): the number of academy honesty can be correctly classified, iii) false positive (FP): the number of academy dishonesty incorrectly classified as honesty, and iv) false negative (FN): the number of academy honesty incorrectly classified as dishonesty.Based on the confusion matrix in Figure 2, the metrics for measuring the machine learning performances are accuracy, classification error, recall, and precision.Accuracy and classification error measure the performance of the machine learning in detecting both classes (1,0) from the total validation cases.On the other hand, recall and precision present the ability in detecting each specific class.The formula for accuracy and classification error as in ( 5) and ( 6): = ( + )/( +  +  + ) The formula to measure the sensitivity of machine learning in predicting academic dishonesty (class 1) or recall is denoted in (7).Sensitivity or recall for class 1 is defined as the TPR to present how much academic dishonesty can be correctly predicted.The complement of recall for class 1 is precision or specificity that presents how much academic honesty can be correctly classified.The formula for precision is denoted in (8).

RESULTS AND DISCUSSION
There are three sets of results presented from the study.Firstly, the results of performances of the machine learning to correctly (accuracy) and incorrectly (classification error) classify both cases of academic misconduct from the total validation cases provided in Table 4. TTC is the time to complete from the training to the validation stages in milliseconds.In general, all machine learning algorithms have achieved good accuracy results (above 80%) with considerably less errors (lower than 20%), mainly DT and RF that used a tree-based paradigm for constructing the classification model.Both DT and RF performed at equal performances for achieving the accuracy but DT has lower processing TTC than RF.Although RF has taken the longest time, the process can be completed in just 3 seconds.RF structure is more complex because it uses more than one tree than DT, which causes it to take much more time than other algorithms.
Second set of results is that the precision and recall for each class of academic misconduct can be measured based on the confusion matrix as labeled in Figure 2 that were generated from each machine learning algorithm as listed in Table 5.As expected, the class precision and recall for detecting academic dishonesty in all machine learning algorithms is lower than the results for predicting the academic honesty class.However, even with the very small numbers that are given for the machine learning training with the academic dishonesty class, the precision results from GLM, LR, and RF are considerably good enough (50-75%).DT probably did not experience academic dishonesty data during the training stage that resulted in 0% of precision and recall for the case 1 class.Lastly, the third set of results explains how each attribute from demography and FTT was used in the different machine learning algorithm as listed in Table 6.Table 5 lists the weight of correlation coefficients that the machine learning used for the academic misconduct prediction.In general, the rationalization attribute from FTT has become the most important to GLM, LR, and DT but in RF, opportunity attribute was the highest.The research findings indicate that the rationalization attribute of the FTT becomes significant when students attempt to rationalize their academic misconduct by providing selfjustifications.To illustrate, students may persuade themselves that engaging in cheating or plagiarism is justified due to various factors such as the pressure to attain high grades, an excessive workload, the prevalence of such behavior among peers, and the perception of an arbitrary grading system [13].From the demography attributes, the variations of importance seem similar from each attribute and learning attitude is the second highest in GLM, LR, and DT after rationalization.Although health has the highest correlation coefficient outside machine learning model (refer Table 1), it has become the second important in RF.Gender and peer academic misconduct remain as the least significant attributes in all the machine learning models consistent with the rank of correlation coefficient in Table 1.

CONCLUSION
This research has opened up many research opportunities related to machine learning prediction in the education domain particularly for academic misconduct.Machine learning has an intelligent mechanism that is able to continuously learn from the prediction errors it can measure during the training phase.At each row of prediction from the training data, it will improve the attributes correlation coefficients given for the models by using mathematical derivation until the best configurations are found.Based on the tested dataset that focused on students from higher institution in Malaysia, the findings of this research showed that the factors from FTT have been more useful to the performance of machine learning prediction models than demographic factors.Various research questions can be raised based on these findings that need a lot of extensive research work either on the machine learning or in the attributes of the prediction models.

Figure 1
depicts the process in RapidMiner for splitting the dataset into training and testing sets.As seen in the ratio field, the research used 0.7:0.3testing validation ratio.Therefore, from the 108 data, 76 of Bulletin of Electr Eng & Inf ISSN: 2302-9285  Machine learning prediction for academic misconduct prediction: an analysis of binary … (Suraya Masrom) 391 them were used for the machine learning training and 32 were used as a hold-out sample for the machine learning testing.

Figure 2 .
Figure 2. Confusion matrix for the academy dishonest model

Table 1 .
Pearson correlation of each IV to the DV

Table 2 .
The DV of the classification model

Table 3 .
Configuration of parameters

Table 4 .
The result

Table 5 .
Confusion matrix of GLM

Table 6 .
The weights of correlations of each academic misconduct attributes