Naive Bayes modification for intrusion detection system classification with zero probability

Received Jan 25, 2021 Revised Apr 30, 2021 Accepted Jul 25, 2021 One of the methods used in detecting the intrusion detection system is by implementing Naïve Bayes algorithm. However, Naïve Bayes has a problem when one of the probabilities is 0, it will cause inaccurate prediction, or even no prediction was found. This paper proposed two modifications for Naïve Bayes algorithm. The first modification eliminated the variable that has 0 probability and the second modification changed the multiplication operations to addition operations. This modification is only applied when the Naïve Bayes algorithm does not find any prediction results caused by zero probabilities. The results of this research show that the value of precision, recall, and accuracy in the modification made tends to increase and better than the original Naïve Bayes algorithm. The highest precision, recall, and accuracy are obtained from modification by changing the multiplication operation to the addition. Increasing precision can reach 4%, increasing recall reaches 2% and increasing accuracy reaches 2%.


INTRODUCTION
Network and data security are some of the most important things for an agency at this time. Various types of attacks that occur through the internet against networks and data encourage agencies to implement various systems to detect and prevent attacks that occur [1]. One system that is often used to detect attacks is intrusion detection system (IDS). IDS is a system used to automate the process of detecting suspicious activity in the network and analyze the possibility of attacks in these activities [2], [3]. There are several methods used in IDS to detect, including anomaly detection and misuse detection. Anomaly detection is detection by comparing the state of an existing activity with the state when a normal activity, while misuse detection is detection by matching the activity pattern with a pattern contained in a database that has been previously defined [4]. Apart from these two methods, several studies have been carried out to conduct detection, prediction, or classification using data mining algorithms [5]- [8]. One algorithm that can be used to predict IDS is Naïve Bayes [9], [10] which gives good accuracy.
Naïve Bayes algorithm is a classification algorithm that is quite good and is often used in various studies [11]- [13]. This algorithm can be used for simple classification with fixed Y variable and also for text classification [14]- [16]. Laga and Sarno [17] showed that Naïve Bayes gave the best accuracy from other classification methods, such as KNN, SVM, and random forest. However, the Naïve Bayes algorithm still has a drawback, that is, if the probability value from one of the variables is 0, it can make the final comparison result 0, which can lead to inaccurate prediction results [15], [17]- [20]. Research [15], [17] overcomes zero probability with RB-Bayes, while research [20] uses Hybrid N-gram, and research [19], [20] uses multinomial Naïve Bayes.
Based on the previous research [15], [17]- [20], it can be seen if the prediction results from the testing data are not found due to the opportunity 0. Therefore, it is necessary to modify the Naïve Bayes algorithm to overcome the existing problems. This paper proposed the modification of Naïve Bayes algorithm to overcome opportunity 0 in the dataset. In this research, the Naïve Bayes algorithm and some Naïve Bayes modifications are implemented in a web-based application and analyze whether the modifications made can improve the accuracy of prediction of attacks in IDS or vice versa. The first modification is to eliminate the variable that has a probability value of 0, while the second modification changes the calculation from multiplication to addition. Both of these modifications are applied when the Naïve Bayes algorithm does not find any classification results.

RESEARCH METHOD
The research method used in this paper is in Figure 1. Each stage is carried out in stages and sequentially.

Problem identification
Problem analysis is the initial stage for identifying a case or problem [21], [22]. This stage is the initial stage which aims to determine the problems that exist in the Naïve Bayes algorithm, especially in predicting attacks in the network. The problem obtained at this stage is that there is an opportunity value of 0 in Naïve Bayes that can make the prediction results inaccurate and the lack of the ability of IDS to predict attacks in the network.

Problem Identification
Data Collection Data Preprocessing Implementation Testing

Data collection
The data in this study came from the NSL-KDD 99 dataset. NSL-KDD 99 is a dataset resulting from the development and reduction of fundamental problems from the KDD 99 dataset. The dataset used is small training set.csv and KDDTest + .csv [23]. Some of the advantages of the NSL-KDD 99 dataset compared to the original KDD 99 dataset include: a. The data contained in the training data is not excessive so the classification results are not biased. b. There is no data duplication in the testing data. c. The amount of data in training and testing data makes sense, which makes it affordable to run experiments on complete datasets without having to randomly select a small portion.

Data preprocessing
In this stage, several processes are carried out to process the data before classification is performed using the Naïve Bayes algorithm. The process includes: a. data cleansing b. feature selection c. variable discretization

Implementation
At this stage, the application starts to be built by the design made in the previous stage. The application is realized in the web form with PHP programming language and using MySQL database.

Testing
The next stage after implementation is testing the system. This stage is carried out to test the Naive Bayes algorithm and the modifications that have been made. The tests carried out are divided into 2, namely algorithm testing and testing of the precision, recall, and accuracy values.

RESULTS AND DISCUSSION
This section is a discussion of the research that has been done. Starting from the preprocessing stage, application implementation, and testing.

Preprocessing
In this stage, several processes are carried out to process the data before classification is performed using the Naïve Bayes algorithm. This stage is implemented because preprocessing can improve the accuracy of Naive Bayes [24]. The process includes:

Data cleansing
This stage is done to eliminate the data in the testing data with the Y variable that is not contained in the training data and to change the class classification (variable Y) from the previous one as the name of the attack to the type of attack so that the number of Y variables is lower so the system performance can be faster. Attack names and attack types can be seen in Table 1.

Feature selection
This stage aims to reduce the number of X variables so they are not too many and to improve the accuracy of the predictions produced. The method used in feature selection is correlation-based features selection (CFS). CFS chooses X variables that have the highest correlation with Y variables but has the fewest correlations between X variables. The Feature Selection process in this study was carried out using WEKA tools which produced 10 X variables out of a total of 41 existing X variables. The list of variables are : flag, src_bytes, dst_bytes, hot, logged_in, count, srv_serror_rate, diff_srv_rate, dst_host_diff_srv_rate, and dst_host_srv_diff_host_rate.

Variable discretization
This stage aims to change the variables in the dataset which are of the continuous type to discrete types. The discretization method used in this stage is supervised discretization because the variable X correlates directly with the Y variable. The results of variable discretization are shown in Table 2.

Implementation
At this stage, the application starts to be built by the design made in the previous stage. The application is realized in the web form with PHP programming language and using MySQL database. There are 3 algorithms applied in the application, including:

Naive Bayes
Naïve Bayes is a simple probability classification based on Bayes' Theorem where each feature/variable is assumed to be independent of each other. Bayes' theorem was put forward by a British scientist named Thomas Bayes as a theorem for predicting future opportunities based on experience [24]. The Bayes theorem equation can be seen in (1)

Modification 1
From the example calculation done above, it can be seen if there is a problem where no prediction results are found because all the classes have a probability value of 0. Therefore, in modification 1 this is done by removing variables that have a 0 value, so the probability of each class when there is no comparison 0 value.

Modification 2
In this modification 2, to overcome the probability value of 0 on Naïve Bayes is to change the multiplication operation into an addition so that the probability results of each class are not worth 0.

Application implementation
The implementation of the previously created design resulted in a web-based application to test the modifications made in this study. In this application, there is one admin user who acts as a data manager. Admins are required to log in before they can manage the data in the system. On the main page, several menus can facilitate the admin to manage data, including training data, testing data, and testing page. In the data training and data testing menu, there are submenu namely the view data menu shown in Figure 2. On the manage data page, the admin can input data either through the form provided or through CSV import using the import data button. In addition, the admin can also delete all data that has been entered by using the delete all button.
On the data view page, Admin can view, edit, and delete data that has been entered. On the data testing list, there is a button that can be used to start the classification process. The testing menu can be used to see the results of the classification process that has been carried out by the system. The page views of the tests are shown in Figure 3.

Testing
Testing is a way to assess quality from an algorithm [25]. This stage is carried out with 2 methods, including algorithm test and precision, recall, and accuracy test.

Algorithm test
Algorithm testing is done by comparing the results of manual calculations with calculations performed by the application. If the calculation result manually is the same as the calculation result using the application, it indicates that the application has performed the calculation correctly. The comparisons compared are the calculations with the Naive Bayes algorithm, the Naive Bayes algorithm with modification 1, and the Naive Bayes algorithm with the 2nd modification.
This test is carried out using 10 training data and 3 testing data. The calculation is done using manual calculations and calculations with applications that have been built. From the results obtained, manual calculations and calculations using the application have the same results. This shows that the application built has implemented the Naïve Bayes algorithm and the two modifications are appropriate.

Precision, recall, and accuracy test
Testing precision, recall, and accuracy is done by calculating the value of precision, accuracy, and recall of the Naïve Bayes algorithm and the two modifications made. Precision is a calculation of the estimated proportion of positive cases that is formulated in (2) [26], [27]: A recall is a calculation of the estimated proportion of positive cases that are correctly identified and as shown in (3): Accuracy is a calculation of the proportion of the total number of correct predictions and as shown in (4): where: TP: True Positive TN: True Negative FP: False Positive FN: False Negative In this test, the testing data used has a fixed amount of 300 data while training data starts from 200 data to 1200 data with the addition of 200 data for each test. This is done to analyze the value of precision, recall, and accuracy of the Naïve Bayes algorithm along with the modifications applied. The results of testing precision, recall, and accuracy can be seen in Figures 4, 5, and 6.

Results analyze
The application that was built in this study has one actor, namely the administrator who has access rights to manipulate training data and data testing and classification testing. Testing of applications that have been built is done by 2 methods, namely algorithm testing and testing precision, recall, and accuracy. The precision, recall, and accuracy testing on the Naïve Bayes algorithm and the two modifications showed an increase with increasing training data. This is because increasing the amount of data can increase the possibility of the same data so that increasing the data can increase the precision, recall, and accuracy values The maximum value of precision in Naive Bayes is 76.83% in the training data of 1200 data. While for the same training data, the precision value in modification 1 is 79.83% and the precision value in modification 2 is 80.83%. This shows that there is an increase in precision in modifications 1 and 2 with the highest value obtained by modification 2. The maximum value of recall on Naive Bayes is 85.52% for 1200 training data. Whereas in the same training data, the recall value in modification 1 was 86.52% and the recall value in modification 2 was 87.52%. This shows that there is an increase in recall on modifications 1 and 2 with the highest value obtained by modification 2. The maximum value of accuracy at Naive Bayes is 87.33% for 1200 training data. While for the same training data, the accuracy value on modification 1 is 88.33% and the accuracy value on modification 2 is 89.33%. This shows that there is an increase in accuracy at modifications 1 and 2 with the highest value obtained by modification 2. Based on the tests that have been done, it can be concluded that modification by eliminating a variable that has a value of 0 and modification by changing the multiplication operation by addition can increase precision, recall, and accuracy. The highest precision, recall, and accuracy is obtained from modification by changing the multiplication operation with the addition of the value resulting in the possible value of 0. Increasing precision can reach 4%, increasing recall reaches 2% and increasing accuracy reaches 2%.

CONCLUSION
Based on the tests that have been done, it can be concluded that the precision, recall, and accuracy testing on the Naïve Bayes algorithm and the two modifications showed an increase with increasing training data. Besides that, modification by eliminating a variable that has a value of 0 and modification by changing the multiplication operation by addition can increase precision, recall, and accuracy. The highest precision, recall, and accuracy is obtained from modification by changing the multiplication operation with the addition of the value resulting in the possible value of 0. Based on the results obtained, to achieve better results it is recommended that improvements be made to the modifications that have been made in subsequent studies. 0,00% 20,00% 40,00% 60,00% 80,00%100,00% 200 600 1000