Distributed denial of service attacks detection for software defined networks based on evolutionary decision tree model

ABSTRACT


INTRODUCTION
The software defined networks (SDN) is a modern networks technique, it separates the data level from the control level and works to collect control functions in a central unit (controller) to provide several advantages such as cost reduction, programming ability and provide management of the entire network from a single point [1].In SDN, the control level and redirect level are separated, thus realizing that the control level is managed using a programmable central controller where policies are configured for each device, be it a switch, router, or firewall, i.e., we leave from a network with a distributed control level to a network with a central control level.Consequently, the controller which has the capability to handle the entire-network from a centralized point, can quickly impose various network policies across the overall infrastructure [2], [3].Despite the advantages offered by this modern structure, its centralized structure makes it vulnerable to attacks of its own, in addition to the well-known attacks against traditional networks [4].Among the attacks that software-defined networks are exposed to, attacks on the central unit (controller) are among the most hazardous types of attacks.If an intruder (attacker) takes over the controller can possess the facility to manage or disable network traffic [5].The most common attacks against the controller are distributed denial of service (DDoS) attacks, wherein users are rejected arrival to the network services.The attackers seek to generate a lot of traffic using multiple machines, exhaust the resources of the host computer, and balk it from serving via a DDoS 2323 attack [6].In the recent period, DDoS attacks have become one of the most famous and very dangerous attacks, and they can be devastating to a variety of network services [7].Botnets, which are made up of zombie devices taken over by internet hackers, are used by attackers.DDoS attacks are difficult to identify and block because they involve a large number of devices [8], [9].As a result, one of the most pressing issues for administrators and network service providers is the rapid identification and mitigation of DDoS attacks.DDoS attacks can disable different SDN layers by flooding communication-channels between the switch and the controller, or between the application layer and the controller, with excessive flow data.On the controller, there is no builtin security system that can differentiate between an intrusion and normal traffic.As a result, detecting an attack is extremely difficult.Volumetric attacks, resource-consuming attacks, and application layer attacks are common types of DDoS attacks [10].As a result, their detection is difficult.This topic includes assaults against hypertext transfer protocol (HTTP) and domain name system (DNS) protocols [11].Servers are rendered inaccessible in resource-consuming attacks by exploiting weaknesses in protocols implemented at the network layer.Transmission control protocol-synchronize (TCP-SYN) flood depletes the target machine's resources (memory, CPU, and storage) [12].Its goal is to use volumetric attacks to devour the network's bandwidth.
Common attacks like ICMP, user datagram protocol (UDP), and TCP-SYN flood take advantage of flaws in layer 4 and layer 3 protocols [13].In this work, we focus on improving the security of SDN against intrusion by using a machine learning algorithm.We use a standard public dataset with a sum of 23 features for discovering DDoS attacks using machine learning for this purpose.We employed a machine learning method to classify SDN traffic into legitimate or illegitimate traffic.Next, we used an optimization algorithm to optimize (tuning) the hyperparameters of the model and improve the classification accuracy.The results appear that the proposed approach is more efficient than utilizing merely a machine learning model.The remainder of the work (paper) was arranged in this manner: the next portion demonstrates some of the previous-works.In section 3, we provide a brief explanation of the data set used.In addition, the proposed-method, machinelearning, and optimization-method are briefly discussed in this part.Section 4 contains the results and analysis.Section 5 provides the discussion and future-work.

LITERATURE REVIEW
In latest years, various investigations have been performed to protect SDN utilizing machine-learning methods and others techniques.This section discusses several research on DDoS security procedures based on machine and deep learning approaches and others techniques.The majority of SDN-based moving target defense (MTD) techniques have been created with a single SDN controller, which introduces a single point of failure and a scalability concern for large-scale networks.To assure both performance and security, the author proposes an SDN-based MTD architecture with several SDN controllers in paper [14].A developing field of research is the use of "defensive cyber-deception" to improve the security and reliability of network-based systems.To deploy more effective "cyber deception", current honey technologies require more underlying infrastructure.The author investigates how to deploy "deception" in enterprise networks using SDN technology in paper [15].Security mechanisms like the intrusion-detection-system (IDS) and the intrusion-preventionsystem (IPS) are employed to enhance network security.Because of the increasing variety of attacks, statistical calculations must be used on these systems.Machine learning techniques have enabled intrusion detection systems to make useful predictions and remarks.A paper [16] developed an ensemble strategy for detecting DDoS attacks, they utilized four distinct machine learning methods.With (98.12%) accuracy, the SVM-SOM algorithm outperformed the other machine learning (ML) algorithms.DoS data set from the Canadian-Institute of cybersecurity (CIC) is used to test several classification algorithms (machine learning and deep learning algorithms).Out of all the machine learning algorithms tested, the multi-layer perceptron algorithm (MLP) produced the best results with 95% accuracy in paper [17].Two models were used in the paper [18] to identify UDP flooding assaults in an SDN scenario.For traffic packet creation, they employed the Scapy software.The OpenFlow switch is used by their system to obtain flow stats.They tested the results of linear and polynomial-SVM models for classification after the phase of features extraction.The SVM (polynomial-SVM) method has a decreased false alarm rate of 34% and a higher accuracy of 3%, according to experimental results.proposed a security framework for detecting DDoS attacks in SDN architecture.The system is based on a paradigm of adaptive learning that classifies traffic using historic data.For efficient accuracy results, they applied a crossvalidation approach.Although the results are encouraging, the adaptive security model needs to be evaluated on a variety of datasets from the real world to ensure that it is more realistic.In the SDN environment, paper [19] presented a novel security paradigm for DDoS attacks.The model is made up of two stages that use machine learning algorithms.The k-means technique is used in the data processing stage to choose the best features, and the k-nearest neighbor (kNN) approach is used in the detection stage to detect attack flows.Their approach has a 98.85% accuracy rate and a 98.47% recall rate.Ensemble technique was employed in the paper [20] to enhance IDS efficiency.The traditional NSL-KDD dataset is used to test multiple classification algorithms [21].Karan et al. [22] presented a system for a DDoS attacks detection in SDN.The system The experimental results demonstrated that DNN has a higher classification accuracy rate than SVM, with a rate of 92.30%.To detect attack flows, a DDoS security system based on SDN architecture was proposed [23].Their hybrid solution employs a combination of kNN and SOM algorithms.They use flow stats.gathered from SDN switches and classify the traffics into regular or malicious.
It is obvious from the summary of previous studies that the performance of intrusion-detection systems is highly dependent on the nature of the data sets.Several datasets have been used in previous studies such as (Cup'99, CAIDA 2016, CICIDS2017, UNB-ISCX, NSL-KDD, and CIC DoS).These data sets are outdated and attack characteristics are constantly changing.Therefore, there is an increasing need to use up-to-date data sets gained from SDN scheme.There are just a few available to the public datasets for use in SDN-based intrusion-detection systems [24], [25].In our research, we used the "DDOS-ATTACK SDN DATASET.", which is a recent dataset created recently in a software defined networking environment.

PROPOSED METHODOLOGY
The proposed work (method) has three parts: dataset and preprocessing, optimization algorithm for hyperparameters optimization (tuning) and improving the accuracy of ML model, and evolutionary machine learning algorithm to classify traffic into normal traffic or attack traffic.Figure 1 shows an overview of this technique.The classes and features of the public dataset used in this section are clarified.The machine learning model used to classify network traffic and the optimization algorithm used to optimize and fine-tune the hyperparameters of the machine learning model that will rise the classification efficiency and accuracy are explained in detail in this section.

Dataset
The new-dataset ("DDOS-attack SDN dataset"), which was generated within the SDN framework (environment) and made publicly available to researchers to be used in machine learning research, was employed in this study [26].The dataset contains 1,04,345 traffic-flows, 23 features, and consist of (UDP, TCP, and ICMP) protocols as attack and normal traffics.Except for the features that define the source and target, the dataset contains statistical (numerical) features such as packet per flow, byte count, packet rate, and duration sec.The data must be preprocessed before beginning machine learning model training.Several preprocessing techniques were applied to the data set.Missing value handling, null value removal, categorical value encoding, and other pre-processing techniques are used.Categorical values with no numerical values, such as source-destination internet protocol (IP) and protocol, were encoded using one-hot encoding [27].We then attempted to find the correlation (correlation) between output and the input features using a variety of machine learning methods, heatmap graphs, and correlation techniques.As a result of this procedure, the column containing time data that was displayed with the "dt" feature was determined to be useless and was deleted from the data set.By applying adjustment (normalization) to numeral (numeric) data, the data preprocessing phase was completed.

Hyperparameter optimization using optimization algorithm (genetic algorithm)
For ML models, choosing the appropriate hyperparameter configuration has a direct effect on the performance of the model.It frequently necessitates extensive knowledge of ML algorithms as well as appropriate hyper-parameter optimization techniques.Although multiple automatic optimization approaches exist, when applied to various types of problems, they have varied strengths and disadvantages.Building an 2325 efficient ML model is a time-consuming and complex process which involves finding the best algorithm and tuning hyper-parameters to obtain the best model architecture.The genetic algorithm (GA) [28] is a popular metaheuristic algorithm based on the-evolutionary hypothesis that individuals with the better survival and environmental adaptability are more able to live and passing on their qualities to future generations.The characteristics of their parents will be passed down to the following generation, which may include both good and bad individuals.Better individuals will have a higher chance of surviving and having more capable offspring, while the worst will eventually fade away.The-individual with the best adaptability will be selected as the global optimum after multiple generations [29].To apply the genetic algorithm to hyperparameter optimization problems, each individual or chromosome represents a hyperparameter, and its decimal-value is the input value for the hyper-actual parameter in each evaluation.Every chromosome contains several genes, which are binary digits, and the genes of this chromosome are then subjected to crossover and mutation operations.The population represents all possible combinations within the initialized chromosome/parameter ranges, whereas the fitness function denotes the parameter evaluation measures [30].Because the parameter values that are randomly initialized typically do not contain the best parameter ranges, several-operations, such as selection stage, crossover stage, and mutation stage, must be performed on the well-performing chromosomes to identify the optimums [31].Chromosome selection is carried out by selecting chromosomes with high fitness function values.To keep the population size constant, chromosomes with high fitness function values are more likely to be carried on to the next generation, where they generate new chromosomes with the best characteristics of their parents.Chromosome selection ensures that the best traits of each generation are passed down to future generations.Crossover is a method of creating new individuals (chromosomes) by exchanging a proportion of genes between chromosomes.Mutation operations can also be used to generate new chromosomes by randomly changing one or more genes on a chromosome.Mutation and crossover operations allow for different characteristics in later generations and reduce the probability of missing good characteristics [32].The following are the main genetic algorithm procedures [33]: i) initialize the population, chromosomes (each chromosome represent set of hyper-parameters), and genes at random, representing the whole search space, hyper-parameters, and hyper-parameter values, respectively, ii) compute the fitnessfunction, which represents the objective-function of an ML model, to examine (evaluate) the performance of each individual in the current generation, iii) run selection, crossover, and mutation operations on the chromosomes to generate a new generation containing the next hyper-parameter configurations to be tested, iv) repetition steps 2 and 3 until the stop condition is satisfied, and v) end the program and display the optimum hyper-parameter configuration.
In the steps, the initial population of hyperparameter configuration-candidates is generated using random initialization with stochastic (random values) in the specified search space.Our objective-function (accuracy (ACC)) is a maximization problem as shown in (1), in the execution of the genetic algorithm, the evaluation, selection, and recombination processes represent one generation.Several open-source libraries exist to implement evolutionary algorithms such as Genetic Algorithm in practice, in our work we used distributed evolutionary algorithms in python (DEAP) library for hyperparameter optimization.DEAP [34] is a novel Python evolutionary computation package that includes several evolutionary algorithms such as genetic algorithm and differential evolution.It works with parallelization mechanisms such as multiprocessing and machine learning packages such as sklearn, DEAP built-in functions were used for evaluation, mutation, crossover (one-point crossover), and tournament selection.The genetic algorithm was run with the following hyperparameters (population_size=10, mutation_probabilty=0.10,crossover_probabilty=0.5, tournament_size=3 and generations_number=15).After executing all the steps, we will get the best possible accuracy and the better possible-combination of the hyperparameters of DT model which are shown in Table 1. Figure 2 shows the flow chart and the steps of genetic algorithm.
Where TP, FP, tn and fn represent the elements of the confusion matrix, which will be explained later.

Classification using proposed evolutionary decision tree
For regression and classification of real-world situations, the decision tree machine learning algorithm is utilized.This model is based on the structure of a tree.The tree's root, on the other hand, is at the very top.The branches are built using objective rules based on the dataset's features and the decision tree is also evolved gradually [35].The processes outlined can be used to generate a decision tree [36]: i) the entire dataset is split into two sections: training and test sets, ii) the training set is used as an input to the tree's root, iii) as shown in (2), the root is found using information theory, iv) the prone-procedure is followed, and v) the steps from 1 to 4 are repeated until all nodes have turned into leaf nodes.
Where p stands for the dataset's probability distribution.In order to get an efficient decision tree, other hyperparameters must be tuned (optimized).conducting many experiments, we concluded that the most important hyperparameters that greatly affect the efficiency of the model results and that need to be tuned (optimized) such as (criterion: the function for determining a split's quality, splitter: the method for selecting the split at each node, max-depth: The tree's maximum depth and max-features: the number of characteristics (features) to consider while looking for the ideal split).As shown in Figure 2, after performing the preprocessing of the data set, the process of optimizing the hyperparameters was implemented using the genetic algorithm to obtain the appropriate values for the hyperparameters of the machine learning model and then used them in the DT model to get the best possible accuracy.

RESULTS AND DISCUSSION
This section discusses the results of experiments and the findings of the proposed evolutionary machine learning model, as well as comparing the findings of the suggested model with other studies.Binary classification was used on the public "DDoS attack SDN dataset", and it was done using the sklearn python machine learning framework.

Performance evaluation using performance metrics
The experimental results in terms of performance investigations done to define the legitimate and malicious network records generated with SDN were tested using a confusion matrix.This matrix contains  2 shows the confusion matrix.True-negative (TN), true-positive (TP) values indicate correctly expected network motion, whereas false-negative (FN), false-positive (FP) indicate incorrectly expected network motion [37].Furthermore, the receiver operating curve (ROC) and areas under the curves (AUC) were used to assess model performance.The false_positive rate and true positive rate are represented by the x (horizontal) and y (vertical) axes, respectively, in the ROC curve [38].The accuracy (ACC), precision (Pr), specificity (Sp), sensitivity (Se), F1-score and performance measures obtained from confusion matrix were used to evaluate the suggested model.These metrics' formulas are as:

Performance evaluation based on proposed evolutionary decision tree algorithm
In this stage after the preprocessing step, the dataset was partitioned into two parts: testing and training at a rate of 0.3 and 0.7, respectively, and the GA was used to optimize hyperparameters of the DT algorithm.The genetic algorithm was constructed using the following hyperparameters (population size=10, mutation probability=0.10,crossover probability=0.5, tournament size=3 and generations number=15).Hyperparameter-optimization technique is used to get suitable hyperparameters and improve the accuracy and efficiency of the DT model.After performing the hyperparameter optimization process using the genetic algorithm, then a decision tree model was built using the obtained hyperparameter values and the model accuracy was 99.46%.Table 3 demonstrates the classification results obtained and the hyperparameter values shown in the Table 4 were obtained.The accuracy (acc) of the traffic is measured by how well the classifier predicts both benign and anomalous classes.Precision (Pr) expects the percentage of traffic that is normal or malware, based on the count in the dataset.The measure of the negative class prediction in the dataset is known as specificity (Sp).The sensitivity (Se) metric assesses a model's facility to estimate true positives in each category.The F1-score represent a metric for determining how accurate a test is, it is calculated using the test's precision and recall.Figure 3 show ROC curve of evolutionary decision tree algorithm (EDT), ROC curve is a binary classification problem evaluation metric.It's a likelihood curve that compares true positive rate (TPR) to false positive rate (FPR) at various thresholds.AUC is a summary of ROC curve that provides the possibility of classifier to identify between classes.4, it's clear that several datasets were utilized to detect attack traffic.Many of the authors employed public datasets including network traffic statistics from classical network architectures, such as NSL-KDD, UNB-ISCX, KDD Cup'99, CICIDS2017, and CAIDA2016 [16], [17], [22].These datasets are useful for assessing the performance of machine learning techniques used in attack traffic detection.However, because the SDN design differs from traditional network architecture, it has its own set of attack vectors in addition to the present ones.Furthermore, the growing volume and variety of attack traffic necessitates the usage of up-to-date data-sets.As a result, researchers, [18], [19], [25] employ datasets collected through the SDN architecture in their studies.The Study Group (Bennett-University study group) for deep learning and machine learning studies created the SDN dataset that was used in this research.The most critical condition for choosing this dataset is that it was developed utilizing SDN architecture and includes modern SDN DDoS traffic data.We note from the information mentioned in Table 5 that the model proposed by us achieved high results compared to other works.[17] REP Tree, MLP, Random Tree, J48, SVM, and Random Forest 95.00 Their dataset [18] Linear SVM-Polynomial SVM 95.00 Their dataset [19] KNN and K-Means 98.85 CAIDA 2016 [16] SVM, Naive Bayes, SOM, and KNN 98.12 KDD cup 99 [22] DNN and SVM 92.30DDOS attack SDN Dataset [25]  Machine learning models are highly good in detecting attack traffic, according to the results.Our work aims to contribute to the research being conducted in this field (assaults detection in SDN utilizingn machinelearning and optimization techniques).The use of a GA for hyperparameters optimization improved the accuracy of machine learning approaches in identifying attack traffics, according to our results.Experimental studies were conducted by selecting the hyperparameters automatically by using the GA to choose the appropriate values for the hyperparameters that make the model accuracy as best as possible.It can be said that the model's classification performance contributes positively to the attack classification when used in conjunction with hyperparameters optimization algorithms.

CONCLUSION
In this study, an evolutionary machine learning-algorithm was used to classify attack and normal traffic in a dataset generated from an SDN environment.The dataset contains 1,04,345 records and 23 features and consist of (UDP, TCP, and ICMP) protocols as attack and normal traffics.The data includes numerical (statistical) features such as packet rate, byte count, packet per flow and duration-sec, in addition to features that indicate source and destination devices.The GA was used to perform efficient classification and select the most appropriate hyperparameters for the decision tree model.After conducting several experiments, the most Bulletin of Electr Eng & Inf ISSN: 2302-9285  Distributed denial of service attacks detection for software defined networks based on … (Hasan Kamel)


ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol.11, No. 4, August 2022: 2322-2330 2324 employed two levels of security.They began by employing Snort to detect signature-based attacks.They next classified attacks using the SVM model and the DNN classifier.

Figure 1 .
Figure 1.Overview of the proposed work Bulletin of Electr Eng & Inf ISSN: 2302-9285  Distributed denial of service attacks detection for software defined networks based on … (Hasan Kamel)

Figure 2 .
Figure 2. Flow chart of genetic algorithm Bulletin of Electr Eng & Inf ISSN: 2302-9285  Distributed denial of service attacks detection for software defined networks based on … (Hasan Kamel) 2327 both estimated and actual values.Table

Table 1 .
The hyper-parameters and configuration space for DT model

Table 3 .
Classification results of evolutionary decision tree algorithm

Table 5
compares the results of research on DDoS attack traffic detection using machine learning techniques with the model we propose.When looking at Table