Bulletin of Electrical Engineering and Informatics

Mohammad Aljanabi, Russul Hayder , Shatha Talib , Ahmed Hussein Ali, Mostafa Abdulghafoor Mohammed, Tole Sutikno Department of Computer, College of Education, AL-Iraqia University, Baghdad, Iraq Department of Computer Science, Al Salam University College, Baghdad, Iraq Engineer in the Ministry of Education Iraqi Directorate of Education Baghdad Karkh III, Baghdad, Iraq Computer Science and Information System, Al-Bayan Universiy College, Baghdad, Iraq Imam Aadham University College, Baghdad, Iraq Department of Electical Engineering, Faculty of Industrial Technology, Universitas Ahmad Dahlan, Yogyakarta, Indonesia


INTRODUCTION
Since the report of the first attack incident by Computer Incident Advisory Capacity in 1999, distributed denial of service (DDoS) attacks have grown to be one of the most difficult network security problems [1]- [5].The threat of DDoS attacks is still extremely real and growing every year [6]- [8], even though many different defense strategies have been put forth in academics and business.DDoS attacks continue to be the main threat that service providers are contending with.DDoS attacks simultaneously and continuously send a lot of traffic to the target system with the goal of preventing genuine users from accessing a certain network service [9]- [12].Hackers frequently utilize botnets to launch a DDoS attack in such attempts.Botnets are networks made up of host computers that have been "enslaved" by one or more Bulletin of Electr Eng & Inf ISSN: 2302-9285  Distributed denial of service attack defense system-based auto machine learning … (Mohammad Aljanabi) 545 attackers, known as "botmasters," in order to carry out destructive operations [13]- [16].Due to the deployment and connection of billions of susceptible internet of things (IoT) devices as well as the ease with which the majority of IoT devices may be hacked and compromised, the most potent botnets have recently tended to rely on IoT devices [17].The purpose of launching an attack might differ between different hackers, but there are often five basic motives for doing so, including financial gain, retaliation, intellectual challenge, ideological belief, and electronic warfare.What consequences do these attacks have?Attacks must increasingly be identified and stopped before they reach their target.The most widespread and effective attacks among the numerous types are DoS and DDoS attacks, which have a variety of origins and formats.These attacks are aimed at using up available network resources and bandwidth just to prevent genuine user access to the target network is limited.DDoS attacks often start with two steps; the first is stealth, where attackers set up their attack's launch configuration by building a network of malicious devices or a "botnet" (using DDoS tools on multiple network hosts).The second stage is to attack the target network by triggering the set or bots [18].DDoS attacks can cost businesses up to $50,000.DDoS assaults are often divided into two categories: Volumetric attack, commonly referred to as a flood attack.This kind of attack has two goals.They first overwhelm the bandwidth of the targeted server by flooding it with traffic to exhaust its bandwidth [19].The second step is to clear all currently cached data.Attackers frequently start by using less bandwidth by focusing on particular services or apps that have an impact on the performance of other applications.Techniques that detecting the attacks can be broadly divided into three categories [20], [21]: signature-based (abuse-based), hybrid-based, and anomaly-based.With the signature-based technique, previously known attacks are identified by matching the attack signatures [22].For the skewbased approaches, attacks are identified by detecting patterns that differ from regular traffic or network activity [23].These are efficient as they can identify unidentified attacks.For the hybrid techniques, they integrate strategies based on anomaly-and signature-based approaches.Several strategies have been put out in recent years to forecast different attacks using machine learning techniques [24].The following are the primary contributions of our proposed approach.− The suggested method integrates the oversampling (SMOTE) and under-sampling techniques (Tomek links) to balance the minority class data.− Suggestion of a hybrid feature selection method for the extraction of the best features with the least amount of training time and with the highest detection rate.− The support vector machine (SVM) hyperparameters are modified using grid Search to obtain the optimal hyperparameters for enhanced model performance.− The performance of the proposed method in terms of performance metrics and computing time was evaluated by making a comparison between the existing techniques and the proposed model in the last section.
We briefly go through the newest and most popular techniques for identifying DDoS attacks in this section.Maslan et al. [25] suggested a broad machine learning (ML) approach that reduced functionality while improving DDoS attack detection performance.To determine the function and choose the subset of first 20 features, this method employs built-in function selection and filtering approaches, especially the F test, the light gradient amplification algorithm, and the random forest (RF) algorithm.The proposed model was then tested on more attacks after being trained using the records for a specific type of attack [26]- [28].The AE-SVM model is intended to quickly identify attacks.To efficiently distinguish attacks from nonattacks, dimensions are downscaled using an automated encoder and trained with the SVM method [29], [30].The developed model produced good accuracy despite the unbalanced data; it also recorded a excellent accuracy level using 25 functions and decreased the high rates of false positives [31]- [33].
Four sections have been created for the paper.The relevant works are described in section 1, and the proposed method and the performance indicators are discussed in section 2. The results of the experiments and the discussion are found in section 3 while the conclusion of the work is in section 4.

PROPOSED METHOD
Preprocessing, model modification and classification are the three phases of the suggested model.data analysis for exploratory is used during preprocessing to examine the data and understand it.After that, a mix of over-and under-sampling strategies.Data quality can also be improved through data cleaning.The next step is to use the function scale to normalize the range of functions before applying a transformation to digitize the categorical data.Similar to this, it is advised to adjust the model using the hybrid function to condense the function space and then tune the hyperparameters to enhance model performance.For various observed learning techniques, the best features and hyperparameters are provided in order to distinguish attacks during classification.In Figure 1, the suggested work's diagram is depicted, and the following parts give a thorough analysis.

Dataset
The suggested model was evaluated CIC_DDoS_2019 dataset for performance [1].The CIC_DoS_2019 dataset covers more forms of DDoS attacks with high volume compared to other datasets [2], [3].The dataset includes two different types of attacks which are thinking and exploration.Both forms of attack disguise the identity of the attacker and flood the resources of the victim with response packets by sending packets to reflexive servers using the address IP of the victim as the source IP.The dataset, which includes 88 functions, was created in two days for training and testing.There are 12 different DDoS attacks in the training set.

Pre-processing
It is a crucial phase in the development of any ML framework and is mostly used for the organization and cleaning data to make sure it is suitable for creation and training of any ML framwork.The pre-processing step is very important in machine learning, applying good pre-processing process reduce the excuation time and increase the accuracy.The following steps are parts of the pre-processing phase: feature scaling, data cleansing, exploratory data analysis, and transformation.

Data analysis
The data that is visible to the human eye is not necessarily accurate.Exploratory data analysis (EDA) is used to condense, display, and understand accurate data from data sources.Our knowledge of the data set depends on our ability to extract specific statistical measures and information, such as the number, mean, number, odds, peak, and frequency of categorical data.Its features can be applied to modeling once the data analysis has been completed.Outliers, the connection between traits and class imbalances, and other statistical measurements, such as outliers, can be displayed using graphs, box plots, and scatter plots.

Cleaning the data
Data need to be processed for proper model training after the data set has been balanced.Before training the model, the data must be prepared as follows: − Removing features that not effecting the model (unneccesery) The functions such as anonymous 0, source port, destination IP, source IP address, destination port, stream ID, timestamp, and similar HTTP are all eliminated because they are superfluous and socket related.Because different networks have different values for this attribute, therefore, package properties are used to train the model.Additionally, the IP addresses of the attacker and common user can be similar.Furthermore, an ML model can be biased due to the handling problem caused by the use of socket functions to train the model.It is possible to get 80 new features by removing redundant features.

− Data cleaning and imputation
The majority of ML algorithms demand tests without values being lost.Noise or missing values impair the model's accuracy.In the suggested work, redundant features are removed to reduce the computational cost while groups that contain deficient or NaN, inf values, are not deleted.Since the attack rating on each die offers some basic information, the calculation of the negative values with 0 values, inf values, and missing values is the final step in processing the noisy data.

Feature selection
The family services stage (FSS) of this study employed the Rao algorithm.A randomly generated initial set, which includes a teacher and a group of students that make up the solution set, serves as the initialization step of the Rao algorithm.Rao uses mutation and crossover factors from GA that represent the function of chromosomes to represent its features.This chromosome is updated using the crossover.Every solution in society is viewed as an individual or chromosome (Figure 2).When a chromosome's characteristic gene has a value of 1, it is regarded as a determinant, however when it has a value of 0, it is the opposite.

Figure 2. Chromosom
The proposed method is comprised of the following detailed steps: Step 1. Randomly initialize the population; the features of each population must differ from that of the others.
Step 2. Determine the best and worst populations based on the classification accuracy for each feature set.
Step 3. Update the solutions based on the specified best and worst solutions and random interactions based on New_set= random_set crossover with (best_set crossover with worst_set).
Step 4. Keep the new set of features if they are better than the old best set (in terms of classification accuracy).
Step 5. Report the best set of features if the termination criteria have been met, else, go to step 3.
Different measures may be assigned to the values of each function in the data set.Training the model at various levels requires complexity, and time, and occasionally results in model errors.We employ a scaling method known as Standard Scaler to prevent this.This technique's goal is to convert the values of the data set's numerical columns to a standardized scale with keeping the distinctions between the ranges of values.The training instance is done using the following default settings: where S = Standard Scaler, µ = mean, SD = standard deviation of the training set.

Transformation
Different types of data functions are contained in the current dataset.Since scalar values may be understood by ML algorithms, it is necessary to transform non-scalar values into numeric values using "Tag Encoder" technology.Assign each data category a special number, starting at 0.

Model tuning
The suggested strategy for choosing hybrid features to extract the best features and altering the hyperparameter to select the optimal parameters for improved model performance is discussed in this section.FS the data that pre-processed can then be used with any ML model after processing is done.Role selections are crucial for developing models with optimum performance [1], [2]; this can be divided into three categories: filtering, packaging, and inline [3], [4].The filter technique uses a single variable to define the set of independent features.
In a multivariate, criteria-based method, important features are selected by sifting out redundant, overlapping, and highly correlated features.The selected roles for the ML are specified.Computing costs are lower for filtering methods than for other methods.The embedding approach operates by assessing chosen sets of functions using an ML algorithm and employing a search strategy to locate a potential subset of functions.This procedure is repeated until the best feature set is obtained with satisfactory outcomes.This is computationally intensive since it searches for multiple feature sets.Function selection in Create file is performed by using built-in techniques.

SVM
As a supervised learning strategy, SVM can be applied to classification and regression problems using support vector classification (SVC) and support vector regression (SVR) respectively.Each data point in SVM is represented by a point in n-dimensional space, with each function value corresponding to a certain coordinate's interpretation.Finding the ideal hyperplane and efficiently classifying the data set are the major goals of SVM.SV, which are the locations nearest to the target groups for hyperplane, are determined as decision boundaries that assist in classifying target groups.

RESULTS AND DISCUSSION
This section thoroughly describes the evaluation of the proposed model.The experimental setup is described first, followed by the results.The results in Table 1 demonstrate the strength of the proposed method as well as future directions for future work.

Experimental setup
The implementation of the proposed model was done in MATLAB; the experiments were conducted on a PC that has these specifications: Intel Core (TM) i5-10500H CPU @ 2.50 GHz, 2.50 GHz 16 GB RAM, and Windows 11 OS.

Performance metrics
The evaluation measures are used to gauge how well the suggested solution performs.The "CICDDoS2019" dataset [1] is used to train and test the suggested model that combines a hybrid feature selection method with SVM classifier.The metrics used to determine the model performance are as (2): Precision (Prc): A measure that determines the ratio of successfully detected DDoS attacks among the overall predicted attacks; it is calculated thus:

CONCLUSION
DDoS attacks alter the size and shape of network resources to drain the resources of the targeted network.Hence, this study proposed an automatic detection method that precisely categorizes the attacks to reduce the harmful impact.To prevent sampling bias, the dataset used in this work is first balanced, then, the hybrid function is selected.One technique involves choosing the right features, which is followed by the implementation of hyperparameter adjustment to enhance model performance.Finally, the supervised learning approach is introduced to usher in unique features and optimum hyperparameters for distinguishing between regular traffic and DDoS attacks.the proposed model is observed to be superior to the current ones when these results are contrasted with the existing techniques.The proposed approach can therefore be applied on any network as a predictive model for effective DDoS attack detection.