Bulletin of Electrical Engineering and Informatics

Received Jul 4, 2022 Revised Aug 5, 2022 Accepted Aug 16, 2022 Network attacks of the distributed denial of service (DDoS) form are used to disrupt server replies and services. It is popular because it is easy to set up and challenging to detect. We can identify DDoS attacks on network traffic in a variety of ways. However, the most effective methods for detecting and identifying a DDoS attack are machine learning approaches. This attack is considered to be among the most dangerous internet threats. In order for supervised machine learning algorithms to function, there needs to be tagged network traffic data sets. On the other hand, an unsupervised method uses network traffic analysis to find assaults. In this research, the K-Means clustering algorithm was developed as a semi-supervised approach for DDoS classification. The proposed algorithm is trained and tested with the CICIDS2017 dataset. After using the proposed hybrid feature selection methods and applying multiple training, testing, and carefully sorting DDoS traffic through a series of experiments, the optimum 2 centroids were found to be DDoS and normal. The generated centroids can be used to classify network traffic. So the proposed method succeeded to cluster the network traffic to safe and theat.


INTRODUCTION
A distributed denial of service (DDoS) attack is a form of a denial of service (DoS) attack in which the attacker targets the victim by utilizing the IP address of an authorized user.The numerous DDoS assaults consist of SYN-flood, ACK-flood, UDP-flood, connection DDoS, DNS reflect, and ICMP flood, among others [1].An attack's primary goal is to prevent its intended recipients from making use of its intended services by overloading those resources.One tactic attacker uses to accomplish this is to send a barrage of fake requests through the network.DDoS is launched from multiple computers simultaneously.By overwhelming the infrastructure that surrounds the internet traffic flow, a DoS attack is a malicious technique that interferes with the regular traffic and networking operations of a targeted server.The rate and volume of network traffic sent to the target closely correlate with the attack's severity [2].
Since the 1990s, sophisticated intrusion detection systems have been made with the help of data mining.Data mining techniques in general, and machine learning techniques in particular, must be applied in five steps: selection, preprocessing, transformation, mining, and interpretation [3], [4].Out of all the ways to find intrusions using data mining, these three important steps are the hardest.There are three types of machine learning-based DDoS detection methods that are already in use.Supervised ML approaches that build the detection model from datasets of network traffic that have been generated and labeled.The supervised

3571
approaches have to deal with two big problems.First, making labeled network traffic datasets takes a lot of time and computing power.Without constant model updates, supervised machine learning techniques cannot predict novel actions that are simultaneously safe and risky.Second, supervised ML classifiers don't work as well when there is a lot of abnormal data in the traffic of the network.This is called noise.In the second group, there is no need for a labeled dataset to build the detection model, which is different from the first group.The main problem with the unsupervised methods is that they give out a lot of false positives.The curse of the dimensionality problem [5] makes it hard for unsupervised methods to find attacks accurately [6].By being able to work on both labeled and unlabeled datasets, semi-supervised ML concepts take advantage of both supervised and unsupervised techniques.Also, using both supervised and unsupervised methods together can improve accuracy and reduce the number of false positives.But the problems with both approaches also make it hard for semi-supervised approaches to work.So, semi-supervised approaches need to have their parts put together in a smart way to make up for the problems with supervised and unsupervised approaches.A group of machine learning tasks and techniques known as "semi-supervised learning" combine labeled with unlabeled samples for training, frequently combining a little amount of labeled samples with a large number of unlabeled samples.Semi-supervised learning way lies in the middle between supervised and unsupervised learning.Numerous machine learning researchers have demonstrated that integrating small amounts of labeled data with unlabeled data can dramatically improve learning accuracy compared to unsupervised learning without the time and expense of supervised learning.The general rule is first explored using labeled data in a semi-supervised learning process, and then the rule is applied to infer unmarked data.The machine learning algorithm that is enhanced for intrusion detection [7].
The primary goal of this work is to locate an appropriate method for classifying DDoS attacks by making use of semi-supervised learning and basing it on a global DDoS dataset.In addition to locating the most effective centroids for application in the offensive classification.The following are some of the benefits of our proposed algorithm over earlier detection solutions using supervised learning and unsupervised learning approaches: i) fewer labeled samples are needed to train detection models with our proposed method than with supervised learning detection algorithms, ii) proposed hybrid feature selection method using both low variance filter and information gain ration techniques, iii) present DDoS and regular centroids to assist in the implementation of them online for traffic classification.Following is a summary of the remaining sections of this paper.The related works in DDoS attack detection are introduced and their limitations.Our detection model, built on a semi-supervised clustering algorithm, is presented in section 2. Following the results and analyses of the experiments and a discussion of their significance, the paper concludes with recommendations for further research.The detection of DDoS attacks has been proposed using a variety of different methods such as [8]- [10].Techniques based on machine learning are the ones that appear most frequently in published works of research.Table 1 (in Appendix) provides a brief overview of some recent research and developments in DDoS detection.

THE PROPOSED METHOD
In the beginning of this part, the dataset utilized in this study is described.Then, the proposed method used for intrusion detection and proposed centroids clustering, are present as shown in Figure 1.Finally, the results are analyzed and discussed.

Description of the dataset
Sharafaldin et al. [21] suggested the CICIDS2017 to get around the fact that there aren't enough IDS datasets that satisfy criteria of real-world network traffic [22].The valid and widely used dataset CICIDS2017 [23], which is the largest and most used dataset [24].20% from the CICIDS2017 dataset is used in current work to train the machine learning algorithm.This set of data includes 84 features, as well as both unattack traffic and attack traffic.The CICIDS2017 dataset has a lot of information with a high-class imbalance.

K-Means clustering algorithm
A vector quantization technique known as "k means" try to group n observations in order to create k clusters, where every one observation belongs to one cluster that has the nearest mean (also known as the cluster centroid or cluster centers), which acts as the cluster's prototype [25].The both algorithms (Hierarchical clustering and K-Means) frequently use canopy method as a preprocessing step in their respective processes [26].Its purpose is toincrease the speed at which clustering operations are performed on large data sets, where it may be impractical to use another algorithm directly due to the volume of the dataset.

Feature selection methods
One of the most common problems researchers' encounters is choosing which features are most important and thus relevant for use in detecting attacks.Feature selection is critical because it affects how well the system works.Too few features may be guide to subpar detection accuracy, while too many may lead to excellent detection

Variance filter feature selection technique
The low variance filter method [27] was used to choose the features that were used in this paper, since all of the attributes were numbers.The method was used to exclude features with low variances that contributed slight or nil to the model's overall performance.Calculating the variance of each characteristic is involved (1).
where μ is the average of all the values that are associated with the attribute.The attribute values, denoted by Xi, are taken from a collection of data, where N is the total number of samples.

Information gain
Due to its usefulness and importance in detecting a class type, the IGR [28] is also employed as a weight for attributes in this work (2).
where Y represents the class and Ajthe index of j th attribute.The entropy function, H (.), is defined as follows: (3) Given an input, the probabilities can be expressed as where P(.) represente the probability operator and i represente an index of the probabilities.

Proposed centroids clustering
The proposed method is the use of semi-supervised K-Means Clustering to generate multiple centroids that can be used to classify traffic as either safe or malicious.Starting with the selected CICIDS2017 dataset, we use the K-Means algorithm to produce semi-supervised centroids for detecting DDoS attacks.The idea of semi-supervised involves the use of small number of labelled data for the purpose of labeling larger data sets.Figure 1 shows semi-supervised framework diagram.The main processes in proposed framework are illustrated as follows: a.The features that were chosen using hybrid the feature selection algorithms.In this work, the variance scores and the information gain were used to discover the perfect list of features.By applying variance to exclude useless features with a variance score less than 3.In addition, discarding features with a minimum weight of 0.6 from the information gain, then 15 selected features are produced, as shown and listed in Table 2.Note the variance values for all the data ranges (0 to 9.99E+14) for (Bwd PSH Flags and Fwd IAT Total) features respectively.b.Utilize the K-Means algorithm to generate the appropriate centroids.20% of the CICIDS2017 dataset was used to train the proposed method to generate centroids, and the remaining 80% of the dataset was used to test generated centroids.

RESULTS AND EVALUATION
The detection performance of the semi-supervised K-Means algorithm was measured in this experiment.WEKA's performance of clustering and feature selection by information gain.Accuracy measures the algorithm's ability to detect attacks in both unattack and attack traffic.The accuracy computed according (4).
The performance of the detection engine can also be measured by its accuracy.The machine's ability to predict traffic based on its actual conditions is indicated by its accuracy.In other words, the capacity of a machine to precisely classify a class.Figure 2 and Table 3 present values of generated centroids of the proposed method.It is providing two optimum centroids to classify traffic into normal and DDoS attack.Table 4 displays K-Means accuracy performance.The results shown in Table 4 illustrate that the test 1 was the best choice to achieved accuracy with 2 centroids that labeled into normal and another with DDoS. Figure 3 present performance comparison between the proposed K-Means and Canopy.

CONCLUSION AND FUTURE WORK
This paper presents the algorithm to classify DDoS attacks using a semi-supervised machine learning method.It starts with traffic statistics that aren't labeled that are gathered from three parts of the victim-end defense, which is the web server.Proposed hybrid feature selection techniques to reduction dataset feature from 84 to 15 of the features are used to final labeling of traffic flows in proposed framework.K-Means clustering algorithm group the data that doesn't have labels.The scheme used a representative part of the benchmark CICIDS2017 dataset with new normal and attack centroids to test how well labels were given.In the future, we want to find better ways to voting based label traffic online, add more ML algorithms to the clustering and classification processes, and put the proposed four centroids into the online detection framework.

ACKNOWLEDGEMENTS
The Authors would like to thank, University of Information Technology and Communications and Mustansiriyah University (https://uomustansiriyah.edu.iq/),Baghdad-Iraq for its support in the present work.This study proposes FloodDetector, an effective architecture for detecting known and unknown flooding assaults in SDN.It is a controller-agnostic SDN application that employs two machine learning classifiers to detect both known and unknown flooding attacks: K-nearest neighbor (K-NN) and artificial neural network (ANN).8 [18] Deep neural networks To detect intelligent systems, this study proposes the use of machine learning frameworks.The study uses deep learning to distinguish between benign data exchange and harmful data traffic attacks.9 [19] The N-Gram line generation, feature selection algorithm, and SVM algorithm This paper offers network traffic flow-based approach for mobile malware detection that assumes each HTTP flow as a document and analyzes HTTP flow requests using natural language processing string analysis.An effective malware detection model is created using the N-Gram line generation, feature selection method, and SVM algorithm.10

APPENDIX
[20] DBSCAN, SVM, and Random Forest In this paper, a hybrid supervised/unsupervised strategy is proposed.First, the clustering algorithm separates the anomalous traffic from the regular data by using numerous flow-based criteria.After determining the statistical characteristics each cluster shares, they can be assigned names using a categorization method.The authors conduct an evaluation of the proposed method by processing vast amounts of data.


ISSN: 2302-9285 Bulletin of Electr Eng & Inf, Vol.11, No. 6, December 2022: 3570-3576 3572 accuracy at the expense of an overly complex system that eats up more resources.This work employed two attractive features selection techniques; Figure 1 represent the main diagram of proposed framework.

Figure 2 .
Figure 2. Distributed traffic of proposed centroids

Figure 3 .
Figure 3. Performance of proposed method and canopy algorithm

Table 2 .
Compare the results with the accuracy scores and select the best result.Features scores using info.gain K-Means clustering-based semi-supervised for DDoS attacks classification … (Mahdi Nsaif Jasim) 3573 c.

Table 4 .
Accuracy of K-Means and canopy algorithms