An implementation of real-time detection of cross-site scripting attacks on cloud-based web applications using deep learning

Received Nov 12, 2020 Revised Mar 17, 2021 Accepted Aug 16, 2021 Cross-site scripting has caused considerable harm to the economy and individual privacy. Deep learning consists of three primary learning approaches, and it is made up of numerous strata of artificial neural networks. Triggering functions that can be used for the production of non-linear outputs are contained within each layer. This study proposes a secure framework that can be used to achieve real-time detection and prevention of cross-site scripting attacks in cloud-based web applications, using deep learning, with a high level of accuracy. This project work utilized five phases cross-site scripting payloads and Benign user inputs extraction, feature engineering, generation of datasets, deep learning modeling, and classification filter for Malicious cross-site scripting queries. A web application was then developed with the deep learning model embedded on the backend and hosted on the cloud. In this work, a model was developed to detect cross-site scripting attacks using multi-layer perceptron deep learning model, after a comparative analysis of its performance in contrast to three other deep learning models deep belief network, ensemble, and long short-term memory. A multi-layer perceptron based performance evaluation of the proposed model obtained an accuracy of 99.47%, which shows a high level of accuracy in detecting crosssite scripting attacks.

another model known as a meta classifier. The precision rate of 84.9% and the recall rate of 85.1% both serve as the best results from their approach.
Shailendra, et al. [17], the authors proposed an XSS vulnerabilities detection technique centered on machine learning algorithms, which can be used in social networking services (SNSs). Next, they described and separated XSS features into three categories: standard url features, hypertext markup language (HTML) tag features, and SNS features. Thirdly, their method collected 1000 SNS web pages to create datasets, and features, which consisted of 400 benign web pages and 600 malicious web pages. Finally, machine learning algorithms were used for the data training phase, and the outcomes generated were graded. The authors utilized ten machine learning algorithms to conduct the experiments. The best results derived from the experiment were a 97.2% accuracy rate and an 87% false-positive rate. However, it was observed that the dataset utilized was too small.
Wang, et al. [18], an approach to machine learning for detecting Cross-site Scripting in social networks was discussed. They manually removed the characteristic features of the web pages; categorized them into distinct four groups with each group comprising multiple attributes, where three of them are then associated with an online social network; also, adaptive boosting (AdaBoost) was utilized, and a 10-fold cross-validation ADTree algorithms; using AdaBoost, the highest recorded values of 0.941 precision value and 0.939 recall value were realized. These results depicted a false-positive rate of 4.20%, high, and a shallow detection rate.
Fang, et al. [19], an approach called deep XSS was used for the detection of cross-site scripting (XSS) based on deep learning. This approach, which entails decoding, generalization, and tokenization techniques, was examined. First of all, word2vec, an extraction application, was used to extract the XSS payloads features. These Cross-site Scripting payloads captured word order information, and each payload was linked to a characteristic vector. Using recurrent neural networks from long short-term memory (LSTM), the detection model was trained and tested. Using this approach, the experimental results showed a precision rate of 99.55% with a recall rate of 97.9% and an F1 score of 98.7%. Nonetheless, the webpage contained JavaScript and HTML code, which are principally sub-standard encoding; hence, word vectors' adoption made the training process too time-consuming and tedious. Their research was not extended to detection in real-time, either.
Goswami, et al. [20] examined a method of XSS attack detection based on the clustering of unsupervised attributes, using the Monte Carlo cross-entropy algorithm as the aggregation of ranks. This method was used to classify clusters into two classes, namely, malicious scripts and benign scripts. This method initially utilized a certain degree of heterogeneity for XSS vulnerabilities tests on the client. If the tests exceeded a given threshold value, the request is jettisoned. Otherwise, the request is forwarded to the proxy for additional processing. One downside of this approach is identifying client and proxy servers, thereby lacking flexibility in its mode of operation. Furthermore, the method interferes with system usage.

MATERIALS AND METHOD 3.1. Deep learning methods
The research method, the design and development of the deep learning model, and the tool employed in building the cloud-hosted web application are all discussed in this section. The primary aim, as earlier specified, is the development and implementation of a prototype of a model that can perform real-time detection of XSS attacks for cloud-based web applications that utilize deep learning to ensure a secured system. This section offers an overview of the proposed method and adapted framework used to implement the model, a detailed overview of the model architecture, and finally, the system's design and analysis. For this research, deep belief network (DBN), MLP, and LSTM deep learning models will be adopted.  LSTM: has distinct units in the recurrent hidden layer referred to as memory blocks [21]. It is used to decide when to let the input enter the neuron, remember what was computed in the previous timestamp, and let the output pass on to the next timestamp [12], [21].  DBN: is made up of a visible layer that corresponds to the inputs and many hidden layers that correspond to latent variables [22]. DBN training is conducted layer by layer; therefore, every layer is treated as a Restricted Boltzmann machine, which is trained on top of the previous layer that was trained [12], [23]. A DBN may also be advanced for cataloging problems under supervised learning [22]. Similar NLP related tasks, for example, classification of text, are performed by DBN models. The models can learn multiplex features within hidden layers and gather more compound functions, which are then used for data demonstration [24]. DBN uses RBM for planning [23].  Multi-layer perceptron (MLP): is a neural feed-forward network with many hidden (multi-layer) layers.
Hidden layers in MLP are fully connected, i.e., each node in each layer is associated with a specific weight to each node of the layer below [25], [26].

Neural networks notations
The deep learning representation and the forward and backward propagation is shown below. The following are the neural network notations [27]. Objects: ·X ∈ Rnx×m is the input matrix ·x(i) ∈ Rnx is the ith example represented as a column vector ·Y ∈ Rny×m is the label matrix ·y(i) ∈ Rny is the output label for the ith example ·W[l] ∈ R is the number of units in the next layer × number of units in the previous layer is the weight matrix, superscript [l] indicates the layer ·b[l] ∈ R is the number of units in the next layer is the bias vector in the lth layer ·yˆ ∈ Rny is the predicted output vector. It can also be denoted a [L] where L is the network's number of layers.

Deep learning representations
Forward propagation means propagating the computations of all neurons within all layers moving from left to right. The process starts with converting your feature vector(s)/tensors into the input layer and ends with the final prediction generated by the output layer. Forward pass computations occur during training to evaluate the objective/loss function under the current network parameter settings in each iteration and during inference (prediction after training) when applied to new/unseen data. Intermediate variables are sequentially calculated and stored within the neural network's computational graph by forwarding propagation. Forward propagation progresses from the input to the output layer [27]. Backward propagation is a step executed during training to compute the objective/loss function gradient for the network's parameters for updating them during a single iteration of some form of gradient descent. When viewing a neural network as a computation graph, it is named because computing the objective/loss function derivatives at the output layer. It propagates them back towards the input layer to compute results and update all parameters in all layers. Backpropagation has performed an essential role in ANNs since 1982. It is a very effective gradient descent method [28]. It calculates and stores the gradients of intermediate variables and parameters sequentially, within the neural network, and in a reversed order [27]. a. Nodes represent inputs, activations, or outputs b. Edges represent weights or biases

Metrics for model performance
The accuracy, ROC-AUC, precision, recall, and F1-Score measures are used as metrics for evaluation.  Accuracy: the overall effectiveness, precision, and efficacy of the classification model can be described as accuracy [29], [30].  ROC-AUC: ROC is a probability curve, and AUC represents the degree or measure of separability. They indicate how well a model can differentiate amongst classes [30].  Precision: this is the ratio of predicted positives, which are accurately positive [30].  Recall: also known as "sensitivity or true positive rate," speak of the amount of positively predicted actual positive occurrences [29], [30].  F-Measure: the "F-measure" is recognized as the unifier means of precision and recall [29], [30].

Adapted model architecture
The adapted model architecture is by [17] titled "XSS Classifier: An efficient XSS attack detection approach based on machine learning classifier on social networking services SNSs." Their framework was dependent upon machine learning classifiers to classify webpages into two classes, namely, XSS or non-XSS. It covers four distinct steps entirely: feature engineering, collection of webpages, generation of datasets, and machine learning classification. The overview is depicted in Figure 1. In their method, XSS attack detection was performed using three features: URLs, websites, and SNSs. A dataset was prepared by collecting about 1,000 SNS web pages and removing the features from those web pages. Ten different machine learning classifiers were used in a trained dataset to categorize web pages into two groups: XSS or non-XSS. It was tested using precision and an F1 score to validate the performance.

Proposed model architecture
The proposed model architecture is adapted from [17] is depicted in Figure 2. Their model architecture comprises four key steps; feature engineering, collection of webpages, generation of datasets, and machine learning classification. The proposed framework utilized the three (3) phases from the adapted framework by [17]: Feature engineering, generation of datasets, and classification. algorithms above would be used individually to train models for XSS input detection. Based on the evaluation of the testing dataset, the best performing model would be picked for use in the web application. e. Ensemble learning: this involves combining the DBN, LSTM, and MLP models to derive another model. f. Classification filter for malicious XSS queries: for an XSS injection to be successful, a web form containing the relevant XSS input must be successfully submitted to the webserver. The filter would be a backend program to scan web forms before submitting them to the server; to check for possible XSS inputs using the generated machine learning model.

The data used to bring out the results
The extraction of payloads data were from online repositories, the building of web crawlers for payloads data extraction from logged XSS websites, and manual curation of the dataset The sources of data include: portswigger XSS cheatsheet, owasp cheatsheets for XSS attack and Github repository. The training and test dataset were gotten from these three websites. The raw payloads data was saved in txt format while the raw bening data was saved in csv format. The model's training was done using the Google colab platform, and the codes were prepared in Jupyter notebook.

Proposed system architecture
The proposed system's structure is founded on a three-tier layered architecture. It comprises the presentation layer, logic layer, and the data layer, with all layers contributing to the system's total workability. a. Presentation tier: this where the web application of the system runs. It provides an interfacing link for the users to interact with the system. From this layer, the user can register themselves and access other functions of the system. b. Logic tier: this layer is regarded as the most relevant layer because all the registered functionalities in the logic tier are carried out here. It also comprises the application modules (the filtering program, XSS model API, cloud data create, read, update, and delete CRUD operations). c. Data tier: the data tier where data is stored and retrieved from. It includes the database (user data, cloud storage, and API event logs).

RESULTS AND DISCUSSION 4.1. Model evaluation
After the extraction and gathering of XSS payloads, the dataset for training and testing was done through downloads, the building of web crawlers, and curation. Ten thousand six hundred datasets for payloads data and five thousand datasets for benign data were extracted. The algorithms (DBN, LSTM, and MLP) were used individually to train XSS input, detection models. The best performing model was to be picked for use in the web application based on the testing dataset evaluation. The performance metrics used were: accuracy, ROC-AUC, precision, recall, and F1-score. The modeling; was based on two features: a. TFIDF: Term frequency and inverse document frequency, computed as a bag-of-word model with each tokenized query as words and their weighted frequencies as vectors. b. Word embedding: The word embedding model used was FastText by Facebook research.
Model parameters used are discussed below a. Multi-layer perceptron: a two dense layer with 512 units of outputs and RELU as activation function was used. Also, the number of Epochs was three at 500 batch size each at training. A third dense layer with the neural network's final output was added with Sigmoid as an activation function. Two dropouts at 0.2 were applied for the first two layers consecutively. The loss function used Binary Cross-Entropy, which was optimized using Adams optimization. b. LSTM: an embedding layer with 128 dimensions and the equal number of features as the dataset was used with the sequence model. 0.2 Dropout was also throughout the network, and the final output layer was activated using Sigmoid. The loss function used binary cross-entropy, which was optimized using Adams Optimization. The number of Epochs was three at 500 batch size for each training or backpropagation.

dense layers and activation functions used in MLP
were also applied here. The loss function used binary cross-entropy, which was optimized using Adams optimization. the number of Epochs was three at 500 batch size for each training. d. Ensemble: hard vote ensemble method was used for ensembling in this work. Calculated as the average of all three models for each prediction given.

Model validation
Model validation are discussed as:  Training set: 70% Testing set: 30%  This split percentage was used for both TFIDF and Word Embedding.  Performance metrics used were: accuracy, ROC-AUC, precision, recall, and F1-Score.  The model's training was done using the Google Colab platform, and the codes were prepared in Jupyter notebook. The dataset's visualization report is shown in Figure 3, 10600 for attack data and 5000 for benign data. Table 1 presents the metric evaluations using TFIDF as features for the four deep learning models being utilized for this study, and Table 2 presents the metric evaluations using Word Embedding as features for the four deep learning models, respectively.   Figure 4 and Figure 5 shows the classification report on the jupyter notebook for the MLP model using TFIDF and word embeddings.

The experiments that works on the system
The malicious scripts were injected on the web application through the input forms. The deep learning model was integrated at the back end of the web application to filter the inputs against malicious xss queries. An API that has a trained deep learning model was built. The API call was done from the frontend using Javascript. This shows the javascript code which filters the users input against the malicious queries and the web application is depicted in Figure 6.

Discussion
LSTM performed the poorest using both TFIDF and Word Embedding. LSTM had 0 precision rates for benign queries and 1 for attack queries, because it was just labeling test set as attacks the test set is only for evaluating the model. Combined with its poor recall, it would lead to very high false positives in practical scenarios indicated by an F1-score less than 0.5. Wrong data belling for LSTM, hence it's poor performance. Using F1-score as the metric, MLP performed best for both TFIDF and Word Embedding features. Word Embedding features performed slightly better TFIDF using the MLP model. Also, MLP made a very high accuracy of 99.4% on the test set and had a recall and precision for both attack and benign queries above 0.9, which should reduce false positives practical scenario as indicated by F1-score above 0.5 and very slightly close to 1 at 0.993. A brief description of what the performance Metrics implies. Performance metrics used were: Accuracy, ROC-AUC, Precision, Recall, and F1-Score. a. Accuracy: this measures how many correct predictions each model got in percentage. However, the easiest method of measuring performance is if the data is imbalanced or there is bias in classification, then an accuracy metric won't effectively be able to tell. b. For example; if there are 90 normal queries and 10 XSS attack queries, and the model classified all set as normal; it would achieve a 90% accuracy, but in this case, the model has not been able to effectively identify attack query due to possible bias or inability to handle data imbalance. Hence accuracy is not always the best validation metric. c. Precision: following the example above, the precision for normal query prediction would be a ratio of 0.9; however, for attack queries, it would be a precision ratio of 0. Precision, unlike accuracy, actually measures how well the model did for each of the classes and not an aggregation of all classes; which tend to lead to less false positives and identification bias in the model. d. Recall: still following the example, recall measures how many attack queries were predicted as attacks and vice versa for the normal queries. There were 90 normal queries, and all 90 were predicted correctly, so the recall ratio for the normal query would be 1. There were ten attack queries for attack query, and none were predicted correctly so that recall would be 0. This metric is also known as sensitivity. e. F1-score: F1-score combines precision and recall using harmonic mean to generate a metric for evaluation. Usually, a metric or ratio above 0.5 and closer to 1 means the model has better precision and recall, which means it tends to be more robust to data bias and false positives and vice versa. f. ROC-AUC: this is a combination of receiver operating characteristics and area under the curve, which is used to measure how well a model can distinguish between classes using random samples at different thresholds. A ratio of above 0.5 and close to 1 indicates better performance, and vice versa indicates poor performance. It measures sensitivity, specificity, and false-positive rates of the model, for its final evaluation.
In addition to the explanation above regarding the metric generated by each model, MLP outperformed all other models in all metrics. It is precise, and recall values for predicting attacks are very promising and show it is not prone to false positives or bias from random samples as indicated by the high ROC-AUC score of 0.99. Previous records' predominant success and network flow LSTM indicates that it is best suited for sequence problems like speech, thereby requiring larger data. At the same time, DBN adds an extra layer of feature extraction using Restricted Boltzmann machine which is to be fed to any other network layer, so it is not a system of layers like CNN, MLP. hence its performance is subjected to another network parameters and the extraction from restricted Boltzmann machine activation. MLP, on the other hand, performs well due to its simplistic network layer and less need for a larger dataset as opposed to other "more-advanced" deep learning models. So it trains and fits the data size much faster and effectively than "more-advanced" algorithms that require a larger set of activation functions and parameter activation. Also, the No free lunch theorem concept, the parameters used for MLP here, as discussed in the previous mail above, could have been more optimal for this data. The parameters used for the others could be less effective. For example, when using a softmax loss function during training, the deep learning models' accuracy metric drastically reduced below 30%; however, when changed to binary crossentropy, there was a significant boost. So the subjective parameters of the model also indicate poor or high performance.

CONCLUSION
In this work, a model was developed to detect XSS attacks using the MLP deep learning model, following a comparative review of its output compared to three other deep learning models, Ensemble, LSTM, and DBN. Research on the project led to the creation of Williams Cloud. This web-based proof-ofconcept framework incorporated an MLP deep learning model to detect and handle XSS attack scripts injected into a web application. The results of machine learning training on four algorithms showed that MLP performed best in detecting XSS attacks based on the evaluation metrics. The MLP model achieved 98.99 percent using TFIDF as a feature and 99.47 percent using word embedding as a feature. This work contributes immensely to the knowledge that can also be further developed and adopted to counter and prevent other web-based attacks. Also, results obtained from users' evaluation will be made available for further research.