Deep learning-based methods for anomaly detection in video surveillance: a review

ABSTRACT


INTRODUCTION
The huge deployment of surveillance camera systems in public areas in recent years has increased the demand of new systems that can automatically analyze video surveillance streams in real-time.Automatically detecting abnormal events in complicated and crowded scenes is a challenging task in intelligent video surveillance.This problem has attracted significant computer vision research interest in recent years.In this work, we aim to present and evaluate the anomaly detection approaches and deep learning-based methods, to automatically detect and localize anomalous events in which subject knowledge is continuously evolving.In this section, the research topic, background information, the research objectives are covered in order to introduce the study and finally the paper structure.
Anomaly detection in the video is the task of recognizing frames from a video sequence that reflect occurrences that differ significantly from the normal, identifying unusual incidents, such as fires, car accidents, escapes, stampedes, or fighting, and can be quite useful [1], [2].The detection and localization of the anomaly are one of the most difficult tasks in video processing due to the definition of "anomaly" which ISSN: 2302-9285  Deep learning-based methods for anomaly detection in video surveillance: a … (Abdelhafid Berroukham) 315 can have some degree of ambiguity within context.Visual behaviors are complicated and diverse in an unrestricted world, complicated backgrounds, moving cameras, occlusion, shadows, and lighting are challenges to overcome.In general, an occurrence is regarded as an "anomaly" if it occurs infrequently or unexpectedly [3], [4].
Anomaly detection is a growing field of research in and of itself.Although various methods have been put out to address this issue, they all have their limitations.Whereas, the inclusion of a labeled dataset with a collection of normal events is a requirement for the majority of approaches currently in use [1].This presumption restricts their field of use because it prevents the system from being continuously retrained without human intervention.
Various approaches have been proposed, early literature relies on trajectory-based techniques [5], [6].These techniques attempt to determine the target's trajectories by using visual tracking and a model is learned to describe normal actions.Then the anomaly is defined as an activity related to trajectories that differ significantly from the learned model.Though, these techniques are ineffective for complex and crowded scenes due to their high temporal complication and the occlusion issue caused by moving objects [7].Therefore, more lately, non-object-centered unsupervised approaches have been more commonly used.These approaches tackle the problem of anomaly identification by learning representative activity patterns from the behavior-related characteristics of objects and humans in spatial and temporal contexts.Size, gradient, speed, and direction of the targets in the image are typically taken into account as behavioral attributes and are expressed with low-level representations like 3D spatio-temporal gradient, histogram of optical flow (HOF), histogram of oriented gradients (HOG) [8], and dense spatial-temporal interest points (dense STIPs).These methods have an advantage over trajectory-based methods in that they work at the pixel level, which makes them more robust in complicated scenes [7].
Dictionary learning is another proposed approach for anomaly event detection; this approach develops a dictionary of typical events and labels the events that the dictionary cannot adequately depict as abnormal.Low-level features like 3D gradient features and HOF or HOG features may also be subject to dictionary learning [1].However, all of these methods depend on hand-crafted features that are difficult to describe a priori because there are so many different types of anomalous behaviors.In addition, they are unable to adapt to abnormalities that have never encountered before [7].
Recently, a variety of computer vision tasks have been successfully tackled using deep learning approaches, surpassing the state-of-the-art in a variety of difficult problems.Such as object classification [9]- [11], object detection [12], [13], and action recognition [8], [14], [15].Deep learning is a subtype of machine learning that achieves high performance by learning to represent the information as a hierarchy of nested concepts within layers of the neural network [16].As the volume of data increases, deep learning outperforms classical machine learning as illustrated in Figure 1.The deep learning-based methods for anomaly detection use one of these techniques: the reconstruction error to calculate the test data divergence from a series of normal training videos, the future frame prediction, the classifiers, or the scoring methods.Most of these techniques, specifically on "traditional" approaches, presuppose the existence of a labeled dataset that represents a collection of 'normal' events.In this work, we present a variety of contributions that tackle these issues.Especially we focus on deep learning-based methods to solve this issue.Today, these solving approaches based on deep learning are rapidly and constantly evolving, which makes it particularly difficult to master this area of expertise.Unlike the previous review papers which are general and tackle the anomaly detection problem in many fields, our paper is more specific for anomaly detection in video surveillance context using deep learning approaches and it covers this problem from different sides: techniques, used dataset, and metrics.This paper is organized as follows: the first section serves as an introduction, in the second section, we review the deep learning-based methods for anomaly detection in surveillance video.In the third section, we provide the publicly available dataset and in the fourth section, we describe the most used evaluation metrics in order to evaluate and compare the methods.In the fifth section, we compare and discuss the results of different approaches according to several datasets.Finally, we terminate this paper with a conclusion.

DEEP LEARNING-BASED METHODS FOR ANOMALY DETECTION
Deep learning algorithms have proven effective in a variety of computer vision tasks, such as object classification [9], [17], object detection [12], [18], and action recognition [19], [20], including anomaly detection in video surveillance.As already introduced in the previous section, the approaches that have been proposed to tackle this challenge can be grouped into four categories: reconstruction error, future frame prediction, classifiers, and scoring.

Reconstruction error based methods
The reconstruction error is one of the most used approaches for solving the anomaly detection problem.The basic presumption of using the reconstruction error is that would be smaller for normal samples, because they are closer to the training data, and assumed to be higher for abnormal samples [21].Deep learning-based methods typically train a deep neural network using an auto-encoder (AE) method and use it to reconstruct normal events with few reconstruction errors.But as it was claimed in [22], larger reconstruction errors for anomalous events don't necessarily happen.As a result, it can show that practically many methods based on the reconstruction of training data cannot guarantee the detection of abnormal events.
A method was proposed in [23] to learn normal patterns with minimal supervision using autoencoders; firstly, the authors use the conventional hand-crafted Spatio-temporal local features to train an autoencoder.The value of using this type of information for training is their capacity to work without or with minimal supervision.Then, they develop a fully convolutional AE to learn the classifiers and the local features in one framework.
Another method was proposed in [24] where the authors used generative adversarial networks (GANs) [25], which employ normal frames and associated optical-flow images as training data to learn the normal frame representation.The GANs cannot generate abnormal events because they have only been trained on normal data.Therefore, to detect abnormalities, a local differential between the actual and produced images is used during testing time.In future work, it could be possible to use dynamic images [26] to represent motion data.
Similarly, the work of [27] has also used GANs [28] and performs transfer learning algorithms on pre-trained CNN (VGG16).Transfer learning is a vital machine learning technique for addressing the fundamental issue of insufficient training data.Its goal is to transfer knowledge from one domain to another [16].They also improve the model's effectiveness by processing the video's optical-flow information.The experiment of this work runs on University of California, San Diego (UCSD) datasets, and for the evaluation, they use various criteria such as area under the receiver operating characteristic (ROC) curve (AUC) and equal error rate (EER).
Vu et al. [28], propose an approach based on two-fold.They propose a customizable multi-channel framework for generating multi-type frame-level characteristics on one side and on other side; they investigate how supervised learning can be used to increase detection performance.The multi-channel framework that they propose is composed of four conditional GANs (CGANs) [29] that take various types of motion and appearance data as input and produce prediction data as output.Then peak signal-to-noise ratio (PSNR) is used to encode the difference between the generating and ground-truth information.For framelevel anomaly detection, the binary support vector machines (SVM) is used.Finally, they perform objectcentric anomaly localization by using mask region-based convolutional neural networks (R-CNN) as a detector.They evaluate their solution on four different datasets: avenue, ShanghaiTech, and UCSD.
Sabokrou et al. [21], propose an approach for anomaly detection and localization based on two cubic patches, where one relies on the strength of an autoencoder to reconstruct an input video patch, while the other relies on the strength of sparse representation of an input video patch.These two stages are constructed based on the analysis of the reconstruction error of the AE and the sparsity value (SV).The main idea of their approach is that the anomaly patch in the testing phase has a more elevated reconstruction error than a normal patch if an AutoEncoder has been trained successfully on the normal patches.

Scoring based methods
There is another category of methods proposed by researchers based on score [6], [21], [22], [30]; the main idea of this approach is to generate an anomaly score that may be used to determine whether or not 317 a video segment or frame is abnormal.Sultani et al. [30] propose an approach to learn anomalies by utilizing both normal and abnormal videos; they postulated that the best way to detect anomalies might not be to use only normal data.Therefore, to save the time-consuming task of marking anomalous portions in training films, they suggest using weakly labeled training videos to learn anomalies using the deep multiple instance ranking system [31].
In their approach, the authors learn an anomaly ranking model that automatically predicts high anomaly scores for anomalous video segments by treating video segments as instances in multiple instances learning (MIL) and normal and abnormal videos as bags.MIL is a deep learning technique where training data is organized in bags, and each bag contains a collection of instances [32].Research by Pang et al. [1], try to solve the problem by end-to-end anomaly scores learning on a collection of video frames without explicitly labeling any data as normal or abnormal.For that, they propose an end-to-end approach based on self-trained deep ordinal regression to detect the anomaly in the video.This approach overcomes some limitations of existing methods, the first one relies on manually labeled normal training data, and the second one is sub-optimal feature learning.
The framework that has been proposed receives a collection of videos without labels and then initially carries out initial detection to produce a set of pseudo anomalous and normal frames.Then, these collections are used to train a ResNet-50 model [33] and a fully connected network in an end-to-end fashion.ResNet50 is a pre-trained model that has the ability of take frame appearance characteristics.The network is composed of an output layer with one linear unit and a hidden layer with 100 units.Finally, the anomaly scores of all frames are then recalculated using the trained model.The abnormal and normal memberships are updated as needed, and the process is repeated.
Another method was proposed by Xu et al. [7] where they have proposed an unsupervised learning approach to learn feature representations automatically.They propose a new double fusion architecture to take advantage of the complementing information contained in both appearance and movement patterns, combining typical early fusion and late fusion advantages.In the early fusion, it is proposed to use stacked denoising auto-encoders (SDAE) to learn both the motion and appearance features of activities in a video separately.Then, they employ multiple one-class SVM models to predict the anomaly scores of each input using the learned features.Finally, the late fusion combines the obtained scores and detects anomalous events.As claimed by the author, this work is the first effort to tackle the challenge of abnormal event identification using deep learning.Despite the good results achieved, the approach still has a limit that is represented in the high computational for real-time processing.Therefore, in the future, it might be possible to research ways to cut the cost of computation.

Future frame-based methods
This approach is considered as another sight to address the anomaly detection challenge within a future frame prediction.The assumption of its use is that normal events are predictable whereas abnormal ones do not match expectations.The first work that introduces this approach is that of [22].In which the authors propose a future frame prediction network.This approach is based on the generator-discriminator structure assimilated to that of a GAN network, and they use a U-net model as a prediction network to create a future frame while the discriminator at the end of the network determines whether or not the predicted frame is abnormal.Moreover, to predict a higher-quality future frame for normal events in addition to appearance constraints that are commonly used, they also use a motion constraint by forcing the optical flow between the ground truth and the anticipated frames.
Another method was proposed by Medel and Savakis [34], where they used a future frame prediction approach.Their approach is based on developing generative models that, with limited supervision, can detect anomalies in videos.They suggest a composite convolutional long short-termmemory (Conv-LSTM) network that is end-to-end trainable and can anticipate the development of a video sequence given a few input frames and predict future frames.The network learns to predict 'normal' activities that are comparable to those seen in the training videos.And with each succeeding timestep, the abnormality forecast deviates further from the ground truth.As a result, the regularity score produced can be used to identify when abnormalities occur in videos.At the evaluation level, the authors did not use the most used matrices for evaluating results and making comparisons with other methods like AUC and EER.

Classifier based methods
The work of Medel and Savakis [4] framed the anomaly detection problem as a classification problem.They proposed an approach for locating and detecting anomalies in videos by analyzing the output of deep layers, their approach uses fully convolutional neural networks (FCNNs) and information about time.The proposed FCN combines a pre-trained CNN using an AlexNet model [9] with a novel convolutional layer that trains kernels with regard to the training video.The network focuses on two key tasks: outlier detection and feature representation.This approach proved good results in terms of accuracy but it still has

BENCHMARK DATASETS
In this part, we describe the public datasets used for the anomaly detection tasks in the video.Many of the papers attempted to use at least one benchmark dataset to compare the performance of their suggested methods to previously published papers.Due to the variable crowd density and behavior patterns, all datasets exhibit dynamic scenarios.The datasets frequently used for activities involving anomaly detection are listed in Table 2

UCSD pedestrian
The UCSD pedestrian dataset [37] contains 2 subsets: the UCSD Peds1 dataset and the UCSD Peds2 dataset, the size of the frame and the camera angle distinguish the two subsets.The dataset is divided into testing and training data.The training data is devoid of abnormalities, it is all normal activities and contains only pedestrians; however there is at least one anomaly in every testing clip, the anomalous events are either: object entities moving via pathways or anomalous people motion.Common anomalies contain small cars, skaters, bikes, and people walking in the grass, in certain frames, the anomalies appear in multiple locations.
UCSD pedestrian 1: this dataset has 34 video sequences for training, and 16 video sequences for testing in which one or more anomalies are present in some of the frames, pixel-level binary masks are given to a collection of ten clips in the testing set to identify regions having anomalous events, each clip contains about 200 frames.There are 5,500 normal and 3,400 abnormal frames, with a resolution 158×238 pixels.In this dataset, The camera is positioned at a considerable height.
UCSD pedestrian 2: this dataset contains around 1,652 anomalous and 346 normal frames across 12 testing and 16 training video sequences.The frame has a 360 by 240 pixel resolution.The camera here is placed at a lower altitude.Each testing clip in this dataset has only one anomalous event, which takes up the majority of the video segment.
Different works are usually evaluated independently on these two datasets.But due to the different camera viewpoints, Ped1 appears to be more challenging than Ped2. Figure 2 shows sample frames from the UCSD dataset for both normal and abnormal behavior in the scene and their ground truth.

Subway dataset
The subway dataset [40] comprises 2 video sequences recorded at the access point (144,249 frames, 1 hour 36 minutes long) and exit door (64,900 frames, 43 minutes long) of a subway station.The abnormal events mainly include individuals traveling in the opposite direction and no-payment events.The number of anomalies in this dataset are low.Figure 4 shows sample frames from the Subway dataset for both normal and abnormal events.Subway entrance: the surveillance video from the subway entrance shows a variety of anomalous events, such as people loitering, walking in the opposite way, and avoiding payment.Subway exit: similar anomalies to those seen in the subway entrance video can be seen in the surveillance video of the subway exit.

UMN dataset
The University of Minnesota (UMN) dataset comprises 3 distinct sights of escape incidents, with a total number of frames 7740 (1,450 for scene 1, 4,415 for scene 2, and 2,145 for scene 3) and the resolution is 320×240.The abnormal activities are people spreads running at the same moment, while the normal events are pedestrians wandering aimlessly around the plaza or through the mall.There are 11 abnormal events in the entire video collection.Figure 5 illustrates example frames from the UMN dataset.

ShanghaiTech dataset
The ShanghaiTech dataset includes 330 videos for training and 107 videos for testing, with over 270,000 training frames.There are 130 abnormal events and numerous forms of anomalies with 13 scenarios that incorporate difficult lighting and camera positions.Furthermore, the ground truth of abnormal events is labeled.On the test set, normal samples outnumber abnormal samples, Figure 6 shows sample frames from this dataset for both abnormal and normal behavior.

Normal Abnormal
Figure 2. Normal and abnormal frames, the red box denotes an anomaly in an anomalous frame.

UCF dataset
The University of Central Florida (UCF) dataset is a sizeable dataset proposed by [30] to help solve the anomaly detection problem with about 128 hours of videos.It contains 1,900 lengthy actual surveillance movies, with 13 realistic abnormalities, including burglary, fights, robbery, accidents on the road, and also the normal activities.This dataset can be utilized for two different purposes.First, all anomalies are taken into account in one group, while all normal events are taken into account in another.Second, to identify each of the 13 anomalous activities.There are 15 times as many movies in this dataset as there are in other datasets.Figure 7 shows few examples of anomalies from the UCF dataset.

EVALUATION METRICS
In this section, we will discuss the evaluation and comparison measures used in state-of-the-art methods.

322
-Frame level: a frame is deemed to have detection if it has at least one abnormal pixel.Each frame's ground truth annotation is compared to these detections.The process is carried out several times for different thresholds to create a ROC curve.This assessment does not confirm that the detection corresponds to the actual location of the anomaly.Therefore, some actual positive detections may be the result of "fortunate" co-occurrences of false positives and abnormal events [37].-Pixel level: the accuracy of localization is evaluated by comparing detections to pixel-level ground truth masks, on a collection of ten clips.The process is comparable to what was previously stated.The frame is deemed as accurately detected if at least 40% of the actually anomalous pixels are found.otherwise, it is tallied as a false positive [37].-ROC curve: to evaluate the accuracy for various threshold settings, the ROC curve is employed.The ROC is composed of false positive rate (FPR) and true positive rate (TPR), where FPR determines the proportion of false-positive findings that occur as compared to the total number of negative samples available through the test stage, and TPR defines a classifier test performance on accurately categorizing positive instances among all available positive samples throughout the test stage.These measurements are provided by ( 1) and ( 2 where true positive (TP) denotes the anomalous events that have been properly identified; true negative (TN) denotes the normal events that have been properly identified, false positive (FP) denotes the anomalous events that have been improperly identified; and false negative (FN) denotes the normal events that have been improperly identified.We select several thresholds for both frame-level and pixel-level detection and compute the TPR and FPR in accordance to produce the ROC curve [39].
The AUC is employed as the evaluation metric.The ground truth and frame-level anomaly scores are used to calculate AUC. Figure 8 illustrates the area under the ROC curve.
The EER is the proportion of incorrectly categorized frames when the FPR and the miss rate are both equal.The lower the EER value, the higher the accuracy of the algorithm.The EER is a point in the ROC at the junction of the curve and a line going from (0.1) to (1.0). Figure 8 illustrates the EER.Time complexity is another important criterion.If an algorithm's overall execution time is sufficiently short, it is more appealing to be used in many applications.

COMPARISON AND DISCUSSION
In this section, we will discuss and analyze the performance of anomaly detection methods in videos sequences, exactly those based on deep learning approaches.

323
with the publicly available datasets.These approaches are grouped by the type of learning used and some evaluated metrics results obtained by applying some anomaly detection methods on different datasets.The comparison of accuracy between different methods is done by their frame and pixel-level scores.
We have classified papers based on deep learning into four categories of approaches: reconstruction errors, future frame prediction, scoring, and using classifiers.The accuracy of each one is tested on several datasets and evaluated using AUC and EER metrics for both frame and pixel-level.As shown in Table 3 (in Appendix), the deep learning-based methods are achieved good results for the most available dataset compared to hand-crafted based methods, except some methods for some specific dataset like [40] which has the lowest EER value (10%) compared to all the others methods for UCSD Ped2 dataset, and the method of [37] in subway entrance dataset, and also the method of [41] that achieved an accuracy of 99.70% in UMN dataset.
The analysis of the deep learning-based methods results demonstrates that the reconstruction errors are the most used approach and gives a superior accuracy in UCSD datasets for both frame and pixel-level, as shown in [23], [24], [28].But in some situations, the larger reconstruction errors for anomalous events may not happen because of the higher capacity of the deep neural network.Whereas score approach has also achieved good results for some other datasets as in [42], especially in the subway exit (AUC=95,1%) and UMN(AUC=99,83%) datasets.In addition, the approach presented in [30], has also given a good accuracy (AUC=75,41%) in their dataset UCF compared to the results of other approaches, but it could not locate exactly the anomaly in some situations.
For the classifier approach, we can see that the approach presented in [43] has achieved good accuracy (AUC=97,80%) for the UCSD Ped2 dataset, but this approach generates a high rate of falsepositives (AUC=68,4% in UCSD Ped1 dataset) in 2 situations: when people walk in the wrong way and in the crowded scenes.Despite the future frame prediction approach proves its effectiveness for anomaly detection on some datasets (AUC=95,4% in UCSD Ped2).In the avenue dataset, it fails to detect several anomalous events of jogging that occur in the background, because it could not differentiate jogging action from walking pedestrians.In general, using some datasets is more challenging than others.For example, all approaches give good results using the UMN dataset, due to its simplicity.But in UCF dataset, the higher result obtained is (AUC=75,41%).
Based on the reviewed literature papers and the results of Table 3 (in Appendix) [1], [4], [7], [21]- [24], [27], [28], [30], [35]- [37], [40]- [60], it appears clearly that several studies choose to tackle the anomaly detection problem using unsupervised learning methods, because do not require labeled video data and can be effectively employed for learning good representations.In addition, it is effective to the complexity and variety of visual behaviors of anomaly in an unconstrained environment.However, they still limited and did not achieve good results.Therefore, other researchers choose to surpass this limit by using the semi-supervised learning methods that use data only related to the "normal" class, thus these methods have greater specifications for anomaly detection problem as well as unsupervised methods, which only use the structure and configuration of the unlabeled data and do not use any other information.
Despite the very huge researchers in this topic, however, it still has some limits; many anomalydetection algorithms work with very regular scenes, so it is necessary to evaluate how well these methods operate in less structured situations.Moreover, the real time application in unconstrained environment and the time complexity.Therefore, we propose to use the vision transformer model [61], which is a new deep learning technique that achieve good results in many problems and it could be a good approach to implement for anomaly detection problem.

CONCLUSION
This paper reviews deep learning-based methods for video anomaly detection, which cover a variety of approaches, techniques, datasets, and evaluation metrics.A thorough overview of anomaly detection should ideally enable readers to comprehend not just the rationale for using a specific technique, but also to compare different techniques and produce a comparative analysis, in addition to propose an approach.Firstly, we have classified the approaches into four types of categories: reconstruction errors, future frame prediction, scoring, and using classifiers.We also presented the strengths and weaknesses of each category according to several datasets.Each category can be applied in a supervised or unsupervised manner, but most researchers focused on tackling the anomaly detection problem by applying unsupervised learning.
Furthermore, we have presented the different publicly available datasets with their details such as the video resolution and example anomalies found within the respective datasets, and we found that many datasets are more challenging than the others.Finally, we have discussed the results of several categories applied to different datasets.Aiming to tackle some problems and achieve good results in both the accuracy and computational complexity, there are research opportunities to develop a new approach based on vision transformer to improve the detection of anomaly object in video sequences.Weakly sup Classi fication [43] 97,80% Score [30] 75,41%

Figure 1 .
Figure 1.Deep learning-based algorithms's performance in comparison to traditional algorithms[16]

TrainFigure 2 .Figure 3 .
Figure 2. Samples from the UCSD dataset; left column illustrates normal pedestrian behavior, the middle shows the anomaly behavior in the scene and the right column shows their ground truth

Figure 4 .
Figure 4. Samples from the subway dataset; the top row displays regular events, whereas the bottom row displays abnormal ones

Figure 1 .
Figure 1.Samples from the UMN dataset; top row depicts normal crowd behaviour, while the bottom row depicts panicked crowd behavior

Figure 3 .
Figure 3.Samples of anomalies from the UCF dataset


ISSN: 2302-9285Bulletin of Electr Eng & Inf, Vol. 12, No. 1, February 2023: 314-327 318 some limitations, it occurs false positives in some cases like when people walk in different directions and when we have crowded scenes.Summary of past literature for anomaly detection techniques is shown in

Table 1 .Table 1 .
Deep learning-based approaches for anomaly detection

Table 2 .
. A comparison of anomaly datasets  Deep learning-based methods for anomaly detection in video surveillance: a … (Abdelhafid Berroukham) 319

Table 3 (
in Appendix) lists the approaches discussed in the previous sections and other papers that tackle the anomaly detection problems in accordance Deep learning-based methods for anomaly detection in video surveillance: a … (Abdelhafid Berroukham)

Table 3 .
The results of different approaches according to several used dataset