A Comprehensive Review of Deep Learning Video Anomaly Detection

Melissa Montero
May 16, 2024
8 min read

Updated: Jun 10, 2024

Video anomaly can be detected by using stationary cameras (e.g surveillance systems) and dynamic cameras (e.g vehicle dashboard cameras). However, here we will focus on deep-learning methods for video anomaly detection in static scenarios. If you are interested in dynamic mechanisms take a look at the survey “Survey on video anomaly detection in dynamic scenes with moving cameras”.

Unlike image anomaly detection, video anomaly detection needs to consider not just the frame appearance but also motion information, hence the need of spatiotemporal features that allows the definition of normality. The mechanism for anomaly detection may follow similar paths that image anomaly detection mechanism but also must consider the temporal information of the input data.

As well as Image Anomaly Detection, Video Anomaly Detection can be classified according to the learning mechanism in supervised, semi-supervised and unsupervised. Unsupervised methods are the most common and long established, but weakly supervised has become an important approach for real life applications where the anomalies can be defined clearer. Supervised methods are out of the scope of this blog since it can be seen as a classification problem.

Unsupervised Video Anomaly Detection

Unsupervised Video Anomaly Detection Categories

The unsupervised video anomaly detection mechanisms can be categorized according to the method used to determine the normality deviation into: distance based, probability based, reconstruction based and prediction-based. But also as proposed by “Generalized Video Anomaly Event Detection: Systematic Taxonomy and Comparison of Deep Models” the methods can be categorized by the nature of the input data employed, highlighting the capacity for modeling the spatiotemporal characteristics of the video, into 3 groups: frame level methods, patch level methods and object level methods. Following you can find a brief description of the methods for both categorizations.

Distance Based Methods

These methods work on the assumption that normal events occur in a dense neighborhood, while anomalous events are far from their neighbors. The detection of anomalies is determined by the distance of the input pattern from the neighbors; gaussians with Mahalanobis distance, clustering distance and similarity measures are used.

The techniques for distance based methods include sparse autoencoders, deep cascade auto encoders, online-growing neural gas(GNG), object centric auto encoder and siamese neural network.

Probability Based Methods

Probabilistic-based methods for video anomaly detection rely on statistical models to represent the normal behavior of a scene or object and detect anomalies based on deviations from this learned model. These methods typically use probability distributions to model the normal patterns observed in the video data.

These includes approaches based on statistical activity analysis of background-subtracted video sequences, the detection of local and global anomalies using a sparse set of spatio-temporal interest points where local anomalies refer to features with low-likelihood visual patterns and global anomalies consider an ensemble of features where the interactions have dissimilar semantics or misalignment structures with respect to the probabilistic normal model. Also 3D convolutional architectures are used to design 3D auto encoders to obtain a meaningful representation of spatiotemporal changes.

Reconstruction Based Methods

The idea for these methods are pretty much the same as the image detection counterpart, determining normality with the capacity to reconstruct a frame. The main difference consists of the use of spatiotemporal information. Between the approaches can be found: input image stacking in the temporal dimension, convolutional long-short-term memory networks(Conv-LSTM) and spatio-temporal auto encoders (STAE) that use 3D convolutional autoencoder to preserve the temporal information and keep track of the spatial feature in the temporal dimension.

Prediction Based Methods

Prediction based methods predict the future segment frames given input video segments at a given time. The anomaly is determined by the error between the predicted frames and the actual current frames at that time. This category is composed of generative networks: generative adversarial networks(GAN), variational autoencoders and adverbial auto encoders that use spatial and temporal information from video segments and further predict the future frames.

Frame Level Methods

Frame Level Single Stream Video Anomaly Detection

(samples from ShanghaiTech dataset)

Frame level methods use complete frames, sequences or optical flow as input to model the normality of normal activities. The methods in this category at the same time can be divided in two according to the model structure: single-stream and multi-stream.

Frame Level Multi Stream Video Anomaly Detection

(samples from ShanghaiTech dataset)

Single stream methods do not distinguish spatial and temporal information, usually learn the spatio-temporal patterns from a single generative model by reconstructing the input sequence or predicting the next frame. These methods focus on efficient network structures, using powerful learners such as 3D convolution and U-net. Among the approaches, the predictive convolutional long short-term memory (PC-LSTM) uses convolutional LSTM network to model the evolution of video sequences, future frame prediction methods with GAN based video prediction frameworks lean normality using 2D & 3D convolution and convolutional LSTMs to characterize spatio-temporal patterns. Also some methods addressed the combination of U-net capabilities in representing spatial information with the convolutional LSTM ability to model temporal variations of moving object and unification of prediction and reconstruction networks. Also to lessen the deep model’s ability to generalize anomalous samples it introduced the use of memory networks and attention based memory addressing mechanisms to record the patterns of normal events.

On the other hand, multi stream methods treat appearance and motion as different dimensions of information and learn spatial and temporal features separately using different network architectures. Some methods also learn the correspondence and consistency between both dimensions to determine normality. Some methods use RGB frames and optical flow for stacked denoising auto-encoders, others use two-stream auto encoders to capture spatial and temporal information and use convolutional LSTM to better model temporal variations. The reconstruction errors of the two encoders are weighted and used to calculate anomaly scores. Also some methods try to keep consistency between appearance and motion, using two independent auto-encoders, but adding an additional elements like decoders to reconstruct inputs frames and predict optical flow or to fuse spatial-temporal features and predict future frames to model spatio-temporal normality, use memory banks to keep consistency in a high-level feature space, as well adversarial learning was used to explore the connection between spatial and temporal information on regular events.

Patch Level Methods

Patch level methods take spatial-temporal cubes as input called video patches and find anomalies in these specific spatio-temporal regions rather than analyzing the whole sequence. Patch methods can be divided into three categories: scale equipartition, information equipartition and foreground object extraction.

As its name indicates scale equipartition methods divide the video sequence into uniform sized patches along the spatial and temporal dimensions. The modeling process is similar to the frame-level methods, only that the partition into patches allows one to focus on regions with motion and select local patches of interest to detect abnormalities, using lightweight networks to select patches to reduce the computational costs and save processing time on more complex networks for the detection.

The information equipartition methods consider that image blocks of the same size don't have the same information, regions close to the camera contain less information than those far away. So this method uses cube sizes that change with the closeness to the camera, being larger the cubes of the regions close to the camera and smaller that ones farther away.

The foreground object extractions, consider only the regions of the video with information variations to avoid the learning and disruption of the background. The sequences are set to the same scale-cubes and those containing only background are discarded.

Object Level Methods

Object level methods use pre-trained object detection or segmentation networks to extract objects of interest from the video sequence prior to the anomaly detector network. This way the anomaly detector model doesn’t need to handle redundant background information and focus on learning the normal behavior and interaction of the foreground objects. Therefore these methods perform scene independent anomaly detection allowing real time cross-scene implementations, however this may be a disadvantage since they may fail to capture context specific anomalous events, like a motorcycle in a sidewalk.

Some of the methods proposed are: the use of multi-tasks learning to obtain anomaly-related semantic information and then an anomaly detector to analyze scene independent features, a pose classifier alongside an LSTM network to model the appearance and motion information of the detected objects, an object centering convolutional auto encoder to encode motion and appearance and detection anomalies using a one-versus-rest classifier and the use of semantic segmentation, auto encoders and binary classifiers to detect anomalies. To overcome the limitation of scene-specific events, a method explored the use of semantic interaction between features of foreground targets and background scene using memory networks and another experimented with a scene-object binding frame prediction module to capture the relationship between foreground and background through scene segmentation.

Weakly Supervised Video Anomaly Detection

Weekly Supervised Video Anomaly Detection Categories

As you know weekly supervised anomaly detection uses abnormal samples during training along with a larger amount of normal samples. Specifically for video anomaly detection the video sequences use video-level labels, meaning that if a sequence is labeled as normal all the frames are normal but if the sequence is labeled as abnormal at least one frame is anomalous but not necessarily all of them are anomalous.

The foundation of weakly supervised video anomaly detection is based on multiple instance learning. The methods can be divided according to the input data modalities as unimodal models and multimodal models.

Unimodal Models

Unimodal models techniques rely only on image frames as input data and focuses on analyzing successive frames to detect anomalies in the video sequence.

The unimodal models typically divide the input video into short segments of fixed size (clips). Each clip is an instance and all clips from a video from a bag with the same video-level label. A bag is called positive if the input video contains an anomalous event, otherwise is a negative bag. Many models based on MIL approaches usually consist of two modules: a video processing backbone and a prediction head.

The backbone convert a video segment into a feature embedding, usually a pre-trained feature extractor like Convolutional 3D (C3D), Temporal Segment Networks(TSN) and Inflated 3D (I3D) is used to extract the spatio-temporal features of the samples. Although the video backbone can be fine-tuned during the training phase, generally it is preferred to freeze it to speed up the training process and reduce computational resources. On the other hand, the head takes the embedding and predicts an anomaly score for each instance with the supervision of the video-level labels.

The base MIL approach for video anomaly detection uses a 3 layer Fully Connected Network(FCN) as head to predict high anomaly scores for anomalous clips and introduce sparsity and smooth constraints to avoid fluctuations in the anomaly score curve. Many variations of the prediction head have been proposed so far, for example, graph convolution network(GCN) architecture, 1D convolutional layers, causal convolution layers or attention networks to model a contextual relationship among video segments. As well extensions of the feature handle is proposed by adding a temporal augmented network to learn motion-aware features to incorporate temporal context into a MIL ranking model; extraction of spatial and temporal features separately and then fuse them to feed them to the head.

Another approaches used pseudo-labeling to improve the performance of the anomaly detection, like introduction of hand crafted anomalies, self-reasoning framework that uses binary clustering to generate the pseudo-labels to supervise the MIL regression models and the use a noise cleaner framework to refine and select high confident predictions to be pseudo labels.

Other methods include Robust Temporal Feature Magnitude (RTFM) by training a feature volume learning function to identity positive sample efficiently in addition with the use of self-attention to capture both long and short time correlations and Deep Temporal Encoding-Decoding that captures the temporal evolution of the videos over time but treating the instances of the same bag as sequential visual data rather than independent segments.

Multimodal Models

Multimodal models use diverse data sources, including video, audio, text and optical flow to extract the required features to detect anomalies. Most of the existing methods for this category were developed to detect violent behavior for surveillance videos using audio and video information.

Between the approaches exists HL-Net a multimodal anomaly detection using a 3 branch neural network: a similarity branch, a proximity branch and a scoring branch. Also a lightweight dual-stream network that uses self-distillation to transfer unimodal visual knowledge to audio-visual models to narrow the semantic gap between multimodal features. And methods that used different strategies to fusion multimodal information like bilinear pooling mechanism, cross-modal perceptual local arousal network and the use attention module to align information and achieve implicit alignment of multimodal data.

In summary, anomaly detection in images and videos is a dynamic field that leverages deep learning techniques to tackle complex problems where anomalies come in various forms and are challenging to obtain in large quantities. By embracing unsupervised or weakly supervised methods, we can better detect deviations from the norm in visual data, making it an indispensable tool for a wide range of applications.

A Comprehensive Review of Deep Learning Video Anomaly Detection