AVadCLIP: Audio-Visual Collaboration for Robust Video Anomaly Detection

Peng Wu1, Wanshun Su1, Guansong Pang21, Yujia Sun3, Qingsen Yan1, Peng Wang1\(^\ast\), Yanning Zhang1
1School of Computer Science, Northwestern Polytechnical University, China
2School of Computing and Information Systems, Singapore Management University, Singapore
3School of Artifical Intelligence, Xidian University, China


Abstract

With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visual-based detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive cross-modal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audio-visual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP’s generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertainty-driven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertainty-driven distillation, our approach consistently outperforms current unimodal VAD methods.

1 Introduction↩︎

Video anomaly detection (VAD), as a pivotal technology in intelligent surveillance systems, focuses on identifying anomalous events within videos and has attracted substantial research interest in recent years [1][3]. Due to the rarity of anomalies and the high cost of manual annotation, fully supervised frameworks are impractical for large-scale deployment. As a solution, weakly supervised video anomaly detection (WSVAD) [4][7] has gained traction, requiring only video-level labels instead of detailed frame annotations. These methods seek to uncover latent anomalies under limited supervision, addressing challenges such as anomaly insufficiency and high false-positive rates in complex environments.

Figure 1: Left: Illustration of audio-visual collaboration effects; Right: Illustration of our proposed distillation (UKD) effects.

Current WSVAD methods primarily rely on the multiple instance learning (MIL) framework, using video-level labels for model training [4], [8]. Specifically, these approaches treat videos as bags of segments (instances) and distinguish anomalous patterns through the hard attention mechanism (a.k.a Top-K) [9]. With the rapid advancement of foundation models, Contrastive Language-Image Pretraining (CLIP) [10] has shown remarkable potential in various downstream tasks, including video understanding [11], [12]. Building on the remarkable success of CLIP, recent methods like VadCLIP [13] and TPWNG [7] have advanced WSVAD by leveraging CLIP’s semantic alignment capabilities. However, these methods—whether CLIP-based or conventional—predominantly rely on unimodal visual information, which often leads to significant detection limitations in complex real-world scenarios. Visual occlusion, extreme lighting variations, and environmental noise can render visual features unreliable or ambiguous [14][16]. In these challenging conditions, multimodal information, particularly audio, offers indispensable contextual cues that can complement and enhance visual-based detection. For instance, audio remains robust when visual data is compromised, allowing detection of off-camera events. In acoustically rich environments, certain anomalies like explosion, scream, or gunshot exhibit distinct acoustic signatures, making them more discriminative in the audio domain. Similarly, in low-light conditions where visual features degrade, audio serves as a critical supplementary modality. These observations underscore the importance of integrating audio and video modalities, as their complementary nature can significantly enhance the accuracy and robustness of anomaly detection systems in diverse and challenging environments. We illustrate the impact of audio-visual integration for WSVAD in Figure 1.

Existing attempts [14], [17], [18] to incorporate audio into video anomaly detection typically adopt traditional feature concatenation methods, such as fusing visual features extracted by I3D [19] or C3D [20] with audio features extracted by VGGish [21]. These approaches fail to fully exploit the potential of modern multimodal learning techniques, resulting in suboptimal cross-modal integration. Moreover, they overlook the inherent semantic correlations between visual and auditory modalities, which are essential for enhancing anomaly detection performance.

To address these limitations, we propose AVadCLIP, a WSVAD framework that leverages audio-visual collaborative learning to drive audio-visual anomaly detection by CLIP-powered cross-modal alignment. AVadCLIP fully exploits CLIP’s intrinsic capability to establish semantic consistency across vision, text, and audio, ensuring that video anomaly detection is performed within a unified multimodal semantic space rather than merely fusing raw features. Our framework introduces three significant innovations: an efficient audio-visual feature fusion mechanism that is different from the naive feature concatenation and achieves adaptive cross-modal integration through lightweight parametric adaptation while keeping the CLIP backbone frozen; a novel audio-visual prompt mechanism dynamically enriches text label embeddings with key multimodal information, enhancing contextual understanding of videos and enabling more precise identification of different categories; and an uncertainty-driven feature distillation (UKD) module that generates audio-visual-like enhanced features in audio-missing scenarios, ensuring robust anomaly detection performance (as illustrated in Figure 1). Overall, our AVadCLIP relies on only a small set of trainable parameters, effectively transferring CLIP’s pretrained knowledge to the weakly supervised audio-visual anomaly detection task. Furthermore, by employing a distillation strategy based on data uncertainty modeling, we further transfer the learned knowledge from our audio-visual anomaly detector to a unimodal detector, enabling robust anomaly detection in scenarios with incomplete modalities.

In summary, our main contributions are as follows:

\(\bullet\) We propose a weakly supervised video anomaly detection framework that harnesses audio-visual collaborative learning, leveraging CLIP’s multimodal alignment capabilities. By incorporating a lightweight adaptive audio-visual fusion mechanism and integrating audio-visual information through prompt-based learning, our approach effectively achieves CLIP-driven robust anomaly detection in multimodal settings.

\(\bullet\) We design an uncertainty-driven feature distillation module, which transforms deterministic estimation into probabilistic uncertainty estimation. This enables the model to capture feature distribution variance, ensuring robust anomaly detection performance even with unimodal data.

\(\bullet\) Extensive experiments on two WSVAD datasets demonstrate that our method achieves superior performance in audio-visual scenarios, while maintaining robust anomaly detection results even in audio-absent conditions.

2 Related Work↩︎

2.1 Video Anomaly Detection↩︎

Video Anomaly Detection has been extensively studied in recent years, with existing approaches broadly categorized into semi-supervised and weakly supervised methods. Among them, semi-supervised methods primarily rely on normal video clips for training and identify anomalies by detecting deviations from learned normal patterns during inference. These methods commonly adopt self-supervised learning techniques [22][24], such as reconstruction [25], [26] or prediction [27], [28]. Reconstruction-based methods assume that the model can effectively reconstruct normal videos, whereas abnormal videos—due to distributional discrepancies—result in significant reconstruction errors. Autoencoders [29], [30] are widely employed to capture normal pattern features, with reconstruction error serving as an anomaly indicator. Prediction-based methods [31] utilize models to forecast future frames, detecting anomalies based on prediction errors. However, a key limitation of semi-supervised methods is their tendency to overfit normal patterns, leading to poor generalization to unseen anomalies.

Weakly supervised methods, in contrast, typically adopt the MIL framework, requiring only video-level anomaly labels and significantly reducing annotation costs. A classic work is DeepMIL [4], which employs a ranking loss to distinguish normal from anomalous instances. To enhance detection, two-stage pseudo-labeling strategy has been introduced, where high-confidence anomalous regions identified during MIL training serve as pseudo-labels for a secondary refinement phase [32][34]. With the rise of Vision-Language Models (VLMs) [35], CLIP has shown remarkable cross-modal capabilities and is increasingly applied to WSVAD. VadCLIP [13], the first CLIP-based WSVAD method, integrates textual priors via text and visual prompts, enhancing anomaly detection. Building on this, TPWNG [7] refines feature learning through a two-stage approach. Recent research trends focus on large model-driven strategies, e.g., training-free frameworks [36], [37], spatiotemporal anomaly detection [38], and open-scene anomaly detection [39].

2.2 Audio-Visual Learning↩︎

The integration of audio and visual information has emerged as a critical research direction in multimodal learning, as it not only enhances model performance but also facilitates a deeper understanding of complex scenes. Significant progress has been achieved in various aspects of audio-visual fusion [40], [41]. In audio-visual segmentation, researchers aim to accurately segment sound-producing objects based on audio-visual cues. Chen et al. [42] proposed a novel informative sample mining method for audio-visual supervised contrastive learning. Ma et al. [43] introduced a two-stage training strategy to address the audio-visual semantic segmentation (AVSS) task. Building on these works, Guo et al. [44] introduced a new task: Open-Vocabulary AVSS (OV-AVSS), which extends AVSS to open-world scenarios beyond predefined annotation labels. Audio-visual event localization aims to identify the spatial and temporal locations of both visual and auditory events, with attention mechanisms widely used for modality fusion. For instance, He et al. [45] proposed an audio-visual co-guided attention mechanism, while Xu et al. [46] introduced an audio-guided spatial-channel attention mechanism. Related tasks include audio-visual video parsing [47], [48] and audio-visual action recognition [49]. Audio-visual anomaly detection [17], [50] has also become a growing research hot. For example, Yu et al. [51] applied a self-distillation module to transfer single-modal visual knowledge to an audio-visual model, reducing noise and bridging the semantic gap between single-modal and multimodal features. Similarly, Pang et al. [52] proposed a weighted feature generation approach, leveraging mutual guidance between visual and auditory information, followed by bilinear pooling for effective feature integration.

3 Methodology↩︎

Figure 2: The pipeline of our proposed AVadCLIP.

Figure 3: The pipeline of our proposed adaptive fusion module and binary classifier.

3.1 Problem Statement↩︎

Given a training set of videos \(\{V_i\}\), where each video \(V\) contains both visual and corresponding audio information, along with a video-level label \(y \in \mathbb{R}^{C}\). Here, \(C\) indicates that the number of categories (including the normal class and various anomaly classes). To facilitate model processing, we employ a video encoder and an audio encoder to extract high-level features \(X_v \in \mathbb{R}^{N \times d}\) and \(X_a \in \mathbb{R}^{N \times d}\), respectively, where \(N\) represents the temporal length of the video (i.e., the number of frames or snippets) and \(d\) denotes the feature dimensionality. The objective of WSVAD task is to train a detector using all available \(X_v\), \(X_a\), and their corresponding labels from the training set, enabling the model to accurately determine whether each frame in a test sample is anomalous and to identify the specific anomaly category.

The overall pipeline of our method as shown in Figure 2, starts with extracting features from video and audio using dedicated encoders, then adaptively fuses them for multimodal correspondence learning. We combine a classification branch with a CLIP-based alignment approach, using a audio-visual prompt to inject fine-grained multimodal information into text embeddings. Additionally, an uncertainty-driven distillation is employed to improve anomaly detection robustness in scenarios with incomplete modalities.

3.2 Video and Audio Encoders↩︎

Video encoder. Leveraging CLIP’s robust cross-modal representation, we use its image encoder (ViT-B/16) as a video encoder, in contrast to traditional models like C3D and I3D, which are less effective in capturing semantic relationships. We extract features from sampled video frames using CLIP, but to address CLIP’s lack of temporal modeling, we incorporate a lightweight temporal model, such as Graph Convolution Network (GCN) [8] and Temporal Transformer [13], to capture temporal dependencies. This approach ensures efficient transfer of CLIP’s pre-trained knowledge to the WSVAD task.

Audio encoder. For audio feature extraction, we use Wav2CLIP [53], a CLIP-based model that maps audio signals into the same semantic space as images and text. The audio is first converted into spectrograms, then sampled to match the number of video frames. These audio segments are processed by Wav2CLIP to extract features. To capture contextual relationships, we apply a temporal convolution layer [54], which models local temporal dependencies, preserving key dynamics within the audio modality.

3.3 Audio-Visual Adaptive Fusion↩︎

In multimodal feature fusion, while both video and audio contain valuable semantic information, their importance often varies depending on the specific task. Inspired by human perception mechanisms [55], our approach follows a vision-centric, audio-assisted paradigm, where video features serve as the primary modality, and audio features complement and enhance visual information. To preserve the generalization capability of the original CLIP model in downstream tasks while avoiding the introduction of excessive trainable parameters, we design a lightweight adaptive fusion that integrates audio features without significantly increasing computational overhead. We present the structure of this fusion in Figure 3.

Specifically, given the video feature \(X_v\) and audio feature \(X_a\), we first concatenate them to obtain a joint representation \(X_{a+v}\in\mathbb{R}^{N \times 2d}\), which is then processed by two projection networks to generate the adaptive weight and residual feature. The first projection network computes adaptive fusion weights \(W\), which determine the contribution of audio at each time step [56]. This is achieved through a linear transformation followed by a sigmoid activation: \[W = \text{Sigmoid}(\text{Linear}(X_{a+v})) \in \mathbb{R}^{N \times d}\]

The second projection network is responsible for residual mapping, which transforms \(X_{a+v}\) into a residual feature \(X_{\text{res}}\) that encodes the fused information from both modalities: \[X_{\text{res}} = \text{Linear}(\text{GELU}(\text{Linear}(X_{a+v}))) \in \mathbb{R}^{N \times d}\]

Finally, the fused representation \(X_{av}\) is obtained by adaptively incorporating the residual feature into the original video feature: \[X_{av} = X_v + W \odot X_{\text{res}}\] where \(\odot\) denotes element-wise multiplication. The adaptive weight \(W\) dynamically adjusts the degree of audio integration, ensuring that video features remain dominant while audio features provide auxiliary information. Additionally, the residual mapping enhances the expressiveness of the fused representation by capturing nonlinear transformations. By introducing an adaptive fusion mechanism and maintaining a lightweight design, our fusion approach effectively balances efficiency and expressiveness, leveraging the complementarity of visual and audio modalities while minimizing computational overhead.

3.4 Dual Branch Framework with Prompts↩︎

We leverage a two-branch framework [13] for the WSVAD task, consisting of a classification branch and an alignment branch, which effectively leverage audio-visual information to improve detection accuracy.

Classification branch consists of a lightweight binary classifier (as shown in Figure 3), which takes \(X_{av}\) as input and directly predicts the frame-level anomaly confidence \(A\).

Alignment branch leverages the cross-modal semantic alignment mechanism, which computes the similarity between frame-level features and class label features. To obtain class label representations, we leverage the CLIP text encoder combined with the learnable textual prompt [57] and audio-visual prompt to extract class embeddings, ensuring unified semantic alignment between visual and textual modalities. Given a set of predefined class labels (e.g., “normal” “fighting”), we first introduce a learnable textual prompt, then we concatenate the textual prompt with class labels and feed them into CLIP text encoder to obtain the class representation \(X_c\). Compared to the manually defined prompt, the learnable prompt allows the model to dynamically adjust textual representations during training, making them more suitable for the specific requirements of WSVAD. Furthermore, we incorporate an audio-visual prompt into the class label features to enrich the class representations with additional multimodal information.

The proposed audio-visual prompt mechanism aims to dynamically inject instance-level key audio-visual information into text labels to enhance the representation. Specifically, we leverage the anomaly confidence \(A\) from the classification branch and audio-visual features \(X_{av}\) to generate a video-level global representation: \[X_{p} = \text{Norm}\left(A^{\top}X_{av}\right) \in \mathbb{R}^{d} \label{con:attention}\tag{1}\] where Norm represents the normalization operation. Next, we calculate the similarity matrix \(S_p\) between the class representation \(X_c\) and the global representation \(X_{p}\) to measure the alignment between class labels and videos: \[S_p = \text{Softmax}\left(X_c X_{p}^{\top}\right)\] based on \(S_p\), we generate the enhanced instance-level audio-visual prompt \(X_{mp}\): \[X_{mp} = S_p X_{av} \in \mathbb{R}^{d}\] This operation dynamically adjusts the class representation’s focus on different video instances by calculating the similarity between global audio-visual features and class labels, thereby enhancing cross-modal alignment.

Then, we add \(X_{mp}\) and the class representation \(X_c\), followed by a feed-forward network (FFN) transformation and a skip connection to obtain the final instance-specific class embedding \(X_{cp}\): \[X_{cp} = \text{FFN}\left(\text{ADD}\left(X_{mp}, X_{c}\right)\right)+X_{c} \label{con:visual95prompt}\tag{2}\] where ADD represents element-wise addition.

This two-branch framework provides anomaly confidence through the classification branch and refines category identification with class information via the alignment branch, improving robustness and enabling fine-grained anomaly detection.

3.5 Optimization of Audio-Visual Model↩︎

For the classification branch, we adopt the Top-K mechanism, as proposed in previous work [17], to select the top \(K\) anomaly confidence values from both normal and abnormal videos, which are averaged as the video-level prediction. The classification loss \(\mathcal{L}_{BCE}\) is then computed using binary cross-entropy between the prediction and groundtruth class.

In the case of the alignment branch, the MIL-Align mechanism [13] is applied. We compute an alignment map \(M\), reflecting the similarity between frame-level features \(X_{av}\) and all category embeddings \(X_{cp}\). For each row in \(M\), the top \(K\) similarities are selected and their average is used to quantify the alignment between the video and the current class. This results in a vector \(S = \{s_1, \ldots, s_m\}\) representing the similarity between the video and all possible classes. Then the multi-class prediction is then calculated as: \[p_i = \frac{exp\left(s_i/\tau\right)}{\sum_j{exp\left(s_j/\tau\right)}} \label{con:nce}\tag{3}\] where \(p_i\) represents the prediction for the \(i^{th}\) class, and \(\tau\) is the temperature scaling parameter. Then, we compute \(\mathcal{L}_{NCE}\) based on cross-entropy. Besides, to address the class imbalance in WSVAD , where normal samples dominate and anomaly instances are sparse, we employ the focal loss [58]. Finally, the overall loss \(\mathcal{L}_{ALIGN}\) for alignment branch is the average of \(\mathcal{L}_{NCE}\) and \(\mathcal{L}_{FOCAL}\).

3.6 Uncertainty-Driven Distillation↩︎

In the WSVAD task, audio serves as a complementary modality to video, enhancing detection accuracy. However, audio may be unavailable in practical scenarios, leading to performance degradation. To address this, we apply knowledge distillation (KD) by using a pre-trained multi-modal (video+audio) teacher model to guide a unimodal (video-only) student model, ensuring robust anomaly detection even without audio. Traditional knowledge distillation methods typically assume a deterministic transfer of knowledge, employing mean square error (MSE) loss to align the student model with the teacher’s feature representations. However, this approach fails to account for the inherent uncertainty in audio-visual feature fusion. In real-world scenarios, factors such as noisy audio or occluded visual content can introduce distortions in the fused features, leading to inaccurate feature representations and diminished generalization capability.

To overcome this, we propose a probabilistic uncertainty distillation strategy [59], [60], which models data uncertainty during distillation, improving the student model’s robustness across diverse scenarios. Specifically, assume \(X_{av,i} = X_{vs,i} + \epsilon\sigma_i\), where \(\epsilon\sim \mathcal{N}(\mathbf{0},\mathbf{I})\), \(X_{vs}\) represents enhanced visual features derived from the student model, and \(\sigma_i\) refers to the inherent uncertainty between the \(i^{th}\) pair of features. Then we model the observation as a Gaussian likelihood function to more accurately quantify data uncertainty in the feature distillation. The relationship between the audio-visual fusion feature \(X_{av,i}\) and the unimodal feature \(X_{vs,i}\) is formulated as: \[p(X_{av,i} | X_{vs,i}, {\theta}) = \frac{1}{\sqrt{2\pi {\sigma}_i^2}} \exp\left(-\frac{\| X_{av,i} - X_{vs,i} \|^2}{2{\sigma}_i^2}\right) \label{eq:pi}\tag{4}\] where \(\theta\) is the parameter of models, to maximize the likelihood for each pair of features \(X_{av,i}\) and \(X_{vs,i}\), we adopt the log-likelihood form: \[\ln p(X_{av,i} | X_{vs,i}, {\theta}) = -\frac{\| X_{av,i} - X_{vs,i} \|^2}{2{\sigma}_i^2} - \frac{\ln 2\pi {\sigma}_i^2}{2} \label{eq:log95pi}\tag{5}\]

In practice, we design a network branch (a simple three-layer convolutional neural network) to predict the variance \({\sigma}_i^2\) and reformulate the likelihood maximization problem as the minimization of a loss function. Specifically, we employ an uncertainty-weighted MSE loss: \[\mathcal{L}_{UKD} = \frac{1}{L} \sum_{i=1}^{L} \left[ \frac{\| X_{av,i} - X_{vs,i} \|^2 }{\sigma_{i}^2}+ \ln {\sigma}_i^2 \right] \label{eq:loss95pi}\tag{6}\] where \(L\) represents the number of feature pairs, and the constant term is omitted for clarity.

During the distillation process, the student model not only learns the unimodal feature \(X_{vs,i}\) from the teacher model but also considers the feature uncertainty \({\sigma}_i^2\) to optimize its learning strategy. Specifically, the first term of the loss function represents the feature similarity between the student and teacher models, normalized by \({\sigma}_i^2\). This assigns smaller weights to features with higher uncertainty, thereby avoiding overfitting to hard-to-learn information. The second term acts as a regularization term to prevent \({\sigma}_i^2\) from becoming too small, ensuring effective distillation.

Ultimately, during the inference phase, we only input video and perform anomaly detection through the unimodal student model for audio-missing scenarios.

4 Experiments↩︎

4.1 Datasets and Evaluation↩︎

XD-Violence. As the largest publicly available audio-visual WSVAD dataset, XD-Violence [17] significantly surpasses existing datasets in scale and diversity. It comprises 3,954 training videos and 800 test videos, with the test set containing 500 violent and 300 non-violent instances. The dataset covers six distinct categories of violent events, including abuse, car accidents, explosion, fighting, riot, and shooting, which occur at various temporal locations within videos.

CCTV-Fights\(_{sub}\). Derived from CCTV-Fights [61], CCTV-Fights\(_{sub}\) [62] is a carefully curated subset designed to address audio-visual anomaly detection, The subset retains 644 high-quality videos depicting real-world fight scenarios, each with meaningful audio content, making it a valuable resource for evaluating audio-visual anomaly detection methods in real-world surveillance contexts.

Evaluation. For performance evaluation, we adopt distinct metrics tailored to different granularities of WSVAD tasks. For coarse-grained WSVAD, we employ frame-level Average Precision (AP), which provides a comprehensive measure of detection accuracy across varying confidence thresholds. For fine-grained anomaly detection, we utilize mean Average Precision (mAP) [14] computed across multiple intersection over union (IoU) thresholds and the average mAP (AVG) across different thresholds.

4.2 Implementation Details↩︎

We conduct experiments on an NVIDIA RTX 4090 GPU, where the visual enhancement module is a single-layer 1D convolutional network, which includes a convolutional layer with a kernel size of 3, padding size of 1, ReLU activation function, and a skip connection. Such an operation effectively facilitates the aggregation of local contextual information. For input processing, we employ a frame selection strategy tailored to different datasets, sampling one frame per 16 frames for XD-Violence and one frame per 4 frames for CCTV-Fights\(_{sub}\), using a uniform sampling strategy with a maximum frame count of 256; During optimization, we set the batch size, learning rate, and total epoch to 96, \(1e^{-5}\), and 10, respectively.

4.3 Comparison with State-of-the-Art Methods↩︎

Performance comparison on XD-Violence↩︎

Our experiments evaluate both coarse-grained and fine-grained anomaly detection performance on XD-Violence, comparing our AVadCLIP against state-of-the-art approaches, as shown in Tables 1 and 2.

For coarse-grained anomaly detection, using only RGB inputs, AVadCLIP\(^*\) (\(*\) denotes RGB-only input) achieves an AP score of 85.53%, surpassing all existing vision-only methods. Notably, it outperforms VadCLIP—the previous best-performing RGB-only approach—by 1.0%, demonstrating superior visual anomaly detection. When incorporating audio, AVadCLIP further improves performance, significantly outperforming all multimodal baselines, achieving a remarkable 4.9% gain over the latest AVCL [62] method.

For fine-grained anomaly detection, AVadCLIP consistently outperforms all competitors across different IoU thresholds, as detailed in Table 2. With RGB-only input, AVadCLIP\(^*\) surpasses VadCLIP at all IoU thresholds, achieving an AVG improvement of 2.7%. Similarly, the full-modality model AVadCLIP leads across all metrics, boosting the AVG by 3.9%. These results highlight the effectiveness of multimodal learning in precisely localizing anomaly boundaries and improving category predictions.

Overall, AVadCLIP achieves state-of-the-art performance in both unimodal and multimodal settings across coarse- and fine-grained anomaly detection tasks. The comprehensive results validate its effectiveness in leveraging audio-visual collaboration and demonstrate the feasibility of uncertainty-driven distillation strategy.

Table 1: Coarse-grained comparisons on XD-Violence.
Method Reference Modality AP(%)
DeepMIL [4] CVPR 2018 RGB(ViT) 75.18
Wu et al. [17] ECCV 2020 RGB(ViT) 80.00
RTFM [63] ICCV 2021 RGB(ViT) 78.27
AVVD [14] TMM 2022 RGB(ViT) 78.10
Ju et al. [11] ECCV 2022 RGB(ViT) 76.57
DMU [18] AAAI 2023 RGB(ViT) 82.41
CLIP-TSA [64] ICIP 2023 RGB(ViT) 82.17
VadCLIP [13] AAAI 2024 RGB(ViT) 84.51
AVadCLIP\(^*\) this work RGB(ViT) 85.53
FVAI [52] ICASSP2021 RGB(I3D)+Audio 81.69
MACIL-SD [51] ACMMM2022 RGB(I3D)+Audio 81.21
CUPL [65] CVPR2023 RGB(I3D)+Audio 81.43
AVCL [62] TMM 2025 RGB(I3D)+Audio 81.11
AVadCLIP this work RGB(ViT)+Audio 86.04
Table 2: Fine-grained comparisons on XD-Violence.
Method mAP@IoU(%)
  0.1 0.2 0.3 0.4 0.5 AVG
Random 1.82 0.92 0.48 0.23 0.09 0.71
DeepMIL [4] 22.72 15.57 9.98 6.20 3.78 11.65
AVVD [14] 30.51 25.75 20.18 14.83 9.79 20.21
VadCLIP [13] 37.03 30.84 23.38 17.90 14.31 24.70
AVadCLIP\(^*\) 39.63 32.77 26.84 21.58 16.39 27.44
AVadCLIP 41.89 34.61 27.08 22.16 17.30 28.61
Table 3: Coarse-grained comparisons on CCTV-Fights\(_{sub}\).
Method Reference Modality AP(%)
VadCLIP [13] AAAI 2024 RGB(ViT) 72.78
AVadCLIP\(^*\) this work RGB(ViT) 73.36
MACIL-SD [51] ACMMM2022 RGB(I3D)+Audio 72.92
DMU [18] CVPR2023 RGB(I3D)+Audio 72.97
AVCL [62] TMM 2025 RGB(I3D)+Audio 73.20
AVadCLIP this work RGB(ViT)+Audio 73.38

Performance comparison on CCTV-Fights\(_{sub}\)↩︎

The coarse-grained anomaly detection results on CCTV-Fights\(_{sub}\) are presented in Table 3. For RGB-only methods, AVadCLIP\(^*\) achieves 73.36% AP, surpassing the state-of-the-art VadCLIP and demonstrating the effectiveness of our approach in unimodal scenarios. For audio-visual scenarios, AVadCLIP further improves performance, outperforming all existing methods. These results indicate that incorporating audio information can further enhance anomaly detection performance, validating the effectiveness of cross-modal complementary information mining.

4.4 Ablation Studies↩︎

Table 4: Effectiveness of designed modules on XD-Violence.
AV Fusion AV Prompt \(\mathcal{L}_{FOCAL}\) AP(%) AVG(%)
\(\times\) \(\times\) \(\times\) 79.85 27.89
\(\surd\) \(\times\) \(\times\) 82.90 26.63
\(\surd\) \(\surd\) \(\times\) 86.18 26.79
\(\surd\) \(\surd\) \(\surd\) 86.04 28.61
Table 5: Effectiveness of audio-visual fusion on XD-Violence.
Method AP(%) AVG(%)
Cross Attention 75.15 10.51
Element-wise Addition 83.02 27.66
Concat+Linear Projection 83.36 28.88
Adaptive Fusion 86.04 28.61
Table 6: Effectiveness of UKD on XD-Violence.
Method AP(%) AVG(%)
Audio Model w/o UKD 50.89 12.20
Audio Model wUKD 52.51 13.50
Visual Model w/o UKD 84.60 22.92
Visual Model wUKD 85.53 27.44
Audio-Visual Model 86.04 28.61

The effect of audio-visual adaptive fusion↩︎

From Table 4, it can be observed that the introduction of audio-visual fusion improves detection performance. Furthermore, Table 5 presents the impact of different audio-visual fusion strategies on anomaly detection performance. First, the cross attention fusion performs poorly in the WSVAD task, indicating that although it can capture the relationships between modalities, its complex parameterized design may negatively impact the generalization ability of CLIP model in downstream WSVAD tasks. Next, the simple element-wise addition strategy achieves an AP of 83.02% and an AVG of 27.66%. Then, the concatenation with linear projection approach improves the AP to 83.36% and the AVG to 28.88%, indicating that enhancing feature representation through linear transformation facilitates more effective cross-modal information capture. Finally, our proposed adaptive fusion strategy achieves the best AP of 86.04%, outperforming the other three methods on the whole. This demonstrates that our adaptive fusion strategy, as a lightweight and effective fusion strategy, can more exploit complementary information between audio and visual modalities.

Figure 4: Coarse- and Fine-grained WSVAD visualization results of AVadCLIP and Baseline on XD-Violence.

The effect of audio-visual prompt and \(\mathcal{L}_{FOCAL}\)↩︎

As presented in Table 4, the baseline model achieves an AP of only 79.85%. Integrating the audio-visual prompt on top of the adaptive fusion mechanism significantly enhances performance, increasing the AP to 86.18%. This improvement underscores the effectiveness of the audio-visual prompt in capturing critical multimodal patterns, thereby facilitating more precise anomaly recognition. Furthermore, incorporating focal loss into the model contributes to refining anomaly boundary detection, leading to more stable performance in fine-grained anomaly localization. In summary, the audio-visual prompt primarily enhances coarse-grained anomaly detection, and focal loss further refines boundary precision, enabling the model to achieve optimal performance across both AP and AVG metrics.

The effect of uncertainty-driven distillation↩︎

As shown in Table 6, the proposed UKD mechanism significantly enhances anomaly detection performance in both visual-only and audio-only models. Specifically, in the visual-only setting, UKD achieves a 0.9% improvement in AP and a 4.5% increase in AVG, attaining performance levels comparable to the teacher model trained with audio-visual inputs. Similarly, the audio-only model also benefits from UKD, exhibiting consistent performance gains. These results highlight the effectiveness of UKD in leveraging data uncertainty to enhance the robustness of unimodal representations during the distillation process, making it particularly well-suited for real-world applications where modality incompleteness is prevalent.

4.5 Qualitative Results↩︎

In Figure 4, we present the qualitative visualizations of AVadCLIP and the baseline model for both coarse-grained and fine-grained WSVAD. The blue curves denote the anomaly predictions by AVadCLIP, whereas the yellow curves represent those by the baseline model. As illustrated, compared to the baseline model, AVadCLIP significantly reduces anomaly confidence in normal video segments, thereby enhancing its ability to distinguish between abnormal and normal regions more accurately. The fine-grained map below also indicates that AVadCLIP can predict categories with greater precision. Notably, the observed performance improvement supports our hypothesis that audio information is more advantageous in visual occlusion (shooting) or acoustic dominant scenes (explosion), and can effectively eliminate ambiguity in visually similar patterns in anomaly detection scenes, thereby ensuring more robust detection performance.

5 Conclusion↩︎

In this work, we propose a novel weakly supervised framework for robust video anomaly detection using audio-visual collaboration. Leveraging the powerful representation ability and cross-modal alignment capability of CLIP, we design two distinct modules to achieve efficient audio-visual collaboration and multimodal anomaly detection, based on the frozen CLIP model. Specifically, to seamlessly integrate audio-visual information, we introduce a lightweight fusion mechanism that adaptively generates fusion weights based on the importance of audio to assist visual information. Additionally, we propose an audio-visual prompt strategy that dynamically refines text embeddings with key multimodal features, strengthening the semantic alignment between video content and corresponding textual labels. To further bolster robustness in scenarios with missing modalities, we develop an uncertainty-driven distillation module that synthesizes audio-visual representations from visual inputs, focusing on challenging features. Experimental results across two benchmarks demonstrate that our framework effectively enables video-audio anomaly detection and enhances the model’s robustness in scenarios with incomplete modalities. In the future, we will explore the integration of additional modalities (e.g., textual description) based on VLMs to achieve more robust video anomaly detection.

References↩︎

[1]
W. Luo, W. Liu, D. Lian, and S. Gao, “Future frame prediction network for video anomaly detection,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 11, pp. 7505–7520, 2021.
[2]
M. I. Georgescu, R. T. Ionescu, F. S. Khan, M. Popescu, and M. Shah, “A background-agnostic framework with adversarial training for abnormal event detection in video,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 9, pp. 4505–4523, 2021.
[3]
P. Wu, C. Pan, Y. Yan, G. Pang, P. Wang, and Y. Zhang, “Deep learning for video anomaly detection: A review,” arXiv preprint arXiv:2409.05383, 2024.
[4]
W. Sultani, C. Chen, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Shah Mubarak, “Real-world anomaly detection in surveillance videos,” 2018, pp. 6479–6488.
[5]
M. Z. Zaheer, A. Mahmood, M. Astrid, and booktitle=Computer. V. 2020:. 16th. E. C. G. U. A. 23–28,. 2020,. P. P. X. 16. Lee Seung-Ik, “Claws: Clustering assisted weakly supervised learning with normalcy suppression for anomalous event detection,” 2020 , organization={Springer}, pp. 358–376.
[6]
C. Huang et al., “Weakly supervised video anomaly detection via self-guided temporal discriminative transformer,” IEEE Transactions on Cybernetics, 2022.
[7]
Z. Yang, J. Liu, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Wu Peng, “Text prompt with normality guidance for weakly supervised video anomaly detection,” 2024, pp. 18899–18908.
[8]
J.-X. Zhong, N. Li, W. Kong, S. Liu, T. H. Li, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Li Ge, “Graph convolutional label noise cleaner: Train a plug-and-play action classifier for anomaly detection,” 2019, pp. 1237–1246.
[9]
S. Paul, S. Roy, and booktitle=Proceedings. of the E. conference on computer vision Roy-Chowdhury Amit K, “W-talc: Weakly-supervised temporal activity localization and classification,” 2018, pp. 563–579.
[10]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021 , organization={PMLR}, pp. 8748–8763.
[11]
C. Ju, T. Han, K. Zheng, Y. Zhang, and booktitle=Computer. V. 2022:. 17th. E. C. T. A. I. O. 23–27,. 2022,. P. P. X. Xie Weidi, “Prompting visual-language models for efficient video understanding,” 2022 , organization={Springer}, pp. 105–124.
[12]
H. Xu et al., “Videoclip: Contrastive pre-training for zero-shot video-text understanding,” arXiv preprint arXiv:2109.14084, 2021.
[13]
P. Wu et al., “Vadclip: Adapting vision-language models for weakly supervised video anomaly detection,” 2024, vol. 38, pp. 6074–6082.
[14]
P. Wu, X. Liu, and J. Liu, “Weakly supervised audio-visual violence detection,” IEEE Transactions on Multimedia, pp. 1674–1685, 2022.
[15]
Y. Tian, J. Shi, B. Li, Z. Duan, and booktitle=Proceedings. of the E. conference on computer vision (ECCV). Xu Chenliang, “Audio-visual event localization in unconstrained videos,” 2018, pp. 247–263.
[16]
Y. Tian, D. Li, and booktitle=Computer. V. 2020:. 16th. E. C. G. U. A. 23–28,. 2020,. P. P. I. 16. Xu Chenliang, “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” 2020 , organization={Springer}, pp. 436–454.
[17]
P. Wu et al., “Not only look, but also listen: Learning multimodal violence detection under weak supervision,” 2020 , organization={Springer}, pp. 322–339.
[18]
H. Zhou, J. Yu, and booktitle=Proceedings. of the A. C. on A. I. Yang Wei, “Dual memory units with uncertainty regulation for weakly supervised video anomaly detection,” 2023.
[19]
J. Carreira and booktitle=proceedings. of the I. C. on C. V. and P. R. Zisserman Andrew, “Quo vadis, action recognition? A new model and the kinetics dataset,” 2017, pp. 6299–6308.
[20]
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and booktitle=Proceedings. of the I. international conference on computer vision Paluri Manohar, “Learning spatiotemporal features with 3d convolutional networks,” 2015, pp. 4489–4497.
[21]
J. F. Gemmeke et al., “Audio set: An ontology and human-labeled dataset for audio events,” 2017 , organization={IEEE}, pp. 776–780.
[22]
C. Huang et al., “Self-supervised attentive generative adversarial networks for video anomaly detection,” IEEE transactions on neural networks and learning systems, vol. 34, no. 11, pp. 9389–9403, 2022.
[23]
C. Shi, C. Sun, Y. Wu, and booktitle=Proceedings. of the I. I. C. on C. V. Jia Yunde, “Video anomaly detection via sequentially learning multiple pretext tasks,” 2023, pp. 10330–10340.
[24]
C. Huang, J. Wen, C. Liu, and booktitle=Proceedings. of the T.-T. I. J. C. on A. I. Liu Yabo, “Long short-term dynamic prototype alignment learning for video anomaly detection,” 2024, pp. 866–874.
[25]
Y. Cong, J. Yuan, and booktitle=CVPR. 2011. Liu Ji, “Sparse reconstruction cost for abnormal event detection,” 2011 , organization={IEEE}, pp. 3449–3456.
[26]
W. Luo, W. Liu, and booktitle=Proceedings. of the I. international conference on computer vision Gao Shenghua, “A revisit of sparse coding based anomaly detection in stacked rnn framework,” 2017, pp. 341–349.
[27]
W. Liu, W. Luo, D. Lian, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Gao Shenghua, “Future frame prediction for anomaly detection–a new baseline,” 2018, pp. 6536–6545.
[28]
C. Cao, H. Zhang, Y. Lu, P. Wang, and Y. Zhang, “Scene-dependent prediction in latent space for video anomaly detection and anticipation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[29]
M. Hasan, J. Choi, J. Neumann, A. K. Roy-Chowdhury, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Davis Larry S, “Learning temporal regularity in video sequences,” 2016, pp. 733–742.
[30]
D. Gong et al., “Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection,” 2019, pp. 1705–1714.
[31]
Z. Yang, J. Liu, Z. Wu, P. Wu, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Liu Xiaotao, “Video event restoration based on keyframes for video anomaly detection,” 2023, pp. 14592–14601.
[32]
S. Li, F. Liu, and booktitle=Proceedings. of the A. C. on A. I. Jiao Licheng, “Self-training multi-sequence learning with transformer for weakly supervised video anomaly detection,” 2022, vol. 36, pp. 1395–1403.
[33]
J.-C. Feng, F.-T. Hong, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Zheng Wei-Shi, “Mist: Multiple instance self-training framework for video anomaly detection,” 2021, pp. 14009–14018.
[34]
M. Cho, M. Kim, S. Hwang, C. Park, K. Lee, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Lee Sangyoun, “Look around for anomalies: Weakly-supervised anomaly detection via context-motion relational learning,” 2023, pp. 12137–12146.
[35]
F.-L. Chen et al., “Vlp: A survey on vision-language pre-training,” Machine Intelligence Research, vol. 20, no. 1, pp. 38–56, 2023.
[36]
L. Zanella, W. Menapace, M. Mancini, Y. Wang, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Ricci Elisa, “Harnessing large language models for training-free video anomaly detection,” 2024, pp. 18527–18536.
[37]
Y. Yang, K. Lee, B. Dariush, Y. Cao, and S.-Y. Lo, “Follow the rules: Reasoning for video anomaly detection with large language models,” arXiv preprint arXiv:2407.10299, 2024.
[38]
P. Wu et al., “Weakly supervised video anomaly detection and localization with spatio-temporal prompts,” 2024.
[39]
P. Wu et al., “Open-vocabulary video anomaly detection,” 2024, pp. 18297–18307.
[40]
G. Li, Y. Wei, Y. Tian, C. Xu, J.-R. Wen, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Hu Di, “Learning to answer questions in dynamic audio-visual scenarios,” 2022, pp. 19108–19118.
[41]
Y. Wei, D. Hu, Y. Tian, and X. Li, “Learning in audio-visual context: A review, analysis, and new perspective,” arXiv preprint arXiv:2208.09579, 2022.
[42]
Y. Chen et al., “Unraveling instance associations: A closer look for audio-visual segmentation,” 2024, pp. 26497–26507.
[43]
J. Ma, P. Sun, Y. Wang, and booktitle=European. C. on C. V. Hu Di, “Stepping stones: A progressive training strategy for audio-visual semantic segmentation,” 2024 , organization={Springer}, pp. 311–327.
[44]
R. Guo et al., “Open-vocabulary audio-visual semantic segmentation,” 2024, pp. 7533–7541.
[45]
X. He et al., “CACE-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,” 2024, pp. 985–993.
[46]
H. Xu, R. Zeng, Q. Wu, M. Tan, and booktitle=Proceedings. of the 28th. A. international conference on multimedia Gan Chuang, “Cross-modal relation-aware networks for audio-visual event localization,” 2020, pp. 3893–3901.
[47]
J. Zhou, D. Guo, Y. Zhong, and M. Wang, “Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,” International Journal of Computer Vision, vol. 132, no. 11, pp. 5308–5329, 2024.
[48]
J. Zhou, D. Guo, Y. Mao, Y. Zhong, X. Chang, and booktitle=European. C. on C. V. Wang Meng, “Label-anticipated event disentanglement for audio-visual video parsing,” 2024 , organization={Springer}, pp. 35–51.
[49]
J. Chalk, J. Huh, E. Kazakos, A. Zisserman, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Damen Dima, “Tim: A time interval machine for audio-visual action recognition,” 2024, pp. 18153–18163.
[50]
Y. Liu, Z. Wu, M. Mo, J. Gan, J. Leng, and booktitle=2024. I. I. C. on M. and E. (ICME). Gao Xinbo, “Dual space embedding learning for weakly supervised audio-visual violence detection,” 2024 , organization={IEEE}, pp. 1–6.
[51]
J. Yu, J. Liu, Y. Cheng, R. Feng, and booktitle=Proceedings. of the 30th. A. international conference on multimedia Zhang Yuejie, “Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection,” 2022, pp. 6278–6287.
[52]
W.-F. Pang, Q.-H. He, Y. Hu, and booktitle=ICASSP. 2021. I. international conference on acoustics,. speech and signal processing (ICASSP). Li Yan-Xiong, “Violence detection in videos based on fusing visual and audio information,” 2021 , organization={IEEE}, pp. 2260–2264.
[53]
H.-H. Wu, P. Seetharaman, K. Kumar, and booktitle=ICASSP. 2022. I. I. C. on A. S. and S. P. (ICASSP). Bello Juan Pablo, “Wav2clip: Learning robust audio representations from clip,” 2022 , organization={IEEE}, pp. 4563–4567.
[54]
C. Lea, R. Vidal, A. Reiter, and booktitle=Computer. vision–ECCV. 2016. workshops:. A. the N. O. 8. and 15. 2016,. proceedings,. part I. 14. Hager Gregory D, “Temporal convolutional networks: A unified approach to action segmentation,” 2016 , organization={Springer}, pp. 47–54.
[55]
X. Chen et al., “Whole-cortex in situ sequencing reveals input-dependent area identity,” Nature, pp. 1–10, 2024.
[56]
X. Chen, N. Mishra, M. Rohaninejad, and booktitle=International. conference on machine learning Abbeel Pieter, “Pixelsnail: An improved autoregressive generative model,” 2018 , organization={PMLR}, pp. 864–872.
[57]
K. Zhou, J. Yang, C. C. Loy, and Z. Liu, “Learning to prompt for vision-language models,” International Journal of Computer Vision, vol. 130, no. 9, pp. 2337–2348, 2022.
[58]
T.-Y. Lin, P. Goyal, R. Girshick, K. He, and booktitle=Proceedings. of the I. international conference on computer vision Dollár Piotr, “Focal loss for dense object detection,” 2017, pp. 2980–2988.
[59]
J. Chang, Z. Lan, C. Cheng, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Wei Yichen, “Data uncertainty learning in face recognition,” 2020, pp. 5710–5719.
[60]
Z. Yang, W. Dong, X. Li, J. Wu, L. Li, and booktitle=European. C. on C. V. Shi Guangming, “Self-feature distillation with uncertainty modeling for degraded image recognition,” 2022 , organization={Springer}, pp. 552–569.
[61]
M. Perez, A. C. Kot, and booktitle=ICASSP. 2019. I. I. C. on A. S. and S. P. (ICASSP). Rocha Anderson, “Detection of real-world fights in surveillance videos,” 2019 , organization={IEEE}, pp. 2662–2666.
[62]
J. Meng, H. Tian, G. Lin, J.-F. Hu, and W.-S. Zheng, “Audio-visual collaborative learning for weakly supervised video anomaly detection,” IEEE Transactions on Multimedia, 2025.
[63]
Y. Tian, G. Pang, Y. Chen, R. Singh, J. W. Verjans, and booktitle=Proceedings. of the I. international conference on computer vision Carneiro Gustavo, “Weakly-supervised video anomaly detection with robust temporal feature magnitude learning,” 2021, pp. 4975–4986.
[64]
H. K. Joo, K. Vo, K. Yamazaki, and booktitle=2023. I. I. C. on I. P. (ICIP). Le Ngan, “Clip-tsa: Clip-assisted temporal self-attention for weakly-supervised video anomaly detection,” 2023 , organization={IEEE}, pp. 3230–3234.
[65]
C. Zhang et al., “Exploiting completeness and uncertainty of pseudo labels for weakly supervised video anomaly detection,” 2023, pp. 16271–16280.

  1. Corresponding Authors↩︎