April 06, 2025
With the increasing adoption of video anomaly detection in intelligent surveillance domains, conventional visual-based detection approaches often struggle with information insufficiency and high false-positive rates in complex environments. To address these limitations, we present a novel weakly supervised framework that leverages audio-visual collaboration for robust video anomaly detection. Capitalizing on the exceptional cross-modal representation learning capabilities of Contrastive Language-Image Pretraining (CLIP) across visual, audio, and textual domains, our framework introduces two major innovations: an efficient audio-visual fusion that enables adaptive cross-modal integration through lightweight parametric adaptation while maintaining the frozen CLIP backbone, and a novel audio-visual prompt that dynamically enhances text embeddings with key multimodal information based on the semantic correlation between audio-visual features and textual labels, significantly improving CLIP’s generalization for the video anomaly detection task. Moreover, to enhance robustness against modality deficiency during inference, we further develop an uncertainty-driven feature distillation module that synthesizes audio-visual representations from visual-only inputs. This module employs uncertainty modeling based on the diversity of audio-visual features to dynamically emphasize challenging features during the distillation process. Our framework demonstrates superior performance across multiple benchmarks, with audio integration significantly boosting anomaly detection accuracy in various scenarios. Notably, with unimodal data enhanced by uncertainty-driven distillation, our approach consistently outperforms current unimodal VAD methods.
Video anomaly detection (VAD), as a pivotal technology in intelligent surveillance systems, focuses on identifying anomalous events within videos and has attracted substantial research interest in recent years [1]–[3]. Due to the rarity of anomalies and the high cost of manual annotation, fully supervised frameworks are impractical for large-scale deployment. As a solution, weakly supervised video anomaly detection (WSVAD) [4]–[7] has gained traction, requiring only video-level labels instead of detailed frame annotations. These methods seek to uncover latent anomalies under limited supervision, addressing challenges such as anomaly insufficiency and high false-positive rates in complex environments.
Figure 1: Left: Illustration of audio-visual collaboration effects; Right: Illustration of our proposed distillation (UKD) effects.
Current WSVAD methods primarily rely on the multiple instance learning (MIL) framework, using video-level labels for model training [4], [8]. Specifically, these approaches treat videos as bags of segments (instances) and distinguish anomalous patterns through the hard attention mechanism (a.k.a Top-K) [9]. With the rapid advancement of foundation models, Contrastive Language-Image Pretraining (CLIP) [10] has shown remarkable potential in various downstream tasks, including video understanding [11], [12]. Building on the remarkable success of CLIP, recent methods like VadCLIP [13] and TPWNG [7] have advanced WSVAD by leveraging CLIP’s semantic alignment capabilities. However, these methods—whether CLIP-based or conventional—predominantly rely on unimodal visual information, which often leads to significant detection limitations in complex real-world scenarios. Visual occlusion, extreme lighting variations, and environmental noise can render visual features unreliable or ambiguous [14]–[16]. In these challenging conditions, multimodal information, particularly audio, offers indispensable contextual cues that can complement and enhance visual-based detection. For instance, audio remains robust when visual data is compromised, allowing detection of off-camera events. In acoustically rich environments, certain anomalies like explosion, scream, or gunshot exhibit distinct acoustic signatures, making them more discriminative in the audio domain. Similarly, in low-light conditions where visual features degrade, audio serves as a critical supplementary modality. These observations underscore the importance of integrating audio and video modalities, as their complementary nature can significantly enhance the accuracy and robustness of anomaly detection systems in diverse and challenging environments. We illustrate the impact of audio-visual integration for WSVAD in Figure 1.
Existing attempts [14], [17], [18] to incorporate audio into video anomaly detection typically adopt traditional feature concatenation methods, such as fusing visual features extracted by I3D [19] or C3D [20] with audio features extracted by VGGish [21]. These approaches fail to fully exploit the potential of modern multimodal learning techniques, resulting in suboptimal cross-modal integration. Moreover, they overlook the inherent semantic correlations between visual and auditory modalities, which are essential for enhancing anomaly detection performance.
To address these limitations, we propose AVadCLIP, a WSVAD framework that leverages audio-visual collaborative learning to drive audio-visual anomaly detection by CLIP-powered cross-modal alignment. AVadCLIP fully exploits CLIP’s intrinsic capability to establish semantic consistency across vision, text, and audio, ensuring that video anomaly detection is performed within a unified multimodal semantic space rather than merely fusing raw features. Our framework introduces three significant innovations: an efficient audio-visual feature fusion mechanism that is different from the naive feature concatenation and achieves adaptive cross-modal integration through lightweight parametric adaptation while keeping the CLIP backbone frozen; a novel audio-visual prompt mechanism dynamically enriches text label embeddings with key multimodal information, enhancing contextual understanding of videos and enabling more precise identification of different categories; and an uncertainty-driven feature distillation (UKD) module that generates audio-visual-like enhanced features in audio-missing scenarios, ensuring robust anomaly detection performance (as illustrated in Figure 1). Overall, our AVadCLIP relies on only a small set of trainable parameters, effectively transferring CLIP’s pretrained knowledge to the weakly supervised audio-visual anomaly detection task. Furthermore, by employing a distillation strategy based on data uncertainty modeling, we further transfer the learned knowledge from our audio-visual anomaly detector to a unimodal detector, enabling robust anomaly detection in scenarios with incomplete modalities.
In summary, our main contributions are as follows:
\(\bullet\) We propose a weakly supervised video anomaly detection framework that harnesses audio-visual collaborative learning, leveraging CLIP’s multimodal alignment capabilities. By incorporating a lightweight adaptive audio-visual fusion mechanism and integrating audio-visual information through prompt-based learning, our approach effectively achieves CLIP-driven robust anomaly detection in multimodal settings.
\(\bullet\) We design an uncertainty-driven feature distillation module, which transforms deterministic estimation into probabilistic uncertainty estimation. This enables the model to capture feature distribution variance, ensuring robust anomaly detection performance even with unimodal data.
\(\bullet\) Extensive experiments on two WSVAD datasets demonstrate that our method achieves superior performance in audio-visual scenarios, while maintaining robust anomaly detection results even in audio-absent conditions.
Video Anomaly Detection has been extensively studied in recent years, with existing approaches broadly categorized into semi-supervised and weakly supervised methods. Among them, semi-supervised methods primarily rely on normal video clips for training and identify anomalies by detecting deviations from learned normal patterns during inference. These methods commonly adopt self-supervised learning techniques [22]–[24], such as reconstruction [25], [26] or prediction [27], [28]. Reconstruction-based methods assume that the model can effectively reconstruct normal videos, whereas abnormal videos—due to distributional discrepancies—result in significant reconstruction errors. Autoencoders [29], [30] are widely employed to capture normal pattern features, with reconstruction error serving as an anomaly indicator. Prediction-based methods [31] utilize models to forecast future frames, detecting anomalies based on prediction errors. However, a key limitation of semi-supervised methods is their tendency to overfit normal patterns, leading to poor generalization to unseen anomalies.
Weakly supervised methods, in contrast, typically adopt the MIL framework, requiring only video-level anomaly labels and significantly reducing annotation costs. A classic work is DeepMIL [4], which employs a ranking loss to distinguish normal from anomalous instances. To enhance detection, two-stage pseudo-labeling strategy has been introduced, where high-confidence anomalous regions identified during MIL training serve as pseudo-labels for a secondary refinement phase [32]–[34]. With the rise of Vision-Language Models (VLMs) [35], CLIP has shown remarkable cross-modal capabilities and is increasingly applied to WSVAD. VadCLIP [13], the first CLIP-based WSVAD method, integrates textual priors via text and visual prompts, enhancing anomaly detection. Building on this, TPWNG [7] refines feature learning through a two-stage approach. Recent research trends focus on large model-driven strategies, e.g., training-free frameworks [36], [37], spatiotemporal anomaly detection [38], and open-scene anomaly detection [39].
The integration of audio and visual information has emerged as a critical research direction in multimodal learning, as it not only enhances model performance but also facilitates a deeper understanding of complex scenes. Significant progress has been achieved in various aspects of audio-visual fusion [40], [41]. In audio-visual segmentation, researchers aim to accurately segment sound-producing objects based on audio-visual cues. Chen et al. [42] proposed a novel informative sample mining method for audio-visual supervised contrastive learning. Ma et al. [43] introduced a two-stage training strategy to address the audio-visual semantic segmentation (AVSS) task. Building on these works, Guo et al. [44] introduced a new task: Open-Vocabulary AVSS (OV-AVSS), which extends AVSS to open-world scenarios beyond predefined annotation labels. Audio-visual event localization aims to identify the spatial and temporal locations of both visual and auditory events, with attention mechanisms widely used for modality fusion. For instance, He et al. [45] proposed an audio-visual co-guided attention mechanism, while Xu et al. [46] introduced an audio-guided spatial-channel attention mechanism. Related tasks include audio-visual video parsing [47], [48] and audio-visual action recognition [49]. Audio-visual anomaly detection [17], [50] has also become a growing research hot. For example, Yu et al. [51] applied a self-distillation module to transfer single-modal visual knowledge to an audio-visual model, reducing noise and bridging the semantic gap between single-modal and multimodal features. Similarly, Pang et al. [52] proposed a weighted feature generation approach, leveraging mutual guidance between visual and auditory information, followed by bilinear pooling for effective feature integration.
Figure 2: The pipeline of our proposed AVadCLIP.
Figure 3: The pipeline of our proposed adaptive fusion module and binary classifier.
Given a training set of videos \(\{V_i\}\), where each video \(V\) contains both visual and corresponding audio information, along with a video-level label \(y \in \mathbb{R}^{C}\). Here, \(C\) indicates that the number of categories (including the normal class and various anomaly classes). To facilitate model processing, we employ a video encoder and an audio encoder to extract high-level features \(X_v \in \mathbb{R}^{N \times d}\) and \(X_a \in \mathbb{R}^{N \times d}\), respectively, where \(N\) represents the temporal length of the video (i.e., the number of frames or snippets) and \(d\) denotes the feature dimensionality. The objective of WSVAD task is to train a detector using all available \(X_v\), \(X_a\), and their corresponding labels from the training set, enabling the model to accurately determine whether each frame in a test sample is anomalous and to identify the specific anomaly category.
The overall pipeline of our method as shown in Figure 2, starts with extracting features from video and audio using dedicated encoders, then adaptively fuses them for multimodal correspondence learning. We combine a classification branch with a CLIP-based alignment approach, using a audio-visual prompt to inject fine-grained multimodal information into text embeddings. Additionally, an uncertainty-driven distillation is employed to improve anomaly detection robustness in scenarios with incomplete modalities.
Video encoder. Leveraging CLIP’s robust cross-modal representation, we use its image encoder (ViT-B/16) as a video encoder, in contrast to traditional models like C3D and I3D, which are less effective in capturing semantic relationships. We extract features from sampled video frames using CLIP, but to address CLIP’s lack of temporal modeling, we incorporate a lightweight temporal model, such as Graph Convolution Network (GCN) [8] and Temporal Transformer [13], to capture temporal dependencies. This approach ensures efficient transfer of CLIP’s pre-trained knowledge to the WSVAD task.
Audio encoder. For audio feature extraction, we use Wav2CLIP [53], a CLIP-based model that maps audio signals into the same semantic space as images and text. The audio is first converted into spectrograms, then sampled to match the number of video frames. These audio segments are processed by Wav2CLIP to extract features. To capture contextual relationships, we apply a temporal convolution layer [54], which models local temporal dependencies, preserving key dynamics within the audio modality.
In multimodal feature fusion, while both video and audio contain valuable semantic information, their importance often varies depending on the specific task. Inspired by human perception mechanisms [55], our approach follows a vision-centric, audio-assisted paradigm, where video features serve as the primary modality, and audio features complement and enhance visual information. To preserve the generalization capability of the original CLIP model in downstream tasks while avoiding the introduction of excessive trainable parameters, we design a lightweight adaptive fusion that integrates audio features without significantly increasing computational overhead. We present the structure of this fusion in Figure 3.
Specifically, given the video feature \(X_v\) and audio feature \(X_a\), we first concatenate them to obtain a joint representation \(X_{a+v}\in\mathbb{R}^{N \times 2d}\), which is then processed by two projection networks to generate the adaptive weight and residual feature. The first projection network computes adaptive fusion weights \(W\), which determine the contribution of audio at each time step [56]. This is achieved through a linear transformation followed by a sigmoid activation: \[W = \text{Sigmoid}(\text{Linear}(X_{a+v})) \in \mathbb{R}^{N \times d}\]
The second projection network is responsible for residual mapping, which transforms \(X_{a+v}\) into a residual feature \(X_{\text{res}}\) that encodes the fused information from both modalities: \[X_{\text{res}} = \text{Linear}(\text{GELU}(\text{Linear}(X_{a+v}))) \in \mathbb{R}^{N \times d}\]
Finally, the fused representation \(X_{av}\) is obtained by adaptively incorporating the residual feature into the original video feature: \[X_{av} = X_v + W \odot X_{\text{res}}\] where \(\odot\) denotes element-wise multiplication. The adaptive weight \(W\) dynamically adjusts the degree of audio integration, ensuring that video features remain dominant while audio features provide auxiliary information. Additionally, the residual mapping enhances the expressiveness of the fused representation by capturing nonlinear transformations. By introducing an adaptive fusion mechanism and maintaining a lightweight design, our fusion approach effectively balances efficiency and expressiveness, leveraging the complementarity of visual and audio modalities while minimizing computational overhead.
We leverage a two-branch framework [13] for the WSVAD task, consisting of a classification branch and an alignment branch, which effectively leverage audio-visual information to improve detection accuracy.
Classification branch consists of a lightweight binary classifier (as shown in Figure 3), which takes \(X_{av}\) as input and directly predicts the frame-level anomaly confidence \(A\).
Alignment branch leverages the cross-modal semantic alignment mechanism, which computes the similarity between frame-level features and class label features. To obtain class label representations, we leverage the CLIP text encoder combined with the learnable textual prompt [57] and audio-visual prompt to extract class embeddings, ensuring unified semantic alignment between visual and textual modalities. Given a set of predefined class labels (e.g., “normal” “fighting”), we first introduce a learnable textual prompt, then we concatenate the textual prompt with class labels and feed them into CLIP text encoder to obtain the class representation \(X_c\). Compared to the manually defined prompt, the learnable prompt allows the model to dynamically adjust textual representations during training, making them more suitable for the specific requirements of WSVAD. Furthermore, we incorporate an audio-visual prompt into the class label features to enrich the class representations with additional multimodal information.
The proposed audio-visual prompt mechanism aims to dynamically inject instance-level key audio-visual information into text labels to enhance the representation. Specifically, we leverage the anomaly confidence \(A\) from the classification branch and audio-visual features \(X_{av}\) to generate a video-level global representation: \[X_{p} = \text{Norm}\left(A^{\top}X_{av}\right) \in \mathbb{R}^{d} \label{con:attention}\tag{1}\] where Norm represents the normalization operation. Next, we calculate the similarity matrix \(S_p\) between the class representation \(X_c\) and the global representation \(X_{p}\) to measure the alignment between class labels and videos: \[S_p = \text{Softmax}\left(X_c X_{p}^{\top}\right)\] based on \(S_p\), we generate the enhanced instance-level audio-visual prompt \(X_{mp}\): \[X_{mp} = S_p X_{av} \in \mathbb{R}^{d}\] This operation dynamically adjusts the class representation’s focus on different video instances by calculating the similarity between global audio-visual features and class labels, thereby enhancing cross-modal alignment.
Then, we add \(X_{mp}\) and the class representation \(X_c\), followed by a feed-forward network (FFN) transformation and a skip connection to obtain the final instance-specific class embedding \(X_{cp}\): \[X_{cp} = \text{FFN}\left(\text{ADD}\left(X_{mp}, X_{c}\right)\right)+X_{c} \label{con:visual95prompt}\tag{2}\] where ADD represents element-wise addition.
This two-branch framework provides anomaly confidence through the classification branch and refines category identification with class information via the alignment branch, improving robustness and enabling fine-grained anomaly detection.
For the classification branch, we adopt the Top-K mechanism, as proposed in previous work [17], to select the top \(K\) anomaly confidence values from both normal and abnormal videos, which are averaged as the video-level prediction. The classification loss \(\mathcal{L}_{BCE}\) is then computed using binary cross-entropy between the prediction and groundtruth class.
In the case of the alignment branch, the MIL-Align mechanism [13] is applied. We compute an alignment map \(M\), reflecting the similarity between frame-level features \(X_{av}\) and all category embeddings \(X_{cp}\). For each row in \(M\), the top \(K\) similarities are selected and their average is used to quantify the alignment between the video and the current class. This results in a vector \(S = \{s_1, \ldots, s_m\}\) representing the similarity between the video and all possible classes. Then the multi-class prediction is then calculated as: \[p_i = \frac{exp\left(s_i/\tau\right)}{\sum_j{exp\left(s_j/\tau\right)}} \label{con:nce}\tag{3}\] where \(p_i\) represents the prediction for the \(i^{th}\) class, and \(\tau\) is the temperature scaling parameter. Then, we compute \(\mathcal{L}_{NCE}\) based on cross-entropy. Besides, to address the class imbalance in WSVAD , where normal samples dominate and anomaly instances are sparse, we employ the focal loss [58]. Finally, the overall loss \(\mathcal{L}_{ALIGN}\) for alignment branch is the average of \(\mathcal{L}_{NCE}\) and \(\mathcal{L}_{FOCAL}\).
In the WSVAD task, audio serves as a complementary modality to video, enhancing detection accuracy. However, audio may be unavailable in practical scenarios, leading to performance degradation. To address this, we apply knowledge distillation (KD) by using a pre-trained multi-modal (video+audio) teacher model to guide a unimodal (video-only) student model, ensuring robust anomaly detection even without audio. Traditional knowledge distillation methods typically assume a deterministic transfer of knowledge, employing mean square error (MSE) loss to align the student model with the teacher’s feature representations. However, this approach fails to account for the inherent uncertainty in audio-visual feature fusion. In real-world scenarios, factors such as noisy audio or occluded visual content can introduce distortions in the fused features, leading to inaccurate feature representations and diminished generalization capability.
To overcome this, we propose a probabilistic uncertainty distillation strategy [59], [60], which models data uncertainty during distillation, improving the student model’s robustness across diverse scenarios. Specifically, assume \(X_{av,i} = X_{vs,i} + \epsilon\sigma_i\), where \(\epsilon\sim \mathcal{N}(\mathbf{0},\mathbf{I})\), \(X_{vs}\) represents enhanced visual features derived from the student model, and \(\sigma_i\) refers to the inherent uncertainty between the \(i^{th}\) pair of features. Then we model the observation as a Gaussian likelihood function to more accurately quantify data uncertainty in the feature distillation. The relationship between the audio-visual fusion feature \(X_{av,i}\) and the unimodal feature \(X_{vs,i}\) is formulated as: \[p(X_{av,i} | X_{vs,i}, {\theta}) = \frac{1}{\sqrt{2\pi {\sigma}_i^2}} \exp\left(-\frac{\| X_{av,i} - X_{vs,i} \|^2}{2{\sigma}_i^2}\right) \label{eq:pi}\tag{4}\] where \(\theta\) is the parameter of models, to maximize the likelihood for each pair of features \(X_{av,i}\) and \(X_{vs,i}\), we adopt the log-likelihood form: \[\ln p(X_{av,i} | X_{vs,i}, {\theta}) = -\frac{\| X_{av,i} - X_{vs,i} \|^2}{2{\sigma}_i^2} - \frac{\ln 2\pi {\sigma}_i^2}{2} \label{eq:log95pi}\tag{5}\]
In practice, we design a network branch (a simple three-layer convolutional neural network) to predict the variance \({\sigma}_i^2\) and reformulate the likelihood maximization problem as the minimization of a loss function. Specifically, we employ an uncertainty-weighted MSE loss: \[\mathcal{L}_{UKD} = \frac{1}{L} \sum_{i=1}^{L} \left[ \frac{\| X_{av,i} - X_{vs,i} \|^2 }{\sigma_{i}^2}+ \ln {\sigma}_i^2 \right] \label{eq:loss95pi}\tag{6}\] where \(L\) represents the number of feature pairs, and the constant term is omitted for clarity.
During the distillation process, the student model not only learns the unimodal feature \(X_{vs,i}\) from the teacher model but also considers the feature uncertainty \({\sigma}_i^2\) to optimize its learning strategy. Specifically, the first term of the loss function represents the feature similarity between the student and teacher models, normalized by \({\sigma}_i^2\). This assigns smaller weights to features with higher uncertainty, thereby avoiding overfitting to hard-to-learn information. The second term acts as a regularization term to prevent \({\sigma}_i^2\) from becoming too small, ensuring effective distillation.
Ultimately, during the inference phase, we only input video and perform anomaly detection through the unimodal student model for audio-missing scenarios.
XD-Violence. As the largest publicly available audio-visual WSVAD dataset, XD-Violence [17] significantly surpasses existing datasets in scale and diversity. It comprises 3,954 training videos and 800 test videos, with the test set containing 500 violent and 300 non-violent instances. The dataset covers six distinct categories of violent events, including abuse, car accidents, explosion, fighting, riot, and shooting, which occur at various temporal locations within videos.
CCTV-Fights\(_{sub}\). Derived from CCTV-Fights [61], CCTV-Fights\(_{sub}\) [62] is a carefully curated subset designed to address audio-visual anomaly detection, The subset retains 644 high-quality videos depicting real-world fight scenarios, each with meaningful audio content, making it a valuable resource for evaluating audio-visual anomaly detection methods in real-world surveillance contexts.
Evaluation. For performance evaluation, we adopt distinct metrics tailored to different granularities of WSVAD tasks. For coarse-grained WSVAD, we employ frame-level Average Precision (AP), which provides a comprehensive measure of detection accuracy across varying confidence thresholds. For fine-grained anomaly detection, we utilize mean Average Precision (mAP) [14] computed across multiple intersection over union (IoU) thresholds and the average mAP (AVG) across different thresholds.
We conduct experiments on an NVIDIA RTX 4090 GPU, where the visual enhancement module is a single-layer 1D convolutional network, which includes a convolutional layer with a kernel size of 3, padding size of 1, ReLU activation function, and a skip connection. Such an operation effectively facilitates the aggregation of local contextual information. For input processing, we employ a frame selection strategy tailored to different datasets, sampling one frame per 16 frames for XD-Violence and one frame per 4 frames for CCTV-Fights\(_{sub}\), using a uniform sampling strategy with a maximum frame count of 256; During optimization, we set the batch size, learning rate, and total epoch to 96, \(1e^{-5}\), and 10, respectively.
Our experiments evaluate both coarse-grained and fine-grained anomaly detection performance on XD-Violence, comparing our AVadCLIP against state-of-the-art approaches, as shown in Tables 1 and 2.
For coarse-grained anomaly detection, using only RGB inputs, AVadCLIP\(^*\) (\(*\) denotes RGB-only input) achieves an AP score of 85.53%, surpassing all existing vision-only methods. Notably, it outperforms VadCLIP—the previous best-performing RGB-only approach—by 1.0%, demonstrating superior visual anomaly detection. When incorporating audio, AVadCLIP further improves performance, significantly outperforming all multimodal baselines, achieving a remarkable 4.9% gain over the latest AVCL [62] method.
For fine-grained anomaly detection, AVadCLIP consistently outperforms all competitors across different IoU thresholds, as detailed in Table 2. With RGB-only input, AVadCLIP\(^*\) surpasses VadCLIP at all IoU thresholds, achieving an AVG improvement of 2.7%. Similarly, the full-modality model AVadCLIP leads across all metrics, boosting the AVG by 3.9%. These results highlight the effectiveness of multimodal learning in precisely localizing anomaly boundaries and improving category predictions.
Overall, AVadCLIP achieves state-of-the-art performance in both unimodal and multimodal settings across coarse- and fine-grained anomaly detection tasks. The comprehensive results validate its effectiveness in leveraging audio-visual collaboration and demonstrate the feasibility of uncertainty-driven distillation strategy.
Method | Reference | Modality | AP(%) |
---|---|---|---|
DeepMIL [4] | CVPR 2018 | RGB(ViT) | 75.18 |
Wu et al. [17] | ECCV 2020 | RGB(ViT) | 80.00 |
RTFM [63] | ICCV 2021 | RGB(ViT) | 78.27 |
AVVD [14] | TMM 2022 | RGB(ViT) | 78.10 |
Ju et al. [11] | ECCV 2022 | RGB(ViT) | 76.57 |
DMU [18] | AAAI 2023 | RGB(ViT) | 82.41 |
CLIP-TSA [64] | ICIP 2023 | RGB(ViT) | 82.17 |
VadCLIP [13] | AAAI 2024 | RGB(ViT) | 84.51 |
AVadCLIP\(^*\) | this work | RGB(ViT) | 85.53 |
FVAI [52] | ICASSP2021 | RGB(I3D)+Audio | 81.69 |
MACIL-SD [51] | ACMMM2022 | RGB(I3D)+Audio | 81.21 |
CUPL [65] | CVPR2023 | RGB(I3D)+Audio | 81.43 |
AVCL [62] | TMM 2025 | RGB(I3D)+Audio | 81.11 |
AVadCLIP | this work | RGB(ViT)+Audio | 86.04 |
Method | mAP@IoU(%) | |||||
0.1 | 0.2 | 0.3 | 0.4 | 0.5 | AVG | |
Random | 1.82 | 0.92 | 0.48 | 0.23 | 0.09 | 0.71 |
DeepMIL [4] | 22.72 | 15.57 | 9.98 | 6.20 | 3.78 | 11.65 |
AVVD [14] | 30.51 | 25.75 | 20.18 | 14.83 | 9.79 | 20.21 |
VadCLIP [13] | 37.03 | 30.84 | 23.38 | 17.90 | 14.31 | 24.70 |
AVadCLIP\(^*\) | 39.63 | 32.77 | 26.84 | 21.58 | 16.39 | 27.44 |
AVadCLIP | 41.89 | 34.61 | 27.08 | 22.16 | 17.30 | 28.61 |
Method | Reference | Modality | AP(%) |
---|---|---|---|
VadCLIP [13] | AAAI 2024 | RGB(ViT) | 72.78 |
AVadCLIP\(^*\) | this work | RGB(ViT) | 73.36 |
MACIL-SD [51] | ACMMM2022 | RGB(I3D)+Audio | 72.92 |
DMU [18] | CVPR2023 | RGB(I3D)+Audio | 72.97 |
AVCL [62] | TMM 2025 | RGB(I3D)+Audio | 73.20 |
AVadCLIP | this work | RGB(ViT)+Audio | 73.38 |
The coarse-grained anomaly detection results on CCTV-Fights\(_{sub}\) are presented in Table 3. For RGB-only methods, AVadCLIP\(^*\) achieves 73.36% AP, surpassing the state-of-the-art VadCLIP and demonstrating the effectiveness of our approach in unimodal scenarios. For audio-visual scenarios, AVadCLIP further improves performance, outperforming all existing methods. These results indicate that incorporating audio information can further enhance anomaly detection performance, validating the effectiveness of cross-modal complementary information mining.
AV Fusion | AV Prompt | \(\mathcal{L}_{FOCAL}\) | AP(%) | AVG(%) |
---|---|---|---|---|
\(\times\) | \(\times\) | \(\times\) | 79.85 | 27.89 |
\(\surd\) | \(\times\) | \(\times\) | 82.90 | 26.63 |
\(\surd\) | \(\surd\) | \(\times\) | 86.18 | 26.79 |
\(\surd\) | \(\surd\) | \(\surd\) | 86.04 | 28.61 |
Method | AP(%) | AVG(%) |
---|---|---|
Cross Attention | 75.15 | 10.51 |
Element-wise Addition | 83.02 | 27.66 |
Concat+Linear Projection | 83.36 | 28.88 |
Adaptive Fusion | 86.04 | 28.61 |
Method | AP(%) | AVG(%) |
---|---|---|
Audio Model w/o UKD | 50.89 | 12.20 |
Audio Model wUKD | 52.51 | 13.50 |
Visual Model w/o UKD | 84.60 | 22.92 |
Visual Model wUKD | 85.53 | 27.44 |
Audio-Visual Model | 86.04 | 28.61 |
From Table 4, it can be observed that the introduction of audio-visual fusion improves detection performance. Furthermore, Table 5 presents the impact of different audio-visual fusion strategies on anomaly detection performance. First, the cross attention fusion performs poorly in the WSVAD task, indicating that although it can capture the relationships between modalities, its complex parameterized design may negatively impact the generalization ability of CLIP model in downstream WSVAD tasks. Next, the simple element-wise addition strategy achieves an AP of 83.02% and an AVG of 27.66%. Then, the concatenation with linear projection approach improves the AP to 83.36% and the AVG to 28.88%, indicating that enhancing feature representation through linear transformation facilitates more effective cross-modal information capture. Finally, our proposed adaptive fusion strategy achieves the best AP of 86.04%, outperforming the other three methods on the whole. This demonstrates that our adaptive fusion strategy, as a lightweight and effective fusion strategy, can more exploit complementary information between audio and visual modalities.
Figure 4: Coarse- and Fine-grained WSVAD visualization results of AVadCLIP and Baseline on XD-Violence.
As presented in Table 4, the baseline model achieves an AP of only 79.85%. Integrating the audio-visual prompt on top of the adaptive fusion mechanism significantly enhances performance, increasing the AP to 86.18%. This improvement underscores the effectiveness of the audio-visual prompt in capturing critical multimodal patterns, thereby facilitating more precise anomaly recognition. Furthermore, incorporating focal loss into the model contributes to refining anomaly boundary detection, leading to more stable performance in fine-grained anomaly localization. In summary, the audio-visual prompt primarily enhances coarse-grained anomaly detection, and focal loss further refines boundary precision, enabling the model to achieve optimal performance across both AP and AVG metrics.
As shown in Table 6, the proposed UKD mechanism significantly enhances anomaly detection performance in both visual-only and audio-only models. Specifically, in the visual-only setting, UKD achieves a 0.9% improvement in AP and a 4.5% increase in AVG, attaining performance levels comparable to the teacher model trained with audio-visual inputs. Similarly, the audio-only model also benefits from UKD, exhibiting consistent performance gains. These results highlight the effectiveness of UKD in leveraging data uncertainty to enhance the robustness of unimodal representations during the distillation process, making it particularly well-suited for real-world applications where modality incompleteness is prevalent.
In Figure 4, we present the qualitative visualizations of AVadCLIP and the baseline model for both coarse-grained and fine-grained WSVAD. The blue curves denote the anomaly predictions by AVadCLIP, whereas the yellow curves represent those by the baseline model. As illustrated, compared to the baseline model, AVadCLIP significantly reduces anomaly confidence in normal video segments, thereby enhancing its ability to distinguish between abnormal and normal regions more accurately. The fine-grained map below also indicates that AVadCLIP can predict categories with greater precision. Notably, the observed performance improvement supports our hypothesis that audio information is more advantageous in visual occlusion (shooting) or acoustic dominant scenes (explosion), and can effectively eliminate ambiguity in visually similar patterns in anomaly detection scenes, thereby ensuring more robust detection performance.
In this work, we propose a novel weakly supervised framework for robust video anomaly detection using audio-visual collaboration. Leveraging the powerful representation ability and cross-modal alignment capability of CLIP, we design two distinct modules to achieve efficient audio-visual collaboration and multimodal anomaly detection, based on the frozen CLIP model. Specifically, to seamlessly integrate audio-visual information, we introduce a lightweight fusion mechanism that adaptively generates fusion weights based on the importance of audio to assist visual information. Additionally, we propose an audio-visual prompt strategy that dynamically refines text embeddings with key multimodal features, strengthening the semantic alignment between video content and corresponding textual labels. To further bolster robustness in scenarios with missing modalities, we develop an uncertainty-driven distillation module that synthesizes audio-visual representations from visual inputs, focusing on challenging features. Experimental results across two benchmarks demonstrate that our framework effectively enables video-audio anomaly detection and enhances the model’s robustness in scenarios with incomplete modalities. In the future, we will explore the integration of additional modalities (e.g., textual description) based on VLMs to achieve more robust video anomaly detection.
Corresponding Authors↩︎