PreFM: Online Audio-Visual Event Parsing
via Predictive Future Modeling

Xiao Yu\(^{1,2}\) Yan Fang\(^{1,2}\) Xiaojie Jin\(^{1,2}\) Yao Zhao\(^{1,2}\) Yunchao Wei\(^{1,2,}\)
\(^{1}\)Institute of Information Science, Beijing Jiaotong University
\(^{2}\)Visual Intelligence + X International Joint Laboratory
Corresponding Author
xiaoyu@bjtu.edu.cn


Abstract

Audio-visual event parsing plays a crucial role in understanding multimodal video content, but existing methods typically rely on offline processing of entire videos with huge model sizes, limiting their real-time applicability. We introduce Online Audio-Visual Event Parsing (On-AVEP), a novel paradigm for parsing audio, visual, and audio-visual events by sequentially analyzing incoming video streams. The On-AVEP task necessitates models with two key capabilities: (1) Accurate online inference, to effectively distinguish events with unclear and limited context in online settings, and (2) Real-time efficiency, to balance high performance with computational constraints. To cultivate these, we propose the Predictive Future Modeling (PreFM) framework featured by (a) predictive multimodal future modeling to infer and integrate beneficial future audio-visual cues, thereby enhancing contextual understanding and (b) modality-agnostic robust representation along with focal temporal prioritization to improve precision and generalization. Extensive experiments on the UnAV-100 and LLP datasets show PreFM significantly outperforms state-of-the-art methods by a large margin with significantly fewer parameters, offering an insightful approach for real-time multimodal video understanding. Code is available at https://github.com/XiaoYu-1123/PreFM.

1 Introduction↩︎

Multimodal learning [1][4] is a significant topic in the machine learning research area. Among various modalities, audio [5] and vision [6], [7] are the primary ways humans perceive the world, making audio-visual learning (AVL) [8][11] essential. Among various progress [12][15] related to AVL, audio-visual event parsing (AVEP), i.e., understanding events in videos, becomes increasingly important with the explosive growth of video content on streaming platforms.

AVEP involves processing both modality-aligned (audio-visual) and modality-misaligned (audio-only or visual-only) events in video content. Prevailing methods [16][18] operate offline, analyzing entire video sequences to utilize global context for accurate video events understanding. Though offering precise predictions, the necessity of whole-video processing, often coupled with large models and consequently high computational costs, makes these approaches unsuitable for real-time applications that require immediate detection and swift responses in dynamic environments such as autonomous driving [19], [20], wearable devices [21], [22], and human-robot interaction [23], [24].

To tackle these limitations, we introduce Online Audio-Visual Event Parsing (On-AVEP), a new paradigm that parses audio, visual, and audio-visual events in streaming videos with an online processing manner. The core characteristic of On-AVEP is to perceive the environmental state and generate timely feedback using only historical and current multimodal information, while balancing model performance and efficiency particularly in resource-aware and dynamic environments.

Specifically, On-AVEP necessitates the model possess two key capabilities: (1) Accurate Online Inference: This requires that the model adapts to complex and dynamic scene variations and accurately predicts ongoing events by relying exclusively on past and current information without any future context. As illustrated in Figure 1 (a), the model needs to distinguish similar events with unclear, limited context due to the lack of future information. (2) Real-time Efficiency: To meet the immediate response demands of On-AVEP applications, the model needs to achieve accurate event parsing with low computational cost, balancing performance and complexity well to satisfy the needs of online video processing.

To cultivate these essential capabilities, we introduce the Predictive Future Modeling (PreFM) framework as illustrated in Figure 1 (b). PreFM aims to predict future states through effective temporal-modality feature fusion and leverage knowledge distillation and temporal prioritization for training efficiency. To achieve (1) accurate online inference, PreFM employs predictive multimodal future modeling, using available data and fusing their features to infer beneficial future audio-visual cues. The cross-temporal and cross-modal feature interactions are utilized to effectively reduce noise within the pseudo-future context and enhance current representations. For (2) balancing real-time efficiency and overall parsing performance, PreFM integrates two designs during training: modality-agnostic robust representation distills rich, modality-agnostic knowledge from a large pre-trained teacher model for more generalized representation, and focal temporal prioritization encourages the model to focus on the most temporally critical information for online decisions, thereby boosting the model’s inference accuracy while keeping high inference efficiency.

Figure 1: (a) Illustration of parsing events in online scenarios: if a man opens his mouth and produces a vocal sound at time T, it is unclear based solely on information from 0 to T whether this marks the beginning of a musical phrase (as part of “singing”) or the start of a conversation (as in “speaking”). Precisely parsing events with these limited context is crucial for accurate online inference. (b) Simplified architecture of our PreFM framework, highlighting predictive future modeling and modality-agnostic robust representation (MRR). (c) Comparison of performance and efficiency against SOTA methods on the UnAV-100 [16] dataset.

Extensive experiments on two challenging datasets, UnAV-100 [16] and LLP [25], demonstrate that the PreFM framework significantly outperforms existing state-of-the-art (SOTA) methods in both segment-level and event-level metrics. Moreover, PreFM exhibits substantial advantages in model efficiency, striking a superior balance between performance and model complexity, with the margin of \(\mathbf{+9.3}\) in event-level average F1 score and merely \(\mathbf{2.7\%}\) parameters as highlighted in Figure 1 (c).

In summary, our main contributions are:

  • (I) We introduce Online Audio-Visual Event Parsing (On-AVEP), a new paradigm for real-time multimodal understanding. To our knowledge, this is the first work to systematically address the challenge of parsing audio, visual, and audio-visual events from streaming video. We further establish that success in this paradigm requires two critical capabilities: (a) accurate online inference from limited context, and (b) real-time efficiency to balance performance with computational cost.

  • (II) We propose the PreFM framework, a novel and efficient architecture for On-AVEP. PreFM’s core innovations include: (a) Predictive Multimodal Future Modeling mechanism to overcome the critical problem of missing future context; and (b) a combination of Modality-agnostic Robust Representation and Focal Temporal Prioritization to enhance model robustness and efficiency during training, providing an insightful approach to multimodal real-time video understanding.

  • (III) We establish new SOTA performance with unprecedented efficiency. Extensive experiments on two public datasets show that PreFM drastically outperforms previous methods (e.g., +9.3 Avg F1-score on UnAV-100), while using a fraction of the computational resources (e.g., only 2.7% of the parameters of the next best model), validating it as a powerful and practical solution.

2 Related Work↩︎

2.0.0.1 Online Video Understanding

encompasses online action detection for identifying actions [26][28], action anticipation for predicting future [29], [30], and online temporal action localization [31], [32] for determining action boundaries. Frameworks like JOADAA [33], TPT [34] and MAT [35] jointly model detection and anticipation tasks, bridging the present and future. Recent research focuses on model reliability through uncertainty quantification [36] and adaptability through open-vocabulary detection [37]. Concurrent advancements explore leveraging large language models for complex online understanding tasks [38][40]. However, these methods rely solely on the visual modality and neglect the crucial auditory perception, motivating our research into online audio-visual event parsing aiming to integrate both sensory streams for a more robust and holistic real-time understanding.

2.0.0.2 Audio Visual Video Parsing (AVVP)

aims to temporally classify videos within segments as audible or visible events. Early weakly-supervised methods [25], [41] use attention to infer temporal structure. Subsequent works [42][45] further address modality imbalance and interaction. A significant recent trend involves leveraging external knowledge, using language prompts [46] or pre-trained models like CLIP [47]/CLAP [48] to denoise or generate finer pseudo-labels from weak supervision [49][51]. Building on this, methods such as CoLeaF [52], NREP [53], and MM-CSE [54] focuse on sophisticated feature disentanglement and interaction for improved performance.

2.0.0.3 Audio Visual Event Localization (AVEL)

is first introduced to temporally locate events that are both visually and auditorily present within trimmed video clips [55]. Subsequent methods [56][61] leverage cross-modal attention, background suppression, contrastive smaples and adapters to improve localization accuracy. AVE-PM [62] is developed to handle portrait-mode short videos, while OV-AVEL [63] extends the task into an open-vocabulary setting. For densely annotated, untrimmed videos featuring multiple overlapping events, UnAV [16] releases the UnAV-100 benchmark and inspires models like UniAV [17], LOCO [64], FASTEN [65] and CCNet [18], which employ multi-temporal fusion, local correspondence correction and cross-modal consistency for dense event localization. Recent efforts [11], [66][68] also aim to omni-understanding using powerful large language models. However, these approaches generally rely on full-video inputs and huge model sizes, making them unsuitable for real-time parsing. Our work distinguishes itself by unifying AVEL and AVVP into a comprehensive online audio-visual event parsing framework, designed for efficient real-time processing and capable of identifying events regardless of whether they are solely auditory, visual or audio-visual.

3 Methods↩︎

In this section, we first introduce the problem setup in Sec. 3.1 and present a brief overview of our method in Sec. 3.2, with core designs: Predictive Multimodal Future Modeling (Sec. 3.3), Modality-agnostic Robust Representation (Sec. 3.4), and Focal Temporal Prioritization (Sec. 3.5). Finally, we discuss the specifics of our approach during training and online inference in Sec. 3.6.

3.1 Preliminaries↩︎

On-AVEP involves predicting events within streaming videos by sequentially processing multimodal information. This task is primarily divided into two sub-tasks: online audio-visual event localization (On-AVEL) and online audio-visual video parsing (On-AVVP). In the former, given a sequence of audio-visual data pairs \(\{V_t, A_t\}^T_{t=1}\) and the corresponding label \(y_{t=T}\), where \(T\) denotes the current time step, the model is required to predict the multi-label event vector \(\hat{y}_{t=T} \in \{0, 1\}^{C_{av}}\), where \(C_{av}\) represents the total number of audio-visual event categories. While the latter task involves predicting \(\hat{y}_{t=T} \in \{0, 1\}^{C_a+C_v}\), where \(C_a\) and \(C_v\) represent the number of audio-only and visual-only events, respectively. In both sub-tasks, models typically take the pre-processed visual-audio feature vectors \(\{f_t^a\}_{t=1}^T\) and \(\{f_t^v\}_{t=1}^T\) \(\in \mathbb{R}^{T\times D}\) (\(D\): feature dimension) within existing video datasets [16], [25], [55], [69] for subsequent operations.

3.2 Overview↩︎

As illustrated in Figure 2, during online inference, PreFM sequentially processes incoming audio and visual features \(F_c^a, F_c^v\) available up to the current time \(T\). To address the challenge of missing future information, which is crucial for event disambiguation, the core Predictive Multimodal Future Modeling network (Sec. 3.3) dynamically generates pseudo-future multimodal sequences. This process starts with a Pseudo-Future Mechanism that fuses current-time multimodal features and subsequently models initial pseudo-future predictions \(\tilde{F}_f^a, \tilde{F}_f^v\), then Temporal-Modality Cross Fusion where pseudo-future cues and current representations are mutually enhanced through comprehensive cross-temporal and cross-modal interactions. The resulting contextually augmented representations \(\hat{F}_c^a, \hat{F}_c^v\) are then utilized for event parsing at time \(T\).

To train an effective and efficient PreFM model, in addition to direct supervision on predictions for both the current window and the pseudo-future sequences, PreFM utilizes Modality-agnostic Robust Representation (MRR, Sec. 3.4). Through MRR, event labels \(y_t\) are transformed into target modality-agnostic features \(f_t\) using a pre-trained teacher model; PreFM’s internal event representations \(\hat{y}^{av}_t\) are then guided to align with these target features via a dedicated distillation loss term. Furthermore, Focal Temporal Prioritization (Sec. 3.5) is implemented by reweighting the contributions of different relative time steps, encouraging the model to make precise predictions at current time.

3.3 Predictive Multimodal Future Modeling Network↩︎

Figure 2: The pipeline of PreFM. It takes real-time audio-visual streams, using predictive modeling to generate multimodal future context, modality-agnostic robust representation to enhance performance by transferring knowledge, and focal temporal prioritization to emphasize the current time step T.

Inspired by advances in online action detection [33][35], our approach to On-AVEP centers on predictively modeling multimodal pseudo-future sequences using only currently available data. To help PreFM better utilize and consolidate all available modal and temporal cues, we propose a Universal Hybrid Attention (UHA) block to bridge different modalities across time. Given the target query sequence \(Q\) and the flexible list of \(k\) context sets \(\{F_i\}_{i=1}^k\) where each \(F_i\) can represent various temporal segments of different modalities, UHA merges these features into \(Q\) as follows: \[\text{UHA}(Q, \{F_i\}_{i=1}^k) = \text{FFN}(\text{LN}(Q+\sum_{i=1}^{k} \text{Attn}(Q, F_i, F_i))) \label{eq:uha}\tag{1}\] Where \(\text{Attn}\) is multi-head attention [70], \(\text{LN}\) is Layer Normalization, and \(\text{FFN}\) is a Feed-Forward Network. UHA serves as the foundational attention block for subsequent fusion operations.

3.3.0.1 Pseudo-Future Mechanism

This mechanism first fuses current audio-visual information and then models an initial prediction of the future sequence. Given input features up to current time \(T\), \(\{ (f_t^v, f_t^a) \}_{t=1}^T\), we define a current working window of length \(\boldsymbol{L_c}\). This yields the initial current audio and visual features \(F_c^a=\{f_{t}^a\}_{t=T-L_c+1}^T\) and \(F_c^v=\{f_{t}^v\}_{t=T-L_c+1}^T\), both in \(\mathbb{R}^{L_c\times D}\).

First, we perform an initial feature fusion between \(F_c^a\) and \(F_c^v\). Each sequence is processed by our UHA block with both as context. The fused current features \(\tilde{F}_c^a, \tilde{F}_c^v \in \mathbb{R}^{L_c \times D}\) are produced by: \[\tilde{F}_c^m=\text{UHA}(F_c^m, \{F_c^a, F_c^v\}), m \in \{a,v\}\]

Next, future modeling generates initial pseudo-future sequences of length \(\boldsymbol{L_f}\). We use learnable tokens \(Q^a, Q^v \in \mathbb{R}^{L_f \times D}\) as queries. These attend to the corresponding fused current features: \[\label{eq:future95modeling95sequences} \tilde{F}_f^m = \text{Attn}(Q^m, \tilde{F}_c^m, \tilde{F}_c^m), m \in \{a,v\}\tag{2}\]

This step yields the initial multimodal pseudo-future sequences \(\tilde{F}_f^a, \tilde{F}_f^v \in \mathbb{R}^{L_f \times D}\).

3.3.0.2 Temporal-Modality Cross Fusion

Having obtained the fused current features (\(\tilde{F}_c^a, \tilde{F}_c^v\)) and initial pseudo-future sequences (\(\tilde{F}_f^a, \tilde{F}_f^v\)), this stage performs further interactions to mutually refine them, reducing potential noise within the pseudo-future, while simultaneously enriching the current representations with foresight gleaned from the modeled future.

Figure 3: Temporal-modality cross fusion for the pseudo-future \hat{F}_f^v.

First, future augmentation refines the initial pseudo-future predictions with UHA block: \[\label{equ:future95aug95seq} \hat{F}_f^m =\text{UHA}(\tilde{F}_f^m, \{\tilde{F}_f^a, \tilde{F}_f^v, \tilde{F}_c^m\}), m \in \{a,v\}\tag{3}\]

This yields augmented pseudo-future sequences \(\hat{F}_f^a, \hat{F}_f^v \in \mathbb{R}^{L_f \times D}\). Notably, the context list within the UHA block enables a rich combination of self-attention, as well as cross-interactions across modalities and time. For instance, the augmented visual pseudo-future \(\hat{F}_f^v\) is obtained by interacting with \(\tilde{F}_f^v\) (for self-attention), \(\tilde{F}_f^a\) (for cross-modal attention), and \(\tilde{F}_c^v\) (for cross-temporal attention) as shown in Figure 3.

Next, current refinement integrates the augmented future back into the current representations: \[\label{equ:curr95aug95seq95unified} \hat{F}_c^m = \text{UHA}(\tilde{F}_c^m, \{\tilde{F}_c^a, \tilde{F}_c^v, \hat{F}_f^m\}), m \in \{a,v\}\tag{4}\]

This results in the final contextually-aware current feature sequences \(\hat{F}_c^a, \hat{F}_c^v \in \mathbb{R}^{L_c \times D}\).

Finally, these augmented current and future features are projected by a shared classification head \(h(\cdot)\) and a Sigmoid function \(\mathcal{S}(\cdot)\) to obtain event predictions \(\hat{y}_c \in \mathbb{R}^{L_c \times C}\) (for the current window) and \(\hat{y}_f \in \mathbb{R}^{L_f \times C}\) (for the future window): \[\hat{y}_c = \mathcal{S}(h(\texttt{Concat}(\hat{F}_c^a, \hat{F}_c^v))), \hat{y}_f = \mathcal{S}(h(\texttt{Concat}(\hat{F}_f^a, \hat{F}_f^v))) \label{eq:two32predictions}\tag{5}\]

Here, \(\texttt{Concat}\) denotes feature concatenation, and \(C\) is the event class count (either \(C_{av}\) or \(C_a+C_v\)). For online inference, the prediction in \(\hat{y}_c\) corresponding to time \(T\) is used. While during training, \(\hat{y}_c\) and \(\hat{y}_f\) are supervised across the time steps of \([T-L_c +1, T+L_f]\), using BCE loss with annotations \(y_c \in \mathbb{R}^{L_c \times C}\) and \(y_f \in \mathbb{R}^{L_f \times C}\) : \[\label{eq:two32bce32loss} \mathcal{L}_{c} = \text{BCE}(\hat{y}_c, y_c), \mathcal{L}_{f} = \text{BCE}(\hat{y}_f, y_f)\tag{6}\]

3.4 Modality-agnostic Robust Representation↩︎

Learning from the rich, modality-agnostic event representations established by powerful pre-trained teacher models [47], [48], [71][73] is efficient to obtain robust and generalizable representations while maintaining efficiency. For each time step \(t\), we convert the event labels \(y_t\) into “a/an audio/visual/audio-visual event of [cls]” as text prompt, which is then processed by the text encoder of frozen teacher model OnePeace [72] to obtain modality-agnostic event features \(f_t\). Simultaneously, the student’s representation \(\hat{f}^{av}_t = \texttt{Concat}(\hat{f}_t^a, \hat{f}_t^v)\) can be easily extracted from our PreFM model, and distilled by the target representation \(f_t\). Cosine similarity is used as distillation loss as follows: \[\mathcal{L}_{mrr} = \frac{1}{L_c+L_f} \sum_{t=T-L_c+1}^{T+L_f} (1-\frac{\hat{f}^{av}_t \cdot h'(f_t)}{\|\hat{f}^{av}_t\| \cdot \|h'(f_t)\|}) \label{eq:mrr95loss}\tag{7}\]

Where \(h'(\cdot)\) denotes the projector module implemented by a linear layer to align different dimensions.

3.5 Focal Temporal Prioritization↩︎

To emphasize the predictions close to the present moment, we introduce a focal temporal prioritization scheme to the loss calculation, highlighting the significance of prediction at time step \(T\) instead of a uniform weighting. Specifically, we define temporal priorities using a Gaussian function centered at the current time \(T\): \(g(t, \sigma) = \exp\left( -\frac{(t)^2}{2\sigma^2} \right)\), where \(t\) is the relative temporal distance from time \(T\), and \(\sigma\) controls the width of the focus. We define the temporal weights \(w_c(t) \in [T-L_c+1, T]\) for the current window, and \(w_f(t) \in [T+1, T+L_f]\) for pseudo-future sequences, \[w_c(t-T) = \frac{L_c \cdot g(t-T, L_c)}{\sum_{k=T-L_c+1}^{T} g(k-T, L_c)} , \quad w_f(t-T) = \frac{L_f \cdot g(t-T, L_f)}{\sum_{k=T+1}^{T+L_f} g(k-T, L_f)} \label{eq:two32weight}\tag{8}\]

Let \(\mathcal{L}_c(t)\), \(\mathcal{L}_f(t)\) and \(\mathcal{L}_{mrr}(t)\) be the per-timestep loss in Eq. 6 and Eq. 7 . We use \(w(t-T)= \texttt{Concat}\{w_c(t-T), w_f(t-T)\}\) to obtain the while weights sequence vector. The final loss is computed as: \[\mathcal{L} = \sum_{t=T-L_c+1}^{T} w_c(t-T) \cdot \mathcal{L}_c(t) + \sum_{t=T+1}^{T+L_f} w_f(t-T) \cdot \mathcal{L}_f(t) + \mathbf{\lambda} \sum_{t=T-L_c+1}^{T+L_f} w(t-T) \cdot \mathcal{L}_{mrr}(t) \label{eq:final32loss}\tag{9}\] The hyperparameter \(\lambda\) balances the robust representation term (typically 1.0).

3.6 Training and Online Inference↩︎

3.6.0.1 Random Segment Sampling for Training

To adapt training for the online nature of On-AVEP and enhance data utilization, we design a random segment sampling strategy. During training, for a video of total length \(T_{all}\), the target prediction times \(T_k \in [1, T_{all}]\) are generated by \(T_k = kL_c + \delta\). Here, \(k\) serves as an index for iterating across the video, and \(\delta \in [0, L_c-1]\) is a periodically selected random integer offset. The \(L_c\)-length feature sequences \(\{ (f_t^v, f_t^a) \}_{t=T_k-L_c+1}^{T_k}\) act as model inputs, and zero-padding is applied at the beginning if \(T_k < L_c-1\). This strategy provides diverse training segments with a fixed history length \(L_c\), suitable for the online setting.

3.6.0.2 Online Inference

During inference, the model works in a truly online manner, processing the input video stream with a sliding window of length \(L_c\) and stride \(1\). At each step \(T_{infer}\), the model takes features from \([T_{infer}-L_c+1, T_{infer}]\), generates the multimodal pseudo-future context, and gets the final event predictions for the current time step \(T_{infer}\).

4 Experiments↩︎

4.1 Experimental Setups↩︎

4.1.0.1 Dataset

UnAV-100 [16] is a large-scale dataset designed for dense audio-visual event localization in untrimmed videos. It contains 10,790 videos of varying lengths covering 100 event categories, with over 30,000 annotated audio-visual event instances. LLP [25] provides 11,849 trimmed 10-second clips across 25 categories for audio-only and visual-only event parsing. For online scenarios, we concatenate LLP clips into longer video sequences. Specifically, half of these sequences are formed by randomly concatenating clips to simulate the rapid scene variations often encountered in online streaming content; the other half are formed by concatenating clips from the same event category to represent longer, continuous event occurrences. Following recent works [46], [49], [50], [52][54], segment-wise pseudo labels from CLIP [47], [74] and CLAP [48] are used for supervision.

4.1.0.2 Metric

For model performance, we follow prior work [25], [35], using F1-score and mean Average Precision (mAP) as segment-level metrics. For event-level evaluation, consecutive positive segments are treated as a complete event instance. We calculate event-level F1-scores by setting tiou = [0.1:0.1:0.9] [16] and average F1-score (Avg F1-score) for overall performance. For the On-AVVP task, we adhere to the established protocol from VALOR [49], evaluating audio-only (A), visual-only (V), and combined audio-visual (AV, denoted with subscript “av”) events. Regarding model efficiency, we assess the number of trainable parameters, FLOPs per inference, peak inference memory and FPS.

4.1.0.3 Implementation details

For both tasks, we set 60 training epochs, with the first 10 epochs dedicated to warm-up. A batch size of 128 is used, and AdamW serves as the optimizer with a weight decay of \(1e^{-4}\). We set the value \(L_c\) of 10 and \(L_f\) of 5 as the default setting. CLIP [47] and CLAP [48] are used to extract visual and audio features with a temporal stride set to 1 second, respectively. All experiments are conducted on a single RTX 3090. For the learning rate and the hidden dimension within the attention block, we use \(1e^{-3}\) and 256 for On-AVEL, \(5e^{-4}\) and 128 for On-AVVP.

4.2 Comparison with Existing Work↩︎

Our method is benchmarked against recent SOTA methods UnAV [16], UniAV [17] and CCNet [18] on UnAV-100 [16] for the On-AVEL task, while VALOR [49], CoLeaf [52], LEAP [75], NREP [53], and MM-CSE [54] on LLP [25] for the On-AVVP task. We provide two versions of our method: the basic version “PreFM”, and the improved version “PreFM+” with larger hidden size.

Table 1: Comparison with SOTA methods on On-AVEL task. Feature extractors: I.V. denotes I3D [76]+VGGish [77], O. denotes OnePeace [72] and C.C. denotes CLIP [47]+CLAP [48]. “PreFM+” replaces the feature extractor from CLIP/CLAP to OnePeace and the hidden dimension is expanded from \(256\) to \(512\). Methods marked with “*” take the entire video as input, fully utilizing the complete context.
Methods Extractors Segment-Level Event-Level Params\(\textcolor{myred}{\downarrow}\) FLOPs\(\textcolor{myred}{\downarrow}\) Inference
F1 mAP 0.1 0.3 0.5 0.7 0.9 Avg Memory\(\textcolor{myred}{\downarrow}\) FPS\(\textcolor{mygreen}{\uparrow}\) Latency \(\textcolor{myred}{\downarrow}\)
UnAV [16] I.V. 47.5 58.3 50.9 37.1 28.7 18.2 9.4 28.6 139.4M 52.4G 764.7MB 10.6 94.3ms
UniAV [17] O. 47.8 66.9 50.3 38.9 29.9 21.1 12.3 30.3 130.8M 22.7G 1020.5MB 15.6 64.1ms
CCNet [18] O. 54.8 62.3 58.3 46.3 37.5 27.3 15.8 37.0 238.8M 72.1G 1179.4MB 7.5 133.3ms
PreFM (Ours) C.C. 59.1 70.1 61.5 53.6 46.9 39.6 29.2 46.3 6.5M 0.4G 56.4MB 51.9 19.3ms
PreFM+ (Ours) O. 62.4 70.6 66.3 58.2 52.2 44.5 35.4 51.5 13.8M 0.5G 144.2MB 42.0 23.8ms
UnAV* [16] I.V. 56.1 67.8 59.3 56.0 52.7 46.7 35.1 50.6 139.4M 52.4G 764.7MB 10.6 94.3ms
UniAV* [17] O. 59.2 70.0 62.8 59.0 55.1 48.7 35.0 52.9 130.8M 22.7G 1020.5MB 15.6 64.1ms
CCNet* [18] O. 65.0 70.6 69.0 65.1 61.0 53.1 40.1 58.3 238.8M 72.1G 1179.4MB 7.5 133.3ms
Table 2: Comparison with SOTA methods on On-AVVP task. Feature extractors: R.C.C. denotes R3D [78]+CLIP [47]+CLAP [48] and R.R.V. denotes R3D [78]+ResNet152 [79]+Vsh: VGGish [77]. “PreFM+” increases the hidden dimension from \(128\) to \(256\). Methods marked with “*” take the entire 10-second video clips as input, fully utilizing the complete context.
Methods Extractors Segment-Level Event-Level Params\(\textcolor{myred}{\downarrow}\) FLOPs\(\textcolor{myred}{\downarrow}\) Inference
F1a F1v F1av mAPa mAPv mAPav 0.5a 0.5v 0.5av Avga Avgv Avgav Memory\(\textcolor{myred}{\downarrow}\) FPS\(\textcolor{mygreen}{\uparrow}\) Latency \(\textcolor{myred}{\downarrow}\)
VALOR [49] R.C.C. 49.7 52.4 45.4 72.9 68.4 56.7 36.5 46.1 34.6 35.2 42.8 33.0 4.9M 0.45G 20.1MB 62.2 16.1ms
CoLeaF [52] R.R.V. 50.7 44.5 41.0 62.8 45.8 37.3 37.9 36.4 29.6 37.3 35.5 29.7 5.7M 0.25G 114.1MB 60.4 16.6ms
LEAP [75] R.R.V. 50.6 49.3 45.8 73.3 64.3 54.6 40.1 42.5 35.9 38.4 39.6 34.3 52.0M 1.09G 204.7MB 19.3 51.8ms
NREP [53] R.C.C. 53.7 51.4 45.5 66.5 52.7 42.3 38.9 45.6 34.2 38.3 42.3 33.5 9.6M 1.69G 90.2MB 26.4 37.9ms
MM-CSE [54] R.C.C. 53.3 56.5 48.9 74.6 70.0 57.5 39.4 50.8 38.4 37.7 46.9 36.2 6.2M 0.91G 33.0MB 36.1 27.7ms
PreFM (Ours) R.C.C. 60.0 59.3 53.3 80.0 73.7 61.3 47.1 50.9 42.0 46.3 50.6 41.2 3.3M 0.22G 20.7MB 94.4 10.6ms
PreFM+ (Ours) R.C.C. 61.0 60.0 54.6 80.2 73.8 61.4 48.5 51.7 43.1 47.6 51.0 42.2 12.1M 0.48G 55.9MB 53.5 18.7ms
VALOR* [49] R.C.C. 65.6 61.8 56.5 81.4 73.7 61.4 55.1 54.9 46.7 54.0 54.2 46.0 4.9M 0.45G 20.1MB 62.2 16.1ms
CoLeaF* [52] R.R.V. 60.5 58.0 52.4 71.7 60.7 49.3 48.3 53.0 42.1 48.7 51.8 42.5 5.7M 0.25G 114.1MB 60.4 16.6ms
LEAP* [75] R.R.V. 61.6 61.5 56.5 80.6 71.3 60.2 52.3 56.4 47.7 51.2 55.0 46.7 52.0M 1.09G 204.7MB 19.3 51.8ms
NREP* [53] R.C.C. 67.3 63.7 57.9 77.4 66.2 53.9 55.9 57.5 47.8 54.9 56.7 47.1 9.6M 1.69G 90.2MB 26.4 37.9ms
MM-CSE* [54] R.C.C. 67.0 64.0 57.6 82.3 74.8 61.7 56.9 56.8 47.3 54.7 56.0 46.1 6.2M 0.91G 33.0MB 36.1 27.7ms

4.2.0.1 Performance Comparison

As shown in Table 1, PreFM clearly achieves new SOTA results for On-AVEL task, surpassing the second-best method with significant improvement of +7.8 in mAP and +9.3 in Avg F1-score. Furthermore, our enhanced version, PreFM+, extends these gains to +8.3 in mAP and +14.5 in Avg F1-score with only a moderate increase in parameters, highlighting the excellent scalability of the PreFM architecture for applications requiring higher precision. Similarly for On-AVVP task shown in Table 2, PreFM demonstrates consistent advantages, achieving improvements of +3.8 in mAPav and +5.0 in Avg F1-scoreav and PreFM+ further elevating performance to +3.9 in mAPav and +6.0 in Avg F1-scoreav over the second-best methods. Notably, we also present the original offline results of these baseline methods (marked with “*”) to show their performance under full-context conditions. Even when compared to these results, our online PreFM achieves comparable performance despite predicting with limited context.

These substantial performance gains across both tasks are largely attributed to our core predictive multimodal future modeling (PMFM) design. By dynamically generating and integrating pseudo-future contextual cues from streaming data, PMFM empowers our method to effectively parse environmental states and accurately capture temporal boundaries.

4.2.0.2 Efficiency Analysis

Regarding the On-AVEL task (Table 1), PreFM’s efficiency is remarkable. PreFM utilizes merely 2.7% parameters (6.5M vs 238.8M) compared to the next best performing method, and it requires only 0.6% FLOPs (0.4G vs 72.1G) and 4.8% peak memory (56.4MB vs 1179.4MB) for a single inference, while running at an impressive 51.9 FPS with merely a latency of 19.3ms. The compelling efficiency advantage is also evident in the On-AVVP task (Table 2). Such ability to deliver SOTA performance with drastically reduced overhead highlights that PreFM is designed with a strong emphasis on practical deployability, rendering it a highly suitable and efficient solution for resource-constrained real-time applications.

4.3 Ablation Studies↩︎

4.3.0.1 Main Component

To systematically evaluate the contribution of each proposed component, we conduct comprehensive ablation studies on the On-AVEL task, with results presented in Table [tab:ablation95three-tables](a). The simple prediction strategy (row 1) uses only data at time \(T\) and performs badly. Our baseline (row 2), which just extends accessible data to context \(L_c\) but no more improvements, achieves an Avg F1-score of 40.8%. Introducing the pseudo future mechanism (\(PF\), row 3) significantly boosts performance to 42.4 (+1.6 vs baseline), underscoring the importance of future context modeling over relying solely on past or current information. Further incorporating modality-agnostic robust representation (\(\mathcal{L}_{mrr}\), row 5) or random segment sampling (\(RS\), row 6) individually builds upon this, yielding Avg F1-scores of 44.2 (+3.4 vs baseline) and 44.0 (+3.2 vs baseline) respectively, demonstrating their distinct benefits. The focal temporal prioritization (\(w(t)\)) consistently improves results when applied (e.g., row 4 vs 3, and row 7 vs 5), confirming its effectiveness in focusing the model on critical information at the current moment. Finally, our full PreFM model (row 8), integrating all components, achieves a final Avg F1-score of 46.3, marking a substantial +5.5 improvement over the baseline and validating the collective effectiveness of our design.

4.3.0.2 Impact of future-oriented losses

We investigate the impact of direct future supervision \(\mathcal{L}_f\) and future part of robust representation loss \(\mathcal{L}_{mrr,f}\) used in pseudo-future (\(PF\)) mechanism. Results are shown in Table [tab:ablation95three-tables](b). A comparison of the first two rows shows that merely incorporating the extra parameters in the future module without applying any future supervision yields negligible performance gains. Conversely, the results in the subsequent three rows indicate that designing losses to explicitly guide the model in anticipating and modeling the future, whether through direct supervision or robust representation distillation, enhances model performance. These findings clearly demonstrate that the performance benefits derived from our pseudo-future mechanism are primarily attributable to the effective learning guided by these targeted future-oriented losses, rather than merely an increase in model capacity.

image
image
image

4.3.0.3 Impact of temporal-modality cross fusion

Generating reliable audio-visual pseudo-future is challenging due to inherent predictive noise. Table [tab:ablation95three-tables](c) compares our Temporal-Modality Cross Fusion (TMCF) with ablated variants that utilize only self-attention (Self), audio-visual modality fusion (M only), or temporal-only fusion (T only), focusing on their accuracy in predicting the future (the first three relative time steps) and overall event parsing performance. The inferior performance of these simplified variants underscores that uni-dimensional interactions are insufficient for producing robust future sequences, leaving its reliability and noise levels suboptimal. In contrast, our full TMCF, by collaboratively leveraging both cross-modal and cross-temporal interactions from available content, generates more accurate and dependable pseudo-future sequences. This results in a higher-quality predictive context that more effectively mitigates noise and aids robust real-time event parsing.

image
image

4.3.0.4 Impact of context lengths \(L_c\) and \(L_f\)

We investigate the impact of varying lengths for the working area \(L_c\) and the pseudo-future sequence \(L_f\), with results presented in Table [tab:ablation952-tables](a). The optimal performance, achieving an Avg F1-score of 46.3, is obtained with our default configuration of \(L_c=10\) and \(L_f=5\) (row 2). Analysis of \(L_f\) (rows 1, 2, 3, with \(L_c=10\) fixed) indicates that while a very short future window (\(L_f=1\)) provides insufficient predictive insight, an overly long one (\(L_f=10\)) can introduce distracting noise, both degrading performance. Similarly, examining \(L_c\) (rows 2, 4, 5, with \(L_f=5\) fixed) reveals that too little historical context (\(L_c=5\)) offers inadequate support, whereas excessive history (\(L_c=20\)) may include outdated or irrelevant information. These findings confirm the importance of appropriately sized context windows, with \(L_c=10\) and \(L_f=5\) providing the most effective balance for the immediate event parsing task.

4.3.0.5 Different teacher models in MRR

We evaluate the influence of different pre‑trained teacher models on our modality‑agnostic robust representation (MRR) module. Specifically, we compare OnePeace [72], ImageBind [71], and AudioClip [73] across multiple metrics. As the results demonstrate in Table [tab:ablation952-tables](b), no single model consistently outperforms the others on every measure. However, OnePeace delivers better segment‑level F1-scores and average event‑level performance, which lead us to adopt it as our default teacher.

4.3.0.6 Temporal impact of the pseudo-future

Figure 4 (a) illustrates how our pseudo future (\(PF\)) mechanism affects prediction accuracy across time steps relative to the current moment \(T\). From the orange line, we observe that the model’s peak performance occurs significantly earlier (around relative time \(T-6\)), with accuracy declining as it approaches \(T\), indicating a strong reliance on full context. In contrast, the purple line shows that incorporating the \(PF\) not only achieves generally higher accuracy but also shifts its performance peak much closer to the actual target time \(T\) (around \(T-2\)). These observations underscore a fundamental principle in event parsing: accurate event identification intrinsically depends on a comprehensive contextual window. Thus, the reliance on future context presents a significant hurdle for online systems. Our \(PF\) mechanism effectively anticipates event trends, models and utilizes audio-visual future information to enhance prediction accuracy near the present moment, thereby mitigating immediate contextual limitations.

Figure 4: (a) The performance across different relative time steps. (b) t-SNE visualization of the pre-classifier features. We use nine animal events from UnAV-100 [16] for better illustration.

4.3.0.7 Impact of MRR on latent feature representation

Figure 4 (b) qualitatively evaluates our modality-agnostic robust representation (MRR) via t-SNE visualization of latent features from nine predefined animal events from UnAV-100 [16]. With MRR, event classes form more compact and well-separated clusters, unlike the more chaotic clusters from the model without MRR. This suggests that while MRR may shift the latent space, it guides the model towards more discriminative representations, enhancing event separability and overall performance.

4.4 Qualitative Analysis↩︎

Figure 5 presents a qualitative comparison of our PreFM with SOTA methods UnAV [16], CCNet [18] on On-AVEL task and MM-CSE [54] on On-AVVP tasks. Prior methods often exhibit limitations such as missed detections (e.g., “trombone” by UnAV, “piano” by CCNet, “cheering” by MM-CSE), fragmented predictions (e.g., “trumpet” by UnAV) depicted by red dotted box. In stark contrast, PreFM’s predictions exhibit strong temporal continuity and precise event boundary localization, without the interruptions or errors in other methods. These visualizations intuitively showcase PreFM’s enhanced recognition accuracy and the coherent, continuous nature of its event parsing.

Figure 5: (a) The visualization on the On-AVEL task. (b) The visualization on the On-AVVP task. GT: ground truth. The red dotted box indicates the area of mispredictions.

5 Conclusions↩︎

In this work, we introduce online audio-visual event parsing to enable real-time multimodal event understanding in streaming videos. We identify accurate online inference and real-time efficiency as two crucial capabilities in this setting, and propose the PreFM framework, featuring a novel predictive multimodal future modeling to infer future context and modality-agnostic robust representation together with focal temporal prioritization for model’s generalization. Extensive experiments on the UnAV-100 and LLP datasets validate that PreFM significantly outperforms prior methods, achieving state-of-the-art performance while offering a superior balance between accuracy and computational efficiency, thus presenting a viable solution for practical real-time multimodal applications.

This work is supported by the National Natural Science Foundation of China (No. 92470203, U23A20314), the Beijing Natural Science Foundation (No. L242022), and the Fundamental Research Funds for the Central Universities (No. 2024XKRC082).

6 Technical Appendices and Supplementary Material↩︎

6.1 Limitations and Broader Impact↩︎

6.1.0.1 Limitations

While PreFM demonstrates promising results in online audio-visual event parsing, we identify a couple of key avenues for future exploration and enhancement. Firstly, the current PreFM design is primarily tailored for event detection and localization. Further research can extend its capabilities to more complex, semantically rich tasks such as video question answering or detailed captioning, and enhance its capacity for long-range temporal reasoning, potentially through integration with large language models. Secondly, while PreFM’s predictive modeling of pseudo-future context is a key component for enhancing online inference, the inherent nature of future prediction means that the generated cues may not always perfectly foresee subsequent events. Although our temporal-modality cross fusion (TMCF) (detailed in Sec. 3.3) is designed to refine these predictions and mitigate potential noise by leveraging cross-modal and cross-temporal interactions—with its positive impact analyzed in Sec. 4.3— the noise may somewhat degrade performance. While TMCF offers an initial solution, further research can be developed to enhance the reliability and effectiveness of the future.

6.1.0.2 Broader Impact

Our work on online audio-visual event parsing, using methods like PreFM, can greatly improve real-time AI systems. However, we must think carefully about serious ethical issues when using audio and video data. Important issues include protecting people’s privacy from unwanted watching or access, reducing unfair biases that the AI might learn from its data should be considered.

6.2 Additional Results and Analysis↩︎

6.2.0.1 Quantitative Analysis of the Pseudo-Future’s Quality

We evaluate them from two perspectives, their semantic similarity to ground-truth event features, and their effectiveness in predicting future events: Top-k Similarity Accuracy: We measure if a generated feature vector at a relative future time step (T+1-T+5) is the Top-1 or Top-5 closest match to its corresponding ground-truth class feature embedding, among all 100 classes in UnAV-100. Future Prediction F1-Score: We also report the standard segment-level F1-score for predictions made for future time steps. The results are presented in the Table 3.

This results provides two key insights: First, the pseudo-future features are remarkably realistic. The Top-5 Similarity Accuracy of over 94% demonstrates that the correct event feature is almost always ranked among the top candidates. Second, these high-quality features enable strong future prediction performance, as evidenced by the solid F1-scores.

Table 3: The quantitative results of the pseudo-future’s quality.
Metric T+1 T+2 T+3 T+4 T+5
Top-1 Similarity 45.4 44.6 44.0 43.3 42.6
Top-5 Similarity 95.5 95.3 95.1 94.8 94.3
F1 57.5 56.5 55.4 54.7 54.1

6.2.0.2 Ablation study on hyperparams

The ablation study on the loss weighting hyperparameter \(\lambda\) is shown in Table 4. The results indicate that performance diminishes at the tested extreme values (\(\lambda=0.1\) and \(\lambda=10\)), while the model exhibits stable and strong performance across a moderate range (from \(\lambda=0.5\) to \(\lambda=2\)). Therefore, we adopt \(\lambda=1\) as the default setting in our experiments for simplicity.

Table 4: Ablation studies for hyperparams, loss weight \(\lambda\).
\(\lambda\) Seg-Level Event-Level
F1 mAP 0.5 Avg
0.1 58.3 70.8 46.9 45.8
0.5 58.9 70.2 47.2 46.2
1 59.1 70.1 46.9 46.3
1.5 58.4 69.9 47.2 46.2
2 59.2 70.0 47.5 46.6
5 55.9 68.9 44.3 43.7
10 13.2 50.8 9.4 9.5

6.2.0.3 Different feature extractors

Table 5 presents the ablation study on different feature extractors, evaluating their impact on the performance and efficiency of On-AVEP task. The results clearly indicate that employing more powerful foundation models as feature extractors generally leads to significant improvements in parsing performance. Specifically, while the I3D [76]+VGGish [77] combination is relatively lightweight, its performance is comparatively limited. In contrast, AudioClip [73] and CLIP [47]+CLAP [48] offer a favorable balance between performance and computational efficiency. Although OnePeace [72] achieves the best parsing results, its substantial computational requirements may hinder its practical deployment in real-world scenarios. Notably, the computational complexity of our proposed PreFM module remain relatively stable and low across all tested feature extractors. This underscores that the feature extraction stage constitutes the primary performance bottleneck and source of computational load, directly impacting the system’s online processing capabilities.

Table 5: Ablation studies for different feature extractors. a: audio extractor part, v: visual extractor part.
Methods Seg-Level Event-Level Dimensions FLOPS\(\textcolor{myred}{\downarrow}\)
F1 mAP 0.5 Avg a v a v PreFM
I3D [76]+VGGish [77] 30.7 48.7 23.7 24.0 128 2048 0.9G 3.5G 0.3G
AudioClip [73] 48.0 63.6 37.7 37.0 1024 1024 2.7G 5.4G 0.1G
CLIP [47]+CLAP [48] 59.1 70.1 46.9 46.3 768 768 23.1G 77.8G 0.5G
ONE-PEACE [72] 62.4 70.6 52.2 51.5 1536 1536 78.8G 389.8G 0.4G

6.2.0.4 Failure cases

The quantitative findings and analysis about failure cases are shown below:

Confusion Between Similar Events  We analyze the events with the lowest performance and their most common confusions in Table 6. This results reveal that PreFM struggles to distinguish between events that are semantically or acoustically similar. We hypothesize this is because our current framework does not explicitly incorporate a contrastive learning design to better separate the representations of events originating from similar audio-visual sources.

Table 6: The quantitative analysis of confusion between similar events.
Event Precision Recall F1 Most confused with
People slurping 0.55 0.17 0.25 People eating, man speaking
People shouting 0.42 0.19 0.27 Baby laughter, engine knocking

Performance in Dense Scenes  We analyze the impact of event density (the number of event classes within a video) on event-level performance in Table 7. These results show that PreFM’s performance degrade in complex videos containing a large number of distinct event classes. This suggests that while our future modeling is effective, its benefits are less pronounced in scenarios with very rapid scene changes and drastic context shifts.

Table 7: The quantitative analysis of PreFM in dense scenes.
Num events Event-Level Avg
1-3 0.54
4-6 0.23
>6 0.13

6.3 Additional Details↩︎

6.3.0.1 The detailed process of modality-agnostic representation refinement (MRR)

For the MRR process, we select OnePeace [72] as the pre-trained teacher model. The generation of target teacher features \(f_t\) at each time step \(t\) involves the following steps: First, ground-truth event labels \(y_t\) are converted into textual prompts using the template “a/an audio/visual/audio-visual event of [cls]”. These prompts are then processed by the OnePeace text encoder to yield the modality-agnostic event features. If multiple events are active at time \(t\), the final \(f_t\) is computed by averaging the features corresponding to all active event classes. Concurrently, the student model’s representation \(\hat{f}^{av}_t\) is prepared. We extract the audio features \(\hat{f}_t^a\) and visual features \(\hat{f}_t^v\) from our model at time \(t\), specifically from the layer before the final classification head \(h(\cdot)\). These extracted features are subsequently concatenated in feature dimension to form the student’s representation: \(\hat{f}^{av}_t = \texttt{Concat}(\hat{f}_t^a, \hat{f}_t^v)\).

6.3.0.2 Specific values for focal temporal prioritization

As detailed in Eq. 8 and Eq. 9 , our focal temporal prioritization are designed to emphasize predictions closer to the current time \(T\) while maintaining the overall loss scale. This scale preservation ensures that the sum of weights for the context window, \(\sum^{T}_{t = T-L_c+1} w_c(t)\), equals \(L_c\), and for the future window, \(\sum^{T+L_f}_{t = T+1} w_f(t)\), equals \(L_f\). Table 8 presents the specific numerical values of these weights for each relative time step, calculated with our default settings of \(L_c=10\) and \(L_f=5\).

Table 8: Specific values of the focal temporal prioritization for current time steps (\(t \in[T-9, T]\)) and future prediction time steps (\(t \in[T+1, T+5]\)).
Time step Current Future
T-9 T-8 T-7 T-6 T-5 T-4 T-3 T-2 T-1 T T+1 T+2 T+3 T+4 T+5
Weight value 0.76 0.83 0.89 0.95 1.01 1.06 1.09 1.12 1.14 1.14 1.12 1.10 1.03 0.94 0.81

6.3.0.3 Difference between random segment sampling and normal sampling

The details of our random segment sampling strategy are described in Sec. 3.6. In contrast, normal sampling strategy just involves dividing a video of total length \(T_{all}\) into a sequence of \(k\) non-overlapping chunks, each of length \(L_c\). For such a normal approach, the target prediction time \(T_k\) for each \(k\)-th chunk is deterministically set to its final time step, specifically defined as \(T_k = kL_c - 1\). This means that predictions are consistently targeted only at the very end of these fixed chunks, unlike the more varied and diverse target prediction times generated by our random segment sampling method.

6.3.0.4 Dataset modification for online setting

For the UnAV-100 dataset [16], while its original annotations specify continuous time segments for events (e.g.,\([cls, T_{start}, T_{end}]\)), we convert these into frame-level discrete labels for our online task. Specifically, for any given time \(T\) and a particular event within a video stream, a label of \(1\) indicates the event is currently occurring, while \(0\) indicates it is not.

Figure 6: Visualization of the LLP dataset adaptation for the online setting. Various colored rectangles represent video clips of different categories. Video clips are processed by random concatenation and consistent concatenation.

The LLP dataset [25] initially provides 11,849 10-second video clips. To adapt it for our online evaluation setting, we concatenate 11,849 new, untrimmed video streams as shown in Figure 6. Each new stream is created by using the original 10-second clips as its base and concatenating additional clips. These resulting streams are specifically constructed to achieve one of six distinct target total durations: 10 seconds (representing the original clip itself), 20 seconds, 30 seconds, 40 seconds, 50 seconds, or 60 seconds. Approximately an equal number of streams are generated for each of these six target durations.

This concatenation process employs two distinct strategies: half of the 11,849 streams are formed by random concatenation, randomly combining clips from different categories. This aims to simulate the rapid and complex within-second scene variations commonly observed in current streaming content. The other half are constructed by consistent concatenation, identifying the first event category present in the base clip and then concatenating multiple additional clips that also contain this specific event, thereby simulating longer videos with a consistent, ongoing event context. This approach allows us to assess the model’s adaptability to complex dynamic scenes and its capability for consistent understanding and discrimination within extended event contexts.

6.3.0.5 Regarding other datasets: The case of LFAV

The LFAV dataset [69], comprising 5175 untrimmed videos with diverse audio, visual, and audio-visual events, is designed for long-form audio-visual video parsing and thus appears initially relevant for the On-AVEP task. However, we identify two critical limitations that preclude its effective use with our PreFM framework.

Firstly, complete access to the original video data is restricted. Of the officially stated 3721 training, 486 validation, and 968 test samples, our attempts allow us to retrieve only 3512, 461, and 910 samples respectively (totaling 4883 out of 5175 raw videos). The LFAV benchmark provides pre-extracted features using VGGish [77], ResNet18 [79], and R3D [78]. This reliance on fixed features prevents us from employing different feature extractors or leveraging pre-trained models (such as OnePeace [72] for our modality-agnostic robust representation) directly on the raw video data, which is a key aspect of our method.

Secondly, LFAV is curated under a weak supervision paradigm, offering only video-level annotations for its training set. The absence of readily available segment-level ground truth makes LFAV unsuitable for training critical components of our PreFM model. Specifically, mechanisms like our predictive future modeling and focal temporal prioritization require finer-grained temporal supervision than what LFAV’s training annotations provide, rendering it incompatible with the training requirements for our online streaming prediction approach.

6.3.0.6 More online inference details

All methods are evaluated using their officially provided checkpoints; for those without an available checkpoint, we reproduce the results using their official code. For any prediction at time \(T\) in online testing, only data from \(0\) to \(T\) is available.

For On-AVEL tasks, SOTA methods [16][18] pad the entire video beyond \(T+1\) with zeros as input because these methods originally utilize the complete video, and we use this padding to ensure a uniform input length under online settings. Our method, in contrast, does not use all available historical data up to \(T\); instead, it processes the segment \([T-L_c+1, T]\) as input to derive the prediction at time \(T\).

For On-AVVP tasks, SOTA approaches [49], [52][54], [75] employ the segment \([T-9, T]\) as input, since these methods are designed for 10-second video clips. Similarly, our method utilizes the segment \([T-L_c+1, T]\) as input for making predictions at time \(T\).

6.3.0.7 Re-evaluation of efficiency metrics for fair comparison

Table 9 (for the On-AVEL task) and Table 10 (for the On-AVVP task) present side-by-side comparisons of efficiency metrics. These include figures from our standardized re-evaluation (denoted as “Our Eval.”) and those originally published in the respective papers (denoted as “Reported”).

For our evaluations, we adhere to a strict and consistent protocol. The number of trainable parameters for all models is calculated as the sum of elements in all parameters requiring gradients (using sum(p.numel() for p in model.parameters() if p.requires_grad)). To measure FLOPs, we consistently employ the thop library for all methods, assessing a single forward pass (via flops, _ = profile(model, inputs=(input,))). All our efficiency tests are conducted under identical environmental conditions to ensure reproducibility and a fair basis for comparison.

Discrepancies may be observed between our “Our Eval.” figures and the “Reported” values from the original publications. Such differences can arise from variations in measurement methodologies, specific versions of libraries used, or the underlying hardware and software environments. We present both sets of values to offer a transparent perspective, respecting the data from original publications while providing a benchmark that is directly comparable across methods under our unified testing framework.

Table 9: Comparison of re-evaluated (“Our Eval.”) and originally reported (“Reported”) efficiency metrics on the On-AVEL task. (“-” indicates the metric was not provided in the original paper.
Methods Params FLOPs
Our Eval. Reported Our Eval. Reported
UnAV [16] (CVPR2023) 139.4M - 52.4G -
UniAV [17] (Arxiv2404) 130.8M 130M 22.7G -
CCNet [18] (AAAI2025) 238.8M - 72.1G -
PreFM 12.3M none 0.4G none
PreFM+ 36.9M none 0.5G none
Table 10: Comparison of re-evaluated (“Our Eval.”) and originally reported (“Reported”) efficiency metrics on the On-AVVP task. “-” indicates the metric was not provided in the original paper.
Methods Params FLOPs
Our Eval. Reported Our Eval. Reported
VALOR [49] (NeurIPS2023) 4.9M 5.1M 0.45G 0.45G
Coleaf [52] (ECCV2024) 5.7M - 0.25G 48.2G
LEAP [75] (ECCV2024) 52.0M 52.0M 1.09G 0.79G
NREP [53] (TNNLS2024) 9.6M 9.6M 1.69G 0.37G
MM-CSE [54] (AAAI2025) 6.2M 4.5M 0.91G 0.80G
PreFM 3.3M none 0.22G none
PreFM+ 12.1M none 0.48G none

6.4 More Qualitative Analysis↩︎

6.4.0.1 On-AVEL

Figure 7 provides further qualitative validation of our method’s on the On-AVEL task through four distinct examples, comparing our results (“Ours”) against the state-of-the-art CCNet [18] model (“SOTA”). These visualizations collectively demonstrate that our approach consistently yields event localizations that are more aligned with the ground truth annotations compared to CCNet.

Figure 7: More visualization results on the On-AVEL task.

6.4.0.2 On-AVVP

Similarly, we provide further qualitative results of our method’s on the On-AVVP task through four distinct examples, comparing our results (“Ours”) against the state-of-the-art MM-CSE [54] model (“SOTA”). The visualization results are shown in Figure 8. These visualizations highlight our method’s superior performance in precisely parsing events and reducing errors compared to the SOTA model.

This robust handling of both unimodal and multimodal event characteristics signifies a key advantage of our approach for the online audio-visual event parsing task.

Figure 8: More visualization results on the On-AVVP task.

References↩︎

[1]
H. Cui et al., “Towards multimodal foundation models in molecular cell biology,” Nature, vol. 640, no. 8059, pp. 623–633, 2025.
[2]
X. Zhao, S. Poria, X. Li, Y. Chen, and B. Tang, “Toward robust multimodal learning using multimodal foundational models,” arXiv preprint arXiv:2401.13697, 2024.
[3]
X. Liu, X. Xia, S.-K. Ng, and T.-S. Chua, “Continual multimodal contrastive learning,” arXiv preprint arXiv:2503.14963, 2025.
[4]
Y. Zong, O. Mac Aodha, and T. Hospedales, “Self-supervised multimodal learning: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024.
[5]
Y. Wu, Y. Peng, Y. Lu, X. Chang, R. Song, and booktitle=2024. I. S. L. T. W. (SLT). Watanabe Shinji, “Robust audiovisual speech recognition models with mixture-of-experts,” 2024 , organization={IEEE}, pp. 43–48.
[6]
Y. Wang et al., “Internvideo2: Scaling foundation models for multimodal video understanding,” 2024 , organization={Springer}, pp. 396–416.
[7]
Z. Ren et al., “Videoworld: Exploring knowledge learning from unlabeled videos,” 2025, pp. 29029–29039.
[8]
Y. Guo et al., “Crossmae: Cross-modality masked autoencoders for region-aware audio-visual pre-training,” 2024, pp. 26721–26731.
[9]
S. Mo and booktitle=International. C. on M. L. Morgado Pedro, “A unified audio-visual learning framework for localization, separation, and recognition,” 2023 , organization={PMLR}, pp. 25006–25017.
[10]
Y.-B. Lin and booktitle=European. C. on C. V. Bertasius Gedas, “Siamese vision transformers are scalable audio-visual learners,” 2024 , organization={Springer}, pp. 303–321.
[11]
T. Geng, J. Zhang, Q. Wang, T. Wang, J. Duan, and F. Zheng, “Longvale: Vision-audio-language-event benchmark towards time-aware omni-modal perception of long videos,” arXiv preprint arXiv:2411.19772, 2024.
[12]
S. Mo, W. Pian, and booktitle=Proceedings. of the I. I. C. on C. V. Tian Yapeng, “Class-incremental grouping network for continual audio-visual learning,” 2023, pp. 7788–7798.
[13]
W. Pian, S. Mo, Y. Guo, and booktitle=Proceedings. of the I. I. C. on C. V. Tian Yapeng, “Audio-visual class-incremental learning,” 2023, pp. 7799–7811.
[14]
G. Li, H. Du, and booktitle=Proceedings. of the 32nd. A. I. C. on M. Hu Di, “Boosting audio visual question answering via,” 2024, pp. 5997–6005.
[15]
Z. Li, D. Guo, J. Zhou, J. Zhang, and booktitle=Proceedings. of the A. C. on A. I. Wang Meng, “Object-aware adaptive-positivity learning for audio-visual question answering,” 2024, vol. 38, pp. 3306–3314.
[16]
T. Geng, T. Wang, J. Duan, R. Cong, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Zheng Feng, “Dense-localizing audio-visual events in untrimmed videos: A large-scale benchmark and baseline,” 2023, pp. 22942–22951.
[17]
T. Geng, T. Wang, Y. Zhang, J. Duan, W. Guan, and F. Zheng, “Uniav: Unified audio-visual perception for multi-task video localization,” arXiv e-prints, pp. arXiv–2404, 2024.
[18]
Z. Zhou, J. Zhou, W. Qian, S. Tang, X. Chang, and booktitle=Proceedings. of the A. C. on A. I. Guo Dan, “Dense audio-visual event localization under cross-modal consistency and multi-temporal granularity collaboration,” 2025, vol. 39, pp. 10905–10913.
[19]
Y. Xu et al., “Vlm-ad: End-to-end autonomous driving through vision-language model supervision,” arXiv preprint arXiv:2412.14446, 2024.
[20]
J. Zhang, Z. Huang, A. Ray, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Ohn-Bar Eshed, “Feedback-guided autonomous driving,” 2024, pp. 15000–15011.
[21]
C. Huang, Y. Tian, A. Kumar, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Xu Chenliang, “Egocentric audio-visual object localization,” 2023, pp. 22910–22921.
[22]
H. Chen, Y. Hou, C. Qu, I. Testini, X. Hong, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Jiao Jianbo, “360+ x: A panoptic multi-modal scene understanding dataset,” 2024, pp. 19373–19382.
[23]
A. Shek, B. Y. Su, R. Chen, and booktitle=2023. I. international C. on R. and automation (ICRA). Liu Changliu, “Learning from physical human feedback: An object-centric one-shot adaptation method,” 2023 , organization={IEEE}, pp. 9910–9916.
[24]
P. Li, L. Cao, X.-M. Wu, X. Yu, and R. Yang, “UGotMe: An embodied system for affective human-robot interaction,” arXiv preprint arXiv:2410.18373, 2024 #related work ovd-----------.
[25]
Y. Tian, D. Li, and booktitle=Computer. V. 2020:. 16th. E. C. G. U. A. 23–28,. 2020,. P. P. I. 16. Xu Chenliang, “Unified multisensory perception: Weakly-supervised audio-visual video parsing,” 2020 , organization={Springer}, pp. 436–454.
[26]
E. Vahdani and Y. Tian, “Deep learning-based action detection in untrimmed videos: A survey,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 4, pp. 4302–4320, 2022.
[27]
R. De Geest, E. Gavves, A. Ghodrati, Z. Li, C. Snoek, and booktitle=Computer. V. 2016:. 14th. E. C. A. T. N. O. 11. 2016,. P. P. V. 14. Tuytelaars Tinne, “Online action detection,” 2016 , organization={Springer}, pp. 269–284.
[28]
Z. Pang, F. Sener, and A. Yao, “Context-enhanced memory-refined transformer for online action detection,” arXiv preprint arXiv:2503.18359, 2025.
[29]
D. Damen et al., “Rescaling egocentric vision: Collection, pipeline and challenges for epic-kitchens-100,” International Journal of Computer Vision, pp. 1–23, 2022.
[30]
Y. Zhao and booktitle=European. C. on C. V. Krähenbühl Philipp, “Real-time online video detection with temporal smoothing transformers,” 2022 , organization={Springer}, pp. 485–502.
[31]
S. Reza, Y. Zhang, M. Moghaddam, and booktitle=European. C. on C. V. Camps Octavia, “HAT: History-augmented anchor transformer for online temporal action localization,” 2024 , organization={Springer}, pp. 205–222.
[32]
Y. Song, D. Kim, M. Cho, and booktitle=European. C. on C. V. Kwak Suha, “Online temporal action localization with memory-augmented transformer,” 2024 , organization={Springer}, pp. 74–91.
[33]
M. Guermal, A. Ali, R. Dai, and booktitle=Proceedings. of the I. W. C. on A. of C. V. Bremond Francois, “JOADAA: Joint online action detection and action anticipation,” 2024, pp. 6889–6898.
[34]
H. Yuan, Y. Chen, Z. Ji, Z. Zheng, Y. Gu, and J. Zhou, “Throughout procedural transformer for online action detection and anticipation,” IEEE Transactions on Circuits and Systems for Video Technology, 2025.
[35]
J. Wang, G. Chen, Y. Huang, L. Wang, and booktitle=Proceedings. of the I. I. C. on C. V. Lu Tong, “Memory-and-anticipation transformer for online action understanding,” 2023, pp. 13824–13835.
[36]
H. Guo, H. Wang, and booktitle=European. C. on C. V. Ji Qiang, “Bayesian evidential deep learning for online action detection,” 2024 , organization={Springer}, pp. 283–301.
[37]
Y. Wang et al., “Does video-text pretraining help open-vocabulary online action detection?” Advances in Neural Information Processing Systems, vol. 37, pp. 47908–47930, 2024.
[38]
Y. Li et al., “OVO-bench: How far is your video-LLMs from real-world online video understanding?” arXiv preprint arXiv:2501.05510, 2025 #related work avvp-----------.
[39]
J. Chen et al., “Videollm-online: Online video large language model for streaming video,” 2024, pp. 18407–18418.
[40]
H. Zhang et al., “Flash-vstream: Memory-based real-time understanding for long video streams,” arXiv preprint arXiv:2406.08085, 2024.
[41]
J. Yu, Y. Cheng, R.-W. Zhao, R. Feng, and booktitle=Proceedings. of the 30th. A. international conference on multimedia Zhang Yuejie, “Mm-pyramid: Multimodal pyramid attentional network for audio-visual event localization and video parsing,” 2022, pp. 6241–6249.
[42]
J. Fu, J. Gao, B.-K. Bao, and C. Xu, “Multimodal imbalance-aware gradient modulation for weakly-supervised audio-visual video parsing,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 6, pp. 4843–4856, 2023.
[43]
J. Gao, M. Chen, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Xu Changsheng, “Collecting cross-modal presence-absence evidence for weakly-supervised audio-visual event perception,” 2023, pp. 18827–18836.
[44]
J. Gao, M. Chen, and C. Xu, “Learning probabilistic presence-absence evidence for weakly-supervised audio-visual event perception,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025.
[45]
Y. Chen et al., “CM-PIE: Cross-modal perception for interactive-enhanced audio-visual video parsing,” 2024 , organization={IEEE}, pp. 8421–8425.
[46]
Y. Fan, Y. Wu, B. Du, and Y. Lin, “Revisit weakly-supervised audio-visual video parsing from the language perspective,” Advances in Neural Information Processing Systems, vol. 36, pp. 40610–40622, 2023.
[47]
A. Radford et al., “Learning transferable visual models from natural language supervision,” 2021 , organization={PmLR}, pp. 8748–8763.
[48]
B. Elizalde, S. Deshmukh, M. Al Ismail, and booktitle=ICASSP. 2023. I. I. C. on A. S. and S. P. (ICASSP). Wang Huaming, “Clap learning audio concepts from natural language supervision,” 2023 , organization={IEEE}, pp. 1–5.
[49]
Y.-H. Lai, Y.-C. Chen, and F. Wang, “Modality-independent teachers meet weakly-supervised audio-visual event parser,” Advances in Neural Information Processing systems, vol. 36, pp. 73633–73651, 2023.
[50]
J. Zhou, D. Guo, Y. Zhong, and M. Wang, “Advancing weakly-supervised audio-visual video parsing via segment-wise pseudo labeling,” International Journal of Computer Vision, vol. 132, no. 11, pp. 5308–5329, 2024.
[51]
Y.-H. Lai, J. Ebbers, Y.-C. F. Wang, F. Germain, M. J. Jones, and booktitle=Proceedings. of the C. V. and P. R. C. Chatterjee Moitreya, “UWAV: Uncertainty-weighted weakly-supervised audio-visual video parsing,” 2025, pp. 13561–13570.
[52]
F. Sardari, A. Mustafa, P. J. Jackson, and booktitle=European. C. on C. V. Hilton Adrian, “CoLeaF: A contrastive-collaborative learning framework for weakly supervised audio-visual video parsing,” 2024 , organization={Springer} #related work avel-----------, pp. 1–17.
[53]
X. Jiang, X. Xu, L. Zhu, Z. Sun, A. Cichocki, and H. T. Shen, “Resisting noise in pseudo labels: Audible video event parsing with evidential learning,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
[54]
P. Zhao, J. Zhou, Y. Zhao, D. Guo, and booktitle=Proceedings. of the A. C. on A. I. Chen Yanxiang, “Multimodal class-aware semantic enhancement network for audio-visual video parsing,” 2025, vol. 39, pp. 10448–10456.
[55]
Y. Tian, J. Shi, B. Li, Z. Duan, and booktitle=Proceedings. of the E. conference on computer vision (ECCV). Xu Chenliang, “Audio-visual event localization in unconstrained videos,” 2018, pp. 247–263.
[56]
Y. Xia and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Zhao Zhou, “Cross-modal background suppression for audio-visual event localization,” 2022, pp. 19989–19998.
[57]
T. Mahmud and booktitle=Proceedings. of the I. W. conference on applications of computer vision Marculescu Diana, “Ave-clip: Audioclip-based multi-window temporal transformer for audio visual event localization,” 2023, pp. 5158–5167.
[58]
Y. Jiang, J. Yin, and Y. Dang, “Leveraging the video-level semantic consistency of event for audio-visual event localization,” IEEE transactions on multimedia, vol. 26, pp. 4617–4627, 2023.
[59]
Y.-B. Lin, Y.-L. Sung, J. Lei, M. Bansal, and booktitle=Proceedings. of the I. C. on C. V. and P. R. Bertasius Gedas, “Vision transformers are parameter-efficient audio-visual learners,” 2023, pp. 2299–2309.
[60]
X. He et al., “CACE-net: Co-guidance attention and contrastive enhancement for effective audio-visual event localization,” 2024, pp. 985–993.
[61]
J. Zhou, D. Guo, and M. Wang, “Contrastive positive sample propagation along the audio-visual event line,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 7239–7257, 2022.
[62]
W. Liu, Y. Chai, Y. Yan, and Y. Ren, “Audio-visual event localization on portrait mode short videos,” arXiv preprint arXiv:2504.06884, 2025.
[63]
J. Zhou et al., “Towards open-vocabulary audio-visual event localization,” arXiv preprint arXiv:2411.11278, 2024.
[64]
L. Xing, H. Qu, R. Yan, X. Shu, and J. Tang, “Locality-aware cross-modal correspondence learning for dense audio-visual events localization,” arXiv preprint arXiv:2409.07967, 2024.
[65]
Y. Liu, Q. Wu, M. Zeng, Y. Liu, and Y. Pan, “FASTEN: Video event localization based on audio-visual feature alignment and multi-scale temporal enhancement,” IEEE Signal Processing Letters, 2025.
[66]
B. Zhang et al., “VideoLLaMA 3: Frontier multimodal foundation models for image and video understanding,” arXiv preprint arXiv:2501.13106, 2025.
[67]
Y. Tang, D. Shimada, J. Bi, M. Feng, H. Hua, and booktitle=Proceedings. of the A. C. on A. I. Xu Chenliang, “Empowering llms with pseudo-untrimmed videos for audio-visual temporal understanding,” 2025, vol. 39, pp. 7293–7301.
[68]
F. Ma, X. Jin, H. Wang, Y. Xian, J. Feng, and Y. Yang, “Vista-llama: Reliable video narrator via equal distance to visual tokens,” arXiv preprint arXiv:2312.08870, 2023.
[69]
W. Hou, G. Li, Y. Tian, and D. Hu, “Toward long form audio-visual video understanding,” ACM Transactions on Multimedia Computing, Communications and Applications, vol. 20, no. 9, pp. 1–26, 2024.
[70]
A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[71]
R. Girdhar et al., “Imagebind: One embedding space to bind them all,” 2023, pp. 15180–15190.
[72]
P. Wang et al., “One-peace: Exploring one general representation model toward unlimited modalities,” arXiv preprint arXiv:2305.11172, 2023.
[73]
A. Guzhov, F. Raue, J. Hees, and booktitle=ICASSP. 2022. I. I. C. on A. S. and S. P. (ICASSP). Dengel Andreas, “Audioclip: Extending clip to image, text and audio,” 2022 , organization={IEEE}, pp. 976–980.
[74]
X. He et al., “Vlab: Enhancing video language pre-training by feature adapting and blending,” IEEE Transactions on Multimedia, 2024.
[75]
J. Zhou, D. Guo, Y. Mao, Y. Zhong, X. Chang, and booktitle=European. C. on C. V. Wang Meng, “Label-anticipated event disentanglement for audio-visual video parsing,” 2024 , organization={Springer}, pp. 35–51.
[76]
J. Carreira and booktitle=proceedings. of the I. C. on C. V. and P. R. Zisserman Andrew, “Quo vadis, action recognition? A new model and the kinetics dataset,” 2017, pp. 6299–6308.
[77]
S. Hershey et al., “CNN architectures for large-scale audio classification,” 2017 , organization={IEEE}, pp. 131–135.
[78]
D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, and booktitle=Proceedings. of the I. conference on C. V. and P. R. Paluri Manohar, “A closer look at spatiotemporal convolutions for action recognition,” 2018, pp. 6450–6459.
[79]
K. He, X. Zhang, S. Ren, and booktitle=Proceedings. of the I. conference on computer vision and pattern recognition Sun Jian, “Deep residual learning for image recognition,” 2016, pp. 770–778.