March 06, 2025
Recent advances in video-based multimodal large language models (Video-LLMs) have significantly improved video understanding by processing videos as sequences of image frames. However, many existing methods treat frames independently in the vision backbone, lacking explicit temporal modeling, which limits their ability to capture dynamic patterns and efficiently handle long videos. To address these limitations, we introduce STORM(Spatiotemporal TOken Reduction for Multimodal LLMs), a novel architecture incorporating a dedicated temporal encoder between the image encoder and the LLM. Our temporal encoder leverages the Mamba State Space Model to integrate temporal information into image tokens, generating enriched representations that preserve inter-frame dynamics across the entire video sequence. This enriched encoding not only enhances video reasoning capabilities but also enables effective token reduction strategies, including test-time sampling and training-based temporal and spatial pooling, substantially reducing computational demands on the LLM without sacrificing key temporal information. By integrating these techniques, our approach simultaneously reduces training and inference latency while improving performance, enabling efficient and robust video understanding over extended temporal contexts. Extensive evaluations show that STORM achieves state-of-the-art results across various long video understanding benchmarks (more than 5% improvement on MLVU and LongVideoBench) while reducing the computation costs by up to \(8\times\) and the decoding latency by 2.4-2.9\(\times\) for the fixed numbers of input frames.
Figure 1: Open-Ended Video Understanding. We show STORM’s ability to handle free-form queries about complex long video scenes. By employing the Mamba-based temporal encoder to capture essential spatiotemporal cues while compressing redundant frame information, STORM enables efficient, accurate long-video understanding and outperforms existing methods on a wide range of video understanding tasks..
Recent advancements in video-based multimodal Large Language Models (Video-LLMs) have significantly improved the ability of artificial intelligence systems to understand and generate descriptions of video content [1]–[6]. A common strategy in these models involves treating a video as a sequence of individual image frames, processing each frame independently using an image encoder and a visual-language projector [2]–[4]. The resulting frame-level representations are then fed into a large language model (LLM), which performs temporal reasoning to comprehend the sequence of events depicted in the video. Despite leveraging powerful image encoders and language understanding capabilities of LLMs, this approach exhibits fundamental limitations in video processing, particularly for long videos. The absence of explicit temporal encoding fails to capture crucial temporal dynamics between frames, forcing the LLM to infer temporal relationships solely from sequences of static images. This sequential processing imposes a substantial computational burden on the language model, significantly impairing its ability to handle extended video sequences and the generalization ability to longer contexts during inference. To manage computational overhead, existing methods often rely on naive frame subsampling to reduce the number of tokens for LLM processing. However, this approach results in significant information loss from discarded frames, potentially eliminating critical details necessary for comprehensive video understanding [7], [8]. Additionally, these methods fail to effectively compress consecutive frames that typically contain substantial overlapping information. 1
In this work, we propose STORM(Spatiotemporal TOken Reduction for Multimodal LLMs), which introduces a novel temporal encoder between the image encoder and the LLM that bridges the gap between visual and language representations. This architecture integrates temporal dynamics earlier in the pipeline, significantly enhancing the temporal reasoning capabilities of Video-LLMs while enabling substantial downstream computational efficiency. By injecting temporal information directly into visual tokens, we reduce the LLM’s temporal reasoning burden, allowing it to focus on higher-level language tasks. We employ the Mamba State Space Model [9] as the core of our temporal layer, enabling efficient processing of long videos while enhancing generalization to extended temporal contexts. The temporal layer processes image and video inputs differently — for images, it functions as a spatial scanner that enhances tokens by incorporating global spatial context; for videos, it performs simultaneous spatial and temporal scanning that captures comprehensive spatiotemporal information.
The key advantage of the Mamba layer is its ability to compress historical information into state representations. As consecutive frames in the video input often contain redundant information, our temporal encoder efficiently processes and propagates temporal information across the entire video sequence. The resulting visual tokens inherently encapsulate temporal history, effectively summarizing the temporal dynamics of the video. This characteristic allows us to extract fewer visual tokens for LLM processing while preserving key information. In our experiments, we explore a straightforward yet effective token reduction approach by applying training-free token subsampling at test time. This method not only reduces computational costs but also enhances performance across various scenarios. Additionally, we also explore training-based compression method including temporal and spatial token compression. Temporal pooling reduces the number of tokens along the temporal dimension, while spatial average pooling decreases tokens per frame. These compression strategies are optimized during training to preserve essential information while minimizing redundancy. Unlike previous methods that directly subsample raw video frames, which can lead to loss of critical information, our approaches preserve the essential temporal information in a compressed format. Our method not only reduces the computational load on the LLM but also improves model performance through more comprehensive video representation in a compact token space. Our empirical evaluation shows that the proposed approach successfully extends to the long-context video understanding model without compromising training efficiency.
Vision language models (VLMs) primarily adopt two paradigms to process visual input. The first paradigm freezes language model weights and integrates visual information via cross-attention mechanisms [10], while the second paradigm utilizes a pre-trained image encoder, such as CLIP [11] or SigLIP [12], to convert images into tokens. These tokens are then concatenated with text tokens and input into the language model [1], [2], [13]. This approach can be naturally extended to video understanding by treating videos as sequences of images processed by the vision encoder [3], [4]. To enhance video processing, some works introduce specialized video encoders. For instance, InternVideo [5], [6] uses VideoMAE [14] as a video encoder, while Kangaroo [15] integrates depth-wise 3D convolution for fusing video tokens. In this work, we retain SigLIP as the vision encoder and focus on enhancing long video understanding by incorporating a linear-complexity temporal module in the Mamba [9] architecture. Positioned between the SigLIP vision encoder and the language model, this module efficiently improves spatial-temporal modeling effectiveness.
Understanding long videos with VLMs presents significant challenges in both accuracy and efficiency. Previous approaches have employed long-context language models trained on short-context video data to enable long video comprehension [7]. However, these methods lack sufficient long video training data and incur high computational costs during both training and inference as the number of frames increases. LongVILA [8] addresses these challenges through a multi-modal sequence parallelism system that directly handles long video data during training and inference, but this approach requires customized system implementations tailored for multi-GPU setups. Another line of research focuses on token reduction to shorten input sequences, thereby enabling efficient inference for long videos [16]–[21]. For instance, VoCO-LLaMA [18] and VideoXL [20] use recursive KV cache compression learnt in an end-to-end manner, and LongVU [21] leverages DINO features for frame selection and inter-frame similarity to reduce tokens. Despite these diverse strategies, direct pooling along the temporal or spatial dimensions often performs sufficiently well, with additional gains being marginal. In this paper, we apply temporal and spatial pooling for token reduction, achieving superior performance when combined with our temporal projector.
This section presents the Mamba-based temporal projector architecture and introduces several token compression techniques for efficient long video processing. We propose two token compression approaches: temporal and spatial compression. We first detail our temporal and spatial pooling strategies that effectively reduce the number of tokens during training. Additionally, we propose a test-time temporal token sampling method that maintains model performance without requiring additional training steps. The overview of our method is shown in 2.
Figure 2: Overview of STORM pipeline. We propose a Mamba-based temporal projector between the image encoder and the LLM. This projector bridges the gap between visual and language representation while injecting the temporal information into the tokens. The processed tokens, denoted as Summary Tokens in the figure, naturally capture temporal history, effectively summarizing the temporal dynamics of the video. This capability enables us to reduce the number of visual tokens for LLM processing without sacrificing essential information..
State Space Models (SSMs) A State Space Model (SSM) [9], [22]–[24] establishes a linear transformation between an input sequence \(\boldsymbol{x}_{1:T} \in \mathbb{R}^{T \times D}\) and an output sequence \(\boldsymbol{y}_{1:T} \in \mathbb{R}^{T \times D}\) through the following recurrent process: \[\begin{align} \boldsymbol{h}_t &= \boldsymbol{\overline{A}}_t \boldsymbol{h}_{t-1} + \boldsymbol{\overline{B}}_t \boldsymbol{x}_t\;, \\ \boldsymbol{y}_t &= \boldsymbol{C}_t \boldsymbol{h}_t\;. \end{align}\]
Here, \(T\) is the sequence length; \(\boldsymbol{x}_t, \boldsymbol{y}_t \in \mathbb{R}^{D}\) are input and output vectors at time \(t\); and \(\boldsymbol{h}_t \in \mathbb{R}^{H}\) is the hidden state summarizing the history \(\boldsymbol{x}_{\leq t}\). The matrices \(\boldsymbol{\overline{A}}_t \in \mathbb{R}^{H \times H}\), \(\boldsymbol{\overline{B}}_t \in \mathbb{R}^{H \times D}\), and \(\boldsymbol{C}_t \in \mathbb{R}^{D \times H}\) are parameterized with learnable weights designed to facilitate the modeling of long-term dependencies. When \(\boldsymbol{\overline{A}}_t, \boldsymbol{\overline{B}}_t, \boldsymbol{C}_t\) are time-invariant (constant over \(t\)), the computation of \(\boldsymbol{y}_{1:T}\) can be parallelized, enabling efficient training and inference.
Mamba Recently, Mamba [9] propose to condition these matrices on the input \(\boldsymbol{x}_t\) to enhance the sequence modeling capabilities of SSMs. Specifically, Mamba employs learnable functions \(\boldsymbol{\overline{A}} \colon \mathbb{R}^{D} \to \mathbb{R}^{H \times H}\), \(\boldsymbol{\overline{B}} \colon \mathbb{R}^{D} \to \mathbb{R}^{H \times D}\), and \(\boldsymbol{C} \colon \mathbb{R}^{D} \to \mathbb{R}^{D \times H}\) to generate input-dependent matrices as follows: \[\begin{align} \boldsymbol{\overline{A}}_t &= \boldsymbol{\overline{A}}(\boldsymbol{x}_t)\;, \quad \boldsymbol{\overline{B}}_t = \boldsymbol{\overline{B}}(\boldsymbol{x}_t)\;, \quad \boldsymbol{C}_t = \boldsymbol{C}(\boldsymbol{x}_t)\;. \end{align}\] This approach allows the model to dynamically emphasize or suppress information based on the current input, thereby enabling more flexible and adaptive sequence modeling. Additionally, Mamba leverages a hardware-aware parallel algorithm to ensuring that the input-dependent matrices does not hinder the training and inference efficiency inherent to SSMs.
Traditional Video-LLMs often process video frames independently, requiring the LLM to infer temporal relationships from sequences of static images. This approach is computationally inefficient, particularly when processing long videos. Additionally, this method fails to leverage the inherent temporal redundancy in video data, resulting in redundant processing of similar information across consecutive frames. To address these limitations, we introduce the Mamba-based temporal projector, which efficiently integrates temporal information across video frames while enabling effective temporal token compression.
Let \(\mathbf{X}_t \in \mathbb{R}^{\hat{N} \times D}\) denote the image tokens for frame \(t\) by a Vision Transformer (ViT) encoder, where \(\hat{N}\) is the number of tokens per frame and \(D\) is the token dimension. We first apply a linear layer to downsample the tokens of each frame to \(\frac{\hat{N}}{r}\) tokens: \[\tilde{\mathbf{X}}_t = \text{Linear}\left( \mathbf{X}_t \right), \quad \text{for } t = 1, \dots, T.\]
Where \(r\) is the downsample ratio. For simplicity, we define \(N = \frac{\hat{N}}{r}\) and use \(N\) throughout the remainder of this paper. The downsampled tokens from all frames are stacked to form the input tensor for the temporal module: \[\tilde{\mathbf{X}} = \left[ \tilde{\mathbf{X}}_1; \tilde{\mathbf{X}}_2; \dots; \tilde{\mathbf{X}}_T \right] \in \mathbb{R}^{T \times N \times D}.\]
The core of the temporal projector consists of \(L\) Mamba layers that iteratively integrate temporal dynamics into the tokens. In each layer \(l = 1, \dots, L\), we fuse temporal information into the visual tokens by: \[\mathbf{X}^{(l)} = \mathbf{X}^{(l-1)} + \text{MambaMixer}\left( \text{Norm}\left( \mathbf{X}^{(l-1)} \right) \right),\] where \(\mathbf{X}^{(0)} = \tilde{\mathbf{X}}\), and \(\text{Norm}(\cdot)\) denotes layer normalization. Each MambaMixer employs a bidirectional scanning module that captures dependencies in both spatial and temporal dimensions. Specifically, we apply a sweeping scan order within each frame and across frames, i.e., left-to-right, top-to-bottom, and frame-to-frame (see 2). After \(L\) layers, we obtain tokens enriched with temporal information, denoted as \(\mathbf{X}^{(L)} \in \mathbb{R}^{T \times N \times D}\).
Figure 3: Token Compression Strategies. This figure illustrates our token compression techniques: temporal average pooling (left), spatial average pooling (middle), and training-free temporal token sampling (right). These methods can be applied individually or in combination, depending on task requirements and computational budget constraints..
Processing long videos poses two major challenges for LLMs. First, handling all frames is computationally expensive, often requiring specialized system optimizations like sequence parallelization and multiple GPUs for training and inference [8]. Second, LLMs are inherently limited by their training context lengths constraints. For example, LLaMA 3 has a context length of 8K tokens, which corresponds to only 32 frames for video inputs when using 256 tokens per frame. Without token compression, video processing quickly exceeds the effective input capacity of LLMs, leading to significantly reduced model performance. In this work, we aim to address both computational cost and context length limitations by enabling efficient long-video processing through token compression along both temporal and spatial dimensions. Our approach eliminates the need for custom system optimizations and allows inference on a single GPU. We illustrate the training-time token compression in 3 (left and middle).
Since consecutive frames often contain similar information, analyzing every frame may lead to redundant processing and potential overfitting. Additionally, having too many tokens can make it difficult for the LLM to identify important temporal patterns. Thus, we propose applying temporal average pooling to compress visual information efficiently [25], [26]. This method combines data from consecutive frames by averaging their enriched visual tokens. Specifically, for the tokens \(\mathbf{X}^{(L)} \in \mathbb{R}^{T \times N \times D}\) from the temporal projector, we average every \(k\) consecutive frames:
\[\mathbf{X}_{\text{time-avg}} = \frac{1}{k} \sum_{i=0}^{k-1} \mathbf{X}^{(L)}_{t+i}, \;\text{for } t = 0, k, 2k, \dots, T - k.\]
As a result, we obtain compressed tokens:
\[\mathbf{X}_{\text{time-avg}} \in \mathbb{R}^{\frac{T}{k} \times N \times D},\]
Despite its simplicity, temporal averaging effectively decreases the number of tokens the LLM processes, with minimal loss of critical information. This motivates us to adopt such simple yet effective technique for efficient training in long video understanding.
In addition to temporal pooling, we also explore average pooling in the spatial domain. Formally, given the input \(\mathbf{X}^{(L)} \in \mathbb{R}^{T \times N \times D}\) from the visual encoder and the spatial compression ratio \(p\), we apply average pooling with a kernel size and stride of \(p\) on each frame, resulting in \(\mathbf{X}_{\text{space-avg}} \in \mathbb{R}^{T \times \frac{N}{p} \times D}\).
After processing through the temporal projector, each visual token is enriched with spatiotemporal information, capturing features not only from its corresponding frame but also from other frames across the entire video. This encoding of global information allows us to subsample the visual tokens along the temporal dimension at test time, further reducing the number of tokens fed into the LLM without significant loss of information or performance. This temporal token subsampling strategy can be done with or without our pooling mechanisms. Note that comparing to methods that subsample raw frames and risk discarding critical temporal cues, our approach leverages the spatiotemporal richness of tokens post-temporal encoding. We show an illustration of the temporal token sampling in 3 (right).
Formally, let \(\mathbf{X}^{'} \in \mathbb{R}^{T' \times N' \times D}\) denote the token input to the subsampling layer which can be tokens from visual encoder \(\mathbf{X}^{(L)}\), or tokens output from compression modules \(\mathbf{X}_{\text{time-avg}}\) or \(\mathbf{X}_{\text{space-avg}}\). Here, \(T'\) and \(N'\) are the temporal and spatial dimensions after any temporal or spatial pooling. We apply a uniform subsampling with rate \(s\) along the temporal dimension.
\[\mathbf{X}_{\text{time-skip}} = \{ \mathbf{X}^{'}_t \mid t = 0, k, 2k, \dots, T-1 \} \in \mathbb{R}^{\frac{T'}{s} \times {N'} \times D},\]
Our empirical results demonstrate that this subsampling method not only maintains performance on various video understanding benchmarks but can also improve it by reducing noise from redundant frames.
In this section, we extensively evaluate the proposed method on various video understanding benchmarks and provide empirical analysis demonstrating how the temporal projector enables efficient token reduction while delivering strong video reasoning abilities.
We utilize the pre-trained SigLIP [12] from PaliGemma [27] and Qwen2-VL [28], respectively, and fine-tune them to adapt to our video datasets. The temporal projector is initialized with random weights. Each image is always resized to a \(448 \times 448\) resolution.
2.5pt width=0.95
In the first stage, known as the Alignment Stage, we freeze both the image encoder and the LLM, training only the temporal projector using a small image-text dataset [29], containing 95K image-text pairs. Note that the Mamba layers perform not only temporal scan but also spatial scan within images, so video inputs are not strictly required to train it. For alignment stage, we find it sufficient to use only image-text pairs to pretrain the temporal projector. In the second stage, the supervised fine-tuning stage (SFT), we fine-tune all three components using a large and diverse dataset that includes text-only, image-text, and video-text data. There are around 12.5M samples in our SFT data mixture. due to space constraints. At this stage, we use 32 frames for each video input. For models with training-time token compression , we use a compression ratio of 4\(\times\) — temporal pooling models compress 32 frames to 8 frames while spatial pooling models compress 256 tokens per image into 64 tokens. Moreover, for models with training-time token compression , we further employ a long video fine-tuning stage using 128-frames long-video inputs from the LLaVA-Video dataset [30]. We provide further details about the full SFT dataset and long video fine-tuning dataset in the appendix Section 7.4.
1.3pt width=0.95
We evaluate our STORM across multiple configurations on recent long-video understanding benchmarks specifically designed to rigorously assess video-language model capabilities. These benchmarks include EgoSchema [31], MVBench [32], MLVU [33], LongVideoBench [34], and VideoMME [35]. We compare our approach against a wide range of representative video-language models, including recently proposed models tailored for long-video understanding [7], [8], [15], [20], [21], [36], [37]. See [tab:main95results95compare95external] for details.
We implement STORM based on the VILA codebase [2], a typical VLM pipeline consisting of a vision encoder, LLMs, and vision-language projector, and introduce our novel Mamba module and compression mechanisms into the architecture. We will refer all models trained with the VILA codebase as VILA-based models in our experiments. To thoroughly analyze our design, we evaluate STORM variants using all combinations of the three compression methods: temporal average pooling, spatial average pooling, and temporal token sampling. The full results are presented in [tab:comp95abl]. For comparison with existing Video-LLMs, we highlight the best-performing variants in [tab:main95results95compare95external]. And include the detailed analysis of all variants of STORM in [tab:main95results95compare95internel]. To ensure fairness, we also include a baseline VILA model trained on the same dataset and training scheme but without the Mamba module. All of our models in [tab:main95results95compare95internel] are trained under the same 8K token budget (corresponding to 32 frames of tokens ), which represents the number of visual tokens fed to the LLM—a key factor affecting inference latency and memory consumption, particularly as the number of frames increases (see Section 5). Specifically, we report results for the standard STORM trained on 32-frame inputs and STORM with Temporal Pooling, which processes 128-frame inputs while reducing the token count to match the 32-frame variant. Additionally, we evaluate configurations where temporal sampling is applied at test time (+T. Sampling), which not only further enhances model efficiency but also improves performance on certain benchmarks.
We start with comparing our configuration STORM+ T. Pooling (+ T. Sampling) with existing Video-LLMs. As shown in [tab:main95results95compare95external] and detailed in [tab:main95results95compare95internel], STORM+ T. Pooling achieves state-of-the-art performance across all long-video understanding benchmarks. Specifically, it achieves 71.3% accuracy on MVBench, 72.5% on MLVU, 59.5% on LongVideoBench, and 63.4% on VideoMME, outperforming all open-source Video-LLMs, including recent models specifically designed for long-context inputs such as LongVU and LongVILA. Additionally, our method significantly narrows the performance gap with proprietary models, outperforming GPT4-V and GPT4-O on MVBench and MLVU, as well as GPT4-V on VideoMME. Notably, STORM+ T. Pooling achieves computational efficiency by compressing visual tokens to 25% of their original number before processing them through the LLM. We can further enhance efficiency by applying test-time temporal sampling in STORM+ T. Pooling + T. Sampling while still achieving competitive results, reducing the token count to just 12.5% while maintaining competitive performance. In fact, this additional compression even improves results of certain benchmarks, yielding the best overall performance on MLVU and LongVideoBench.
Next, we provide controlled comparisons within VILA-based models to reveal advantages of the Mamba-based temporal models in [tab:main95results95compare95internel]. We first compare the baseline VILA model with our STORM. By incorporating the Mamba module, STORM achieves performance gains on three out of four benchmarks, including a notable 2.3% improvement on VideoMME. Moreover, augmenting STORM with test-time temporal token sampling (STORM+ T. Sampling) further enhances efficiency, reducing inference time by 43% while, surprisingly, maintaining or slightly boosting performance (an additional 0.8% gain on VideoMME). This advantageous behavior emerges because the Mamba module’s ability to effectively propagate temporal information across video frames, enabling redundant tokens to be discarded without compromising the model’s overall understanding.
The temporal pooling variant (STORM+ T. Pooling) extends these benefits to long-context training, by applying temporal average pooling after the Mamba layer, which allows the model to process 128-frame inputs while compressing the token count to match that of the 32-frame setting. This approach not only improves performance, achieving 63.4% (+3.3%) on VideoMME, 59.5% (+3.6%) on LongVideoBench, 72.5% (+2.3%) on MLVU, and 71.3% (+1.8%) on MVBench, but also significantly reduces inference latency by 58.4%. By combining this model with test-time temporal token sampling (STORM+ T. Pooling + T. Sampling), we further reduce the inference time by 65. 5% and use only 12. 5% of the visual tokens compared to VILA without sacrificing performance. We observe that this test-time temporal token sampling particularly benefits MLVU and LongVideoBench compared to MVBench and VideoMME. This different impact across benchmarks likely stems from the nature of the underlying tasks. MLVU and LongVideoBench require global understanding across long videos. We believe that test-time compression with the Mamba module better summarizes the essential contextual information. On the other hand, MVBench and VideoMME require visual details from specific frames. Our pooling-only method with the Mamba module maintains more detailed frame information throughout the sequence. Section 5 includes a detailed discussion about retaining visual information with our token compression.
[tab:spatial95vs95temporal95compression] provide comparison between spatial average pooling and temporal average pooling on 32 frames training and 128-frames extension. All models use STORM as the base model. We see that spatial pooling is effective when 32 frames are during training, outperforming temporal pooling and on LongVideoBench while achieving on par results on VideoMME. However, when applying 128-frames input, even though both methods use the same token budget, temporal pooling results in sigificantly better performance. In fact, spatial pooling can not benefit from longer video inputs, results in degrated performance on both LongVideoBench and VideoMME, while temporal pooling successfully achieves stronger performance on both benchmarks from the extended video length.
width=0.9
2pt width=0.9
As illustrated in 2, the Mamba-based temporal projector performs a spatio-temporal scan over input frames. This approach enables refinement of visual tokens before compression operations, thereby preserving key temporal and spatial cues—such as the positional information of frames being pooled—that would otherwise be lost in naive compression strategies. Consequently, the Mamba temporal module allows simple token compression methods to work more effectively for long videos compared to the baseline, as evidenced in [tab:mamba95compression] and further detailed in [tab:mamba95compression95full].
Our results demonstrate that STORM consistently improves with increased video input length during training, achieving substantial gains of +2.6% over all benchmarks when extending from 8 frames to 32 frames and an additional +2.5% when extending from 32 frames to 128 frames. In contrast, the baseline VILA model shows smaller improvement when extending from 8 frames to 32 frames, and demonstrates only +0.8% performance gains when fine-tuned on 128 frames. In fact, as shown in [tab:mamba95compression95full], the baseline model exhibits performance degradation on MVBench and MLVU benchmarks when extending video length from 32 frames to 128 frames. These results fully demonstrate the importance of the Mamba module in enabling both efficient and effective token compression, particularly for long-video input
1.5pt width=0.95
[tab:comp95abl] presents an ablation study comparing various token compression strategies for models trained on a fixed 32-frame input. Without any compression, our standard STORM delivers improved performance across benchmarks comparing to the baseline VILA while utilizing a similar inference latency and number of visual tokens. When test-time temporal sampling is applied, the model maintains its performance while reducing the visual token count to 50% and lowering inference latency to 58.7% of the uncompressed STORM. Employing temporal average pooling during both training and testing further compresses the token count to 25% and cuts latency down to 42.7%, which is particularly effective for MVBench and MLVU but slightly degrades LongVideoBench and VideoMME. Similarly, spatial average pooling provides the same efficiency improvements and proves particularly effective on LongVideoBench, but with compromises on other benchmarks.
We further apply temporal token sampling with either temporal or spatial pooling to improve the test-time efficiency. For instance, STORM+ T. Pooling + T. Sampling reduces the visual tokens to only 12.5% and latency to 35.4% of the original STORM, while maintaining comparable accuracy to STORM+ T. Pooling. The STORM+ S. Pooling + T. Sampling even achieves the best LongVideoBench performance with this strong token reduction. Finally, we combine the temporal and spatial average pooling with/without test-time temporal token sampling, leading to even more aggressive compression. Interestingly, our STORM+ T. Pooling + S. Pooling + T. Sampling variant, which has the strongest compression ratio, utilizing only 3.13% visual token and 29.5% of inference latency, already achieves a competitive results. Comparing to the baseline VILA, it achieves 100% performance on VideoMME, 98.9% on LongVideoBench, 97.7% on MLVU, and 96.7% on MVBench, making it quite attractive for efficiency-oriented scenarios.
We note that, although the uncompressed STORM and its test-time sampling variant achieve the highest overall performance when using the same 32 frames input, their computational demands limit scalability to train on longer sequence. In contrast, as demonstrated in [tab:main95results95compare95internel], the STORM+ T. Pooling allows extending the temporal context to 128 frames with temporal pooling, which enables improved performance without requiring additional LLM computation.
Figure 4: Model Efficiency and Effectiveness on Long Video Inputs. (left) Profiling results of token compression as the number of frames increases during inference. (middle) Profiling results for 256 input frames with different compression ratios on a single A100. (right) The accuracy of Video-MME (without subtitles) across different numbers of frames during inference. While STORM with test-time temporal sampling showed consistent performance improvements, both VILA and STORM without token compression demonstrated decreased performance beyond 64 frames..
Figure 5: Qualitative Examples of STORM+ T. Pooling. Our model effectively processes complex video content across various tasks requiring fine-grained temporal and visual understanding while reducing computational overhead through efficient token compression. The example videos can be found in our website..
We provide analysis on model efficiency in 4. In 4 left, we compare the inference latency of STORM with and without token compression and across varying numbers of input frames. Due to the quadratic computation of the LLM, the latency gap between the two configurations widens significantly as the sequence length increases. 4 middle provides a detailed breakdown of inference latency of different modules for processing 256 frames. Without token compression, approximately 80% of the total latency is attributed to the LLM module. By applying a token compression ratio of 4, the latency associated with the LLM is reduced to roughly 35% of its original value, demonstrating a substantial improvement in efficiency.
Importantly, the temporal projector enables token compression without compromising performance. In fact, for some benchmarks, it even improves model performance while reducing computation cost, as demonstrated in [tab:main95results95compare95internel] and detailed in [tab:main95results95full]. 4 right presents a detailed analysis using the temporal sampling method on the VideoMME benchmark. We compare three configurations: the VILA baseline, STORM without compression, and STORM with test-time temporal token sampling Using up to 128 input frames and a compression ratio of 2. The results indicate that: (1) The VILA baseline’s performance improves up to 32 frames but declines beyond this point, (2) STORM generalizes better to longer sequences, maintaining performance up to 64 frames before a slight drop, and (3) STORM with temporal sampling continues to improve performance up to 128 frames while providing a 50% token compression ratio, meaning STORM+ T. Sampling at 128 frames uses the same number of visual tokens as the other two models at 64 frames.
We speculate that phenomenon is because of the limitation of the pre-trained LLM models which, despite having large context lengths on paper, operate with a smaller effective context length in practice. This limitation hinders their performance on very long sequences. The combination of the temporal projector and temporal sampling ensures that the number of tokens fed into the LLM remains within its effective context length. Simultaneously, the enriched visual tokens retain comprehensive temporal information beyond the sampled frames.
The Mamba projector introduces minimal training overhead: even when no compression is applied, it adds only \(\sim\) 5\(\%\) latency in full training (19.1 hours with VILA (without Mamba module) and 20.1 hours with STORM(with Mamba module)). More importantly, since the Mamba module enables effective token compression, it in fact provides significant training / inference speedups and performance gains (e.g., +3.6\(\%\) on MLVU with 4× fewer tokens), eventually reducing training the costs.
The qualitative results in 5 demonstrate that STORM’s token compression preserves critical visual information while significantly reducing computational overhead. Even with 4\(\times\) compression ratio, our model accurately extracts task-relevant information across diverse video understanding tasks. For example, questions like ‘Whom is the poem in the video written by?’, ‘How many times do news segments appear in this video?’, and ‘What is unique about the last performance?’ require fine-grained visual information and focus on the specific content from the prompt. Detailed category-level analysis on VideoMME (7) also reveals that STORM+ token compression consistently outperforms baseline model across nearly all task categories. Even in OCR-heavy tasks that require detailed visual analysis, our compressed model maintains performance comparable to uncompressed versions while using only a fraction of the computational resources (only 25% visual tokens). These findings demonstrate that our Mamba-based temporal encoder enables effective token compression by encoding spatiotemporal relationships directly into visual tokens, allowing the model to maintain high-level understanding while drastically reducing token count. Additional qualitative results are provided in 10-13.
Our core design, temporal module + token compression, is model-agnostic and can be integrated into various video-LLM architectures. [tab:architecture] shows the experiments on models utilizing different LLM models, vision encoders, and model sizes. Our design shows consistent improvements in both performance and latency for all configurations, clearly showcasing the universality of the model.
3pt width=0.95
We introduced STORM, a novel Video-LLM architecture that enhances long-video understanding through a Mamba-based temporal encoder and efficient token reduction. By explicitly integrating spatiotemporal dynamics into visual tokens early in the pipeline, STORM enables significant token compression while preserving critical temporal information in the compressed inputs. Experiments demonstrate that STORM achieves new state-of-the-art results on long-video understanding benchmarks while substantially improving computational efficiency. Our work establishes a scalable approach to long video-language modeling that simultaneously achieve good performance and computation efficiency.
We present comprehensive qualitative evaluations in 9 to 13, which are segmented into three subsections
Effective Long Video Understanding: Demonstrating STORM’s ability to effectively utilize long video inputs by comparing it with existing long-video LLMs.
Importance of Long Video Context: Highlighting the need for long video inputs by showcasing scenarios where 128-frame inputs (with token compression) enable accurate predictions, whereas 32-frame inputs fail.
Showcase of Video Understanding Abilities: Illustrating STORM’s capabilities in various aspects such as OCR, spatial perception, temporal reasoning, and so on.
We compare our proposed STORM+ Temporal Sampling with LongVILA and LongVU, both designed for long video understanding. We use a short film depicting a "moonfall disaster" from the VILA webpage 2. The models are prompted to provide a narrative description of the video. The short film was chosen for its engaging and dramatic storyline that spans various interconnected scenarios, all contributing to a cohesive narrative. Understanding this video requires the models to comprehend each individual scene and effectively integrate temporal events to grasp the complete story. Both STORM and LongVILA use 128 input frames, while LongVU output was obtained from its online demonstration which uses 1fps input.
As shown in 9, STORM delivers the most detailed and coherent summary of the video’s narrative, effectively capturing key events and transitions throughout the entire film. Its response showcases a comprehensive understanding of the content, highlighting its ability to connect temporal events across different scenes. In contrast, the baseline models LongVILA and LongVU focus on some of the events but fail to cover all critical moments that contribute to the overall storyline. Their responses also highlight specific scenes without integrating them into the full context. Moreover, we observed that the baseline models often generate redundant content, repeating the same sentences with minimal new information, which reveals their limitations in handling open-ended queries. Notably, our STORM with Temporal Sampling is also computationally more efficient. By applying temporal sampling, we reduce the number of tokens to the equivalent of processing 32 frames. This comparison showcases STORM’s superior ability to leverage long video inputs for in-depth visual understanding.
We further demonstrate the significance of incorporating long video context by providing qualitative examples where a 128-frame input yields more accurate predictions than a 32-frame input, as shown in 10. Using samples from the VideoMME benchmark, we compare two configurations of our STORM: one with a 32-frame input without compression, and another with a 128-frame input employing a temporal sampling ratio of 4. In both settings, the number of tokens fed into the LLM remains the same; however, the STORM with temporal sampling encodes additional information into the compressed tokens due to the extended frame sequence.
The inclusion of more frames allows the model to capture richer temporal dynamics and contextual information. For example, the 128-frame input enables the model to develop a stronger understanding of the video’s narrative (10 top). It also allows the 128-frame model to capture additional events that the 32-frame model misses (10 center). Finally, the additional information further improve model’s ability to reason through different temporal events across the entire video to form a coherent understanding (10 bottom). These example demonstrate the crucial role of long video context in tasks that require detailed temporal reasoning and comprehensive content understanding.
Finally, we conclude our qualitative evaluation by showcasing the diverse video understanding capabilities of STORM, including OCR, attribute perception, spatial perception, information synopsis, and temporal reasoning. Results are shown in 11 to 13. We use the same setting of STORM+ Temporal Sampling with 128-frame input and sampling ratio of 4. Utilizing videos from the VideoMME benchmark, we designed a more challenging assessment to thoroughly evaluate the model’s proficiency. Instead of providing the model with multiple-choice questions accompanied by predefined answer options, we transformed these tasks into open-ended queries that require the model to generate answers in raw text form without any given choices. This modification significantly increases the task’s difficulty, as it demands a precise understanding of the content and the ability to accurately locate and extract specific information from the video input.
Our qualitative results demonstrate that STORM provides strong performance in these scenarios. Despite the increased complexity, the model effectively interprets intricate visual details, recognizes textual information within videos, and provides coherent summaries of temporal events. This showcases STORM’s robust ability to handle various aspects of video understanding.
[tab:main95results95full] extends [tab:comp95abl] in the main text by providing a comprehensive comparison of different compression method combinations across various token budgets during training. Overall, considering both compression ratio and inference latency, we find that STORM with temporal pooling (STORM+ T. Pooling) is the most efficient and effective approach. Additionally, test-time temporal sampling offers a lossless way to further enhance inference efficiency in inference time.
2 shows the VideoMME results with different video lengths. The short is less than 2 minutes, the medium is up to 15 minutes, and the long is up to 60 minutes. Overall, our STORM with token compression outperforms the VILA baseline and STORM with no token compressions for all video lengths. 7 compares the VideoMME results by task categories. We find that STORM with temporal pooling especially improves the object reasoning task accuracy, and STORM with test-time temporal sampling improves the attribute perception accuracy. Both token compression methods improve the temporal perception task accuracies compared to VILA and STORM. It indicates that the temporal perception task requires a longer video context, and our token compression methods are effective for such tasks.
[tab:dataset95effect] shows how dataset composition affects model performance during 128 frames fine-tuning. We compare using the full LLaVA-Video dataset (\(\sim\) 1.35M samples) versus only its longest 25% videos with at least 128 frames (\(\sim\) 360K samples). Interestingly, while STORM improves with the larger dataset across all benchmarks, the baseline model actually performs worse on several benchmarks when trained on the full dataset.
Two key differences between these datasets are size and video length distribution, where the full dataset contains more data but with a mixture of short and long videos, whereas the long-video subset exclusively consists of longer videos. Since larger, more diverse datasets typically improve performance (assuming similar data quality), we attribute the baseline model’s unexpected performance drop to its limited ability to generalize from shorter to longer videos. More specifically, when trained predominantly on shorter clips from the full dataset, the baseline overfits and can not effectively handle long contexts at inference time. Training solely on longer videos dataset variant better aligns with test conditions, partially addressing this limitation.
In contrast, STORM shows consistent performance gains in all benchmarks when trained on the larger and more diverse data set. This suggests that STORM is more robust in handling longer sequences and is capable of using a wide range of video lengths to enhance its overall performance.
STORM is built on a standard multimodal pipeline but introduces key modifications for improved reasoning ability and token efficiency. 6 illustrates the detailed composition of our models. Instead of an MLP projector, STORM uses a linear layer followed by a Mamba-based temporal module projector which integrates spatiotemporal information into visual tokens.
STORM incorporates three main components: (1) The Mamba-Based Temporal Projector captures and propagates spatiotemporal information within visual tokens. (2) Temporal Token Compression Module applies compression on temporal dimension using training-based average pooling and/or training-free sampling (applied only at test time). (3) Spatial Token Compression further reduces token number by performing training-based frame-level spatial average pooling. Both spatial and temporal compression methods—whether training-free or training-based—are independently applicable. Notably, spatial and temporal pooling can be applied in parallel after the Mamba module, while temporal sampling is performed separately at test time. These components enable STORM to process longer sequences more efficiently before passing them to the LLM.
Figure 6: Breaking Down the STORM Architecture. We begin with a standard multimodal pipeline that uses a pixel-shuffle downsampling layer and an MLP projector. In STORM, we replace the MLP with a linear layer and introduce our Mamba-based temporal module on top. Since the Mamba layer propagates spatiotemporal information in each visual tokens, the model can then perform temporal and spatial token compression of these tokens before passing them to the LLM, allowing STORM to handle longer sequences more efficiently..
For all models, we begin with an alignment stage to align the multi-modal projector using the LLaVA-CC3M-Pretrain-595K dataset [1]. Following this, we proceed to the visual instruction fine-tuning stage, experimenting with two different training data mixtures. These SFT mixtures incorporate both image and video data, encompassing three task types: captioning, open-ended question answering, and multiple-choice question answering. Further details are provided in the following:
SFT Data: For most of our main experiments, we construct an expanded mixture by incorporating additional high-quality image datasets—such as Cambrian-1375K [38], Idefics2-SFT [39], and LLaVA-OneVision-Images-SFT [40]—along with video datasets including M4-Instruct-Video [41] and Youtube [42]. This enlarged dataset is used to scale up training and enhance overall performance. Detailed compositions of these mixtures are provided in 1.
Long Video SFT Data: In the long video fine-tuning stage, our goal is to adapt models initially trained on full SFT data with 32-frame inputs to handle 128-frame inputs. Because processing 128-frame input incurs significant computational cost, we reduce training time by using a smaller dataset at this stage. Specifically, we select videos from only the LLaVA-Video dataset [30] that contains data amount roughly 11% of the full dataset (approximately 1.35M video-text pairs).
Long Video 25% Data: As an ablation study to investigate the impact of dataset composition in the long video fine-tuning stage, we introduce an additional dataset derived from the LLaVA-Video dataset [30]. This subset consists exclusively of long videos with at least 128 frames, comprising approximately 25% of the full dataset (around 360K video-text pairs). Unlike the Long Video SFT Data, which includes both short and long videos, this dataset contains only long videos. Our experiments in Section 7.2 and [tab:dataset95effect] reveal distinct behaviors in our models and baselines in the composition of the dataset.
3pt
Datasets |
---|
iVQA [43], MSRVTT-QA, STEM-QA [44] |
GQA-train [45], ChatQA [46], Geo170K [47], |
LRV-Instruction [48], RefCOCO-train [49], |
DocVQA [50], GeoQA [51], KVQA [52], Cambrian-1375K [38] |
M4-Instruct-Images [53], M4-Instruct-Video [41] , WIT [54], Youtube [42], etc |
1.3pt width=0.95
2pt width=0.95
width=
2pt
Models | # frames | Short | Medium | Long | Avg. |
(train) | \(<\) 2 min | 4\(\sim\)15 min | 30\(\sim\)60 min | ||
Token Budget: 8K | |||||
VILA Baseline | 32 | 73.0 | 58.0 | 49.2 | 60.1 |
STORM | 32 | 75.6 | 60.9 | 51.1 | 62.5 |
STORM+ T. Sampling* | 32 | 75.2 | 60.8 | 53.2 | 63.1 |
STORM+ T. Pooling | 128 | 72.4 | 64.4 | 53.4 | 63.4 |
STORM+ T. Pooling + T. Sampling* | 128 | 72.9 | 60.9 | 53.4 | 62.4 |
2x additional compression at test time. |
Figure 7: VideoMME Results by Task Categories..
# Frames | Compression Ratio | Overall (ms) | llm (ms) | vision_tower (ms) | mm_projector (ms) |
---|---|---|---|---|---|
4 | 1 | 162.92 | 103.80 | 52.41 | 6.71 |
8 | 1 | 270.87 | 174.61 | 85.11 | 11.15 |
16 | 1 | 486.37 | 321.73 | 144.47 | 20.17 |
32 | 1 | 933.99 | 623.41 | 269.49 | 41.09 |
64 | 1 | 1910 | 1310 | 515.17 | 82.41 |
128 | 1 | 4270 | 3090 | 1020 | 163.34 |
256 | 1 | 10340 | 7960 | 2030 | 348.22 |
512 | 1 | 28620 | 23710 | 4090 | 811.31 |
32 | 4 | 486.97 | 175.75 | 269.96 | 41.26 |
64 | 4 | 920.10 | 322.22 | 515.82 | 82.06 |
128 | 4 | 1800 | 622.29 | 1010 | 163.23 |
256 | 4 | 3680 | 1310 | 2020 | 348.52 |
512 | 4 | 7950 | 3080 | 4060 | 811.84 |
64 | 8 | 772.18 | 175.27 | 514.59 | 82.32 |
128 | 8 | 1500 | 322.73 | 1020 | 163.09 |
256 | 8 | 3000 | 622.56 | 2030 | 348.71 |
512 | 8 | 6200 | 1310 | 4070 | 815.08 |
Figure 8: Latency Comparison: VILA vs STORM. The multi-modal projector in VILA is a 2-layer MLP, while it is the Mamba Temporal Module in STORM..
In this section, we compare the latencies of the vanilla VILA architecture and STORM across varying numbers of frames without token compression and provide a breakdown of the percentage contribution of the multi-modal projector. All experiments are conducted on a single NVIDIA DGX A100-80G. The results, shown in Figure 8, demonstrate that STORM incurs negligible overhead compared to the vanilla VILA architecture, with the introduced Mamba Temporal Module accounting for no more than 3% of the total latency.
4 summarizes the number of frames used for inference. We evaluate all models between 8 and 512 frames and select the number of frames with the best accuracy overall for each task and setup.
2pt
Models | MVBench | MLVU | LongVidBench | VideoMME |
---|---|---|---|---|
8-frame-models | 8 | 64 | 32 | 64 |
+ Temporal Sampling | 16 | 256 | 256 | 128 |
32-frame-models | 16 | 64 | 64 | 64 |
+ Temporal Sampling | 32 | 256 | 256 | 128 |
32-frame-models + Temporal Pooling | 32 | 64 | 128 | 64 |
+ Temporal Sampling | 64 | 256 | 256 | 128 |
Figure 9: Effective Long Video Understanding. We compare STORM+ Temporal Sampling with existing long video LLMs. Reults show that STORM delivers a more detailed and coherent summary, effectively capturing key events and transitions throughout the film. The example videos can be found in our website..
Figure 10: Importance of Long Video Context. We compare STORM with a 32-frame input to STORM+ Temporal Sampling using a 128-frame input. Both configurations have negligible differences in computational cost; however, the latter encodes additional information into compressed tokens due to the extended frame sequence. The examples illustrate that processing more frames allows the model to capture richer temporal dynamics and contextual information. This leads to a stronger understanding of the video’s narrative, reduces information loss, and enhances the ability to reason through temporal events across the entire video. The example videos can be found in our website..
Figure 11: Showcase of Video Understanding Abilities in Various Task Categories. We provide additional examples to showcase model’s video understanding capabilities in different aspects. This is done by providing the models with open-ended queries that require the model to generate answers in raw text form without any given choices. Part 1. The example videos can be found in our website..
Figure 12: Showcase of Video Understanding Abilities in Various Task Categories. Continue 2. The example videos can be found in our website..
Figure 13: Showcase of Video Understanding Abilities in Various Task Categories. Continue 3. The example videos can be found in our website..
https://vila.mit.edu/↩︎