April 01, 2024
Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps are identified. Moreover, they are unable to extract details relevant to the question, instead providing general descriptions of the frame. To overcome this, we design a multi-LMM agent framework that travels along the video, iteratively collecting relevant information from keyframes through interactive question-asking until there is sufficient information to answer the question. Specifically, we propose TraveLER, a model that can create a plan to “Traverse” through the video, ask questions about individual frames to “Locate” and store key information, and then “Evaluate” if there is enough information to answer the question. Finally, if there is not enough information, our method is able to “Replan” based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several video question-answering benchmarks, such as NExT-QA, STAR, and Perception Test, without the need to fine-tune on specific datasets.
Figure 1: A simplified overview of our TraveLER framework. Our proposed framework aims to answer the question by collecting relevant information from keyframes through interactive question-asking. To accomplish this, several agents (in colored boxes) with different roles interact (left-to-right in each row) over several iterations. TraveLER creates a plan (in blue) to “traverse” (in orange) through the video, asks questions regarding individual frames (in yellow) to “locate” and store key information and, “evaluates” whether there is sufficient information to answer the question (in green), and “replans” using past collected knowledge if there is not enough information. Click on the image to see the video.
Over the last few years, Large Multimodal Models (LMMs) have demonstrated tremendous progress in the area of video understanding, particularly for the video question-answering (VideoQA) domain [1], [2]. More recently, LMMs have been able to achieve impressive results through video-based models [3]–[6]. However, video models require a high level of computational complexity to fine-tune, and annotations are difficult and expensive to collect. As a result, many recent approaches [7]–[9] operate on a frame level, leveraging large-scale image-based pretraining in a zero-shot setting.
Despite the effectiveness of image-based LMMs for image tasks, applying them to VideoQA is challenging since using all frames results in high computational demands and redundancy. Thus, many works try to select subsets of frames, either through uniform sampling [9] or keyframe selection [8]. However, uniform sampling may skip important information, while keyframe selection methods might select the wrong frames and mislead the model. To address this, we introduce a novel video traversal approach using a “Planner,” an LLM agent that creates a plan to find and extract key information.
Next, we wish to ensure we can capture correct and detailed information from frames when executing the plan. Yet, the common captioning approach provides general descriptions for the frame, whereas answering questions often requires more specific details. Moreover, not all elements of the frame are relevant to the question and may even be misleading. As a result, we propose an interactive question-answering process using a “Locator" to locate and extract the most relevant and fine-grained details from each frame. Specifically, we use two LMM agents, one who asks questions about the frame and another who answers them.
Nevertheless, it can be difficult for the model to collect all necessary details in a single pass, and extracting incorrect information may be misleading. Hence, we introduce an iterative approach using an “Evaluator” that reviews collected information after each iteration and evaluates if it has enough information to answer the question. If there is, the answer is selected; otherwise, the new information is used to “Replan” and begin a new iteration.
Consider the example in Figure 1. Suppose we are asked “why the boy turned over in the middle of the video”. In the first iteration, our method uses temporal cues from the question to skip to the middle of the video and asks questions to find the relevant frames. In the next iteration, we gather more information. Asking about what the boy is doing, we learn that he is "standing up at the bottom of the slide" and is not looking at anything specific, which informs us that the boy is no longer "sitting down" (choice B) or "resting on the yellow object" (choice D). To eliminate these choices, we must confirm that the boy does not sit back down again by traveling to a timestamp near the end of the video. Finally, since we have collected enough information and followed the plans, we can select the right choice that the baby turns over to be on his stomach "to get down on slide" (Choice E).
Our proposed approach – Traverse, Locate, Evaluate, and Replan (TraveLER), is a modular, multi-LMM agent framework for video question answering. Our framework is composed of four main stages, each with LLM or LMM “agents" that interact with each other through the different stages. First, in the Traversal stage (”traverse”), an agent creates a plan to answer the question. In the Locator stage (“locate”), an agent uses the plan to decide which timestamp of the video to select. The corresponding frames are then sent to another agent, which asks questions and stores the answers in a memory bank for future iterations. Finally, in the Evaluator stage (“evaluate”), an agent reviews all collected information and decides whether to answer or create a modified plan (“replan”) to start the next iteration if necessary.
To summarize, our main contributions are as follows: (i) We introduce TraveLER, a modular multi-LMM agent framework for video question-answering. (ii) Our proposed TraveLER method does not require task-specific fine-tuning or video annotations, as well as being easy to employ with several different LLM or LMMs. (iii) Our method shows improved performance on multiple difficult video question-answering benchmarks such as NExT-QA, Perception Test, and STAR, highlighting the effectiveness of our approach.
Video question-answering (VideoQA) involves answering free-form or multiple choice questions given an input video. In comparison to image question answering, VideoQA poses unique challenges because it often requires strong temporal understanding and the ability to deal with long input sequences. Many recent works have focused on training end-to-end video-language models [1], [2], [4]–[6], [10], but doing so remains challenging due to computational constraints and difficulties in architecture scaling. As a result, many approaches adopt pretrained image models to the video domain by extracting information independently from each frame [7]–[9]. In this work, we design a framework that builds an adaptive plan to traverse through the video to identify keyframes and extract relevant information using a question-answering approach.
LMMs have been shown to be extremely useful for VideoQA. Some methods use supervised or contrastive training to perform video-LMM pretraining [11]–[13], while others adapt existing LMMs and use instruction tuning to adapt them to the video domain [3], [14], [15]. However, recent improvements in LMM capabilities have allowed for many strong approaches for few-shot [16], [17] and zero-shot VideoQA [17], [18]. In particular, zero-shot methods such as LLoVi [9], use pre-trained LMMs to generate captions for each frame in the video. Nevertheless, uniformly sampling frames at random may result in the model missing important visual information and focusing on unimportant frames [19], [20]. Recent works such as SeViLA [8] addressed this problem by performing parameter-efficient finetuning using captions to identify keyframes [21]–[23], but this requires fine-tuning on specific datasets. In contrast to these works, which select all keyframes in a single pass, we introduce a zero-shot, iterative method that repeatedly gathers data from various timestamps until enough information is collected to correctly answer the question.
The strong reasoning abilities of LLMs [24], [25] have made them effective in LLM-based agent approaches for videos, where an LLM performs much of the reasoning after collecting information from different modules [26]–[29]. For example, Socratic Models [29] proposes a method to reason about videos based on generated audio transcriptions and CLIP frame similarity scores, while other works like VideoChatCaptioner [26] proposes a way to caption videos through chat dialogues between an LLM and a LMM. Unlike these works, our method utilizes a novel video traversal approach and an iterative, planning-based information gathering process.
Figure 2: TraveLER framework. Our framework consists of four different modules, the Planner, Retriever, Extractor, and Evaluator. The Planner creates a plan and sends it to the Retriever. The Retriever uses the plan to select the next timestamp and sends this to the Extractor. The Extractor captions and generates questions about the timestamp, answers the questions, and saves the output in the memory bank. Finally, the Evaluator determines if there is enough information and if the plan has been followed. If yes, the Evaluator returns the answer, else the existing information is sent back to the Planner to begin a new iteration.
To design a robust approach that can find the correct keyframes and extract the most relevant information for VideoQA, we propose a modular LMM agent framework that Traverses, Locates, Evaluates, and Replans iteratively (TraveLER). We begin by describing the general LLM and LMM architectures (Section 3.1), then introduce each component of our pipeline (Section 3.2), and implementation details (Section 3.3). Our method is illustrated in Figure 2.
LLMs are text-conditioned generative models. Given a prompt \(P_\text{in}\), they encode it into a fixed language embedding \(l\) in an embedding space \(f(\cdot)\) and then use this to produce a text response \(R\): \(R = f(l(P_\text{in}))\).
Similarly, Large Multimodal Models (LMMs) are adapted to jointly reason over vision and language modalities. To map the different modalities into the shared embedding space \(f(\cdot)\), the image \(I\) is encoded using a trainable encoder \(v\) and the prompt \(P_{\text{in}}\) is encoded using a fixed language embedding \(l\). The LMM outputs a text response \(R\): \(R = f(v(I), l(P_{\text{in}}))\).
VideoQA involves viewing a video and answering questions. The model is usually evaluated through top-1 accuracy, in which it chooses the best answer out of a set of possible choices. Specifically, given a question \(Q\), video input \(V\) consisting of a set of frames \(\{f_1, \cdots, f_n\}\), and set of choices \(C = \{c_1, \cdots, c_n\}\), the model is asked to choose the best \(c_i\) to answer \(Q\). Next, we introduce each component of our method.
In the Traversal stage, we create a plan for how to traverse through the video, which is a list of textual instructions that guide our approach to answering the question. To achieve this, we use the task prompt \(P_T\), which is an instruction to create a plan for answering the question. We combine \(P_T\) with the question \(Q\), and memory bank \(M\), which is a dictionary of collected information keyed by timestamps and containing information from the corresponding frame, to receive the final prompt \(P_\mathrm{T}^{(1)}\): \(P_\mathrm{T}^{(1)} = ``[Q][M][P_T]"\).
Our method uses a memory bank \(M\) to store collected information, which allows information to persist and to be updated as we proceed through different iterations. We initialize \(M\) with captions of 5 evenly sampled frames throughout the video. We find that this memory initialization gives the model good context for the general idea of the video, and performs better than starting with an empty memory \(M\). After the first iteration, we add information iteratively using the Extractor module, which will be discussed later.
Next, we input the prompt \(P_\mathrm{T}^{(1)}\) into \(\text{LLM}_{\text{planner}}\), which returns response \(R_T\), a step-by-step plan on how to traverse through the video and what information the model needs to collect.
\[R_T = \text{LLM}_{\text{planner}}(l(P_\mathrm{T}^{(1)}))\]
Our next step is to use the plan \(R_T\) in the Locator stage to locate keyframes and extract the information that we will use to answer the question.
The Locator is a component that consists of two submodules, the Retriever and the Extractor. The Retriever selects the timestamps of the next frames to view, while the Extractor extracts relevant information from these frames, using a question-answering process. Next, we discuss each component in more detail.
(i) Retriever: The Retriever carries out the given plan \(R_T\) by selecting which frames to view next. The Retriever is an LLM-based submodule whose goal is to use the collected information \(M\) to find the next timestamp \(t\) to select in order to fulfill the plan \(R_T\). The task prompt \(P_R\) is an instruction that contains information about the video length and asks which timestamp to view next. Thus, we insert the question \(Q\), plan \(R_T\), and collected information \(M\) into the task prompt \(P_R\) to create the new prompt \(P_\mathrm{R}^{(1)}\): \(P_\mathrm{R}^{(1)} = ``[P_R][Q][R_T][M]"\).
Given prompt \(P_\mathrm{R}^{(1)}\), the LLM in the Retriever, \(\text{LLM}_{\text{retriever}}\), returns \(t\), or the next set of timestamps. The module then retrieves frames \(I_t\) at timestamp \(t\).
(ii) Extractor: The Extractor is a significant part of our method because it allows us to capture more relevant and question specific details from the visual input, unlike using only captions. We pass the frames selected by the retriever \(I_t\) into the Extractor submodule, which consists of two large models: an LLM, \(\text{LLM}_{\text{extractor}}\), to generate context-dependent questions about the frames \(I_t\), and a different vision-language LMM, \(\text{LMM}_{\text{extractor}}\), whose job is to extract the desired information from the same frames.
In this module, we first generate a general caption \(c_t\) for frame \(I_t\) using the \(\text{LMM}_{\text{extractor}}\). Then, we concatenate the caption \(c_t\), plan \(R_T\), and memory \(M\), and the Extractor task prompt \(P_E\), which is an instruction that asks to use available information to create 3 questions to ask about the current frame. This results in the new prompt \(P_\mathrm{E}^{(1)}\): \(P_\mathrm{E}^{(1)} = ``[P_E][c_t][R_T][M]"\).
Next, we input this new prompt \(P_\mathrm{E}^{(1)}\) into the LLM to get a set of questions \(\{q_1, q_2, \cdots, q_n\}\) about each frame, where \(n\) is a parameter for how many questions to ask about each frame.
\[\{q_1, q_2, \cdots, q_n\} = \text{LLM}_{\text{extractor}}(l(P_\mathrm{E}^{(1)}))\]
where \(l\) is the fixed language embedding.
In this way, the generated questions take into account both the plan \(R_T\) and information from past and future frames of the video \(M\). We then use the frame \(I_t\), and the corresponding questions \(\{q_1, q_2, \cdots, q_n\}\) as input into \(\text{LMM}_{\text{extractor}}\). The \(\text{LMM}_{\text{extractor}}\) then outputs a set of answers \(\{a_1, a_2, \cdots, a_3\}\), where each answer \(a_i\) corresponds to the question \(q_i\).
\[\{a_1, a_2, \cdots, a_n\} = \text{LMM}_{\text{extractor}}(v(I_t), l(\{q_1, q_2, \cdots, q_n\}))\]
where \(v\) is the visual encoder.
Finally, to use this collected information in future iterations, we update our memory bank \(M\). To do this, we use the timestamp \(t\) of \(I_t\) as our key and the question-answer pair list as the value, and append this to our memory \(M\).
If the memory bank dictionary \(M\) is too long, we summarize it by using the memory bank as an input to another LLM and instruct it to make the memory bank entries more concise, while retaining the same keys and format. This output becomes our new memory bank.
The Evaluator decides if there is enough information and determines if the plan has been followed. We concatenate the memory information \(M\), the plan \(R_T\), the question \(Q\), and the choices \(C\) with the task prompt \(P_V\), The task prompt \(P_V\) is an instruction to evaluate if there is enough information to answer the question and if the given plan \(R_T\) has been fulfilled. Thus, we get the new prompt \(P_\mathrm{V}^{(1)}\): \(P_\mathrm{V}^{(1)} =``[P_\mathrm{V}][Q][C][R_T][M]"\).
We use this prompt \(P_\mathrm{V}^{(1)}\) as input into the LLM in the Evaluator, \(\text{LLM}_{\text{evaluator}}\), which evaluates if there is enough information to answer the question and if the plan has been completely followed. If both are true, \(\text{LLM}_{\text{evaluator}}\) outputs the best choice \(c*\) to answer the question \(Q\). Otherwise, it provides an explanation \(E\) on why there is not enough information and gives this explanation to the Planner to start a new iteration of the process.
After each iteration, if the evaluator decides that there is not enough information to answer the question \(Q\) or if the plan \(P\) has not been completed, the existing memory \(M\) will be provided to the Planner in the next iteration, in addition to an explanation \(E\) for why an answer was not chosen. The Planner then outputs a new plan, restarting the process. We also implement a limit on the number of iterations a question can take, to prevent infinite loops. After this limit is reached, we force the Evaluator to choose the best choice.
Here, we discuss how we implement various components of our framework. The code will be released upon acceptance. More implementation details, such as prompts and dataset-specific details are in the Supplementary in Section 7.
We represent past collected information as a Python dictionary, with the timestamp of different frames as keys and a list of extracted information from the frame as the values. This extracted information consists of a brief caption of the frame and a list of question-answer pairs. To prevent the memory bank from becoming too large, we also implement a summarizer module that instructs an LLM to summarize the memory bank and return a more concise version in the same dictionary format as before.
Our modular approach has the benefit of allowing us to easily swap in different LLMs and LMMs (see Section 4.4). For our main experiments, we use LLaVA-1.6 for \(\text{LMM}_{\text{extractor}}\) and GPT-4 for \(\text{LLM}_{\text{planner}}\), \(\text{LLM}_{\text{retriever}}\), \(\text{LLM}_{\text{extractor}}\), and \(\text{LLM}_{\text{evaluator}}\).
We also allow for the Retriever to select multiple frames instead of a single frame. This helps the model to better capture events that happen quickly or require more context to recognize. For example, if we want to find the action of a "a woman clapping her hands", single frame selection may cause us to incorrectly assume the woman is not clapping if we view the frame where their hands are apart. We do this by creating an optional parameter called window size. The window size refers to the number of frames the Retriever extracts each time. When the window size is non-zero, the Retriever still specifies a single timestamp to go to, but when retrieving the frame at that timestamp we take the number of frames specified by the window size before and after the selected frame as well.
Model | Cau. (%) | Tem. (%) | Des. (%) | Avg. (%) | |
---|---|---|---|---|---|
MC-ViT-B [30] | - | - | - | 60.6 | |
SeViLA [8] | 61.3 | 61.5 | 75.6 | 63.6 | |
MC-ViT-L [30] | - | - | - | 65.0 | |
InternVideo [2] | 43.4 | 48.0 | 65.1 | 49.1 | |
VFC [31] | 45.4 | 51.6 | 64.1 | 51.5 | |
ViperGPT [32] | - | - | - | 60.0 | |
BLIP-2concat [8] | 59.7 | 60.8 | 73.8 | 62.4 | |
BLIP-2voting [8] | 59.1 | 61.3 | 74.9 | 62.7 | |
ProViQ [33] | - | - | - | 63.8 | |
LLoVi [9] | 69.5 | 61.0 | 75.6 | 67.7 | |
TraveLER (ours) | 70.0 (+0.5) | 60.5 | 78.2 (+2.6) | 68.2 (+0.5) |
ZS results on Perception Test.
We evaluated our TraveLER framework on several benchmarks described in Section 4.1, and compared it to multiple baselines in Section 4.2. The results and ablations are in Section 4.3 and Section 4.4. Additional results and ablations are in the Supplementary in Section 6.
We use the following datasets: (1) NExT-QA [34] is a dataset that tests causal action reasoning and temporal understanding, with questions categorized into a Temporal, Causal, or Descriptive type. NExT-QA requires going beyond simple recognition of objects and actions to answer the questions correctly. Following the trend of works before us, we evaluate our method on the 5,000 questions in the NExT-QA validation set. (2) Perception Test [35] is a dataset that focuses on skills such as memory, abstraction, physics, and semantics and is intended to be approached in a few-shot or zero-shot manner. The dataset consists of 11.6k real-world videos and 38K multiple-choice QA questions. (3) STAR [36] is a dataset that tests reasoning in real-world video situations. It consists of 22K video clips, with 60K situated reasoning questions. Questions are broadly divided into 4 main categories: interaction, sequence, prediction, and feasibility.
In our experiments, we compare our method to recent state-of-the-art zero-shot (ZS) methods, such as LLoVi [9], ProViQ [33], and other methods that are not necessarily ZS, such as SeViLA [8], and MC-ViT [30]. We note that SeViLA uses fine-tuned components on QV-Highlights [37] in their model, while the MC-ViT model is fine-tuned on NExT-QA for Perception Test.
Our results are shown in Table 1, Table [tab:star], and Table [tab:perception]. First, we see that on NExT-QA, our method outperforms LLoVi using GPT-4, which is the current state-of-the-art that uniformly captions frames across the entire video. Interestingly, our method demonstrates superior performance in comparison to LLoVi despite viewing 50% fewer frames on average, highlighting the effectiveness of our approach. Second, we also outperform SeViLA by +4.6%, although SeViLA uses a keyframe selector that is fine-tuned on a video moment retrieval/grounding task. Lastly, we see comparable performance to BLIP-2voting (60.5% vs 61.3%) on the Temporal split, even though BLIP-2voting views every single frame.
For Perception Test and STAR, we use GPT-3.5 because it is much cheaper than GPT-4, but results are likely to be improved even further with GPT-4. Nevertheless, we achieve higher accuracy than LongViViT on Perception Test by +4.5% and MC-ViT by +2.1%, although it was fine-tuned on NExT-QA. We surpass on STAR both the best zero-shot approach by +2.7% and the best fine-tuned result by +0.3%.
Finally, please refer to Section 9 for visualizations. For example, in Figure 7, we see that question-answering is able to extract more relevant details in comparison to simple captioning. This may explain why our method significantly outperforms the descriptive split of NeXT-QA. For Perception Test, the questions require a high-level understanding of the video and fine-grained details. In Figure 6, we see that our method is able to reason about vague references and correctly identify relevant objects through question-answering.
.3
LLM | LMM | Accuracy | |||
---|---|---|---|---|---|
GPT-3.5 | LLaVA-1.6 | 60.4 | |||
GPT-4 | 65.8 | ||||
GPT-3.5 | BLIP-2 | 52.7 | |||
GPT-4V | 59.5 | ||||
LLaVA-1.6 | 60.4 |
.3
# Frames | Accuracy |
---|---|
1 | 59.0 |
3 | 57.9 |
5 | 60.4 |
7 | 59.0 |
.3
# Questions | Accuracy |
---|---|
0 | 58.1 |
1 | 58.4 |
3 | 60.4 |
5 | 59.6 |
We perform comprehensive ablations using 1000 randomly selected questions from the NExT-QA training set to understand the impact of each component. Unless otherwise specified, we use GPT-3.5 as the LLM and LLaVA-1.6 as the LMM for all agents.
In our method, we use LLMs and LMMs as the main agents. To understand how the choice of the LLM and LMM affects our framework’s performance, we swap different LLMs and LMMs into our framework (see Table [tab:agents]). First, we examine the performance of our framework using different LLMs while fixing the LMM. For this, we use LLaVA-1.6 as our LMM, and we find that GPT-4 performs the best in our framework by a significant margin of +5.4% compared to GPT-3.5. We hypothesize that stronger LLMs perform better in our framework due to how the LLM components are required to perform advanced reasoning tasks. Second, we measure the performance of our framework using different LMMs for \(\text{LLM}_{\text{extractor}}\) while fixing the LLM to be GPT-3.5. We find that LLaVA-1.6 performs best, GPT-4V is slightly worse (-0.9%), and BLIP-2 is significantly worse (-7.7%).
The Planner module outputs a plan, a list of instructions that guides the behavior of all other modules. In order to evaluate the effectiveness of this module, we remove it from our framework. We find that removing the Planner leads to a drop in accuracy of -2.3%. This may be due to several reasons. First, the Planner provides many temporal cues that guide the Retriever module’s search, such as “go to the middle of the video”, and without these cues, the Retriever is not as good at selecting the next timestamp. Moreover, the Planner also helps the Evaluator better decide when to stop since in our iterative approach, the Evaluator uses the plan to determine when to stop.
The Retriever module determines the next timestamps to view, which helps focus our traversal and information collection. To evaluate the effectiveness of the Retriever, we uniformly sample frames from the video at 2-second intervals, similar to other methods, such as LLoVi [9]. We find that this leads to a reduction in overall accuracy by -3.5%. Our results suggest that compared to uniform sampling, the Retriever allows us to capture frames that might have otherwise been skipped. This also indicates that our Retriever selects fewer unimportant frames that might mislead the model.
Question asking is a key component of our model as it allows us to capture more fine-grained and question-relevant information in comparison to simple caption generation, which produces a generic description. As such, we ablate the Extractor by only allowing the LMM to caption frames. We find that this decreases performance by -2.2%, suggesting that the ability to ask specific questions about a frame is important. We notice that many generated captions capture the main idea of visual information in the frame, but often miss fine-grained details that are useful for answering the question.
To evaluate the impact of window size, which is the number of frames the Retriever extracts centered around the selected frame, we experiment with multiple-frame retrieval. This allows to capture better actions that occur quickly or require more context to understand. As shown in Table [tab:retriever], we find that choosing 5 frames yields the best results and a +1.4% increase when compared to selecting a single frame, but viewing more than 5 decreases performance. This suggests that the ability to retrieve multiple frames is beneficial in allowing the model to better capture relevant information, but retrieving too many frames can lead to too much information, resulting in performance drops.
Question answering allows us to extract more specific details from our visual inputs. However, we noticed that too many questions can yield irrelevant questions and false positives. As such, we experiment with modifying the number of questions asked for each frame by our extractor (see Table [tab:questions]). We record results for a 5-question, 3-question, and 1-question maximum. Note that 0-questions asked is equivalent to only allowing captions, which is discussed in the Extractor ablation. From our results, we find that a 3-question limit yields the best results compared to asking 1 or 5 questions (+2.0%/+0.8%). This suggests that asking questions helps in extracting relevant information, but too many questions can lead to false positives or too much irrelevant information.
Our memory bank \(M\) is a critical part of our framework since it stores information that all modules rely on to evaluate inputs and make decisions. First, we experiment with different initializations as \(M\) must be initialized in the first iteration. We experiment with initialization of 1, 3, and 5 uniformly sampled captions. We find that using 5 evenly spaced frames yields the best results, possibly because it provides the model with a general overview of the video at the start before starting to collect more relevant information. Next, we experiment with changing the format of the memory bank, as there are many different ways to represent information. We try a markdown table instead of our dictionary format. Our results show that this performs worse by -2.9%, suggesting the effectiveness of our original dictionary formatting.
When collecting large amounts of information from videos, we use a Summarizer to condense the information, since long inputs can be challenging for LLMs. This has also been observed in recent work [9]. To understand the impact of the Summarizer, we remove it. The results indicate that removing it degrades performance by -3.2%, demonstrating the advantage of more concise information.
Our TraveLER framework has demonstrated significant potential in utilizing an LMM image-based approach for VideoQA. We introduce a multi-LMM agent framework that travels along the video, collecting relevant information from keyframes through interactive question-asking. Our method creates a plan to “traverse” through the video, asking questions about individual frames to “locate” and store key information, and then “evaluate” if there is enough information to answer the question. Finally, if there is not enough information, our model is able to “replan” according to its collected knowledge. However, there are a few limitations to our work. Firstly, our framework depends on the strength of the LLM and LMM. We notice that false positives and incorrect statements from the LMM can impact performance. We also found that our method has a high runtime with slower LLMs since each iteration requires our LLM to generate significant amounts of text. We believe that with better and faster LLMs and LMMs in the future, these issues can be overcome. Finally, our research encourages future work on using large models for modular video approaches.
We would like to thank Suzie Petryk, Chancharik Mitra, Alon Mendelson, and David Chan for helpful feedback and discussions. This project was supported in part by DoD, including PTG and/or LwLL programs, as well as BAIR’s industrial alliance programs.
Here we provide additional information about our experimental results, qualitative examples, implementation details, and datasets. Specifically, Section 6 provides more experiment results, Section 7 provides additional implementation details, Section 8 provides additional related work, and Section 9 provides qualitative visualizations to illustrate our approach.
We begin by presenting several additional ablations in Section 6.1 that further demonstrate the benefits of our TraveLER approach. We also present additional results in Section 6.2.
In what follows, we provide additional ablations that further illustrate the benefits of TraveLER. For all ablations, we compare the ablated experiment with the corresponding best-performing TraveLER results on a random sample of 1000 examples from the training set of the NExT-QA dataset. We use GPT-3.5 as the LLM and LLaVA-1.6 as the LMM.
In order for the Planner to create effective plans, it is beneficial to initialize the memory bank properly. Memory initialization allows the Planner to have a high-level overview of the video, and create a corresponding plan on how to traverse the video given the initial frames. We perform three different initializations with 1, 3, and 5 frames and and display our results in Fig 3. We observe that initializing the memory bank with 5 frames (0, 0.25, 0.5, 0.75, 1 / beginning, quarter, middle, three-quarter, end) yields the best result. In contrast, we notice a decrease in accuracy by -0.6% when initializing with 3 frames (0, 0.5, 1 / beginning, middle, end) and a decrease by -2.5% when initializing with 1 frame (0.5 / middle).
Figure 3: .
The LMM in our framework is crucial because it allows us to capture more relevant and question-specific details from visual input. However, if the LMM’s responses are too long, the memory bank will become too large, whereas if the LMM’s responses are too short, insufficient information will be captured. Thus, we conduct an experiment to determine the optimal LMM response length, and display our results in Fig [lmm95response95length]. We find that limiting the LMM response to 150 tokens yields the most optimal performance, while accuracy decreases by -2.2% and -1.7% if the response is limited to 75 tokens and 300 tokens respectively. This supports the fact that there is a tradeoff between not collecting enough information for short response lengths and collecting too much information as the LMM response size increases.
In each module of our framework, we use a task prompt to provide instructions to our agents (LLMs or LMMs). The construction of these prompts plays a large role in how instructions are executed. Currently, we use the question \(Q\) as input into all prompts \((P_\mathrm{T}^{(1)}), (P_\mathrm{R}^{(1)}), (P_\mathrm{E}^{(1)}), (P_\mathrm{V}^{(1)})\). However, we use the choices \(C\) as input only for the Planner and Evaluator prompts since the Planner needs the choices to tailor its plan, and the Evaluator needs the choices to answer the question. We experiment with adding the choices \(C\) to the Retriever and Extractor prompts, and find that this degrades performance by -1.6%. This may be because the incorrect choices mislead the Retriever into searching for non-existent events or the Extractor into asking irrelevant questions.
We compare our method with other keyframe localization methods to further demonstrate the benefits of our TraveLER approach.
Method | NExT-QA (Random Subset) | |||
---|---|---|---|---|
2-5 | Tem. | Cau. | Des. | Avg. |
SeViLA - Localizer [8] | 48.8 | 61.2 | 68.3 | 58.2 |
Moment-DETR [37] | 45.3 | 55.8 | 70.8 | 54.6 |
SigLIP [38] | 48.4 | 61.5 | 73.8 | 59.1 |
TraveLER - Planner & Retriever (ours) | 50.9 | 62.7 | 72.4 | 60.3 |
In Table 5, we compare our Planner and Retriever with other keyframe localization methods by replacing our Planner and Retriever with each of the other methods, and use our Extractor and Evaluator to perform the question answering. For all these methods, we use GPT-3.5 and LLaVA-1.6, and we evaluate these methods on a random subset of 1000 examples from the training set of NExT-QA. Note that other methods find keyframes in one inference iteration, whereas our inference occurs over multiple iterations. Therefore, to ensure fair comparisons, we uniformly sample 32 frames and extract out 4 keyframes in the other methods, and we run 4 iterations of TraveLER using the Retriever with a window size of 2 to similarly find 4 keyframes among fewer than 32 viewed frames.
We find that our Planner and Retriever surpasses other keyframe localization methods, despite considering fewer total frames (\(\sim\) 25 total frames; we have 5 for memory initialization, and up to 5 frames each iteration). We would like to highlight that while our method is effective at finding keyframes, we do not need to find all keyframes to answer a question. Instead, we are often able to choose the correct answer with only a subset of the keyframes.
To run our models on larger benchmarks, we use 8 NVIDIA RTX 6000 GPUs and split the dataset across multiple processes. In addition, we use the SGLang [39] package, which provides a variety of performance optimizations for our LMMs and enables us to perform batched inference for models that do not natively support doing so. We serve our LMM on a single GPU and implement a queue that is shared across all runs. This allows individual runs to asynchronously call the LMM using an API request instead of creating a new instance of the LMM for each run. Typically, we have 4-5 processes sharing the same LMM at a time.
Next, for our LLMs we use all default parameters. For our LMMs, we experiment with modifying the maximum output token length as an ablation, but use all default parameters otherwise.
Our prompts are below. The black text is the base prompt template, and we replace the blue text with the corresponding information from the relevant video. The generated outputs are in the orange text.
Planner prompt \(P_T\):
User:
Create the best plan to gather information to answer the question.
QUESTION: QUESTION CHOICES: CHOICES
You are provided information collected from individual frames of the video and is represented as a dictionary keyed by the timestamps of the frames.
INFORMATION: INFO
You are also given an explanation for why you aren’t able to definitively answer the question with the current information.
EXPLANATION: EXPLANATION
Follow these rules:
1. You only have access to individual frames of the video, with no audio. You can go to a certain timestamp, search for actions or settings, and describe or ask questions about individual frames.
2. Make sure that you have viewed the relevant frames.
Make your plan as simple and straightforward as possible, and no longer than 5 steps long. Return your plan as a numbered list, after PLAN. Do not include any other response or explanation. Let’s think step-by-step.
Assistant: Output: PLAN
Retriever prompt \(P_R\):
User:
You are given the following information about a LENGTH second video, with information from individual frames at different timestamps.
INFORMATION: INFO
PLAN: PLAN
Currently, you are viewing second CURR. Choose the timestamp, in seconds, of the next frame to view. When choosing the next frame to view, remember that you are trying to collect information to answer this multiple choice question: QUESTION
Think of what information you need, and consider what information you already have. Use the temporal nature of the video and your past information to choose the next frame. Do not choose a frame you already have information about, and make sure that the frame you choose is at least WINDOW SIZE seconds apart from the second you are currently viewing.
Return your answer as a single Python float representing the second you want to view. Don’t provide any other response or explanation.
Assistant: Output: TIMESTAMP
Extractor prompt \(P_E\):
User:
You are given the following information about a LENGTH second video, with information from individual frames at different timestamps.
INFORMATION: INFO
Currently, you are viewing second CURRENT TIMESTAMP, which has the caption: FRAME CAPTION
Form up to 3 questions about this frame to best help answer the multiple-choice question: QUESTION.
Follow these rules:
1. Use the given information to decide what further visual information you need to answer the question.
2. Since you are asking questions about a single frame, you cannot ask about other frames, reference past or future events, or ask about specific timestamps.
Return your questions as a Python list of strings (in double quotes) and don’t include any numbered lists, backticks, or language hints. Follow Python syntax. Make sure you have followed the steps. Don’t provide any other response or explanation.
Assistant: Output: QUESTIONS
Evaluator prompt \(P_V\):
User:
Evaluate if there is enough information to answer a multiple-choice question about a video and if the plan has been completed.
If there is enough information to choose the correct answer with complete certainty and the plan has been followed, return the index of the choice after a brief explanation. Otherwise, return None after a brief explanation of why you can’t narrow down to a single answer choice. Be strict and don’t guess.
INFORMATION: INFO
PLAN: PLAN
QUESTION: QUESTION
CHOICES: CHOICES
Give a brief explanation. Then, include your final answer after the words "Final Answer:" in your response at the end. Do not include anything other than the answer as an integer or None after "Final Answer:".
Let’s think step by step.
Assistant: Output: ANSWER
NExT-QA is a challenging dataset that tests causal action reasoning and temporal understanding. It contains 5,440 videos with an average length of 44s. Compared to earlier VideoQA benchmarks [40]–[43], NExT-QA requires going beyond simple recognition of objects and actions to answer the questions correctly. Each question requires selecting the best option out of 5 choices, often with very similar degrees of plausibility. Additionally, each question is categorized into either a Temporal, Causal, or Descriptive type. Temporal questions often ask what happens during, before, or after an event or action, while causal questions require advanced reasoning and inference about why an event or action occurs. Following the trend of works before us, we evaluate our method on the 5,000 questions in the NExT-QA validation set, which consist of 500 different videos.
For NExT-QA, we use LLaVA-1.6 (Vicuna 13B) as the LMM and GPT-4 (gpt-4-1106-preview) as the LLM. We used the longer, comprehensive prompts, with no answer choices included in the Extractor prompt. We also initialize the memory bank with 5 frames, and use the multi-frame Retriever with 5 frames.
In comparison with earlier VideoQA datasets [42]–[44] that focus on computational tasks such as classification, detection, or tracking, Perception Test is a dataset that focuses on skills such as memory, abstraction, physics, and semantics. Moreover, it is designed to test the transfer capabilities of different models and intended to be approached in a few-shot or zero-shot manner. The dataset consists of 11.6k real-world videos with an average length of 23 seconds, and 38K multiple choice QA questions.
For Perception Test, we use LLaVA-1.6 (Vicuna 13B) as the LMM and GPT-3.5 (gpt-3.5-turbo-0125) as the LLM. We used shorter, simplified prompts, with no answer choices included in the Extractor prompt. We also initialize the memory bank with 5 frames, and use the multi-frame Retriever with 5 frames.
STAR is a dataset that tests reasoning in real-world video situations. It consists of 22K video clips, with 60K situated reasoning questions, with 4 possible choices each. Questions are broadly divided into 4 main categories: interaction, sequence, prediction, and feasibility.
For STAR, we use LLaVA-1.6 (Vicuna 13B) as the LMM and GPT-3.5 (gpt-3.5-turbo-0125) as the LLM. We used shorter, simplified prompts, with no answer choices included in the Extractor prompt. We also initialize the memory bank with 5 frames, and use the multi-frame Retriever with 3 frames, since the videos are shorter.
There has been a long history of work [45]–[58] that attempts to combine deep neural networks with modularity. Recently, works like VisProg [59], CodeVQA [60], RVP [61], and ViperGPT [32] have leveraged the improved coding capabilities of LMMs to generate code to compose different submodules together to answer visual questions. In addition, ProViQ [33] extends ViperGPT’s work in the video domain by adding more modules for VideoQA. Similarly, we leverage the strong power of LMMs in a modular approach. However, while these approaches have shown promising results, they are limited to single-shot planning when generating code, resulting in a fixed plan that cannot adapt. In contrast to these works, our approach has the advantage of being able to iteratively replan based on new information collected.
We present further qualitative success and failure cases of our TraveLER framework. For each dataset, we display qualitative visualizations for 2 successes and 2 failure cases. For the success cases, we show expanded examples in Figures 4, 5, 6, and abridged versions in 7 that demonstrate the benefits of our question-answering approach compared to regular captioning. For failures, we also present abridged versions for each dataset in Figures 8, 9, and 10. Finally, we present some additional success and failure cases in 11. For the abridged versions, we display the video on top, and the traversal order using the numbered orange circles. In the row beneath, we display the frames in the order they are selected, and display corresponding Extractor question-answer output in yellow and captions in gray.
Figure 4: NeXT-QA Success Predictions. We can see that our framework can adapt to new information collected in past iterations. For example, in Iteration 3, our Planner module is able to use information about the wooden post from a previous iteration and ask further questions to identify the correct answer.
Figure 5: STAR Success Predictions. Here, we can see that our method does not require viewing frames sequentially. For example, we view the beginning of the video in Iteration 1, the middle of the video in Iterations 2 and 3, and return to the beginning in Iteration 4. Moreover, our method can collect information and double-check ambiguous information across different timestamps. For example, in Iteration 2, we are told the man is taking a blanket, and then we can view a different frame to confirm that he is indeed holding a towel in Iteration 3.
Figure 6: Perception Test Success Predictions. We display some success cases for the challenging Perception Test dataset. Here, our method is able to infer which objects the question refers to through our question-asking approach, even though the question does not explicitly describe them.
Figure 7: Comparison with Captioning Approaches. For each example, we display the videos on top, and the traversal order using the numbered orange circles. In the rows beneath, we display the frames in the order they are selected, and display corresponding Extractor question-answer output in yellow and captions in gray. We display compressed versions of GPT-4V-generated captions for a visual comparison. By asking specific questions, we can extract more detailed and relevant information than a general description.
Figure 8: NeXT-QA Failure Predictions. Here, we display some failure cases for NeXT-QA. Like before, we display the video on top, and the traversal order using the numbered orange circles. In the row beneath, we display the frames in the order they are selected, and display corresponding Extractor question-answer output in yellow and captions in gray. We can see that conflicting information or false positives can mislead our approach. We also observe that counting can be a challenge for certain LMMs, but this can be mitigated in the future by swapping in stronger LMMs.
Figure 9: STAR Failure Cases. Here, we display some failure cases for the STAR dataset, using the same abridged representation described previously. We see that a limitation of a framewise approach is that it may be difficult to capture very temporal actions. For example, in Example 1, it is difficult to understand if the woman is opening or closing the window.
Figure 10: Perception Test Failure Cases. We display some qualitative visualizations for Perception Test failure cases using the abridged representation discussed previously. We see that for some cases where objects of interest are occluded or not in the frame, our method might have difficulties extracting useful information.
Figure 11: An additional visualization of predictions. We show more qualitative visualizations of our method on NExT-QA and Perception Test using our abridged representation, with successes on top and failures on the bottom. We compare our generated question-answer pairs for each frame (in yellow) with captions (labeled C in gray) generated from the same frame. We see that our method is able to extract more fine-grained and relevant information compared to simple captioning.
The license, PII, and consent details of each dataset are in the respective papers. In addition, we wish to emphasize that the datasets we use do not contain any harmful or offensive content, as many other papers in the field also use them. Thus, we do not anticipate a specific negative impact, but, as with any Machine Learning method, we recommend to exercise caution.