TraveLER: A Multi-LMM Agent Framework for Video Question-Answering

Chuyi Shang*, Amos You*, Sanjay Subramanian, Trevor Darrell, Roei Herzig
University of California, Berkeley


Abstract

Recently, Large Multimodal Models (LMMs) have made significant progress in video question-answering using a frame-wise approach by leveraging large-scale, image-based pretraining in a zero-shot manner. While image-based methods for videos have shown impressive performance, a current limitation is that they often overlook how key timestamps are selected and cannot adjust when incorrect timestamps are identified. Moreover, they are unable to extract details relevant to the question, instead providing general descriptions of the frame. To overcome this, we design a multi-LMM agent framework that travels along the video, iteratively collecting relevant information from keyframes through interactive question-asking until there is sufficient information to answer the question. Specifically, we propose TraveLER, a model that can create a plan to “Traverse” through the video, ask questions about individual frames to “Locate” and store key information, and then “Evaluate” if there is enough information to answer the question. Finally, if there is not enough information, our method is able to “Replan” based on its collected knowledge. Through extensive experiments, we find that the proposed TraveLER approach improves performance on several video question-answering benchmarks, such as NExT-QA, STAR, and Perception Test, without the need to fine-tune on specific datasets.

Figure 1: A simplified overview of our TraveLER framework. Our proposed framework aims to answer the question by collecting relevant information from keyframes through interactive question-asking. To accomplish this, several agents (in colored boxes) with different roles interact (left-to-right in each row) over several iterations. TraveLER creates a plan (in blue) to “traverse” (in orange) through the video, asks questions regarding individual frames (in yellow) to “locate” and store key information and, “evaluates” whether there is sufficient information to answer the question (in green), and “replans” using past collected knowledge if there is not enough information. Click on the image to see the video.

1 Introduction↩︎

Over the last few years, Large Multimodal Models (LMMs) have demonstrated tremendous progress in the area of video understanding, particularly for the video question-answering (VideoQA) domain [1], [2]. More recently, LMMs have been able to achieve impressive results through video-based models [3][6]. However, video models require a high level of computational complexity to fine-tune, and annotations are difficult and expensive to collect. As a result, many recent approaches [7][9] operate on a frame level, leveraging large-scale image-based pretraining in a zero-shot setting.

Despite the effectiveness of image-based LMMs for image tasks, applying them to VideoQA is challenging since using all frames results in high computational demands and redundancy. Thus, many works try to select subsets of frames, either through uniform sampling [9] or keyframe selection [8]. However, uniform sampling may skip important information, while keyframe selection methods might select the wrong frames and mislead the model. To address this, we introduce a novel video traversal approach using a “Planner,” an LLM agent that creates a plan to find and extract key information.

Next, we wish to ensure we can capture correct and detailed information from frames when executing the plan. Yet, the common captioning approach provides general descriptions for the frame, whereas answering questions often requires more specific details. Moreover, not all elements of the frame are relevant to the question and may even be misleading. As a result, we propose an interactive question-answering process using a “Locator" to locate and extract the most relevant and fine-grained details from each frame. Specifically, we use two LMM agents, one who asks questions about the frame and another who answers them.

Nevertheless, it can be difficult for the model to collect all necessary details in a single pass, and extracting incorrect information may be misleading. Hence, we introduce an iterative approach using an “Evaluator” that reviews collected information after each iteration and evaluates if it has enough information to answer the question. If there is, the answer is selected; otherwise, the new information is used to “Replan” and begin a new iteration.

Consider the example in Figure 1. Suppose we are asked “why the boy turned over in the middle of the video”. In the first iteration, our method uses temporal cues from the question to skip to the middle of the video and asks questions to find the relevant frames. In the next iteration, we gather more information. Asking about what the boy is doing, we learn that he is "standing up at the bottom of the slide" and is not looking at anything specific, which informs us that the boy is no longer "sitting down" (choice B) or "resting on the yellow object" (choice D). To eliminate these choices, we must confirm that the boy does not sit back down again by traveling to a timestamp near the end of the video. Finally, since we have collected enough information and followed the plans, we can select the right choice that the baby turns over to be on his stomach "to get down on slide" (Choice E).

Our proposed approach – Traverse, Locate, Evaluate, and Replan (TraveLER), is a modular, multi-LMM agent framework for video question answering. Our framework is composed of four main stages, each with LLM or LMM “agents" that interact with each other through the different stages. First, in the Traversal stage (”traverse”), an agent creates a plan to answer the question. In the Locator stage (“locate”), an agent uses the plan to decide which timestamp of the video to select. The corresponding frames are then sent to another agent, which asks questions and stores the answers in a memory bank for future iterations. Finally, in the Evaluator stage (“evaluate”), an agent reviews all collected information and decides whether to answer or create a modified plan (“replan”) to start the next iteration if necessary.

To summarize, our main contributions are as follows: (i) We introduce TraveLER, a modular multi-LMM agent framework for video question-answering. (ii) Our proposed TraveLER method does not require task-specific fine-tuning or video annotations, as well as being easy to employ with several different LLM or LMMs. (iii) Our method shows improved performance on multiple difficult video question-answering benchmarks such as NExT-QA, Perception Test, and STAR, highlighting the effectiveness of our approach.

2 Related Work↩︎

Video question-answering (VideoQA) involves answering free-form or multiple choice questions given an input video. In comparison to image question answering, VideoQA poses unique challenges because it often requires strong temporal understanding and the ability to deal with long input sequences. Many recent works have focused on training end-to-end video-language models [1], [2], [4][6], [10], but doing so remains challenging due to computational constraints and difficulties in architecture scaling. As a result, many approaches adopt pretrained image models to the video domain by extracting information independently from each frame [7][9]. In this work, we design a framework that builds an adaptive plan to traverse through the video to identify keyframes and extract relevant information using a question-answering approach.

LMMs have been shown to be extremely useful for VideoQA. Some methods use supervised or contrastive training to perform video-LMM pretraining [11][13], while others adapt existing LMMs and use instruction tuning to adapt them to the video domain [3], [14], [15]. However, recent improvements in LMM capabilities have allowed for many strong approaches for few-shot [16], [17] and zero-shot VideoQA [17], [18]. In particular, zero-shot methods such as LLoVi [9], use pre-trained LMMs to generate captions for each frame in the video. Nevertheless, uniformly sampling frames at random may result in the model missing important visual information and focusing on unimportant frames  [19], [20]. Recent works such as SeViLA [8] addressed this problem by performing parameter-efficient finetuning using captions to identify keyframes [21][23], but this requires fine-tuning on specific datasets. In contrast to these works, which select all keyframes in a single pass, we introduce a zero-shot, iterative method that repeatedly gathers data from various timestamps until enough information is collected to correctly answer the question.

The strong reasoning abilities of LLMs  [24], [25] have made them effective in LLM-based agent approaches for videos, where an LLM performs much of the reasoning after collecting information from different modules [26][29]. For example, Socratic Models [29] proposes a method to reason about videos based on generated audio transcriptions and CLIP frame similarity scores, while other works like VideoChatCaptioner [26] proposes a way to caption videos through chat dialogues between an LLM and a LMM. Unlike these works, our method utilizes a novel video traversal approach and an iterative, planning-based information gathering process.

Figure 2: TraveLER framework. Our framework consists of four different modules, the Planner, Retriever, Extractor, and Evaluator. The Planner creates a plan and sends it to the Retriever. The Retriever uses the plan to select the next timestamp and sends this to the Extractor. The Extractor captions and generates questions about the timestamp, answers the questions, and saves the output in the memory bank. Finally, the Evaluator determines if there is enough information and if the plan has been followed. If yes, the Evaluator returns the answer, else the existing information is sent back to the Planner to begin a new iteration.

3 TraveLER Framework↩︎

To design a robust approach that can find the correct keyframes and extract the most relevant information for VideoQA, we propose a modular LMM agent framework that Traverses, Locates, Evaluates, and Replans iteratively (TraveLER). We begin by describing the general LLM and LMM architectures (Section 3.1), then introduce each component of our pipeline (Section 3.2), and implementation details (Section 3.3). Our method is illustrated in Figure 2.

3.1 Preliminaries↩︎

LLMs are text-conditioned generative models. Given a prompt \(P_\text{in}\), they encode it into a fixed language embedding \(l\) in an embedding space \(f(\cdot)\) and then use this to produce a text response \(R\): \(R = f(l(P_\text{in}))\).

Similarly, Large Multimodal Models (LMMs) are adapted to jointly reason over vision and language modalities. To map the different modalities into the shared embedding space \(f(\cdot)\), the image \(I\) is encoded using a trainable encoder \(v\) and the prompt \(P_{\text{in}}\) is encoded using a fixed language embedding \(l\). The LMM outputs a text response \(R\): \(R = f(v(I), l(P_{\text{in}}))\).

VideoQA involves viewing a video and answering questions. The model is usually evaluated through top-1 accuracy, in which it chooses the best answer out of a set of possible choices. Specifically, given a question \(Q\), video input \(V\) consisting of a set of frames \(\{f_1, \cdots, f_n\}\), and set of choices \(C = \{c_1, \cdots, c_n\}\), the model is asked to choose the best \(c_i\) to answer \(Q\). Next, we introduce each component of our method.

3.2 TraveLER Components↩︎

In the Traversal stage, we create a plan for how to traverse through the video, which is a list of textual instructions that guide our approach to answering the question. To achieve this, we use the task prompt \(P_T\), which is an instruction to create a plan for answering the question. We combine \(P_T\) with the question \(Q\), and memory bank \(M\), which is a dictionary of collected information keyed by timestamps and containing information from the corresponding frame, to receive the final prompt \(P_\mathrm{T}^{(1)}\): \(P_\mathrm{T}^{(1)} = ``[Q][M][P_T]"\).

Our method uses a memory bank \(M\) to store collected information, which allows information to persist and to be updated as we proceed through different iterations. We initialize \(M\) with captions of 5 evenly sampled frames throughout the video. We find that this memory initialization gives the model good context for the general idea of the video, and performs better than starting with an empty memory \(M\). After the first iteration, we add information iteratively using the Extractor module, which will be discussed later.

Next, we input the prompt \(P_\mathrm{T}^{(1)}\) into \(\text{LLM}_{\text{planner}}\), which returns response \(R_T\), a step-by-step plan on how to traverse through the video and what information the model needs to collect.

\[R_T = \text{LLM}_{\text{planner}}(l(P_\mathrm{T}^{(1)}))\]

Our next step is to use the plan \(R_T\) in the Locator stage to locate keyframes and extract the information that we will use to answer the question.

The Locator is a component that consists of two submodules, the Retriever and the Extractor. The Retriever selects the timestamps of the next frames to view, while the Extractor extracts relevant information from these frames, using a question-answering process. Next, we discuss each component in more detail.

(i) Retriever: The Retriever carries out the given plan \(R_T\) by selecting which frames to view next. The Retriever is an LLM-based submodule whose goal is to use the collected information \(M\) to find the next timestamp \(t\) to select in order to fulfill the plan \(R_T\). The task prompt \(P_R\) is an instruction that contains information about the video length and asks which timestamp to view next. Thus, we insert the question \(Q\), plan \(R_T\), and collected information \(M\) into the task prompt \(P_R\) to create the new prompt \(P_\mathrm{R}^{(1)}\): \(P_\mathrm{R}^{(1)} = ``[P_R][Q][R_T][M]"\).

Given prompt \(P_\mathrm{R}^{(1)}\), the LLM in the Retriever, \(\text{LLM}_{\text{retriever}}\), returns \(t\), or the next set of timestamps. The module then retrieves frames \(I_t\) at timestamp \(t\).

(ii) Extractor: The Extractor is a significant part of our method because it allows us to capture more relevant and question specific details from the visual input, unlike using only captions. We pass the frames selected by the retriever \(I_t\) into the Extractor submodule, which consists of two large models: an LLM, \(\text{LLM}_{\text{extractor}}\), to generate context-dependent questions about the frames \(I_t\), and a different vision-language LMM, \(\text{LMM}_{\text{extractor}}\), whose job is to extract the desired information from the same frames.

In this module, we first generate a general caption \(c_t\) for frame \(I_t\) using the \(\text{LMM}_{\text{extractor}}\). Then, we concatenate the caption \(c_t\), plan \(R_T\), and memory \(M\), and the Extractor task prompt \(P_E\), which is an instruction that asks to use available information to create 3 questions to ask about the current frame. This results in the new prompt \(P_\mathrm{E}^{(1)}\): \(P_\mathrm{E}^{(1)} = ``[P_E][c_t][R_T][M]"\).

Next, we input this new prompt \(P_\mathrm{E}^{(1)}\) into the LLM to get a set of questions \(\{q_1, q_2, \cdots, q_n\}\) about each frame, where \(n\) is a parameter for how many questions to ask about each frame.

\[\{q_1, q_2, \cdots, q_n\} = \text{LLM}_{\text{extractor}}(l(P_\mathrm{E}^{(1)}))\]

where \(l\) is the fixed language embedding.

In this way, the generated questions take into account both the plan \(R_T\) and information from past and future frames of the video \(M\). We then use the frame \(I_t\), and the corresponding questions \(\{q_1, q_2, \cdots, q_n\}\) as input into \(\text{LMM}_{\text{extractor}}\). The \(\text{LMM}_{\text{extractor}}\) then outputs a set of answers \(\{a_1, a_2, \cdots, a_3\}\), where each answer \(a_i\) corresponds to the question \(q_i\).

\[\{a_1, a_2, \cdots, a_n\} = \text{LMM}_{\text{extractor}}(v(I_t), l(\{q_1, q_2, \cdots, q_n\}))\]

where \(v\) is the visual encoder.

Finally, to use this collected information in future iterations, we update our memory bank \(M\). To do this, we use the timestamp \(t\) of \(I_t\) as our key and the question-answer pair list as the value, and append this to our memory \(M\).

If the memory bank dictionary \(M\) is too long, we summarize it by using the memory bank as an input to another LLM and instruct it to make the memory bank entries more concise, while retaining the same keys and format. This output becomes our new memory bank.

The Evaluator decides if there is enough information and determines if the plan has been followed. We concatenate the memory information \(M\), the plan \(R_T\), the question \(Q\), and the choices \(C\) with the task prompt \(P_V\), The task prompt \(P_V\) is an instruction to evaluate if there is enough information to answer the question and if the given plan \(R_T\) has been fulfilled. Thus, we get the new prompt \(P_\mathrm{V}^{(1)}\): \(P_\mathrm{V}^{(1)} =``[P_\mathrm{V}][Q][C][R_T][M]"\).

We use this prompt \(P_\mathrm{V}^{(1)}\) as input into the LLM in the Evaluator, \(\text{LLM}_{\text{evaluator}}\), which evaluates if there is enough information to answer the question and if the plan has been completely followed. If both are true, \(\text{LLM}_{\text{evaluator}}\) outputs the best choice \(c*\) to answer the question \(Q\). Otherwise, it provides an explanation \(E\) on why there is not enough information and gives this explanation to the Planner to start a new iteration of the process.

After each iteration, if the evaluator decides that there is not enough information to answer the question \(Q\) or if the plan \(P\) has not been completed, the existing memory \(M\) will be provided to the Planner in the next iteration, in addition to an explanation \(E\) for why an answer was not chosen. The Planner then outputs a new plan, restarting the process. We also implement a limit on the number of iterations a question can take, to prevent infinite loops. After this limit is reached, we force the Evaluator to choose the best choice.

3.3 Implementation Details↩︎

Here, we discuss how we implement various components of our framework. The code will be released upon acceptance. More implementation details, such as prompts and dataset-specific details are in the Supplementary in Section 7.

We represent past collected information as a Python dictionary, with the timestamp of different frames as keys and a list of extracted information from the frame as the values. This extracted information consists of a brief caption of the frame and a list of question-answer pairs. To prevent the memory bank from becoming too large, we also implement a summarizer module that instructs an LLM to summarize the memory bank and return a more concise version in the same dictionary format as before.

Our modular approach has the benefit of allowing us to easily swap in different LLMs and LMMs (see Section 4.4). For our main experiments, we use LLaVA-1.6 for \(\text{LMM}_{\text{extractor}}\) and GPT-4 for \(\text{LLM}_{\text{planner}}\), \(\text{LLM}_{\text{retriever}}\), \(\text{LLM}_{\text{extractor}}\), and \(\text{LLM}_{\text{evaluator}}\).

We also allow for the Retriever to select multiple frames instead of a single frame. This helps the model to better capture events that happen quickly or require more context to recognize. For example, if we want to find the action of a "a woman clapping her hands", single frame selection may cause us to incorrectly assume the woman is not clapping if we view the frame where their hands are apart. We do this by creating an optional parameter called window size. The window size refers to the number of frames the Retriever extracts each time. When the window size is non-zero, the Retriever still specifies a single timestamp to go to, but when retrieving the frame at that timestamp we take the number of frames specified by the window size before and after the selected frame as well.

Table 1: Zero-shot (ZS) results on NExT-QA. For fair comparisons, we gray out methods with fine-tuned components in their model. The best scores are in bold.
Model Cau. (%) Tem. (%) Des. (%) Avg. (%)
MC-ViT-B [30] - - - 60.6
SeViLA [8] 61.3 61.5 75.6 63.6
MC-ViT-L [30] - - - 65.0
InternVideo [2] 43.4 48.0 65.1 49.1
VFC [31] 45.4 51.6 64.1 51.5
ViperGPT [32] - - - 60.0
BLIP-2concat [8] 59.7 60.8 73.8 62.4
BLIP-2voting [8] 59.1 61.3 74.9 62.7
ProViQ [33] - - - 63.8
LLoVi [9] 69.5 61.0 75.6 67.7
TraveLER (ours) 70.0 (+0.5) 60.5 78.2 (+2.6) 68.2 (+0.5)
image

ZS results on Perception Test.

4 Experiments and Results↩︎

We evaluated our TraveLER framework on several benchmarks described in Section 4.1, and compared it to multiple baselines in Section 4.2. The results and ablations are in Section 4.3 and Section 4.4. Additional results and ablations are in the Supplementary in Section 6.

4.1 Datasets↩︎

We use the following datasets: (1) NExT-QA [34] is a dataset that tests causal action reasoning and temporal understanding, with questions categorized into a Temporal, Causal, or Descriptive type. NExT-QA requires going beyond simple recognition of objects and actions to answer the questions correctly. Following the trend of works before us, we evaluate our method on the 5,000 questions in the NExT-QA validation set. (2) Perception Test [35] is a dataset that focuses on skills such as memory, abstraction, physics, and semantics and is intended to be approached in a few-shot or zero-shot manner. The dataset consists of 11.6k real-world videos and 38K multiple-choice QA questions. (3) STAR [36] is a dataset that tests reasoning in real-world video situations. It consists of 22K video clips, with 60K situated reasoning questions. Questions are broadly divided into 4 main categories: interaction, sequence, prediction, and feasibility.

4.2 Baselines↩︎

In our experiments, we compare our method to recent state-of-the-art zero-shot (ZS) methods, such as LLoVi [9], ProViQ [33], and other methods that are not necessarily ZS, such as SeViLA [8], and MC-ViT [30]. We note that SeViLA uses fine-tuned components on QV-Highlights [37] in their model, while the MC-ViT model is fine-tuned on NExT-QA for Perception Test.

4.3 Results↩︎

Our results are shown in Table 1, Table [tab:star], and Table [tab:perception]. First, we see that on NExT-QA, our method outperforms LLoVi using GPT-4, which is the current state-of-the-art that uniformly captions frames across the entire video. Interestingly, our method demonstrates superior performance in comparison to LLoVi despite viewing 50% fewer frames on average, highlighting the effectiveness of our approach. Second, we also outperform SeViLA by +4.6%, although SeViLA uses a keyframe selector that is fine-tuned on a video moment retrieval/grounding task. Lastly, we see comparable performance to BLIP-2voting (60.5% vs 61.3%) on the Temporal split, even though BLIP-2voting views every single frame.

For Perception Test and STAR, we use GPT-3.5 because it is much cheaper than GPT-4, but results are likely to be improved even further with GPT-4. Nevertheless, we achieve higher accuracy than LongViViT on Perception Test by +4.5% and MC-ViT by +2.1%, although it was fine-tuned on NExT-QA. We surpass on STAR both the best zero-shot approach by +2.7% and the best fine-tuned result by +0.3%.

Finally, please refer to Section 9 for visualizations. For example, in Figure 7, we see that question-answering is able to extract more relevant details in comparison to simple captioning. This may explain why our method significantly outperforms the descriptive split of NeXT-QA. For Perception Test, the questions require a high-level understanding of the video and fine-grained details. In Figure 6, we see that our method is able to reason about vague references and correctly identify relevant objects through question-answering.

.3

LLM LMM Accuracy
GPT-3.5 LLaVA-1.6 60.4
GPT-4 65.8
GPT-3.5 BLIP-2 52.7
GPT-4V 59.5
LLaVA-1.6 60.4

.3

# Frames Accuracy
1 59.0
3 57.9
5 60.4
7 59.0

.3

Table 2: Results of different ablations on our framework. We report (a) replacing different LLMs and LMMs, (b) selecting different numbers of frames to view in the Retriever, and (c) changing the number of questions asked in the Extractor.
# Questions Accuracy
0 58.1
1 58.4
3 60.4
5 59.6

4.4 Ablations↩︎

We perform comprehensive ablations using 1000 randomly selected questions from the NExT-QA training set to understand the impact of each component. Unless otherwise specified, we use GPT-3.5 as the LLM and LLaVA-1.6 as the LMM for all agents.

In our method, we use LLMs and LMMs as the main agents. To understand how the choice of the LLM and LMM affects our framework’s performance, we swap different LLMs and LMMs into our framework (see Table [tab:agents]). First, we examine the performance of our framework using different LLMs while fixing the LMM. For this, we use LLaVA-1.6 as our LMM, and we find that GPT-4 performs the best in our framework by a significant margin of +5.4% compared to GPT-3.5. We hypothesize that stronger LLMs perform better in our framework due to how the LLM components are required to perform advanced reasoning tasks. Second, we measure the performance of our framework using different LMMs for \(\text{LLM}_{\text{extractor}}\) while fixing the LLM to be GPT-3.5. We find that LLaVA-1.6 performs best, GPT-4V is slightly worse (-0.9%), and BLIP-2 is significantly worse (-7.7%).

The Planner module outputs a plan, a list of instructions that guides the behavior of all other modules. In order to evaluate the effectiveness of this module, we remove it from our framework. We find that removing the Planner leads to a drop in accuracy of -2.3%. This may be due to several reasons. First, the Planner provides many temporal cues that guide the Retriever module’s search, such as “go to the middle of the video”, and without these cues, the Retriever is not as good at selecting the next timestamp. Moreover, the Planner also helps the Evaluator better decide when to stop since in our iterative approach, the Evaluator uses the plan to determine when to stop.

The Retriever module determines the next timestamps to view, which helps focus our traversal and information collection. To evaluate the effectiveness of the Retriever, we uniformly sample frames from the video at 2-second intervals, similar to other methods, such as LLoVi [9]. We find that this leads to a reduction in overall accuracy by -3.5%. Our results suggest that compared to uniform sampling, the Retriever allows us to capture frames that might have otherwise been skipped. This also indicates that our Retriever selects fewer unimportant frames that might mislead the model.

Question asking is a key component of our model as it allows us to capture more fine-grained and question-relevant information in comparison to simple caption generation, which produces a generic description. As such, we ablate the Extractor by only allowing the LMM to caption frames. We find that this decreases performance by -2.2%, suggesting that the ability to ask specific questions about a frame is important. We notice that many generated captions capture the main idea of visual information in the frame, but often miss fine-grained details that are useful for answering the question.

To evaluate the impact of window size, which is the number of frames the Retriever extracts centered around the selected frame, we experiment with multiple-frame retrieval. This allows to capture better actions that occur quickly or require more context to understand. As shown in Table [tab:retriever], we find that choosing 5 frames yields the best results and a +1.4% increase when compared to selecting a single frame, but viewing more than 5 decreases performance. This suggests that the ability to retrieve multiple frames is beneficial in allowing the model to better capture relevant information, but retrieving too many frames can lead to too much information, resulting in performance drops.

Question answering allows us to extract more specific details from our visual inputs. However, we noticed that too many questions can yield irrelevant questions and false positives. As such, we experiment with modifying the number of questions asked for each frame by our extractor (see Table [tab:questions]). We record results for a 5-question, 3-question, and 1-question maximum. Note that 0-questions asked is equivalent to only allowing captions, which is discussed in the Extractor ablation. From our results, we find that a 3-question limit yields the best results compared to asking 1 or 5 questions (+2.0%/+0.8%). This suggests that asking questions helps in extracting relevant information, but too many questions can lead to false positives or too much irrelevant information.

Our memory bank \(M\) is a critical part of our framework since it stores information that all modules rely on to evaluate inputs and make decisions. First, we experiment with different initializations as \(M\) must be initialized in the first iteration. We experiment with initialization of 1, 3, and 5 uniformly sampled captions. We find that using 5 evenly spaced frames yields the best results, possibly because it provides the model with a general overview of the video at the start before starting to collect more relevant information. Next, we experiment with changing the format of the memory bank, as there are many different ways to represent information. We try a markdown table instead of our dictionary format. Our results show that this performs worse by -2.9%, suggesting the effectiveness of our original dictionary formatting.

When collecting large amounts of information from videos, we use a Summarizer to condense the information, since long inputs can be challenging for LLMs. This has also been observed in recent work  [9]. To understand the impact of the Summarizer, we remove it. The results indicate that removing it degrades performance by -3.2%, demonstrating the advantage of more concise information.

5 Discussion↩︎

Our TraveLER framework has demonstrated significant potential in utilizing an LMM image-based approach for VideoQA. We introduce a multi-LMM agent framework that travels along the video, collecting relevant information from keyframes through interactive question-asking. Our method creates a plan to “traverse” through the video, asking questions about individual frames to “locate” and store key information, and then “evaluate” if there is enough information to answer the question. Finally, if there is not enough information, our model is able to “replan” according to its collected knowledge. However, there are a few limitations to our work. Firstly, our framework depends on the strength of the LLM and LMM. We notice that false positives and incorrect statements from the LMM can impact performance. We also found that our method has a high runtime with slower LLMs since each iteration requires our LLM to generate significant amounts of text. We believe that with better and faster LLMs and LMMs in the future, these issues can be overcome. Finally, our research encourages future work on using large models for modular video approaches.

Acknowledgements↩︎

We would like to thank Suzie Petryk, Chancharik Mitra, Alon Mendelson, and David Chan for helpful feedback and discussions. This project was supported in part by DoD, including PTG and/or LwLL programs, as well as BAIR’s industrial alliance programs.

Supplementary Material for “TraveLER”↩︎

Here we provide additional information about our experimental results, qualitative examples, implementation details, and datasets. Specifically, Section 6 provides more experiment results, Section 7 provides additional implementation details, Section 8 provides additional related work, and Section 9 provides qualitative visualizations to illustrate our approach.

6 Additional Experiment Results↩︎

We begin by presenting several additional ablations in Section 6.1 that further demonstrate the benefits of our TraveLER approach. We also present additional results in Section 6.2.

6.1 Additional Ablations↩︎

In what follows, we provide additional ablations that further illustrate the benefits of TraveLER. For all ablations, we compare the ablated experiment with the corresponding best-performing TraveLER results on a random sample of 1000 examples from the training set of the NExT-QA dataset. We use GPT-3.5 as the LLM and LLaVA-1.6 as the LMM.

In order for the Planner to create effective plans, it is beneficial to initialize the memory bank properly. Memory initialization allows the Planner to have a high-level overview of the video, and create a corresponding plan on how to traverse the video given the initial frames. We perform three different initializations with 1, 3, and 5 frames and and display our results in Fig 3. We observe that initializing the memory bank with 5 frames (0, 0.25, 0.5, 0.75, 1 / beginning, quarter, middle, three-quarter, end) yields the best result. In contrast, we notice a decrease in accuracy by -0.6% when initializing with 3 frames (0, 0.5, 1 / beginning, middle, end) and a decrease by -2.5% when initializing with 1 frame (0.5 / middle).

Figure 3: image.

The LMM in our framework is crucial because it allows us to capture more relevant and question-specific details from visual input. However, if the LMM’s responses are too long, the memory bank will become too large, whereas if the LMM’s responses are too short, insufficient information will be captured. Thus, we conduct an experiment to determine the optimal LMM response length, and display our results in Fig [lmm95response95length]. We find that limiting the LMM response to 150 tokens yields the most optimal performance, while accuracy decreases by -2.2% and -1.7% if the response is limited to 75 tokens and 300 tokens respectively. This supports the fact that there is a tradeoff between not collecting enough information for short response lengths and collecting too much information as the LMM response size increases.

In each module of our framework, we use a task prompt to provide instructions to our agents (LLMs or LMMs). The construction of these prompts plays a large role in how instructions are executed. Currently, we use the question \(Q\) as input into all prompts \((P_\mathrm{T}^{(1)}), (P_\mathrm{R}^{(1)}), (P_\mathrm{E}^{(1)}), (P_\mathrm{V}^{(1)})\). However, we use the choices \(C\) as input only for the Planner and Evaluator prompts since the Planner needs the choices to tailor its plan, and the Evaluator needs the choices to answer the question. We experiment with adding the choices \(C\) to the Retriever and Extractor prompts, and find that this degrades performance by -1.6%. This may be because the incorrect choices mislead the Retriever into searching for non-existent events or the Extractor into asking irrelevant questions.

6.2 Additional Results↩︎

We compare our method with other keyframe localization methods to further demonstrate the benefits of our TraveLER approach.

Table 3: Comparison of TraveLER with other few-shot and zero-shot keyframe localization methods. For fair comparisons, we gray out methods with fine-tuned components in their model. The best scores are in bold.
Method NExT-QA (Random Subset)
2-5 Tem. Cau. Des. Avg.
SeViLA - Localizer [8] 48.8 61.2 68.3 58.2
Moment-DETR [37] 45.3 55.8 70.8 54.6
SigLIP [38] 48.4 61.5 73.8 59.1
TraveLER - Planner & Retriever (ours) 50.9 62.7 72.4 60.3

In Table 5, we compare our Planner and Retriever with other keyframe localization methods by replacing our Planner and Retriever with each of the other methods, and use our Extractor and Evaluator to perform the question answering. For all these methods, we use GPT-3.5 and LLaVA-1.6, and we evaluate these methods on a random subset of 1000 examples from the training set of NExT-QA. Note that other methods find keyframes in one inference iteration, whereas our inference occurs over multiple iterations. Therefore, to ensure fair comparisons, we uniformly sample 32 frames and extract out 4 keyframes in the other methods, and we run 4 iterations of TraveLER using the Retriever with a window size of 2 to similarly find 4 keyframes among fewer than 32 viewed frames.

We find that our Planner and Retriever surpasses other keyframe localization methods, despite considering fewer total frames (\(\sim\) 25 total frames; we have 5 for memory initialization, and up to 5 frames each iteration). We would like to highlight that while our method is effective at finding keyframes, we do not need to find all keyframes to answer a question. Instead, we are often able to choose the correct answer with only a subset of the keyframes.

7 Additional Implementation Details↩︎

To run our models on larger benchmarks, we use 8 NVIDIA RTX 6000 GPUs and split the dataset across multiple processes. In addition, we use the SGLang [39] package, which provides a variety of performance optimizations for our LMMs and enables us to perform batched inference for models that do not natively support doing so. We serve our LMM on a single GPU and implement a queue that is shared across all runs. This allows individual runs to asynchronously call the LMM using an API request instead of creating a new instance of the LMM for each run. Typically, we have 4-5 processes sharing the same LMM at a time.

Next, for our LLMs we use all default parameters. For our LMMs, we experiment with modifying the maximum output token length as an ablation, but use all default parameters otherwise.

7.1 Prompts↩︎

Our prompts are below. The black text is the base prompt template, and we replace the blue text with the corresponding information from the relevant video. The generated outputs are in the orange text.

Planner prompt \(P_T\):

User:
Create the best plan to gather information to answer the question.

QUESTION: QUESTION CHOICES: CHOICES

You are provided information collected from individual frames of the video and is represented as a dictionary keyed by the timestamps of the frames.
INFORMATION: INFO

You are also given an explanation for why you aren’t able to definitively answer the question with the current information.
EXPLANATION: EXPLANATION

Follow these rules:
1. You only have access to individual frames of the video, with no audio. You can go to a certain timestamp, search for actions or settings, and describe or ask questions about individual frames.
2. Make sure that you have viewed the relevant frames.

Make your plan as simple and straightforward as possible, and no longer than 5 steps long. Return your plan as a numbered list, after PLAN. Do not include any other response or explanation. Let’s think step-by-step.


Assistant: Output: PLAN

Retriever prompt \(P_R\):

User:
You are given the following information about a LENGTH second video, with information from individual frames at different timestamps.
INFORMATION: INFO

PLAN: PLAN

Currently, you are viewing second CURR. Choose the timestamp, in seconds, of the next frame to view. When choosing the next frame to view, remember that you are trying to collect information to answer this multiple choice question: QUESTION

Think of what information you need, and consider what information you already have. Use the temporal nature of the video and your past information to choose the next frame. Do not choose a frame you already have information about, and make sure that the frame you choose is at least WINDOW SIZE seconds apart from the second you are currently viewing.

Return your answer as a single Python float representing the second you want to view. Don’t provide any other response or explanation.


Assistant: Output: TIMESTAMP

Extractor prompt \(P_E\):

User:
You are given the following information about a LENGTH second video, with information from individual frames at different timestamps.
INFORMATION: INFO

Currently, you are viewing second CURRENT TIMESTAMP, which has the caption: FRAME CAPTION

Form up to 3 questions about this frame to best help answer the multiple-choice question: QUESTION.

Follow these rules:
1. Use the given information to decide what further visual information you need to answer the question.
2. Since you are asking questions about a single frame, you cannot ask about other frames, reference past or future events, or ask about specific timestamps.

Return your questions as a Python list of strings (in double quotes) and don’t include any numbered lists, backticks, or language hints. Follow Python syntax. Make sure you have followed the steps. Don’t provide any other response or explanation.


Assistant: Output: QUESTIONS

Evaluator prompt \(P_V\):

User:
Evaluate if there is enough information to answer a multiple-choice question about a video and if the plan has been completed.

If there is enough information to choose the correct answer with complete certainty and the plan has been followed, return the index of the choice after a brief explanation. Otherwise, return None after a brief explanation of why you can’t narrow down to a single answer choice. Be strict and don’t guess.

INFORMATION: INFO
PLAN: PLAN
QUESTION: QUESTION
CHOICES: CHOICES

Give a brief explanation. Then, include your final answer after the words "Final Answer:" in your response at the end. Do not include anything other than the answer as an integer or None after "Final Answer:".

Let’s think step by step.


Assistant: Output: ANSWER

7.2 NExT-QA↩︎

NExT-QA is a challenging dataset that tests causal action reasoning and temporal understanding. It contains 5,440 videos with an average length of 44s. Compared to earlier VideoQA benchmarks  [40][43], NExT-QA requires going beyond simple recognition of objects and actions to answer the questions correctly. Each question requires selecting the best option out of 5 choices, often with very similar degrees of plausibility. Additionally, each question is categorized into either a Temporal, Causal, or Descriptive type. Temporal questions often ask what happens during, before, or after an event or action, while causal questions require advanced reasoning and inference about why an event or action occurs. Following the trend of works before us, we evaluate our method on the 5,000 questions in the NExT-QA validation set, which consist of 500 different videos.

For NExT-QA, we use LLaVA-1.6 (Vicuna 13B) as the LMM and GPT-4 (gpt-4-1106-preview) as the LLM. We used the longer, comprehensive prompts, with no answer choices included in the Extractor prompt. We also initialize the memory bank with 5 frames, and use the multi-frame Retriever with 5 frames.

7.3 Perception Test↩︎

In comparison with earlier VideoQA datasets [42][44] that focus on computational tasks such as classification, detection, or tracking, Perception Test is a dataset that focuses on skills such as memory, abstraction, physics, and semantics. Moreover, it is designed to test the transfer capabilities of different models and intended to be approached in a few-shot or zero-shot manner. The dataset consists of 11.6k real-world videos with an average length of 23 seconds, and 38K multiple choice QA questions.

For Perception Test, we use LLaVA-1.6 (Vicuna 13B) as the LMM and GPT-3.5 (gpt-3.5-turbo-0125) as the LLM. We used shorter, simplified prompts, with no answer choices included in the Extractor prompt. We also initialize the memory bank with 5 frames, and use the multi-frame Retriever with 5 frames.

7.4 STAR↩︎

STAR is a dataset that tests reasoning in real-world video situations. It consists of 22K video clips, with 60K situated reasoning questions, with 4 possible choices each. Questions are broadly divided into 4 main categories: interaction, sequence, prediction, and feasibility.

For STAR, we use LLaVA-1.6 (Vicuna 13B) as the LMM and GPT-3.5 (gpt-3.5-turbo-0125) as the LLM. We used shorter, simplified prompts, with no answer choices included in the Extractor prompt. We also initialize the memory bank with 5 frames, and use the multi-frame Retriever with 3 frames, since the videos are shorter.

8 Additional Related Work↩︎

There has been a long history of work [45][58] that attempts to combine deep neural networks with modularity. Recently, works like VisProg [59], CodeVQA [60], RVP [61], and ViperGPT [32] have leveraged the improved coding capabilities of LMMs to generate code to compose different submodules together to answer visual questions. In addition, ProViQ [33] extends ViperGPT’s work in the video domain by adding more modules for VideoQA. Similarly, we leverage the strong power of LMMs in a modular approach. However, while these approaches have shown promising results, they are limited to single-shot planning when generating code, resulting in a fixed plan that cannot adapt. In contrast to these works, our approach has the advantage of being able to iteratively replan based on new information collected.

9 Qualitative Visualizations↩︎

We present further qualitative success and failure cases of our TraveLER framework. For each dataset, we display qualitative visualizations for 2 successes and 2 failure cases. For the success cases, we show expanded examples in Figures 4, 5, 6, and abridged versions in 7 that demonstrate the benefits of our question-answering approach compared to regular captioning. For failures, we also present abridged versions for each dataset in Figures 8, 9, and 10. Finally, we present some additional success and failure cases in 11. For the abridged versions, we display the video on top, and the traversal order using the numbered orange circles. In the row beneath, we display the frames in the order they are selected, and display corresponding Extractor question-answer output in yellow and captions in gray.

Figure 4: NeXT-QA Success Predictions. We can see that our framework can adapt to new information collected in past iterations. For example, in Iteration 3, our Planner module is able to use information about the wooden post from a previous iteration and ask further questions to identify the correct answer.

Figure 5: STAR Success Predictions. Here, we can see that our method does not require viewing frames sequentially. For example, we view the beginning of the video in Iteration 1, the middle of the video in Iterations 2 and 3, and return to the beginning in Iteration 4. Moreover, our method can collect information and double-check ambiguous information across different timestamps. For example, in Iteration 2, we are told the man is taking a blanket, and then we can view a different frame to confirm that he is indeed holding a towel in Iteration 3.

Figure 6: Perception Test Success Predictions. We display some success cases for the challenging Perception Test dataset. Here, our method is able to infer which objects the question refers to through our question-asking approach, even though the question does not explicitly describe them.

Figure 7: Comparison with Captioning Approaches. For each example, we display the videos on top, and the traversal order using the numbered orange circles. In the rows beneath, we display the frames in the order they are selected, and display corresponding Extractor question-answer output in yellow and captions in gray. We display compressed versions of GPT-4V-generated captions for a visual comparison. By asking specific questions, we can extract more detailed and relevant information than a general description.

Figure 8: NeXT-QA Failure Predictions. Here, we display some failure cases for NeXT-QA. Like before, we display the video on top, and the traversal order using the numbered orange circles. In the row beneath, we display the frames in the order they are selected, and display corresponding Extractor question-answer output in yellow and captions in gray. We can see that conflicting information or false positives can mislead our approach. We also observe that counting can be a challenge for certain LMMs, but this can be mitigated in the future by swapping in stronger LMMs.

Figure 9: STAR Failure Cases. Here, we display some failure cases for the STAR dataset, using the same abridged representation described previously. We see that a limitation of a framewise approach is that it may be difficult to capture very temporal actions. For example, in Example 1, it is difficult to understand if the woman is opening or closing the window.

Figure 10: Perception Test Failure Cases. We display some qualitative visualizations for Perception Test failure cases using the abridged representation discussed previously. We see that for some cases where objects of interest are occluded or not in the frame, our method might have difficulties extracting useful information.

Figure 11: An additional visualization of predictions. We show more qualitative visualizations of our method on NExT-QA and Perception Test using our abridged representation, with successes on top and failures on the bottom. We compare our generated question-answer pairs for each frame (in yellow) with captions (labeled C in gray) generated from the same frame. We see that our method is able to extract more fine-grained and relevant information compared to simple captioning.

10 Licenses and Privacy↩︎

The license, PII, and consent details of each dataset are in the respective papers. In addition, we wish to emphasize that the datasets we use do not contain any harmful or offensive content, as many other papers in the field also use them. Thus, we do not anticipate a specific negative impact, but, as with any Machine Learning method, we recommend to exercise caution.

References↩︎

[1]
Tsu-Jui Fu, Linjie Li, Zhe Gan, Kevin Lin, William Yang Wang, Lijuan Wang, and Zicheng Liu. Violet : End-to-end video-language transformers with masked visual-token modeling, 2022.
[2]
Yi Wang, Kunchang Li, Yizhuo Li, Yinan He, Bingkun Huang, Zhiyu Zhao, Hongjie Zhang, Jilan Xu, Yi Liu, Zun Wang, Sen Xing, Guo Chen, Junting Pan, Jiashuo Yu, Yali Wang, Limin Wang, and Yu Qiao. Internvideo: General video foundation models via generative and discriminative learning, 2022.
[3]
Bin Lin, Yang Ye, Bin Zhu, Jiaxi Cui, Munan Ning, Peng Jin, and Li Yuan. Video-llava: Learning united visual representation by alignment before projection, 2023.
[4]
Yuchong Sun, Hongwei Xue, Ruihua Song, Bei Liu, Huan Yang, and Jianlong Fu. Long-form video-language pre-training with multimodal temporal contrastive learning, 2023.
[5]
Qinghao Ye, Guohai Xu, Ming Yan, Haiyang Xu, Qi Qian, Ji Zhang, and Fei Huang. Hitea: Hierarchical temporal-aware video-language pre-training, 2022.
[6]
Linjie Li, Zhe Gan, Kevin Lin, Chung-Ching Lin, Zicheng Liu, Ce Liu, and Lijuan Wang. Lavender: Unifying video-language understanding as masked language modeling, 2022.
[7]
Hongwei Xue, Yuchong Sun, Bei Liu, Jianlong Fu, Ruihua Song, Houqiang Li, and Jiebo Luo. Clip-vip: Adapting pre-trained image-text model to video-language representation alignment, 2023.
[8]
Shoubin Yu, Jaemin Cho, Prateek Yadav, and Mohit Bansal. Self-chained image-language model for video localization and question answering, 2023.
[9]
Ce Zhang, Taixi Lu, Md Mohaiminul Islam, Ziyang Wang, Shoubin Yu, Mohit Bansal, and Gedas Bertasius. A simple llm framework for long-range video question-answering, 2023.
[10]
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mojtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models, 2022.
[11]
Yue Zhao, Ishan Misra, Philipp Krähenbühl, and Rohit Girdhar. Learning video representations from large language models. In arXiv preprint arXiv:2212.04501, 2022.
[12]
Antoine Yang, Arsha Nagrani, Paul Hongsuck Seo, Antoine Miech, Jordi Pont-Tuset, Ivan Laptev, Josef Sivic, and Cordelia Schmid. Vid2seq: Large-scale pretraining of a visual language model for dense video captioning, 2023.
[13]
Sihan Chen, Handong Li, Qunbo Wang, Zijia Zhao, Mingzhen Sun, Xinxin Zhu, and Jing Liu. Vast: A vision-audio-subtitle-text omni-modality foundation model and dataset, 2023.
[14]
Hang Zhang, Xin Li, and Lidong Bing. Video-llama: An instruction-tuned audio-visual language model for video understanding, 2023.
[15]
Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Shahbaz Khan. Video-chatgpt: Towards detailed video understanding via large vision and language models, 2023.
[16]
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katie Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob Menick, Sebastian Borgeaud, Andrew Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karen Simonyan. Flamingo: a visual language model for few-shot learning, 2022.
[17]
Zhenhailong Wang, Manling Li, Ruochen Xu, Luowei Zhou, Jie Lei, Xudong Lin, Shuohang Wang, Ziyi Yang, Chenguang Zhu, Derek Hoiem, et al. Language models with image descriptors are strong few-shot video-language learners. arXiv preprint arXiv:2205.10747, 2022.
[18]
Antoine Yang, Antoine Miech, Josef Sivic, Ivan Laptev, and Cordelia Schmid. Zero-shot video question answering via frozen bidirectional language models, 2022.
[19]
Wenhao Wu, Dongliang He, Xiao Tan, Shifeng Chen, and Shilei Wen. Multi-agent reinforcement learning based frame sampling for effective untrimmed video recognition, 2019.
[20]
Jie Lei, Linjie Li, Luowei Zhou, Zhe Gan, Tamara L. Berg, Mohit Bansal, and Jingjing Liu. Less is more: Clipbert for video-and-language learning via sparse sampling, 2021.
[21]
Haoyu Lu, Mingyu Ding, Nanyi Fei, Yuqi Huo, and Zhiwu Lu. Lgdn: Language-guided denoising network for video-language modeling, 2022.
[22]
Shyamal Buch, Cristóbal Eyzaguirre, Adrien Gaidon, Jiajun Wu, Li Fei-Fei, and Juan Carlos Niebles. Revisiting the "video" in video-language understanding, 2022.
[23]
Tianwen Qian, Ran Cui, Jingjing Chen, Pai Peng, Xiaowei Guo, and Yu-Gang Jiang. Locate before answering: Answer guided question localization for video question answering, 2023.
[24]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. Language models are few-shot learners, 2020.
[25]
Hyung Won Chung, Le Hou, Shayne Longpre, Barret Zoph, Yi Tay, William Fedus, Yunxuan Li, Xuezhi Wang, Mostafa Dehghani, Siddhartha Brahma, Albert Webson, Shixiang Shane Gu, Zhuyun Dai, Mirac Suzgun, Xinyun Chen, Aakanksha Chowdhery, Alex Castro-Ros, Marie Pellat, Kevin Robinson, Dasha Valter, Sharan Narang, Gaurav Mishra, Adams Yu, Vincent Zhao, Yanping Huang, Andrew Dai, Hongkun Yu, Slav Petrov, Ed H. Chi, Jeff Dean, Jacob Devlin, Adam Roberts, Denny Zhou, Quoc V. Le, and Jason Wei. Scaling instruction-finetuned language models, 2022.
[26]
Jun Chen, Deyao Zhu, Kilichbek Haydarov, Xiang Li, and Mohamed Elhoseiny. Video chatcaptioner: Towards enriched spatiotemporal descriptions, 2023.
[27]
Kevin Lin, Faisal Ahmed, Linjie Li, Chung-Ching Lin, Ehsan Azarnasab, Zhengyuan Yang, Jianfeng Wang, Lin Liang, Zicheng Liu, Yumao Lu, Ce Liu, and Lijuan Wang. Mm-vid: Advancing video understanding with gpt-4v(ision), 2023.
[28]
Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. Mm-narrator: Narrating long-form videos with multimodal in-context learning, 2023.
[29]
Andy Zeng, Maria Attarian, Brian Ichter, Krzysztof Choromanski, Adrian Wong, Stefan Welker, Federico Tombari, Aveek Purohit, Michael Ryoo, Vikas Sindhwani, Johnny Lee, Vincent Vanhoucke, and Pete Florence. Socratic models: Composing zero-shot multimodal reasoning with language, 2022.
[30]
Ivana Balažević, Yuge Shi, Pinelopi Papalampidi, Rahma Chaabouni, Skanda Koppula, and Olivier J. Hénaff. Memory consolidation enables long-context video understanding, 2024.
[31]
Liliane Momeni, Mathilde Caron, Arsha Nagrani, Andrew Zisserman, and Cordelia Schmid. Verbs in action: Improving verb understanding in video-language models, 2023.
[32]
Dídac Surís, Sachit Menon, and Carl Vondrick. Vipergpt: Visual inference via python execution for reasoning. Proceedings of IEEE International Conference on Computer Vision (ICCV), 2023.
[33]
Rohan Choudhury, Koichiro Niinuma, Kris M. Kitani, and László A. Jeni. Zero-shot video question answering with procedural programs, 2023.
[34]
Junbin Xiao, Xindi Shang, Angela Yao, and Tat-Seng Chua. Next-qa:next phase of question-answering to explaining temporal actions, 2021.
[35]
Viorica Pătrăucean, Lucas Smaira, Ankush Gupta, Adrià Recasens Continente, Larisa Markeeva, Dylan Banarse, Skanda Koppula, Joseph Heyward, Mateusz Malinowski, Yi Yang, Carl Doersch, Tatiana Matejovicova, Yury Sulsky, Antoine Miech, Alex Frechette, Hanna Klimczak, Raphael Koster, Junlin Zhang, Stephanie Winkler, Yusuf Aytar, Simon Osindero, Dima Damen, Andrew Zisserman, and João Carreira. Perception test: A diagnostic benchmark for multimodal video models, 2023.
[36]
Bo Wu, Shoubin Yu, Zhenfang Chen, Joshua B Tenenbaum, and Chuang Gan. Star: A benchmark for situated reasoning in real-world videos. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS), 2021.
[37]
Jie Lei, Tamara L. Berg, and Mohit Bansal. Qvhighlights: Detecting moments and highlights in videos via natural language queries, 2021.
[38]
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023.
[39]
Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Jeff Huang, Chuyue Sun, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E. Gonzalez, Clark Barrett, and Ying Sheng. Efficiently programming large language models using sglang, 2023.
[40]
Pinelopi Papalampidi, Skanda Koppula, Shreya Pathak, Justin Chiu, Joe Heyward, Viorica Patraucean, Jiajun Shen, Antoine Miech, Andrew Zisserman, and Aida Nematzdeh. A simple recipe for contrastively pre-training video-first encoders beyond 16 frames, 2023.
[41]
Hongyang Xue, Zhou Zhao, and Deng Cai. Unifying the video and question attentions for open-ended video question answering. IEEE Transactions on Image Processing, 26 (12): 5656–5666, 2017. .
[42]
Kuo-Hao Zeng, Tseng-Hung Chen, Ching-Yao Chuang, Yuan-Hong Liao, Juan Carlos Niebles, and Min Sun. Leveraging video descriptions to learn video question answering, 2016.
[43]
Linchao Zhu, Zhongwen Xu, Yi Yang, and Alexander G. Hauptmann. Uncovering temporal context for video question and answering, 2015.
[44]
Tegan Maharaj, Nicolas Ballas, Anna Rohrbach, Aaron Courville, and Christopher Pal. A dataset and exploration of models for understanding video data through fill-in-the-blank question-answering, 2017.
[45]
Jacob Andreas, Marcus Rohrbach, Trevor Darrell, and Dan Klein. Neural module networks, 2017.
[46]
Roei Herzig, Moshiko Raboh, Gal Chechik, Jonathan Berant, and Amir Globerson. Mapping images to scene graphs with permutation-invariant structured prediction. In Advances in Neural Information Processing Systems (NIPS), 2018.
[47]
Ranjay Krishna, Ines Chami, Michael S. Bernstein, and Li Fei-Fei. Referring relationships. European Conference on Computer Vision, 2018.
[48]
Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. Object level visual reasoning in videos. In European Conference on Computer Vision, pp. 105–121, 2018.
[49]
Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius Zambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan Faulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint arXiv:1806.01261, 2018.
[50]
Roei Herzig, Amir Bar, Huijuan Xu, Gal Chechik, Trevor Darrell, and Amir Globerson. Learning canonical representations for scene graph to image generation. In European Conference on Computer Vision, 2020.
[51]
Roei Herzig, Elad Ben-Avraham, Karttikeya Mangalam, Amir Bar, Gal Chechik, Anna Rohrbach, Trevor Darrell, and Amir Globerson. Object-region video transformers. In Conference on Computer Vision and Pattern Recognition (CVPR), 2022.
[52]
Roei Herzig, Elad Levi, Huijuan Xu, Hang Gao, Eli Brosh, Xiaolong Wang, Amir Globerson, and Trevor Darrell. Spatio-temporal action graph networks. In Proceedings of the IEEE International Conference on Computer Vision Workshops, pp. 0–0, 2019.
[53]
Roei Herzig, Alon Mendelson, Leonid Karlinsky, Assaf Arbelle, Rogerio Feris, Trevor Darrell, and Amir Globerson. Incorporating structured representations into pretrained vision \& language models using scene graphs. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
[54]
Amir Bar, Roei Herzig, Xiaolong Wang, Anna Rohrbach, Gal Chechik, Trevor Darrell, and A. Globerson. Compositional video synthesis with action graphs. In ICML, 2021.
[55]
Elad Ben Avraham, Roei Herzig, Karttikeya Mangalam, Amir Bar, Anna Rohrbach, Leonid Karlinsky, Trevor Darrell, and Amir Globerson. Bringing image scene structure to video via frame-clip consistency of object tokens. In Thirty-Sixth Conference on Neural Information Processing Systems, 2022.
[56]
Achiya Jerbi, Roei Herzig, Jonathan Berant, Gal Chechik, and Amir Globerson. Learning object detection from captions via textual scene attributes. ArXiv, abs/2009.14558, 2020.
[57]
Roei Herzig, Ofir Abramovich, Elad Ben Avraham, Assaf Arbelle, Leonid Karlinsky, Ariel Shamir, Trevor Darrell, and Amir Globerson. Promptonomyvit: Multi-task prompt learning improves video transformers using synthetic scene data. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pp. 6803–6815, January 2024.
[58]
Chancharik Mitra, Brandon Huang, Trevor Darrell, and Roei Herzig. Compositional chain-of-thought prompting for large multimodal models. ArXiv, abs/2311.17076, 2023.
[59]
Tanmay Gupta and Aniruddha Kembhavi. Visual programming: Compositional visual reasoning without training, 2022.
[60]
Sanjay Subramanian, Medhini Narasimhan, Kushal Khangaonkar, Kevin Yang, Arsha Nagrani, Cordelia Schmid, Andy Zeng, Trevor Darrell, and Dan Klein. Modular visual question answering via code generation, 2023.
[61]
Jiaxin Ge, Sanjay Subramanian, Baifeng Shi, Roei Herzig, and Trevor Darrell. Recursive visual programming, 2023.