Abstract

Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.

1 Introduction↩︎

fig: — Figure 1: A conceptual illustration of existing and the proposed RAG scenarios. (A) Textual RAG retrieves documents (relevant to queries) from a text corpus and incorporates them when generating answers. (B) Conventional multimodal RAG extends retrieval to include static images. (C) VideoRAG (ours) further extends the external knowledge source to videos.

Recently, large foundation models, such as Large Language Models (LLMs) and their extension to vision modality called Large Vision-Language Models (VLMs), have become the standard choice for addressing diverse tasks due to their remarkable capabilities [1]–[3]. In particular, these models, trained on extensive textual and multimodal corpora, encode vast amounts of knowledge within their large-scale parameters. However, despite their success, they are still prone to generating factually incorrect outputs, known as hallucination, as their parametric knowledge can be inaccurate, incomplete, or outdated [4], [5], and this limitation highlights the need for incorporating knowledge from external knowledge sources, with Retrieval-Augmented Generation (RAG) emerging as an essential mitigator for it. Specifically, RAG typically operates by retrieving query-relevant information from an external corpus and then generating answers grounded in the retrieved information [6], [7].

However, while existing RAG approaches have been widely adopted for various real-world applications, they have primarily focused on retrieving and incorporating textual content [5], [8], with only recent attempts beginning to explore images (or text-image pairs) as the additional source of external knowledge [9], [10]. On the other hand, we argue that there remains a rapidly expanding yet underutilized medium, called videos, which provides unparalleled multimodal richness, and they may be a compelling resource for augmenting the knowledge landscape of current RAG systems. Specifically, videos combine temporal dynamics, spatial details, and multimodal cues, which collectively enable them to capture complex processes, context-dependent interactions, and non-verbal signals that static modalities (such as text and images) often fail to convey. Moreover, given the increasing popularity of video-sharing platforms (such as YouTube and TikTok), the availability of diverse, high-quality video data has grown significantly, ranging from educational tutorials and scientific demonstrations to personal experiences and real-time events, all of which may be useful when formulating responses to user queries.

It is worth noting that a few very recent studies have started considering video content to handle user queries; however, they have notable limitations. For instance, some of them assume that the video relevant to queries is already known and instead focus on identifying query-relevant frames from within that specified video [11], [12]. While this approach may work for scenarios where the relevant video is explicitly provided, it is suboptimal for more general-use cases, where users expect the system to dynamically identify and retrieve relevant video content to provide the answer. On the other hand, other studies handle videos by converting them into textual formats, such as subtitles or captions, and utilizing these textual representations under off-the-shelf text-based RAG pipelines [13], [14]. However, while this text-only strategy may offer a convenient workaround, it inherently sacrifices the multimodal richness of video data by discarding critical information, such as temporal dynamics and visual context, during the conversion process, which may be essential. For example, consider a query: “How does the expression of the dog change when it is angry?”. While textual transcriptions might describe the dog’s barking or growling, they fail to capture visual cues (baring teeth, raised hackles, or narrowed eyes), which are needed for accurately interpreting the emotional state of the dog as well as formulating the answer to the query.

In this work, to address the aforementioned limitations of existing studies, we introduce a novel framework, called VideoRAG, which aims to offer another fruitful angle to existing RAG frameworks by enabling a more comprehensive utilization of video content for its holistic retrieval and incorporation (See Figure 1 for its illustation and comparison to existing works). Specifically, in response to queries, the proposed VideoRAG not only retrieves relevant videos from a large video corpus, but also integrates both visual and textual elements of videos into the answer-generation process. Furthermore, we operationalize this by harnessing the advanced capabilities of recent Large Video Language Models (LVLMs), which are capable of directly processing the video content, consisting of visual and textual information, within the unified framework, thereby more effectively capturing its multimodal richness. However, while on the one hand, we observe that the joint utilization of visual and textual features is needed for the effective representation of videos and subsequently their retrieval, on the other hand, the textual descriptions of videos (e.g., subtitles) are oftentimes not available. To address this, we further present a simple yet effective mitigation strategy that utilizes automatic speech recognition techniques to generate textual transcripts from videos, which allows us to leverage both visual and textual modalities even when explicit textual annotations are absent.

To validate the effectiveness of our VideoRAG framework, we conduct experiments by leveraging the overlapping queries from the WikiHowQA dataset [15] (consisting of query-answer pairs) with the queries from the HowTo100M dataset [16] (including query-video pairs without answers). Then, the experimental results show the significant performance improvement of the proposed VideoRAG framework over relevant RAG baselines, demonstrating the efficacy of leveraging videos for RAG.

2 Method↩︎

In this section, we present VideoRAG, which performs retrieval of query-relevant videos over video corpus and generates responses grounded in them.

2.1 Preliminaries↩︎

We begin with preliminaries, providing the descriptions of Retrieval-Augmented Generation (RAG) and then Large Video Language Models (LVLMs).

2.1.0.1 Retrieval-Augmented Generation

RAG aims to enhance the capabilities of foundation models by grounding their outputs in external knowledge retrieved from the external knowledge source, such as Wikipedia, which consists of two main components: retrieval and generation modules. Formally, given a query \({\boldsymbol{q}}\), RAG retrieves a set of documents (or knowledge elements) \(\mathcal{K} = \{ {\boldsymbol{k}}_1, {\boldsymbol{k}}_2, \dots, {\boldsymbol{k}}_k \}\) from an external corpus \(\mathcal{C}\) (\(\mathcal{K} \subseteq \mathcal{C}\)) based on their relevance with \({\boldsymbol{q}}\) using a retrieval module, which can be formalized as follows: \(\mathcal{K} = \texttt{Retriever}({\boldsymbol{q}}, \mathcal{C})\). Here, the query \({\boldsymbol{q}}\) and knowledge \({\boldsymbol{k}}\) are represented as a sequence of tokens \({\boldsymbol{q}}= [q_1, q_2, \dots, q_i]\) and \({\boldsymbol{k}}= [k_1, k_2, \dots, k_j]\). Also, during retrieval, the relevance between the query and each knowledge element within the corpus is determined by the scoring function, defined as follows: \(\texttt{Sim}({\boldsymbol{q}}, {\boldsymbol{k}})\), which typically measures their representational similarity over the embedding space. In the subsequent generation step, the retrieved knowledge elements are then used as additional input to the generation module, to augment the query to produce an answer \({\boldsymbol{y}}\), which can be defined as follows: \({\boldsymbol{y}}= \texttt{Model}({\boldsymbol{q}}, \mathcal{K})\), where \(\texttt{Model}\) is typically implemented as the foundation model, such as LLMs. It is worth noting that, unlike existing RAG frameworks that focus mainly on retrieving and incorporating textual content (or, in some recent cases, extra static images), in this work, we explore the extension toward videos.

2.1.0.2 Large Video Language Models

On top of the extensive language understanding capabilities of LLMs, LVLMs are designed to handle and incorporate the features from video content, including temporal, spatial, and multimodal information, within the unified token processing framework. Formally, let us denote a video \({\boldsymbol{V}}\) as a sequence of image frames: \({\boldsymbol{V}}= [{\boldsymbol{v}}_1, {\boldsymbol{v}}_2, \dots, {\boldsymbol{v}}_n]\) and its associated textual data (such as subtitles, or any other textual information such as the video-specific query) \({\boldsymbol{t}}\) as a sequence of tokens: \({\boldsymbol{t}}= [t_1, t_2, \dots, t_m]\). Then, the typical LVLM, denoted as \(\texttt{LVLM}\), enables the joint processing of these multimodal inputs by employing two specialized components: a vision encoder and a text encoder. Specifically, the vision encoder processes the sequence of video frames \({\boldsymbol{V}}\) (which can span multiple videos) to extract temporal and spatial features, resulting in a sequence of visual feature embeddings (or visual tokens): \({\boldsymbol{F}}_{\texttt{visual}} = \texttt{VisionEncoder}({\boldsymbol{V}})\). Concurrently, the text encoder processes the given textual information \({\boldsymbol{t}}\) to generate corresponding feature embeddings: \({\boldsymbol{F}}_{\texttt{text}} = \texttt{TextEncoder}({\boldsymbol{t}})\). Then, the overall process to obtain the video representation (with the goal of capturing both visual and textual features) can be denoted as follows: \({\boldsymbol{f}}_{\texttt{video}} = \texttt{LVLM}({\boldsymbol{V}}, {\boldsymbol{t}})\). Traditionally, \({\boldsymbol{f}}_{\texttt{video}}\) is obtained by the simple interpolation of the visual and textual representations: \({\boldsymbol{f}}_{\texttt{video}} = \alpha \cdot {\boldsymbol{F}}_{\texttt{text}} + (1-\alpha) \cdot {\boldsymbol{F}}_{\texttt{visual}}\) [17], and, more recently, it can be done by further jointly processing the visual and textual embeddings through several LVLM layers (that sit on top of existing LLMs) [18], which allows the model to learn a more effective representation and continue generating the next sequence of tokens (for example, an answer to a query).

2.2 VideoRAG↩︎

We now turn to introduce our VideoRAG, which extends the existing RAG paradigm by leveraging the video corpus as the external knowledge source.

2.2.0.1 Video Retrieval

The initial step to operationalize RAG over the video corpus is to implement video retrieval, whose goal is to identify query-relevant videos \(\mathcal{V} = \{ {\boldsymbol{V}}_1, {\boldsymbol{V}}_2, \dots, {\boldsymbol{V}}_k \}\) from the external corpus \(\mathcal{C}\), consisting of a large number of videos, formalized as follows: \(\mathcal{V} = \texttt{Retriever}({\boldsymbol{q}}, \mathcal{C})\). Recall that, as explained in Section 2.1, this retrieval process involves calculating the similarity between the query \({\boldsymbol{q}}\) and each knowledge element (which is video \({\boldsymbol{V}}\) in our case) to determine their relevance. To achieve this, we first forward the video \({\boldsymbol{V}}\) (composed of image frames and, if available, subtitles) as well as the query \({\boldsymbol{q}}\) (without visual information) into \(\texttt{LVLM}\), to obtain their representations \({\boldsymbol{f}}_{\texttt{query}}\) and \({\boldsymbol{f}}_{\texttt{video}}\). After that, the relevance is computed based on their representation-level similarity, for example, using a cosine similarity, and then the top-\(k\) videos with the highest similarity scores are retrieved for the subsequent answer generation step.

2.2.0.2 Video-Augmented Response Generation

After the retrieval of query-relevant videos is done, the next step is to incorporate the retrieved videos into the answer generation process, in order to formulate the answer grounded in them. To operationalize this, we first concatenate the frames of each retrieved video with its associated textual data (e.g., subtitles), then concatenate these multimodal pairs across all videos retrieved, and lastly append the user query, to construct the input to \(\texttt{LVLM}\), denoted as follows: \([{\boldsymbol{V}}_1, {\boldsymbol{t}}_1, \dots, {\boldsymbol{V}}_k, {\boldsymbol{t}}_k, {\boldsymbol{q}}]\). Then, this input is forwarded into \(\texttt{LVLM}\), which enables the joint processing of the combined visual, textual, and query-specific information, to generate the response while considering their multimodal richness.

2.3 Auxiliary Text Generation↩︎

In both the retrieval and generation steps, the inclusion of video-associated textual data, such as subtitles, can play a crucial role in enhancing video representation since it provides additional context and semantic cues that complement the visual content. However, not every video in the corpus comes with subtitles since they require additional annotations. Therefore, for such videos, we propose generating auxiliary textual data by extracting audio from the video and converting it into text using off-the-shelf automatic speech recognition techniques. Formally, given a video \({\boldsymbol{V}}\), this process can be formalized as follows: \({\boldsymbol{t}}_{\texttt{aux}} = \texttt{AudioToText}(\texttt{Audio}({\boldsymbol{v}}))\), where \(\texttt{Audio}({\boldsymbol{V}})\) extracts the audio track from the video, and \(\texttt{AudioToText}\) converts the extracted audio signal into textual content. Therefore, for those videos without subtitles, the auxiliary text \({\boldsymbol{t}}_{\texttt{aux}}\) is used in place of \({\boldsymbol{t}}\) in both the retrieval and generation steps.

3 Experimental Setups↩︎

In this section, we describe the datasets, models, evaluation metrics, and implementation details.

3.1 Datasets↩︎

To evaluate our VideoRAG in information-seeking question answering (a task particularly suitable and widely used for validating RAG approaches), we use available resources having extensive video corpus and question-answer pairs with diverse topics. Specifically, we adapt WikiHowQA data [15] as the source of queries and answers, as it offers a wide range of instructional questions extracted from the WikiHow webpage¹, with human-written, step-by-step processes to ensure high-quality ground truths. Also, for the external video corpus, we utilize the HowTo100M dataset [16], a comprehensive collection of instruction videos sourced from YouTube, which is also associated with queries from WikiHow based on their search results, therefore, serving as a useful resource for our VideoRAG tasks.

16pt

Table 1: Overall RAG results across four metrics. The best results are highlighted in **bold**, and the second-best results are highlighted with underline. Note that the **Oracle** setting (that uses ideal retrieval results) is not comparable to others.
	Methods	ROUGE-L	BLEU-4	BERTScore	G-Eval
Baselines	Naïve	0.141	0.014	0.834	1.579
	TextRAG(BM25)	0.172	0.032	0.847	1.633
	TextRAG(DPR)	0.166	0.022	0.846	1.591
	TextVideoRAG	0.228	0.044	0.870	1.979
Ours	VideoRAG-T	0.224	0.043	0.868	2.001
	VideoRAG-V	0.254	0.054	0.881	2.161
	VideoRAG-VT	0.252	0.054	0.880	2.104
Oracle	VideoRAG-V	0.262	0.055	0.884	2.225

3.2 Models↩︎

3.2.0.1 Baselines

We evaluate the performance of VideoRAG against four different baselines, as follows:

Naïve – which directly generates answers using input queries without any additional context.
TextRAG (BM25) – which is a text-based RAG model, performing retrieval over documents (from Wikipedia) based on their relevance to queries with BM25 [19] (which is a sparse retriever with lexical matching) and generating answers grounded in them.
TextRAG (DPR) – which is a similar text-based RAG baseline to TextRAG (BM25) but instead performs retrieval with DPR [20], a widely used dense retriever.
TextVideoRAG – which follows the previous video-based RAG methods [13], [14], which first represent videos as their textual descriptions (e.g., captions or transcripts) and then utilize only those textual information in retrieval and generation.

3.2.0.2 VideoRAG

We introduce and validate three variants of the proposed VideoRAG framework, each employing the same video retrieval method but differing in how the retrieved videos are incorporated during answer generation. Specifically, the first variant, VideoRAG-T, integrates only the transcripts (or auxiliary textual data) obtained from retrieved videos as context for answer generation. Meanwhile, the second variant, VideoRAG-V, exclusively utilizes video frames as context to provide visual grounding for responses. Finally, VideoRAG-VT jointly utilizes both the video frames and textual transcripts for generation. Additionally, to further estimate the room for performance improvement, we include an oracle version of VideoRAG, which directly uses the ground-truth video pre-associated with the query labeled in HowTo100M, instead of using retrieval outcomes.

3.3 Evaluation Metrics↩︎

To comprehensively evaluate the performance of different models, we use the following diverse metrics: 1) ROUGE-L measures the longest common subsequence between the generated answer and the ground truth [21]; 2) BLEU-4 calculates the overlap of n-grams (up to 4-grams) between the generated answer and the reference answer [22]; 3) BERTScore is designed to consider the semantic alignment between the generated and reference answers [23] by extracting their contextual embeddings from a pre-trained BERT model [24] and then calculating their representational similarity; 4) G-Eval leverages the evaluation capabilities of LLMs to assess the quality of the generated answer [25], and, to do this, we prompt the LLM to rate the generated answer in comparison to the reference on a 5-point Likert scale.

3.4 Implementation Details↩︎

It is worth noting that there exist various LVLMs available for use, each with different merits depending on the task requirements: for retrieval, precise alignment between textual and video features (obtained from their specialized encoders) is essential to ensure that the retrieved videos are contextually relevant to the query, meanwhile, generation benefits from LVLMs with advanced capabilities for accurately formulating responses and grounding them in the retrieved content. To achieve this, for retrieval, we use InternVideo2 [26] since it is explicitly trained to align semantics between videos and their textual descriptions. Specifically, we use its video and text encoders to extract embeddings for videos and text, respectively. On the other hand, for video-augmented answer generation, we use LLaVA-Video-7B [18], which is known for achieving state-of-the-art performance on video understanding and relevant tasks. For efficiency, we uniformly sample 4 frames per video for retrieval, while we sample 32 frames (or all frames if the video is shorter than 32 seconds, sampled at 1 fps) for generation. In auxiliary text generation, to extract textual content from videos, we use Whisper [27].

4 Experimental Results and Analyses↩︎

We now present results and various analyses, showing the effectiveness of the proposed VideoRAG.

4.1 Main Results↩︎

We provide main results in Table 1, showcasing the performance of different models with varying types of retrieved knowledge. First, we observe that all RAG models, utilizing external knowledge, clearly outperform the Naïve baseline, reaffirming the critical role of external knowledge in enhancing the factual accuracy of generated responses. Also, among these, our VideoRAG achieves the highest performance, significantly surpassing existing textual RAG baselines. This improvement corroborates our hypothesis that video content may be a useful resource for RAG, as it provides richer and more detailed information than textual resources.

10pt

Table 2: Generation results using a different set of videos, such as Random that randomly samples videos, Retrieved that selects videos according to their relevance with queries, and Oracle that uses the ground truth videos annotated in data.
Video Set	ROUGE-L	BLEU-4	BERTScore
Random	0.243	0.050	0.878
Retrieved	0.254	0.054	0.881
Oracle	0.262	0.055	0.884

fig: — Figure 2: Impact of varying the interpolation ratio between textual and visual features on retrieval performance.

fig: — Figure 3: Breakdown performance of different models across 10 different categories.

Then, one might ask whether the performance improvement of the proposed VideoRAG framework comes from the consideration of the additional modality (e.g., video frames) rather than the actual content within the retrieved knowledge. To investigate this, we compare TextVideoRAG and VideoRAG-T, which utilize textual representations of videos (without vision modality), against TextRAG that uses the encyclopedic knowledge base in text (namely, Wikipedia) as the knowledge source. As shown in Table 1, we find that the improved performance of VideoRAG-T and TextVideoRAG over TextRAG suggests that, even if we use only the textual representations of videos, the content embedded in videos is proved to be more fruitful to enhance the response quality than the general encyclopedic knowledge. Additionally, despite the superior performance of TextVideoRAG and VideoRAG-T to baselines, they are outperformed by the other variants of our VideoRAG, which directly incorporate visual features into the answer generation process. This result further highlights the critical role of visual information in performance. Finally, the smaller performance gap between VideoRAG-V and VideoRAG-VT suggests that much of the necessary information required for answer generation is effectively encapsulated within the visual features of video frames, which inherently include information conveyed through textual descriptions.

4.2 Video Retrieval↩︎

4.2.0.1 Impact of Video Retrieval

We hypothesize that the quality of the retrieved videos is a critical factor in the success of RAG, as it can directly influence the subsequent answer generation process. To confirm this, we compare the performance of VideoRAG with different videos, including randomly selected videos and retrieved videos (relevant to queries), used for augmenting the answer generation step. As shown in Table 2, we then observe that incorporating query-relevant videos significantly improves the quality of the answers compared to randomly selected videos, demonstrating the importance of retrieval quality. Furthermore, the Oracle setting, which represents an ideal scenario with perfectly relevant video retrieval, achieves the highest performance, highlighting the potential for further improvements through advancements in video retrieval mechanisms within our VideoRAG.

4.2.0.2 Effectiveness of Textual and Visual Features

When performing video retrieval, it is questionable how much different modalities, such as textual features, visual features, or a combination of both, contribute to the effectiveness of video representations, and we report retrieval results with varying modalities in Table [tab:feat95ensemble]. From this, we observe that textual features consistently outperform visual features, likely due to their stronger semantic alignment with textual user queries. To further examine this, we visualize the embeddings of textual and visual features of video content as well as queries over the latent space in Figure [fig:pca], and it clearly reveals closer proximity between textual query embeddings and textual video representations compared to visual video representations. In other words, visual features exhibit a modality gap relative to text-based queries (as reflected in their greater distance), resulting in suboptimal retrieval performance. Nevertheless, combining textual and visual features achieves the highest performance, demonstrating the complementary nature of those two modalities in video representations for retrieval.

Table 3: Case study comparing Naïve and VideoRAG-V approaches.
Naïve
Question	Explain how to bake cookies on your car dashboard.
Generated Answer	I’m sorry, but it’s not possible to bake cookies on your car dashboard. The dashboard is not designed for cooking and it’s not safe to use it as a heat source. Additionally, the fumes from baking could be harmful to you and the car’s interior.Baking requires a controlled environment with a consistent temperature, which is difficult to achieve in a car. Moreover, the dashboard is made of materials that can easily catch fire or release toxic fumes when exposed to high temperatures.If you want to bake cookies, it’s best to do so in a safe and controlled environment, such as a kitchen or an outdoor grill. Just remember to always follow proper safety precautions and guidelines when cooking.
VideoRAG-V
Question	Explain how to bake cookies on your car dashboard.
Retrieved Video
Generated Answer	To bake cookies on your car dashboard, you’ll need a nonstick cooking spray, a cookie sheet, cookie dough, a spatula, and a thermometer. First, spray the cookie sheet with nonstick cooking spray. Then, place the cookie dough onto the sprayed sheet. Cover it with plastic wrap and let it rest. Next, place the cookie sheet in the car’s dashboard, ensuring it’s at least 95 degrees Fahrenheit. Set the car in a sunny spot to bake the cookies. After baking, use a spatula to remove the cookies from the dashboard.
Ground Truth	To bake cookies on your car dashboard, you’ll need to be parked in a sunny spot on a day when it’s at least 95 degrees Fahrenheit out. First, spread cookie dough out on a baking sheet lined with parchment paper. Then, set the baking sheet on your car dashboard, and shut all of the doors. Finally, check on the cookies every 15-30 minutes until the edges of the cookies are firm and you’re able to slide them off the parchment paper.

5pt

Table 4: Ablation studies with different modalities. For TextRAG, we use BM25 to retrieve textual documents.
Methods	Wikipedia	Video	Subtitle	ROUGE-L	G-Eval
Naïve	\(\bigtimes\)	\(\bigtimes\)	\(\bigtimes\)	0.141	1.579
TextRAG	\(\bigcirc\)	\(\bigtimes\)	\(\bigtimes\)	0.172	1.633
VideoRAG-VT	\(\bigtimes\)	\(\bigcirc\)	\(\bigcirc\)	0.252	2.161
VideoRAG-VT + TextRAG	\(\bigcirc\)	\(\bigcirc\)	\(\bigcirc\)	0.243	2.048

Table 5: Case study comparing TextRAG (BM25) and VideoRAG-V approaches.
TextRAG (BM25)
Question	Explain how to make a clay rose.
Retrieved Document	Vally Wiselthier, an art deco ceramicist who was making pieces for General Ceramics in Keasbey, New Jersey. After seeing her preliminary pieces, Erni gave Rose a membership at Greenwich House, where she learned how to use a potter’s wheel. Shortly after their first son George was born, he was found to have intractable asthma. The Cabat’s decided to move to Arizona around 1942 in order to alleviate his condition. During World War II Rose worked as a riveter at the Davis-Monthan Army Air Field repairing war-damaged aircraft. Rose was able to make primitive ceramics from the extra clay that Erni
Generated Answer	The document does not provide a step-by-step guide on how to make a clay rose. However, based on the information provided, we can infer that Rose learned how to use a potter’s wheel at Greenwich House after seeing Vally Wiselthier’s preliminary pieces. This suggests that she may have received some training or instruction in pottery from Vally Wiselthier or other instructors at Greenwich House.To make a clay rose, you would need to follow these general steps:. Gather materials: You will need clay, a potter’s wheel, a wire cutter, a sponge, and any additional materials you want to use for decoration (such as glaze, paint, or glitter).. Prepare the clay: Cut a piece of clay to the desired size and shape for your rose. Use a wire cutter to trim any excess clay and smooth out any rough edges.. Shape the clay: Use your hands or tools to shape the clay into a rose. You can start by creating a small ball of clay for the center of the rose, then add petals around it. Use a sponge to smooth out any rough areas and refine the shape of the rose.. Add details: Use a tool or your fingers to add details to the petals, such as veins or shading. You can also use additional materials like glaze, paint, or glitter to decorate the rose.. Allow the clay to dry: Let the clay dry completely before firing it in a kiln. This will help prevent any cracking or breaking during the firing process.. Fire the clay: Place the dried clay rose in a kiln and fire it according to the manufacturer’s instructions. This will harden the clay and create a durable piece of pottery.. Glaze or finish: Once the clay is fired, you can apply a glaze or finish to give the rose a glossy or matte appearance. Allow the glaze to dry and cure before handling the finished piece.It’s important to note that this is a general guide and the specific steps may vary depending on the type of clay you are using and the techniques you are employing. It’s also important to follow safety guidelines when working with clay and kilns.
VideoRAG-V
Question	Explain how to make a clay rose.
Retrieved Video
Generated Answer	To make a clay rose, start by rolling out a piece of clay into a flat sheet. Use a rolling pin to flatten the clay further. Then, use a knife or sculpting tool to cut the clay into petal shapes. Roll the petals into a spiral shape to form the rose. Use a small tool to create indentations in the petals for texture. Assemble the petals around a central stem, shaping them to resemble a rose. Add leaves and other details as desired. Finally, use a mold to imprint patterns onto the petals, adding depth and realism to the rose.
Ground Truth	To make a clay rose, start by rolling a chunk of clay into a ball, dividing the ball into seven similarly-sized pieces, and rolling those pieces into balls. Take the smallest ball and press it into a guitar-pick-shaped ’’petal” with the pointy part thicker than the rounded part, then fold the sides gently inward. Repeat with a slightly bigger ball of clay, then fold this second petal around the first. Continue attaching slightly larger petals in a spiral, then pinch the bottom of your rose to connect all the petals.

4.2.0.3 Analysis on Feature Ensemble

To better understand the contribution of textual and visual features in video retrieval, we analyze how varying their combination ratio (\(\alpha\)) impacts performance across different metrics. As shown in Figure [fig:alpha], we find that the optimal ratio for balancing textual and visual features is around 0.5 to 0.7 (with marginal variations depending on the specific metric: 0.7 for R@1 and R@5; 0.5 for R@10). These results further highlight the complementary contributions of textual and visual features in video representations for retrieval, while a slight emphasis on textual features might be preferable due to the modality gap observed in Figure [fig:pca]. Lastly, based on this analysis, we select \(\alpha = 0.7\) for our experiments.

4.3 Analyses and Discussion↩︎

4.3.0.1 Category-Wise Performance Analysis

To evaluate the robustness of our VideoRAG across diverse query types, we breakdown the model performance on 10 distinct categories (annotated within WikiHow). Then, as shown in Figure 3, the family of the proposed VideoRAG (such as VideoRAG-T and VideoRAG-V) outperforms all other baselines across all categories, which highlights its ability to handle a wide range of queries. Also, it is worth noting that VideoRAG-V shows notable performance improvement in the Food & Entertaining category, and this is particularly reasonable given that the questions in this category often benefit significantly from visual details, for example, the query: “How to make a healthy spinach and garlic dish” requires ingredient preparation or cooking techniques, which are not effectively conveyed through text alone. Thus, the results in this category reaffirm the importance of leveraging video content as external knowledge for RAG.

4.3.0.2 Ablation Studies

To analyze how performance varies with different knowledge sources, we conduct ablation studies and present the results in Table 4. We then find that, while incorporating external knowledge (whether from textual encyclopedic sources or video corpus) consistently improves performance over the Naïve baseline, the approach that jointly uses videos with general textual documents achieves slightly degraded performance. This suggests that textual content (retrieved from the encyclopedic knowledge base) may introduce redundant or irrelevant details, which may overlap with or contradict the information provided by video content, leading to diminishing the effectiveness of the VideoRAG framework.

4.3.0.3 Qualitative Results

We now qualitatively analyze the effectiveness of VideoRAG through a case study, with the query: “Explain how to bake cookies on your car dashboard”. As shown in Table 3, the Naïve baseline, relying solely on its parametric knowledge, generates a generic response highlighting the impracticality and safety concerns of such a method, failing to provide the step-by-step instructions necessary to address the query. It indicates the limitation of parametric knowledge that is inadequate especially when specific and uncommon information is required. In contrast, VideoRAG-V retrieves a relevant instructional video that illustrates the process of baking cookies on a car dashboard, and, by leveraging this, it successfully generates a detailed response similar to the ground truth. This example highlights how the proposed VideoRAG can utilize external video content to produce more precise, contextually rich, and actionable answers than methods that rely solely on parametric knowledge embedded within models.

Furthermore, we observe that external textual knowledge alone can sometimes fall short in providing relevant and actionable information for specific procedural queries, such as “Explain how to make a clay rose”. Specifically, as shown in Table 5, TextRAG (BM25) retrieves an irrelevant document about a person named Rose, as Wikipedia does not contain specific procedural guidance on this topic, and, consequently, the generated response is misaligned with the query. In contrast, VideoRAG-V retrieves a relevant video demonstrating how to make a clay rose and leverages this visual content to generate a concise and accurate response that closely mirrors the ground truth, from which we clearly confirm the utility of videos for RAG.

5.0.0.1 Retrieval-Augmented Generation

RAG is a strategy that combines retrieval and generation processes to produce accurate answers by grounding them in relevant external knowledge [4], [5], [28]. To be specific, during the retrieval step, documents (relevant to queries) are selected from a large corpus by calculating their similarity to the query, which can be done by using either sparse methods [19], [29] based on lexical overlaps or dense methods [20], [30] that leverage semantic representations. In the subsequent generation step, these retrieved documents serve as input for generating answers that are rooted in the provided information, with the generation process (as well as retrieval) varying depending on the specific work [31]–[34], for example, incorporating iterative retrieval-generation cycles [35] or adapting different RAG strategies based on query complexity [8]. However, despite the fact that much of the real-world knowledge is inherently multimodal in nature [36]–[38], the majority of current RAG studies have focused preliminary on the textual modality, with little effort on incorporating images, leaving a significant gap in leveraging the full spectrum of available knowledge for the holistic operation of RAG.

5.0.0.2 Multimodal RAG

Recently, there has been growing interest in expanding RAG systems to incorporate multimodal information (beyond textual documents), such as images [9], [10], [39], [40], code [41], tables [42], [43], and audio [44]. However, in contrast to them, videos offer a unique and orthogonal advantage for advancing RAG systems, as they encapsulate temporal dynamics, spatial details, and multimodal cues in ways unmatched by other modalities. Inspired by this, very recent studies have begun exploring the usage of video content within RAG pipelines; however, despite these efforts, existing approaches leverage the distinct advantages of video data in a suboptimal way. Specifically, some focus on extracting query-relevant frames from a preselected video and generating answers based on them, which, while useful in controlled scenarios, limits their real-world applicability in open-domain settings [11], [12]. Also, other approaches attempt to sidestep the complexity of handling video data by converting it into textual representations (such as subtitles or captions); however, while directly applicable to existing text-based RAG frameworks, they sacrifice the multimodal richness embedded within videos (such as temporal dynamics and spatial patterns) [12]–[14]. To address these challenges, we propose VideoRAG that is capable of dynamically retrieving and holistically utilizing video content in RAG, powered by LVLMs discussed next.

5.0.0.3 Large Video Language Models

Building on the remarkable success of LLMs in language understanding and generation as well as their ability to encapsulate vast amounts of knowledge [45]–[47], there has been a growing interest in extending them to encompass diverse modalities, such as images [48]–[50] and code [51], [52]. Furthermore, this expansion has recently extended to another modality called video, leading to the emergence of LVLMs that are capable of directly processing video content. In particular, these models excel in solving traditionally challenging (yet straightforward) tasks, such as object or action detection [53], and their capabilities have been rapidly advanced, enabling them to tackle more challenging tasks, such as analyzing spatio-temporal dynamics to predict the sequence of events, inferring causal relationships across video frames, and generating context-aware descriptions of intricate scenarios [54]–[58], even in zero-shot settings without additional training [59], [60]. However, despite these advancements, their potential has yet to be explored in the context of RAG; thus, in this work, we aim to bridge this gap with the proposal of VideoRAG.

6 Conclusion↩︎

In this work, we presented VideoRAG, a novel framework that expands the current landscape of RAG systems by leveraging a video corpus as the external knowledge source. Specifically, unlike existing works that use the textual representations of videos or assume the existence of query-relevant videos without retrieval, the proposed VideoRAG retrieves videos based on their relevance to queries but also integrates their multimodal richness (including visual and textual elements) into the RAG pipeline, by harnessing the capabilities of LVLMs. Also, through comprehensive analyses, we demonstrated how the inclusion of visual or textual features, or a combination of both, improves retrieval and generation performance, and, inspired by the critical role of textual features (for retrieval quality) but their absence in some videos, we presented a simple yet effective mitigator that uses automatic speech recognition to generate textual transcripts. Overall, experimental results validated the superiority of our VideoRAG over existing RAG methods, and we believe it makes a significant step toward holistic RAG systems that can utilize videos.

References↩︎

[1]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. 2024. https://arxiv.org/abs/2408.03326. Preprint, arXiv:2408.03326.

[2]

An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, Guanting Dong, Haoran Wei, Huan Lin, Jialong Tang, Jialin Wang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Ma, Jianxin Yang, Jin Xu, Jingren Zhou, Jinze Bai, Jinzheng He, Junyang Lin, Kai Dang, Keming Lu, Keqin Chen, Kexin Yang, Mei Li, Mingfeng Xue, Na Ni, Pei Zhang, Peng Wang, Ru Peng, Rui Men, Ruize Gao, Runji Lin, Shijie Wang, Shuai Bai, Sinan Tan, Tianhang Zhu, Tianhao Li, Tianyu Liu, Wenbin Ge, Xiaodong Deng, Xiaohuan Zhou, Xingzhang Ren, Xinyu Zhang, Xipin Wei, Xuancheng Ren, Xuejing Liu, Yang Fan, Yang Yao, Yichang Zhang, Yu Wan, Yunfei Chu, Yuqiong Liu, Zeyu Cui, Zhenru Zhang, Zhifang Guo, and Zhihao Fan. 2024. https://arxiv.org/abs/2407.10671. Preprint, arXiv:2407.10671.

[3]

Wenliang Dai, Nayeon Lee, Boxin Wang, Zhuoling Yang, Zihan Liu, Jon Barker, Tuomas Rintamaki, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. 2024. https://doi.org/10.48550/ARXIV.2409.11402. arXiv Preprint arXiv:2409.11402, abs/2409.11402.

[4]

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.

[5]

Ori Ram, Yoav Levine, Itay Dalmedigos, Dor Muhlgay, Amnon Shashua, Kevin Leyton-Brown, and Yoav Shoham. 2023. https://doi.org/10.1162/TACL_A_00605. Trans. Assoc. Comput. Linguistics, 11:1316–1331.

[6]

Cheng Niu, Yuanhao Wu, Juno Zhu, Siliang Xu, Kashun Shum, Randy Zhong, Juntong Song, and Tong Zhang. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.585. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 10862–10878. Association for Computational Linguistics.

[7]

Orlando Ayala and Patrice Béchard. 2024. https://doi.org/10.18653/V1/2024.NAACL-INDUSTRY.19. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Industry Track, NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 228–238. Association for Computational Linguistics.

[8]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong Park. 2024. https://doi.org/10.18653/V1/2024.NAACL-LONG.389. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), NAACL 2024, Mexico City, Mexico, June 16-21, 2024, pages 7036–7050. Association for Computational Linguistics.

[9]

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, and Maosong Sun. 2024. https://doi.org/10.48550/ARXIV.2410.10594. arXiv Preprint arXiv:2410.10594, abs/2410.10594.

[10]

Monica Riedler and Stefan Langer. 2024. https://doi.org/10.48550/ARXIV.2410.21943. arXiv preprint arXiv:2410.21943, abs/2410.21943.

[11]

Yongdong Luo, Xiawu Zheng, Xiao Yang, Guilin Li, Haojia Lin, Jinfa Huang, Jiayi Ji, Fei Chao, Jiebo Luo, and Rongrong Ji. 2024. https://arxiv.org/abs/2411.13093. arXiv Preprint arXiv:2411.13093.

[12]

Ziyu Ma, Chenhui Gou, Hengcan Shi, Bin Sun, Shutao Li, Hamid Rezatofighi, and Jianfei Cai. 2024. https://doi.org/10.48550/ARXIV.2406.12846. arXiv preprint arXiv:2406.12846, abs/2406.12846.

[13]

Md. Adnan Arefeen, Biplob Debnath, Md. Yusuf Sarwar Uddin, and Srimat Chakradhar. 2024. https://doi.org/10.1145/3627673.3680088. In Proceedings of the 33rd ACM International Conference on Information and Knowledge Management, CIKM 2024, Boise, ID, USA, October 21-25, 2024, pages 4341–4348. ACM.

[14]

Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, and Kyusong Lee. 2024. https://aclanthology.org/2024.emnlp-main.559. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pages 10031–10045. Association for Computational Linguistics.

[15]

Valeria Bolotova-Baranova, Vladislav Blinov, Sofya Filippova, Falk Scholer, and Mark Sanderson. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.290. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 5291–5314. Association for Computational Linguistics.

[16]

Antoine Miech, Dimitri Zhukov, Jean-Baptiste Alayrac, Makarand Tapaswi, Ivan Laptev, and Josef Sivic. 2019. https://doi.org/10.1109/ICCV.2019.00272. In 2019 IEEE/CVF International Conference on Computer Vision, ICCV 2019, Seoul, Korea (South), October 27 - November 2, 2019, pages 2630–2640. IEEE.

[17]

Hu Xu, Gargi Ghosh, Po-Yao Huang, Dmytro Okhonko, Armen Aghajanyan, Florian Metze, Luke Zettlemoyer, and Christoph Feichtenhofer. 2021. https://doi.org/10.18653/V1/2021.EMNLP-MAIN.544. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP 2021, Virtual Event / Punta Cana, Dominican Republic, 7-11 November, 2021, pages 6787–6800. Association for Computational Linguistics.

[18]

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. 2024. https://llava-vl.github.io/blog/2024-04-30-llava-next-video/.

[19]

Stephen E. Robertson, Steve Walker, Susan Jones, Micheline Hancock-Beaulieu, and Mike Gatford. 1994. http://trec.nist.gov/pubs/trec3/papers/city.ps.gz. In Proceedings of The Third Text REtrieval Conference, TREC 1994, Gaithersburg, Maryland, USA, November 2-4, 1994, volume 500-225 of NIST Special Publication, pages 109–126. National Institute of Standards and Technology (NIST).

[20]

Vladimir Karpukhin, Barlas Oguz, Sewon Min, Patrick S. H. Lewis, Ledell Wu, Sergey Edunov, Danqi Chen, and Wen-tau Yih. 2020. https://doi.org/10.18653/V1/2020.EMNLP-MAIN.550. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP 2020, Online, November 16-20, 2020, pages 6769–6781. Association for Computational Linguistics.

[21]

Chin-Yew Lin. 2004. https://aclanthology.org/W04-1013/. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.

[22]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. https://doi.org/10.3115/1073083.1073135. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, Philadelphia, Pennsylvania, USA. Association for Computational Linguistics.

[23]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. https://openreview.net/forum?id=SkeHuCVFDr. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net.

[24]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. https://doi.org/10.18653/v1/N19-1423. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.

[25]

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. 2023. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.153. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 2511–2522. Association for Computational Linguistics.

[26]

Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, Tianxiang Jiang, Songze Li, Jilan Xu, Hongjie Zhang, Yifei Huang, Yu Qiao, Yali Wang, and Limin Wang. 2024. https://doi.org/10.1007/978-3-031-73013-9_23. In Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part LXXXV, volume 15143 of Lecture Notes in Computer Science, pages 396–416. Springer.

[27]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. https://proceedings.mlr.press/v202/radford23a.html. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pages 28492–28518. PMLR.

[28]

Penghao Zhao, Hailin Zhang, Qinhan Yu, Zhengren Wang, Yunteng Geng, Fangcheng Fu, Ling Yang, Wentao Zhang, and Bin Cui. 2024. https://doi.org/10.48550/ARXIV.2402.19473. arXiv preprint arXiv:2402.19473, abs/2402.19473.

[29]

Karen Spärck Jones. 2004. https://doi.org/10.1108/00220410410560573. J. Documentation, 60(5):493–502.

[30]

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. 2022. https://openreview.net/forum?id=jKN1pXi7b0. Trans. Mach. Learn. Res., 2022.

[31]

Zhengbao Jiang, Frank F. Xu, Luyu Gao, Zhiqing Sun, Qian Liu, Jane Dwivedi-Yu, Yiming Yang, Jamie Callan, and Graham Neubig. 2023. https://doi.org/10.18653/V1/2023.EMNLP-MAIN.495. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023, pages 7969–7992. Association for Computational Linguistics.

[32]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2024. https://openreview.net/forum?id=hSyW5go0v8. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.

[33]

Taeho Hwang, Soyeong Jeong, Sukmin Cho, SeungYoon Han, and Jong Park. 2024. https://doi.org/10.18653/v1/2024.knowledgenlp-1.6. In Proceedings of the 3rd Workshop on Knowledge Augmented Methods for NLP, pages 73–92, Bangkok, Thailand. Association for Computational Linguistics.

[34]

Qinyuan Cheng, Xiaonan Li, Shimin Li, Qin Zhu, Zhangyue Yin, Yunfan Shao, Linyang Li, Tianxiang Sun, Hang Yan, and Xipeng Qiu. 2024. https://aclanthology.org/2024.findings-emnlp.999. In Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, pages 17153–17166. Association for Computational Linguistics.

[35]

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. 2023. https://doi.org/10.18653/V1/2023.ACL-LONG.557. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2023, Toronto, Canada, July 9-14, 2023, pages 10014–10037. Association for Computational Linguistics.

[36]

Jaewoo Lee, Joonho Ko, Jinheon Baek, Soyeong Jeong, and Sung Ju Hwang. 2024. https://api.semanticscholar.org/CorpusID:273098663. arXiv Preprint arXiv:2410.02729, abs/2410.02729.

[37]

Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C. Park. 2024. https://doi.org/10.48550/ARXIV.2406.16013. arXiv Preprint arXiv:2406.16013, abs/2406.16013.

[38]

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. 2024. https://doi.org/10.48550/ARXIV.2407.01449. arXiv Preprint arXiv:2407.01449, abs/2407.01449.

[39]

Wenhu Chen, Hexiang Hu, Xi Chen, Pat Verga, and William W. Cohen. 2022. https://doi.org/10.18653/V1/2022.EMNLP-MAIN.375. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 5558–5570. Association for Computational Linguistics.

[40]

Weizhe Lin and Bill Byrne. 2022. https://doi.org/10.18653/V1/2022.EMNLP-MAIN.772. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, EMNLP 2022, Abu Dhabi, United Arab Emirates, December 7-11, 2022, pages 11238–11254. Association for Computational Linguistics.

[41]

Yucan Guo, Zixuan Li, Xiaolong Jin, Yantao Liu, Yutao Zeng, Wenxuan Liu, Xiang Li, Pan Yang, Long Bai, Jiafeng Guo, and Xueqi Cheng. 2024. https://doi.org/10.1007/978-981-97-9434-8_3. In Natural Language Processing and Chinese Computing - 13th National CCF Conference, NLPCC 2024, Hangzhou, China, November 1-3, 2024, Proceedings, Part II, volume 15360 of Lecture Notes in Computer Science, pages 30–42. Springer.

[42]

Feifei Pan, Mustafa Canim, Michael R. Glass, Alfio Gliozzo, and James A. Hendler. 2022. https://doi.org/10.48550/ARXIV.2203.16714. arxiv Preprint arXiv:2203.16714, abs/2203.16714.

[43]

Asim Biswal, Liana Patel, Siddarth Jha, Amog Kamsetty, Shu Liu, Joseph E. Gonzalez, Carlos Guestrin, and Matei Zaharia. 2024. https://doi.org/10.48550/ARXIV.2408.14717. arXiv preprint arXiv:2408.14717, abs/2408.14717.

[44]

Yi Yuan, Haohe Liu, Xubo Liu, Qiushi Huang, Mark D. Plumbley, and Wenwu Wang. 2024. https://doi.org/10.1109/ICASSP48485.2024.10447898. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2024, Seoul, Republic of Korea, April 14-19, 2024, pages 581–585. IEEE.

[45]

OpenAI. 2023. https://doi.org/10.48550/ARXIV.2303.08774. arXiv preprint arXiv:2303.08774.

[46]

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul Ronald Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaı̈s White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, and et al. 2023. https://doi.org/10.48550/arXiv.2312.11805. arXiv preprint arXiv:2312.11805.

[47]

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, Anirudh Goyal, Anthony Hartshorn, Aobo Yang, Archi Mitra, Archie Sravankumar, Artem Korenev, Arthur Hinsvark, Arun Rao, Aston Zhang, Aurélien Rodriguez, Austen Gregerson, Ava Spataru, Baptiste Rozière, Bethany Biron, Binh Tang, Bobbie Chern, Charlotte Caucheteux, Chaya Nayak, Chloe Bi, Chris Marra, Chris McConnell, Christian Keller, Christophe Touret, Chunyang Wu, Corinne Wong, Cristian Canton Ferrer, Cyrus Nikolaidis, Damien Allonsius, Daniel Song, Danielle Pintz, Danny Livshits, David Esiobu, Dhruv Choudhary, Dhruv Mahajan, Diego Garcia-Olano, Diego Perino, Dieuwke Hupkes, Egor Lakomkin, Ehab AlBadawy, Elina Lobanova, Emily Dinan, Eric Michael Smith, Filip Radenovic, Frank Zhang, Gabriel Synnaeve, Gabrielle Lee, Georgia Lewis Anderson, Graeme Nail, Grégoire Mialon, Guan Pang, Guillem Cucurell, Hailey Nguyen, Hannah Korevaar, Hu Xu, Hugo Touvron, Iliyan Zarov, Imanol Arrieta Ibarra, Isabel M. Kloumann, Ishan Misra, Ivan Evtimov, Jade Copet, Jaewon Lee, Jan Geffert, Jana Vranes, Jason Park, Jay Mahadeokar, Jeet Shah, Jelmer van der Linde, Jennifer Billock, Jenny Hong, Jenya Lee, Jeremy Fu, Jianfeng Chi, Jianyu Huang, Jiawen Liu, Jie Wang, Jiecao Yu, Joanna Bitton, Joe Spisak, Jongsoo Park, Joseph Rocca, Joshua Johnstun, Joshua Saxe, Junteng Jia, Kalyan Vasuden Alwala, Kartikeya Upasani, Kate Plawiak, Ke Li, Kenneth Heafield, Kevin Stone, and et al. 2024. https://doi.org/10.48550/ARXIV.2407.21783. arXiv preprint arXiv:2407.21783.

[48]

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. 2024. https://doi.org/10.48550/ARXIV.2411.02571. arXiv Preprint arXiv:2411.02571, abs/2411.02571.

[49]

Florian Bordes, Richard Yuanzhe Pang, Anurag Ajay, Alexander C. Li, Adrien Bardes, Suzanne Petryk, Oscar Mañas, Zhiqiu Lin, Anas Mahmoud, Bargav Jayaraman, Mark Ibrahim, Melissa Hall, Yunyang Xiong, Jonathan Lebensold, Candace Ross, Srihari Jayakumar, Chuan Guo, Diane Bouchacourt, Haider Al-Tahan, Karthik Padthe, Vasu Sharma, Hu Xu, Xiaoqing Ellen Tan, Megan Richards, Samuel Lavoie, Pietro Astolfi, Reyhane Askari Hemmat, Jun Chen, Kushal Tirumala, Rim Assouel, Mazda Moayeri, Arjang Talattof, Kamalika Chaudhuri, Zechun Liu, Xilun Chen, Quentin Garrido, Karen Ullrich, Aishwarya Agrawal, Kate Saenko, Asli Celikyilmaz, and Vikas Chandra. 2024. https://doi.org/10.48550/ARXIV.2405.17247. arXiv preprint arXiv:2405.17247, abs/2405.17247.

[50]

Beier Zhu and Hanwang Zhang. 2025. https://doi.org/10.1007/S11704-024-40051-3. Frontiers Comput. Sci., 19(1):191321.

[51]

DeepSeek-AI, Qihao Zhu, Daya Guo, Zhihong Shao, Dejian Yang, Peiyi Wang, Runxin Xu, Y. Wu, Yukun Li, Huazuo Gao, Shirong Ma, Wangding Zeng, Xiao Bi, Zihui Gu, Hanwei Xu, Damai Dai, Kai Dong, Liyue Zhang, Yishi Piao, Zhibin Gou, Zhenda Xie, Zhewen Hao, Bingxuan Wang, Junxiao Song, Deli Chen, Xin Xie, Kang Guan, Yuxiang You, Aixin Liu, Qiushi Du, Wenjun Gao, Xuan Lu, Qinyu Chen, Yaohui Wang, Chengqi Deng, Jiashi Li, Chenggang Zhao, Chong Ruan, Fuli Luo, and Wenfeng Liang. 2024. https://doi.org/10.48550/ARXIV.2406.11931. arXiv Preprint arXiv:2406.11931, abs/2406.11931.

[52]

Binyuan Hui, Jian Yang, Zeyu Cui, Jiaxi Yang, Dayiheng Liu, Lei Zhang, Tianyu Liu, Jiajun Zhang, Bowen Yu, Kai Dang, An Yang, Rui Men, Fei Huang, Xingzhang Ren, Xuancheng Ren, Jingren Zhou, and Junyang Lin. 2024. https://doi.org/10.48550/ARXIV.2409.12186. arXiv Preprint arXiv:2409.12186, abs/2409.12186.

[53]

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu. 2023. https://doi.org/10.48550/ARXIV.2312.17432. arXiv Preprint arXiv:2312.17432, abs/2312.17432.

[54]

Han Wang, Yongjie Ye, Yanjie Wang, Yuxiang Nie, and Can Huang. 2024. https://doi.org/10.1007/978-3-031-72670-5_10. In Computer Vision - ECCV 2024 - 18th European Conference, Milan, Italy, September 29-October 4, 2024, Proceedings, Part XXII, volume 15080 of Lecture Notes in Computer Science, pages 166–185. Springer.

[55]

Muhammad Maaz, Hanoona Abdul Rasheed, Salman Khan, and Fahad Khan. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.679. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 12585–12602. Association for Computational Linguistics.

[56]

Chaoyi Zhang, Kevin Lin, Zhengyuan Yang, Jianfeng Wang, Linjie Li, Chung-Ching Lin, Zicheng Liu, and Lijuan Wang. 2024. https://doi.org/10.1109/CVPR52733.2024.01295. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13647–13657. IEEE.

[57]

Bo He, Hengduo Li, Young Kyun Jang, Menglin Jia, Xuefei Cao, Ashish Shah, Abhinav Shrivastava, and Ser-Nam Lim. 2024. https://doi.org/10.1109/CVPR52733.2024.01282. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 13504–13514. IEEE.

[58]

Junke Wang, Dongdong Chen, Chong Luo, Bo He, Lu Yuan, Zuxuan Wu, and Yu-Gang Jiang. 2024. https://doi.org/10.1109/CVPR52733.2024.01724. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 18209–18220. IEEE.

[59]

Joya Chen, Zhaoyang Lv, Shiwei Wu, Kevin Qinghong Lin, Chenan Song, Difei Gao, Jia-Wei Liu, Ziteng Gao, Dongxing Mao, and Mike Zheng Shou. 2024. https://doi.org/10.1109/CVPR52733.2024.01742. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pages 18407–18418. IEEE.

[60]

Kangsan Kim, Geon Park, Youngwan Lee, Woongyeong Yeo, and Sung Ju Hwang. 2024. https://arxiv.org/abs/2412.02186. arXiv Preprint arXiv:2412.02186.

https://www.wikihow.com/Main-Page ↩︎

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Abstract

1 Introduction↩︎

2 Method↩︎

2.1 Preliminaries↩︎

2.1.0.1 Retrieval-Augmented Generation

2.1.0.2 Large Video Language Models

2.2 VideoRAG↩︎

2.2.0.1 Video Retrieval

2.2.0.2 Video-Augmented Response Generation

2.3 Auxiliary Text Generation↩︎

3 Experimental Setups↩︎

3.1 Datasets↩︎

3.2 Models↩︎

3.2.0.1 Baselines

3.2.0.2 VideoRAG

3.3 Evaluation Metrics↩︎

3.4 Implementation Details↩︎

4 Experimental Results and Analyses↩︎

4.1 Main Results↩︎

4.2 Video Retrieval↩︎

4.2.0.1 Impact of Video Retrieval

4.2.0.2 Effectiveness of Textual and Visual Features

4.2.0.3 Analysis on Feature Ensemble

4.3 Analyses and Discussion↩︎

4.3.0.1 Category-Wise Performance Analysis

4.3.0.2 Ablation Studies

4.3.0.3 Qualitative Results

5.0.0.1 Retrieval-Augmented Generation

5.0.0.2 Multimodal RAG

5.0.0.3 Large Video Language Models

6 Conclusion↩︎

References↩︎

Subjects

Updated on Academus

VideoRAG: Retrieval-Augmented Generation over Video Corpus

Abstract

1 Introduction↩︎

2 Method↩︎

2.1 Preliminaries↩︎

2.1.0.1 Retrieval-Augmented Generation

2.1.0.2 Large Video Language Models

2.2 VideoRAG↩︎

2.2.0.1 Video Retrieval

2.2.0.2 Video-Augmented Response Generation

2.3 Auxiliary Text Generation↩︎

3 Experimental Setups↩︎

3.1 Datasets↩︎

3.2 Models↩︎

3.2.0.1 Baselines

3.2.0.2 VideoRAG

3.3 Evaluation Metrics↩︎

3.4 Implementation Details↩︎

4 Experimental Results and Analyses↩︎

4.1 Main Results↩︎

4.2 Video Retrieval↩︎

4.2.0.1 Impact of Video Retrieval

4.2.0.2 Effectiveness of Textual and Visual Features

4.2.0.3 Analysis on Feature Ensemble

4.3 Analyses and Discussion↩︎

4.3.0.1 Category-Wise Performance Analysis

4.3.0.2 Ablation Studies

4.3.0.3 Qualitative Results

5 Related Work↩︎

5.0.0.1 Retrieval-Augmented Generation

5.0.0.2 Multimodal RAG

5.0.0.3 Large Video Language Models

6 Conclusion↩︎

References↩︎

Subjects

Updated on Academus