January 10, 2025
Retrieval-Augmented Generation (RAG) is a powerful strategy to address the issue of generating factually incorrect outputs in foundation models by retrieving external knowledge relevant to queries and incorporating it into their generation process. However, existing RAG approaches have primarily focused on textual information, with some recent advancements beginning to consider images, and they largely overlook videos, a rich source of multimodal knowledge capable of representing events, processes, and contextual details more effectively than any other modality. While a few recent studies explore the integration of videos in the response generation process, they either predefine query-associated videos without retrieving them according to queries, or convert videos into the textual descriptions without harnessing their multimodal richness. To tackle these, we introduce VideoRAG, a novel framework that not only dynamically retrieves relevant videos based on their relevance with queries but also utilizes both visual and textual information of videos in the output generation. Further, to operationalize this, our method revolves around the recent advance of Large Video Language Models (LVLMs), which enable the direct processing of video content to represent it for retrieval and seamless integration of the retrieved videos jointly with queries. We experimentally validate the effectiveness of VideoRAG, showcasing that it is superior to relevant baselines.
Figure 1: A conceptual illustration of existing and the proposed RAG scenarios. (A) Textual RAG retrieves documents (relevant to queries) from a text corpus and incorporates them when generating answers. (B) Conventional multimodal RAG extends retrieval to include static images. (C) VideoRAG (ours) further extends the external knowledge source to videos.
Recently, large foundation models, such as Large Language Models (LLMs) and their extension to vision modality called Large Vision-Language Models (VLMs), have become the standard choice for addressing diverse tasks due to their remarkable capabilities [1]–[3]. In particular, these models, trained on extensive textual and multimodal corpora, encode vast amounts of knowledge within their large-scale parameters. However, despite their success, they are still prone to generating factually incorrect outputs, known as hallucination, as their parametric knowledge can be inaccurate, incomplete, or outdated [4], [5], and this limitation highlights the need for incorporating knowledge from external knowledge sources, with Retrieval-Augmented Generation (RAG) emerging as an essential mitigator for it. Specifically, RAG typically operates by retrieving query-relevant information from an external corpus and then generating answers grounded in the retrieved information [6], [7].
However, while existing RAG approaches have been widely adopted for various real-world applications, they have primarily focused on retrieving and incorporating textual content [5], [8], with only recent attempts beginning to explore images (or text-image pairs) as the additional source of external knowledge [9], [10]. On the other hand, we argue that there remains a rapidly expanding yet underutilized medium, called videos, which provides unparalleled multimodal richness, and they may be a compelling resource for augmenting the knowledge landscape of current RAG systems. Specifically, videos combine temporal dynamics, spatial details, and multimodal cues, which collectively enable them to capture complex processes, context-dependent interactions, and non-verbal signals that static modalities (such as text and images) often fail to convey. Moreover, given the increasing popularity of video-sharing platforms (such as YouTube and TikTok), the availability of diverse, high-quality video data has grown significantly, ranging from educational tutorials and scientific demonstrations to personal experiences and real-time events, all of which may be useful when formulating responses to user queries.
It is worth noting that a few very recent studies have started considering video content to handle user queries; however, they have notable limitations. For instance, some of them assume that the video relevant to queries is already known and instead focus on identifying query-relevant frames from within that specified video [11], [12]. While this approach may work for scenarios where the relevant video is explicitly provided, it is suboptimal for more general-use cases, where users expect the system to dynamically identify and retrieve relevant video content to provide the answer. On the other hand, other studies handle videos by converting them into textual formats, such as subtitles or captions, and utilizing these textual representations under off-the-shelf text-based RAG pipelines [13], [14]. However, while this text-only strategy may offer a convenient workaround, it inherently sacrifices the multimodal richness of video data by discarding critical information, such as temporal dynamics and visual context, during the conversion process, which may be essential. For example, consider a query: “How does the expression of the dog change when it is angry?”. While textual transcriptions might describe the dog’s barking or growling, they fail to capture visual cues (baring teeth, raised hackles, or narrowed eyes), which are needed for accurately interpreting the emotional state of the dog as well as formulating the answer to the query.
In this work, to address the aforementioned limitations of existing studies, we introduce a novel framework, called VideoRAG, which aims to offer another fruitful angle to existing RAG frameworks by enabling a more comprehensive utilization of video content for its holistic retrieval and incorporation (See Figure 1 for its illustation and comparison to existing works). Specifically, in response to queries, the proposed VideoRAG not only retrieves relevant videos from a large video corpus, but also integrates both visual and textual elements of videos into the answer-generation process. Furthermore, we operationalize this by harnessing the advanced capabilities of recent Large Video Language Models (LVLMs), which are capable of directly processing the video content, consisting of visual and textual information, within the unified framework, thereby more effectively capturing its multimodal richness. However, while on the one hand, we observe that the joint utilization of visual and textual features is needed for the effective representation of videos and subsequently their retrieval, on the other hand, the textual descriptions of videos (e.g., subtitles) are oftentimes not available. To address this, we further present a simple yet effective mitigation strategy that utilizes automatic speech recognition techniques to generate textual transcripts from videos, which allows us to leverage both visual and textual modalities even when explicit textual annotations are absent.
To validate the effectiveness of our VideoRAG framework, we conduct experiments by leveraging the overlapping queries from the WikiHowQA dataset [15] (consisting of query-answer pairs) with the queries from the HowTo100M dataset [16] (including query-video pairs without answers). Then, the experimental results show the significant performance improvement of the proposed VideoRAG framework over relevant RAG baselines, demonstrating the efficacy of leveraging videos for RAG.
In this section, we present VideoRAG, which performs retrieval of query-relevant videos over video corpus and generates responses grounded in them.
We begin with preliminaries, providing the descriptions of Retrieval-Augmented Generation (RAG) and then Large Video Language Models (LVLMs).
RAG aims to enhance the capabilities of foundation models by grounding their outputs in external knowledge retrieved from the external knowledge source, such as Wikipedia, which consists of two main components: retrieval and generation modules. Formally, given a query \({\boldsymbol{q}}\), RAG retrieves a set of documents (or knowledge elements) \(\mathcal{K} = \{ {\boldsymbol{k}}_1, {\boldsymbol{k}}_2, \dots, {\boldsymbol{k}}_k \}\) from an external corpus \(\mathcal{C}\) (\(\mathcal{K} \subseteq \mathcal{C}\)) based on their relevance with \({\boldsymbol{q}}\) using a retrieval module, which can be formalized as follows: \(\mathcal{K} = \texttt{Retriever}({\boldsymbol{q}}, \mathcal{C})\). Here, the query \({\boldsymbol{q}}\) and knowledge \({\boldsymbol{k}}\) are represented as a sequence of tokens \({\boldsymbol{q}}= [q_1, q_2, \dots, q_i]\) and \({\boldsymbol{k}}= [k_1, k_2, \dots, k_j]\). Also, during retrieval, the relevance between the query and each knowledge element within the corpus is determined by the scoring function, defined as follows: \(\texttt{Sim}({\boldsymbol{q}}, {\boldsymbol{k}})\), which typically measures their representational similarity over the embedding space. In the subsequent generation step, the retrieved knowledge elements are then used as additional input to the generation module, to augment the query to produce an answer \({\boldsymbol{y}}\), which can be defined as follows: \({\boldsymbol{y}}= \texttt{Model}({\boldsymbol{q}}, \mathcal{K})\), where \(\texttt{Model}\) is typically implemented as the foundation model, such as LLMs. It is worth noting that, unlike existing RAG frameworks that focus mainly on retrieving and incorporating textual content (or, in some recent cases, extra static images), in this work, we explore the extension toward videos.
On top of the extensive language understanding capabilities of LLMs, LVLMs are designed to handle and incorporate the features from video content, including temporal, spatial, and multimodal information, within the unified token processing framework. Formally, let us denote a video \({\boldsymbol{V}}\) as a sequence of image frames: \({\boldsymbol{V}}= [{\boldsymbol{v}}_1, {\boldsymbol{v}}_2, \dots, {\boldsymbol{v}}_n]\) and its associated textual data (such as subtitles, or any other textual information such as the video-specific query) \({\boldsymbol{t}}\) as a sequence of tokens: \({\boldsymbol{t}}= [t_1, t_2, \dots, t_m]\). Then, the typical LVLM, denoted as \(\texttt{LVLM}\), enables the joint processing of these multimodal inputs by employing two specialized components: a vision encoder and a text encoder. Specifically, the vision encoder processes the sequence of video frames \({\boldsymbol{V}}\) (which can span multiple videos) to extract temporal and spatial features, resulting in a sequence of visual feature embeddings (or visual tokens): \({\boldsymbol{F}}_{\texttt{visual}} = \texttt{VisionEncoder}({\boldsymbol{V}})\). Concurrently, the text encoder processes the given textual information \({\boldsymbol{t}}\) to generate corresponding feature embeddings: \({\boldsymbol{F}}_{\texttt{text}} = \texttt{TextEncoder}({\boldsymbol{t}})\). Then, the overall process to obtain the video representation (with the goal of capturing both visual and textual features) can be denoted as follows: \({\boldsymbol{f}}_{\texttt{video}} = \texttt{LVLM}({\boldsymbol{V}}, {\boldsymbol{t}})\). Traditionally, \({\boldsymbol{f}}_{\texttt{video}}\) is obtained by the simple interpolation of the visual and textual representations: \({\boldsymbol{f}}_{\texttt{video}} = \alpha \cdot {\boldsymbol{F}}_{\texttt{text}} + (1-\alpha) \cdot {\boldsymbol{F}}_{\texttt{visual}}\) [17], and, more recently, it can be done by further jointly processing the visual and textual embeddings through several LVLM layers (that sit on top of existing LLMs) [18], which allows the model to learn a more effective representation and continue generating the next sequence of tokens (for example, an answer to a query).
We now turn to introduce our VideoRAG, which extends the existing RAG paradigm by leveraging the video corpus as the external knowledge source.
The initial step to operationalize RAG over the video corpus is to implement video retrieval, whose goal is to identify query-relevant videos \(\mathcal{V} = \{ {\boldsymbol{V}}_1, {\boldsymbol{V}}_2, \dots, {\boldsymbol{V}}_k \}\) from the external corpus \(\mathcal{C}\), consisting of a large number of videos, formalized as follows: \(\mathcal{V} = \texttt{Retriever}({\boldsymbol{q}}, \mathcal{C})\). Recall that, as explained in Section 2.1, this retrieval process involves calculating the similarity between the query \({\boldsymbol{q}}\) and each knowledge element (which is video \({\boldsymbol{V}}\) in our case) to determine their relevance. To achieve this, we first forward the video \({\boldsymbol{V}}\) (composed of image frames and, if available, subtitles) as well as the query \({\boldsymbol{q}}\) (without visual information) into \(\texttt{LVLM}\), to obtain their representations \({\boldsymbol{f}}_{\texttt{query}}\) and \({\boldsymbol{f}}_{\texttt{video}}\). After that, the relevance is computed based on their representation-level similarity, for example, using a cosine similarity, and then the top-\(k\) videos with the highest similarity scores are retrieved for the subsequent answer generation step.
After the retrieval of query-relevant videos is done, the next step is to incorporate the retrieved videos into the answer generation process, in order to formulate the answer grounded in them. To operationalize this, we first concatenate the frames of each retrieved video with its associated textual data (e.g., subtitles), then concatenate these multimodal pairs across all videos retrieved, and lastly append the user query, to construct the input to \(\texttt{LVLM}\), denoted as follows: \([{\boldsymbol{V}}_1, {\boldsymbol{t}}_1, \dots, {\boldsymbol{V}}_k, {\boldsymbol{t}}_k, {\boldsymbol{q}}]\). Then, this input is forwarded into \(\texttt{LVLM}\), which enables the joint processing of the combined visual, textual, and query-specific information, to generate the response while considering their multimodal richness.
In both the retrieval and generation steps, the inclusion of video-associated textual data, such as subtitles, can play a crucial role in enhancing video representation since it provides additional context and semantic cues that complement the visual content. However, not every video in the corpus comes with subtitles since they require additional annotations. Therefore, for such videos, we propose generating auxiliary textual data by extracting audio from the video and converting it into text using off-the-shelf automatic speech recognition techniques. Formally, given a video \({\boldsymbol{V}}\), this process can be formalized as follows: \({\boldsymbol{t}}_{\texttt{aux}} = \texttt{AudioToText}(\texttt{Audio}({\boldsymbol{v}}))\), where \(\texttt{Audio}({\boldsymbol{V}})\) extracts the audio track from the video, and \(\texttt{AudioToText}\) converts the extracted audio signal into textual content. Therefore, for those videos without subtitles, the auxiliary text \({\boldsymbol{t}}_{\texttt{aux}}\) is used in place of \({\boldsymbol{t}}\) in both the retrieval and generation steps.
In this section, we describe the datasets, models, evaluation metrics, and implementation details.
To evaluate our VideoRAG in information-seeking question answering (a task particularly suitable and widely used for validating RAG approaches), we use available resources having extensive video corpus and question-answer pairs with diverse topics. Specifically, we adapt WikiHowQA data [15] as the source of queries and answers, as it offers a wide range of instructional questions extracted from the WikiHow webpage1, with human-written, step-by-step processes to ensure high-quality ground truths. Also, for the external video corpus, we utilize the HowTo100M dataset [16], a comprehensive collection of instruction videos sourced from YouTube, which is also associated with queries from WikiHow based on their search results, therefore, serving as a useful resource for our VideoRAG tasks.
16pt
Methods | ROUGE-L | BLEU-4 | BERTScore | G-Eval | ||
---|---|---|---|---|---|---|
Baselines | Naïve | 0.141 | 0.014 | 0.834 | 1.579 | |
TextRAG(BM25) | 0.172 | 0.032 | 0.847 | 1.633 | ||
TextRAG(DPR) | 0.166 | 0.022 | 0.846 | 1.591 | ||
TextVideoRAG | 0.228 | 0.044 | 0.870 | 1.979 | ||
Ours | VideoRAG-T | 0.224 | 0.043 | 0.868 | 2.001 | |
VideoRAG-V | 0.254 | 0.054 | 0.881 | 2.161 | ||
VideoRAG-VT | 0.252 | 0.054 | 0.880 | 2.104 | ||
Oracle | VideoRAG-V | 0.262 | 0.055 | 0.884 | 2.225 |
We evaluate the performance of VideoRAG against four different baselines, as follows:
Naïve – which directly generates answers using input queries without any additional context.
TextRAG (BM25) – which is a text-based RAG model, performing retrieval over documents (from Wikipedia) based on their relevance to queries with BM25 [19] (which is a sparse retriever with lexical matching) and generating answers grounded in them.
TextRAG (DPR) – which is a similar text-based RAG baseline to TextRAG (BM25) but instead performs retrieval with DPR [20], a widely used dense retriever.
TextVideoRAG – which follows the previous video-based RAG methods [13], [14], which first represent videos as their textual descriptions (e.g., captions or transcripts) and then utilize only those textual information in retrieval and generation.
We introduce and validate three variants of the proposed VideoRAG framework, each employing the same video retrieval method but differing in how the retrieved videos are incorporated during answer generation. Specifically, the first variant, VideoRAG-T, integrates only the transcripts (or auxiliary textual data) obtained from retrieved videos as context for answer generation. Meanwhile, the second variant, VideoRAG-V, exclusively utilizes video frames as context to provide visual grounding for responses. Finally, VideoRAG-VT jointly utilizes both the video frames and textual transcripts for generation. Additionally, to further estimate the room for performance improvement, we include an oracle version of VideoRAG, which directly uses the ground-truth video pre-associated with the query labeled in HowTo100M, instead of using retrieval outcomes.
To comprehensively evaluate the performance of different models, we use the following diverse metrics: 1) ROUGE-L measures the longest common subsequence between the generated answer and the ground truth [21]; 2) BLEU-4 calculates the overlap of n-grams (up to 4-grams) between the generated answer and the reference answer [22]; 3) BERTScore is designed to consider the semantic alignment between the generated and reference answers [23] by extracting their contextual embeddings from a pre-trained BERT model [24] and then calculating their representational similarity; 4) G-Eval leverages the evaluation capabilities of LLMs to assess the quality of the generated answer [25], and, to do this, we prompt the LLM to rate the generated answer in comparison to the reference on a 5-point Likert scale.
It is worth noting that there exist various LVLMs available for use, each with different merits depending on the task requirements: for retrieval, precise alignment between textual and video features (obtained from their specialized encoders) is essential to ensure that the retrieved videos are contextually relevant to the query, meanwhile, generation benefits from LVLMs with advanced capabilities for accurately formulating responses and grounding them in the retrieved content. To achieve this, for retrieval, we use InternVideo2 [26] since it is explicitly trained to align semantics between videos and their textual descriptions. Specifically, we use its video and text encoders to extract embeddings for videos and text, respectively. On the other hand, for video-augmented answer generation, we use LLaVA-Video-7B [18], which is known for achieving state-of-the-art performance on video understanding and relevant tasks. For efficiency, we uniformly sample 4 frames per video for retrieval, while we sample 32 frames (or all frames if the video is shorter than 32 seconds, sampled at 1 fps) for generation. In auxiliary text generation, to extract textual content from videos, we use Whisper [27].
We now present results and various analyses, showing the effectiveness of the proposed VideoRAG.
We provide main results in Table 1, showcasing the performance of different models with varying types of retrieved knowledge. First, we observe that all RAG models, utilizing external knowledge, clearly outperform the Naïve baseline, reaffirming the critical role of external knowledge in enhancing the factual accuracy of generated responses. Also, among these, our VideoRAG achieves the highest performance, significantly surpassing existing textual RAG baselines. This improvement corroborates our hypothesis that video content may be a useful resource for RAG, as it provides richer and more detailed information than textual resources.
10pt
Video Set | ROUGE-L | BLEU-4 | BERTScore | |
---|---|---|---|---|
Random | 0.243 | 0.050 | 0.878 | |
Retrieved | 0.254 | 0.054 | 0.881 | |
Oracle | 0.262 | 0.055 | 0.884 |
Figure 2: Impact of varying the interpolation ratio between textual and visual features on retrieval performance.
Figure 3: Breakdown performance of different models across 10 different categories.
Then, one might ask whether the performance improvement of the proposed VideoRAG framework comes from the consideration of the additional modality (e.g., video frames) rather than the actual content within the retrieved knowledge. To investigate this, we compare TextVideoRAG and VideoRAG-T, which utilize textual representations of videos (without vision modality), against TextRAG that uses the encyclopedic knowledge base in text (namely, Wikipedia) as the knowledge source. As shown in Table 1, we find that the improved performance of VideoRAG-T and TextVideoRAG over TextRAG suggests that, even if we use only the textual representations of videos, the content embedded in videos is proved to be more fruitful to enhance the response quality than the general encyclopedic knowledge. Additionally, despite the superior performance of TextVideoRAG and VideoRAG-T to baselines, they are outperformed by the other variants of our VideoRAG, which directly incorporate visual features into the answer generation process. This result further highlights the critical role of visual information in performance. Finally, the smaller performance gap between VideoRAG-V and VideoRAG-VT suggests that much of the necessary information required for answer generation is effectively encapsulated within the visual features of video frames, which inherently include information conveyed through textual descriptions.
We hypothesize that the quality of the retrieved videos is a critical factor in the success of RAG, as it can directly influence the subsequent answer generation process. To confirm this, we compare the performance of VideoRAG with different videos, including randomly selected videos and retrieved videos (relevant to queries), used for augmenting the answer generation step. As shown in Table 2, we then observe that incorporating query-relevant videos significantly improves the quality of the answers compared to randomly selected videos, demonstrating the importance of retrieval quality. Furthermore, the Oracle setting, which represents an ideal scenario with perfectly relevant video retrieval, achieves the highest performance, highlighting the potential for further improvements through advancements in video retrieval mechanisms within our VideoRAG.
When performing video retrieval, it is questionable how much different modalities, such as textual features, visual features, or a combination of both, contribute to the effectiveness of video representations, and we report retrieval results with varying modalities in Table [tab:feat95ensemble]. From this, we observe that textual features consistently outperform visual features, likely due to their stronger semantic alignment with textual user queries. To further examine this, we visualize the embeddings of textual and visual features of video content as well as queries over the latent space in Figure [fig:pca], and it clearly reveals closer proximity between textual query embeddings and textual video representations compared to visual video representations. In other words, visual features exhibit a modality gap relative to text-based queries (as reflected in their greater distance), resulting in suboptimal retrieval performance. Nevertheless, combining textual and visual features achieves the highest performance, demonstrating the complementary nature of those two modalities in video representations for retrieval.
Naïve | |
---|---|
Question | Explain how to bake cookies on your car dashboard. |
Generated Answer | I’m sorry, but it’s not possible to bake cookies on your car dashboard. The dashboard is not designed for cooking and it’s not safe to use it as a heat source. Additionally, the fumes from baking could be harmful to you and the car’s interior.Baking requires a controlled environment with a consistent temperature, which is difficult to achieve in a car. Moreover, the dashboard is made of materials that can easily catch fire or release toxic fumes when exposed to high temperatures.If you want to bake cookies, it’s best to do so in a safe and controlled environment, such as a kitchen or an outdoor grill. Just remember to always follow proper safety precautions and guidelines when cooking. |
VideoRAG-V | |
Question | Explain how to bake cookies on your car dashboard. |
Retrieved Video | |
Generated Answer | To bake cookies on your car dashboard, you’ll need a nonstick cooking spray, a cookie sheet, cookie dough, a spatula, and a thermometer. First, spray the cookie sheet with nonstick cooking spray. Then, place the cookie dough onto the sprayed sheet. Cover it with plastic wrap and let it rest. Next, place the cookie sheet in the car’s dashboard, ensuring it’s at least 95 degrees Fahrenheit. Set the car in a sunny spot to bake the cookies. After baking, use a spatula to remove the cookies from the dashboard. |
Ground Truth | To bake cookies on your car dashboard, you’ll need to be parked in a sunny spot on a day when it’s at least 95 degrees Fahrenheit out. First, spread cookie dough out on a baking sheet lined with parchment paper. Then, set the baking sheet on your car dashboard, and shut all of the doors. Finally, check on the cookies every 15-30 minutes until the edges of the cookies are firm and you’re able to slide them off the parchment paper. |
5pt
Methods | Wikipedia | Video | Subtitle | ROUGE-L | G-Eval |
---|---|---|---|---|---|
Naïve | \(\bigtimes\) | \(\bigtimes\) | \(\bigtimes\) | 0.141 | 1.579 |
TextRAG | \(\bigcirc\) | \(\bigtimes\) | \(\bigtimes\) | 0.172 | 1.633 |
VideoRAG-VT | \(\bigtimes\) | \(\bigcirc\) | \(\bigcirc\) | 0.252 | 2.161 |
VideoRAG-VT + TextRAG | \(\bigcirc\) | \(\bigcirc\) | \(\bigcirc\) | 0.243 | 2.048 |
TextRAG (BM25) | |
---|---|
Question | Explain how to make a clay rose. |
Retrieved Document | Vally Wiselthier, an art deco ceramicist who was making pieces for General Ceramics in Keasbey, New Jersey. After seeing her preliminary pieces, Erni gave Rose a membership at Greenwich House, where she learned how to use a potter’s wheel. Shortly after their first son George was born, he was found to have intractable asthma. The Cabat’s decided to move to Arizona around 1942 in order to alleviate his condition. During World War II Rose worked as a riveter at the Davis-Monthan Army Air Field repairing war-damaged aircraft. Rose was able to make primitive ceramics from the extra clay that Erni |
Generated Answer | The document does not provide a step-by-step guide on how to make a clay rose. However, based on the information provided, we can infer that Rose learned how to use a potter’s wheel at Greenwich House after seeing Vally Wiselthier’s preliminary pieces. This suggests that she may have received some training or instruction in pottery from Vally Wiselthier or other instructors at Greenwich House.To make a clay rose, you would need to follow these general steps:. Gather materials: You will need clay, a potter’s wheel, a wire cutter, a sponge, and any additional materials you want to use for decoration (such as glaze, paint, or glitter).. Prepare the clay: Cut a piece of clay to the desired size and shape for your rose. Use a wire cutter to trim any excess clay and smooth out any rough edges.. Shape the clay: Use your hands or tools to shape the clay into a rose. You can start by creating a small ball of clay for the center of the rose, then add petals around it. Use a sponge to smooth out any rough areas and refine the shape of the rose.. Add details: Use a tool or your fingers to add details to the petals, such as veins or shading. You can also use additional materials like glaze, paint, or glitter to decorate the rose.. Allow the clay to dry: Let the clay dry completely before firing it in a kiln. This will help prevent any cracking or breaking during the firing process.. Fire the clay: Place the dried clay rose in a kiln and fire it according to the manufacturer’s instructions. This will harden the clay and create a durable piece of pottery.. Glaze or finish: Once the clay is fired, you can apply a glaze or finish to give the rose a glossy or matte appearance. Allow the glaze to dry and cure before handling the finished piece.It’s important to note that this is a general guide and the specific steps may vary depending on the type of clay you are using and the techniques you are employing. It’s also important to follow safety guidelines when working with clay and kilns. |
VideoRAG-V | |
Question | Explain how to make a clay rose. |
Retrieved Video | |
Generated Answer | To make a clay rose, start by rolling out a piece of clay into a flat sheet. Use a rolling pin to flatten the clay further. Then, use a knife or sculpting tool to cut the clay into petal shapes. Roll the petals into a spiral shape to form the rose. Use a small tool to create indentations in the petals for texture. Assemble the petals around a central stem, shaping them to resemble a rose. Add leaves and other details as desired. Finally, use a mold to imprint patterns onto the petals, adding depth and realism to the rose. |
Ground Truth | To make a clay rose, start by rolling a chunk of clay into a ball, dividing the ball into seven similarly-sized pieces, and rolling those pieces into balls. Take the smallest ball and press it into a guitar-pick-shaped ’’petal” with the pointy part thicker than the rounded part, then fold the sides gently inward. Repeat with a slightly bigger ball of clay, then fold this second petal around the first. Continue attaching slightly larger petals in a spiral, then pinch the bottom of your rose to connect all the petals. |
To better understand the contribution of textual and visual features in video retrieval, we analyze how varying their combination ratio (\(\alpha\)) impacts performance across different metrics. As shown in Figure [fig:alpha], we find that the optimal ratio for balancing textual and visual features is around 0.5 to 0.7 (with marginal variations depending on the specific metric: 0.7 for R@1 and R@5; 0.5 for R@10). These results further highlight the complementary contributions of textual and visual features in video representations for retrieval, while a slight emphasis on textual features might be preferable due to the modality gap observed in Figure [fig:pca]. Lastly, based on this analysis, we select \(\alpha = 0.7\) for our experiments.
To evaluate the robustness of our VideoRAG across diverse query types, we breakdown the model performance on 10 distinct categories (annotated within WikiHow). Then, as shown in Figure 3, the family of the proposed VideoRAG (such as VideoRAG-T and VideoRAG-V) outperforms all other baselines across all categories, which highlights its ability to handle a wide range of queries. Also, it is worth noting that VideoRAG-V shows notable performance improvement in the Food & Entertaining category, and this is particularly reasonable given that the questions in this category often benefit significantly from visual details, for example, the query: “How to make a healthy spinach and garlic dish” requires ingredient preparation or cooking techniques, which are not effectively conveyed through text alone. Thus, the results in this category reaffirm the importance of leveraging video content as external knowledge for RAG.
To analyze how performance varies with different knowledge sources, we conduct ablation studies and present the results in Table 4. We then find that, while incorporating external knowledge (whether from textual encyclopedic sources or video corpus) consistently improves performance over the Naïve baseline, the approach that jointly uses videos with general textual documents achieves slightly degraded performance. This suggests that textual content (retrieved from the encyclopedic knowledge base) may introduce redundant or irrelevant details, which may overlap with or contradict the information provided by video content, leading to diminishing the effectiveness of the VideoRAG framework.
We now qualitatively analyze the effectiveness of VideoRAG through a case study, with the query: “Explain how to bake cookies on your car dashboard”. As shown in Table 3, the Naïve baseline, relying solely on its parametric knowledge, generates a generic response highlighting the impracticality and safety concerns of such a method, failing to provide the step-by-step instructions necessary to address the query. It indicates the limitation of parametric knowledge that is inadequate especially when specific and uncommon information is required. In contrast, VideoRAG-V retrieves a relevant instructional video that illustrates the process of baking cookies on a car dashboard, and, by leveraging this, it successfully generates a detailed response similar to the ground truth. This example highlights how the proposed VideoRAG can utilize external video content to produce more precise, contextually rich, and actionable answers than methods that rely solely on parametric knowledge embedded within models.
Furthermore, we observe that external textual knowledge alone can sometimes fall short in providing relevant and actionable information for specific procedural queries, such as “Explain how to make a clay rose”. Specifically, as shown in Table 5, TextRAG (BM25) retrieves an irrelevant document about a person named Rose, as Wikipedia does not contain specific procedural guidance on this topic, and, consequently, the generated response is misaligned with the query. In contrast, VideoRAG-V retrieves a relevant video demonstrating how to make a clay rose and leverages this visual content to generate a concise and accurate response that closely mirrors the ground truth, from which we clearly confirm the utility of videos for RAG.
RAG is a strategy that combines retrieval and generation processes to produce accurate answers by grounding them in relevant external knowledge [4], [5], [28]. To be specific, during the retrieval step, documents (relevant to queries) are selected from a large corpus by calculating their similarity to the query, which can be done by using either sparse methods [19], [29] based on lexical overlaps or dense methods [20], [30] that leverage semantic representations. In the subsequent generation step, these retrieved documents serve as input for generating answers that are rooted in the provided information, with the generation process (as well as retrieval) varying depending on the specific work [31]–[34], for example, incorporating iterative retrieval-generation cycles [35] or adapting different RAG strategies based on query complexity [8]. However, despite the fact that much of the real-world knowledge is inherently multimodal in nature [36]–[38], the majority of current RAG studies have focused preliminary on the textual modality, with little effort on incorporating images, leaving a significant gap in leveraging the full spectrum of available knowledge for the holistic operation of RAG.
Recently, there has been growing interest in expanding RAG systems to incorporate multimodal information (beyond textual documents), such as images [9], [10], [39], [40], code [41], tables [42], [43], and audio [44]. However, in contrast to them, videos offer a unique and orthogonal advantage for advancing RAG systems, as they encapsulate temporal dynamics, spatial details, and multimodal cues in ways unmatched by other modalities. Inspired by this, very recent studies have begun exploring the usage of video content within RAG pipelines; however, despite these efforts, existing approaches leverage the distinct advantages of video data in a suboptimal way. Specifically, some focus on extracting query-relevant frames from a preselected video and generating answers based on them, which, while useful in controlled scenarios, limits their real-world applicability in open-domain settings [11], [12]. Also, other approaches attempt to sidestep the complexity of handling video data by converting it into textual representations (such as subtitles or captions); however, while directly applicable to existing text-based RAG frameworks, they sacrifice the multimodal richness embedded within videos (such as temporal dynamics and spatial patterns) [12]–[14]. To address these challenges, we propose VideoRAG that is capable of dynamically retrieving and holistically utilizing video content in RAG, powered by LVLMs discussed next.
Building on the remarkable success of LLMs in language understanding and generation as well as their ability to encapsulate vast amounts of knowledge [45]–[47], there has been a growing interest in extending them to encompass diverse modalities, such as images [48]–[50] and code [51], [52]. Furthermore, this expansion has recently extended to another modality called video, leading to the emergence of LVLMs that are capable of directly processing video content. In particular, these models excel in solving traditionally challenging (yet straightforward) tasks, such as object or action detection [53], and their capabilities have been rapidly advanced, enabling them to tackle more challenging tasks, such as analyzing spatio-temporal dynamics to predict the sequence of events, inferring causal relationships across video frames, and generating context-aware descriptions of intricate scenarios [54]–[58], even in zero-shot settings without additional training [59], [60]. However, despite these advancements, their potential has yet to be explored in the context of RAG; thus, in this work, we aim to bridge this gap with the proposal of VideoRAG.
In this work, we presented VideoRAG, a novel framework that expands the current landscape of RAG systems by leveraging a video corpus as the external knowledge source. Specifically, unlike existing works that use the textual representations of videos or assume the existence of query-relevant videos without retrieval, the proposed VideoRAG retrieves videos based on their relevance to queries but also integrates their multimodal richness (including visual and textual elements) into the RAG pipeline, by harnessing the capabilities of LVLMs. Also, through comprehensive analyses, we demonstrated how the inclusion of visual or textual features, or a combination of both, improves retrieval and generation performance, and, inspired by the critical role of textual features (for retrieval quality) but their absence in some videos, we presented a simple yet effective mitigator that uses automatic speech recognition to generate textual transcripts. Overall, experimental results validated the superiority of our VideoRAG over existing RAG methods, and we believe it makes a significant step toward holistic RAG systems that can utilize videos.