Abstract

Multimodal Large Language Models (MLLMs) have advanced in integrating diverse modalities but frequently suffer from hallucination. A promising solution to mitigate this issue is to generate text with citations, providing a transparent chain for verification. However, existing work primarily focuses on generating citations for text-only content, overlooking the challenges and opportunities of multimodal contexts. To address this gap, we introduce MCiteBench, the first benchmark designed to evaluate and analyze the multimodal citation text generation ability of MLLMs. Our benchmark comprises data derived from academic papers and review-rebuttal interactions, featuring diverse information sources and multimodal content. We comprehensively evaluate models from multiple dimensions, including citation quality, source reliability, and answer accuracy. Through extensive experiments, we observe that MLLMs struggle with multimodal citation text generation. We also conduct deep analyses of models’ performance, revealing that the bottleneck lies in attributing the correct sources rather than understanding the multimodal content.

1 Introduction↩︎

Multimodal Large Language Models (MLLMs) have shown remarkable progress in integrating external information from diverse modalities, allowing them to generate responses beyond the scope of their internal knowledge [1]–[3]. Despite the advancements, these models frequently suffer from hallucination [4], [5], undermining the faithfulness of their outputs [6]. To mitigate this issue, a promising path is to have the model cite the information sources after generating a statement, offering a transparent and verifiable attribution chain.

Figure 1: Illustration of the task form in MCiteBench. The model takes multimodal corpus and generates responses with explicit citations..

Existing studies on citation text generation mainly focus on the textual modality [7], [8]. However, real-world information sources are composed of multiple modalities, which convey information that is beyond the textual modality alone (e.g., flowcharts and tables). Attributing the origin of multimodal sources can improve the faithfulness and quality of model responses in broader scenarios (as shown in Figure 1), which is overlooked in previous work. Additionally, generating responses with citations to multiple modalities sources brings several challenges for MLLMs. For example, the model needs to perform cross-modal understanding, assess whether the cited evidence sufficiently supports the answers, and avoid being influenced by distracting information. This presents significant challenges to MLLMs, which are still under exploration. Therefore, we aim to build a benchmark to evaluate and analyze the multimodal citation text generation ability in MLLMs.

However, building such a benchmark is challenging. First, constructing high-quality question-answer data with multimodal sources as evidence is non-trivial. On one hand, collecting multimodal corpora requires accurately extracting the information sources. On the other hand, the evidence in the dataset must be aligned with the answers. Especially, when multiple pieces of evidence contribute to an answer, it is crucial to ensure that they are interrelated and collectively provide sufficient support for answering the question. Second, evaluating MLLMs’ performance in multimodal citation generation adds further complexity. Developing reliable evaluation methods for cross-modal entailment is the key to assessing whether cited evidence supports generated answers. Additionally, citations are required to align with the responses, ensuring the cited evidence is both necessary for answering the question and directly corresponds to the generated output. Therefore, it is essential to evaluate model performance across multiple dimensions.

In this paper, we propose MCiteBench, the first benchmark for multimodal citation text generation. To tackle the above challenges, we first exploit academic papers, then accurately extract and rigorously filter the information sources from multiple modalities, forming an attribution corpus. Based on the corpus, we utilize the review-rebuttal interactions of these papers to construct QA pairs where the answer contains supportive evidence. To comprehensively assess model performance, we evaluate models from three dimensions: citation quality, source reliability, and answer accuracy. Through extensive experiments, we present several key findings:

While MLLMs can answer questions correctly, they struggle with citation, particularly when evidence comes from multiple sources.

Compared to visual information, MLLMs are better at attributing the textual evidence, indicating a potential modality bias in citation text generation.

In summary, our contributions are as follows:

To the best of our knowledge, we are the first benchmark for comprehensively evaluating the performance of MLLMs in multimodal citation text generation.
MCiteBench consists of 3,000 samples that cover a wide range of difficulty levels, including both single-source and multi-source evidence. Additionally, it encompasses cases involving single-modality and mixed-modality evidence. Given the diversity of evidence sources and complexities involved, we design multi-dimensional criteria to assess model performance.
We conduct experiments to analyze the performance of models in citation text generation. Results indicate that existing models struggle with citation text generation, with limitations stemming from attribution rather than understanding.

2.0.0.1 Citation Text Generation

Citation text generation involves models producing text with explicit references to the sources of the information they were conditioned on. [7], [8] first introduced the concept of generating responses with citations to help users quickly verify the sources of information. Subsequent works have explored two main approaches for generating responses with citations: one involves generating both the response and citations simultaneously [9]–[11], while the other adds citations during the post-processing phase [12]–[14] to enhance the citation generation process. Furthermore, this approach has been extended to domains like long-context citation [15] and fine-grained citation generation [16]. Our work focuses on addressing the attribution of outputs derived from multimodal information sources. Unlike previous approaches that focus primarily on textual data, our benchmark integrates figure and tabular evidence, aligning it more closely with real-world multimodal citation scenarios.

Figure 2: The construction pipeline of MCiteBench. Initially, we collect multimodal academic papers along with their corresponding review-rebuttal interactions and then parse the papers to extract candidate evidence. GPT-4o is used to extract Explanation QA pairs from the comments and generate Locating QA pairs. Next, human annotators match the references in the answers to the relevant content in the original papers. Finally, the data filtered and labeled by the model is manually verified by human annotators to ensure consistency and accuracy..

2.0.0.2 Multimodal RAG

Multimodal retrieval-augmented generation (mRAG) [17] combines retrieved multimodal information sources with multimodal large language models, which is significant for helping models answer questions that cannot be addressed solely by the model’s internal knowledge. [3] acquires unknown visual knowledge through web search to aid in answering queries, while [2] builds a self-adaptive retrieval agent to plan the reasoning path. Additionally, [1] improves multi-page and multi-document understanding through multimodal retrieval. While mRAG uses retrieved passages during inference and training, it does not verify if the generated output faithfully reflects the sources. In contrast, our work focuses on whether the model can correctly attribute its generated outputs to the appropriate information sources when provided with multimodal inputs.

3 MCiteBench↩︎

In this section, we first present the definition of the multimodal citation text generation task, followed by the construction of MCiteBench. The construction pipeline of MCiteBench is illustrated in Figure 2. It consists of four steps: Attribution Corpus Collection, QA Pairs Construction, Evidence Pairing, and Quality Control. First, we gather academic papers as a rich source of multimodal content. Then, we construct QA pairs from review-rebuttal interactions and OCR-processed components, use human annotation to link answers to their supporting evidence, and apply automated filtering followed by manual verification to ensure dataset quality.

3.1 Task Definition↩︎

Given a query \(q\) and a multimodal evidence set \(M\), where \(M\) contains both the ground truth evidence related to \(q\) and distractors, the model is required to generate an answer \(a\) along with a set of citations \(C\). For each sentence \(s_i\) in the answer, the model generates a set of citations \(C_i = \{ c_{i,1}, c_{i,2}, \dots, c_{i,k_i} \}\), where \(k_i\) denotes the number of cited evidence associated with sentence \(s_i\). Each citation \(c_{i,j}\) refers to a specific piece of evidence from the multimodal evidence set \(M\).

3.2 Attribution Corpus Collection↩︎

To evaluate the multimodal citation text generation of MLLMs, an attribution corpus that includes multimodal information sources and allows for easy verification of cited evidence is needed. In MCiteBench, we use academic papers as the attribution corpus because of these characteristics:

Academic papers contain rich content from multiple modalities (e.g., text, figure, and table) that individually or collectively support the arguments.

The information sources in academic papers are numbered (e.g., “Figure 1”, “Table 2”, and text in “Line 10”), making it easy to match them with the cited results.

Academic papers cover the latest content beyond pre-training data, reducing the risk of data leakage.

In MCiteBench, we collect academic papers from OpenReview and employ the MinerU framework [18], a state-of-the-art document extraction tool, to parse and extract multimodal content from these papers. As a result, we obtain a diverse corpus of multimodal content, including over 400k text paragraphs, 40k images, and 9k tables, which serve as candidate evidence. From this corpus, we select a subset as the evidence and distractor sources for our final 3k data samples.

3.3 QA Pairs Construction↩︎

After collecting the attribution corpus, the next step is constructing question-answer pairs with explicit references to the supporting evidence. Establishing a reliable correlation between questions and evidence is challenging, as the source of information must be accurately linked to the generated answers.

We divide MCiteBench data into two categories: Explanation and Locating. Explanation questions require in-depth analysis of evidence and often involve long-form answers, such as “How is the model’s performance evaluated?” While Locating questions are straightforward and can be answered by directly identifying the correct evidence, such as “Which model performs better on the XYZ benchmark, GPT-4o or GPT-4o-mini?”.

For Locating questions, we use GPT-4o to generate structured QA pairs with supporting details. Specifically, we construct QA pairs \(( Q, A)\), where each question \(q_i \in Q\) is formulated based on specific evidence, and each answer \(a_i \in A\) is directly linked to the corresponding source.

However, generating appropriate questions from multiple sources is challenging for MLLMs. They often fail to integrate all required information, making it difficult to generate questions that fully utilize all necessary knowledge. Models often struggle to ensure that generated questions require all selected evidence, rather than being answerable by a single piece of evidence. Therefore, for Explanation questions, we leverage review-rebuttal interactions to construct QA pairs. The review and rebuttal data comprises the reviewers’ questions and authors’ responses, where the responses are consistently supported by multiple pieces of evidence from the papers (i.e., attribution corpus). From these data, we construct QA pairs \(( Q, A)\) by extracting questions \(q_i\) and the corresponding answers \(a_i\) ².

3.4 Evidence Pairing↩︎

In review-rebuttal interactions, the authors’ response contains rich evidence to support their arguments. For example, when addressing a reviewer’s concern about model performance, an author might respond, “Our approach achieves 85.2% accuracy, as shown in Table 3 and discussed in Section 4.2.” Therefore, we extract the supportive evidence \(e_i \in E\) from \(a_i \in A\) to construct \(( Q, A, E )\) triplets. While \(E\) provides explicit references (e.g., Table 3 and Section 4.2), these references must be linked to the actual content in the source papers before they can be used as input for MLLMs. To achieve this, human annotators annotate the evidence by mapping each reference to its corresponding text, image, or table in the original papers.

3.4.0.1 Distractor Construction.

We construct distractors to assess the model’s ability to accurately cite relevant evidence while ignoring irrelevant information. These distractors are sampled from the same paper, ensuring a balanced distribution of multimodal content (text, images, tables).

After obtaining the distractors, we construct each data sample, which includes a question, the correct answer, multimodal ground truth evidence, and distractors. In MCiteBench, each sample is represented as \((Q, A, E, D)\), where \(Q\) is the question, \(A\) is the correct answer, \(E\) is the evidence and \(D\) is the distractors.

Table 1: Statistics of MCiteBench.
Statistic	Number
Total questions	3,000
- Explanation	2,000
- Locating	1,000
Evidence sources
- Single-source	2,538
- Multi-source	462
Evidence modality
- Text	1,243
- Figure	941
- Table	533
- Mixed	283
Total papers	1,749
Average questions per paper	1.72

3.5 Quality Control↩︎

After constructing \((Q, A, E, D)\), we apply a quality control pipeline that first uses automated filtering followed by human verification.

Initially, GPT-4o assigns quality labels and filters out low-quality samples based on predefined criteria such as relevance, clarity, and evidence alignment. The filtered candidates are then manually verified by annotators to ensure consistency and accuracy, focusing on removing any unclear or incorrect instances ³.

3.6 Statistics of MCiteBench↩︎

As shown in Table 1, MCiteBench comprises 3,000 data samples for multimodal citation text generation tasks, extracted from 1,749 academic papers with an average of 1.72 questions per paper. Among these, 2,000 are Explanation questions that require detailed evidence analysis and often lead to long-form answers, while 1,000 are Locating questions that focus on direct evidence identification. The evidence is balanced across modalities, with 1,243 textual, 1,474 visual (including 941 figures and 533 tables), and 283 mixed-modality sources, ensuring diverse multimodal attribution scenarios.

4 Evaluation Metrics↩︎

We evaluate the models across three dimensions: citation quality, source reliability, and answer accuracy. Using Citation F1, we assess whether the cited evidence accurately and sufficiently supports the model’s response. Source reliability ensures that the model’s response cites the ground truth source needed to answer the query. We measure this by comparing the model-generated citation with ground truth citation, using both Source F1 and Source Exact Match scores. Answer accuracy metrics are designed to assess whether the model’s response correctly addresses the query.

Figure 3: The calculation of Citation F1..

4.0.0.1 Citation F1 (C-F1).

Citation quality is evaluated using Citation F1, which measures the alignment between cited evidence and the generated response, ensuring that the response is supported by the cited evidence without including irrelevant ones.

As illustrated in Figure 3, a judge model is employed to verify whether each sentence is properly supported by its corresponding cited evidence. Citation Recall is calculated using a scoring system inspired by LongCite [15], categorizing citations into three levels: No support, Partially supported, and Fully supported, with corresponding scores of 0, 0.5, and 1. In cases where a sentence cites multiple evidence, we concatenate the information from all cited evidence and judge the entailment relationship.

Citation Precision is determined on a binary scale, scored as either relevant (1) or irrelevant (0) to the cited evidence. For sentences citing multiple sources, the final precision score is the average across all cited evidence. Finally, Citation F1 is computed as the harmonic mean of Recall and Precision, providing a balanced measure of the model’s citation quality.

Figure 4: The calculation of Source F1 and Source Exact Match..

Table 2: Main results on MCiteBench. The highest score is highlighted in **bold**, and the second highest score is underlined. C-F1, S-F1, and S-EM represent Citation F1, Source F1, and Source Exact Match scores, respectively. Acc stands for Accuracy.
Models	Explanation								Locating
2-9 (lr)10-13	Single-Source				Multi-Source				Single-Source
2-5 (lr)6-9 (lr)10-13	C-F1	S-F1	S-EM	Acc	C-F1	S-F1	S-EM	Acc	C-F1	S-F1	S-EM	Acc
Open-Source Models (7-14B)
LLaVA-OV-7B	19.93	10.84	5.34	47.79	31.14	22.48	1.26	49.68	26.31	20.93	11.63	60.10
LLaVA-OV-7B-Chat	28.77	13.90	1.43	47.76	35.74	29.82	3.00	49.78	29.58	23.33	4.05	53.85
MiniCPM-V-2.6	49.12	35.23	22.81	51.30	57.90	41.74	5.88	52.60	47.93	52.73	42.94	83.55
Qwen2-VL-7B	58.46	42.98	35.36	51.59	58.64	36.62	2.36	53.03	53.99	54.71	46.32	87.45
InternVL2.5-8B	58.47	45.13	33.45	51.53	63.97	45.50	9.86	52.92	55.94	64.17	56.33	83.90
Llama-3.2-Vision-11B	19.65	14.06	9.60	48.63	31.16	25.87	1.22	49.35	26.56	16.56	11.80	61.40
Open-Source Models (>70B)
Qwen2-VL-72B	53.60	44.81	32.01	52.60	64.66	50.53	8.96	52.38	58.75	68.86	61.48	90.25
InternVL2.5-78B	54.52	42.44	25.40	52.34	71.03	57.65	16.86	54.87	50.57	57.60	52.20	90.10
Llama-3.2-Vision-90B	35.33	28.05	12.30	50.00	46.08	46.73	10.35	51.41	43.69	49.07	32.83	74.75
Proprietary Models
GPT-4o-mini	43.99	34.42	15.48	52.08	57.81	50.22	8.39	54.22	53.71	58.57	46.56	88.50
GPT-4o	84.24	56.82	24.50	54.32	89.19	67.56	21.27	56.60	91.45	85.74	69.45	90.45

4.0.0.2 Source F1 (S-F1).

As shown in Figure 4, Source F1 measures the alignment between citations in the model’s response and ground truth citations, evaluating whether the model cites evidence that aids in answering the query.

We first split the model-generated responses into sentence-citation pairs \((s_i, c_i)\) using GPT-4o. These sentence-level citations are then aggregated to form response-level citations, which are compared against the ground truth. The precision, recall, and F1 score are calculated as follows:

\[\text{Source Precision} = \frac{|C_{\text{pred}} \cap C_{\text{gt}}|}{|C_{\text{pred}}|},\]

\[\text{Source Recall} = \frac{|C_{\text{pred}} \cap C_{\text{gt}}|}{|C_{\text{gt}}|},\]

We calculate Source F1 by computing the harmonic mean of Recall and Precision. \(C_{\text{pred}}\) represents the set of citations generated by the model, and \(C_{\text{gt}}\) denotes the ground truth citations. The intersection \(C_{\text{pred}} \cap C_{\text{gt}}\) counts the correctly cited evidence.

4.0.0.3 Source Exact Match (S-EM).

The Source Exact Match metric provides a stricter evaluation, indicating whether the model’s response-level citation is the same as the ground truth.

\[\text{Source EM} = \begin{cases} 1, & \text{if } C_{\text{pred}} = C_{\text{gt}} \\ 0, & \text{otherwise} \end{cases}\]

4.0.0.4 Accuracy (Acc).

Answer accuracy is evaluated using the LLM-As-Judge [19], [20] approach across both Explanation and Locating questions. The judge model evaluates the model’s response and the reference answer based on criteria specific to the question type. Afterward, the scores are normalized. In Explanation cases, direct comparison with a ground truth answer is not feasible. Instead, we use the authors’ responses as the reference and employ a judge model to evaluate the generated answers based on their relevance, logical consistency, and fluency. In Locating scenarios, this evaluation method mitigates issues related to errors caused by minor formatting differences ⁴.

5 Experiments↩︎

5.1 Evaluation Settings↩︎

5.1.0.1 Implement Details.

In this work, citations for textual modalities are indicated using box brackets (e.g., [1][2]), while for figures and tables, we adopt the index provided in their captions (e.g., Figure 3, Table 2). For both single-source and multi-source evidence questions, the multimodal corpus \(M\) comprises 5 items, including the ground truth evidence and distractors. Distractors are randomly selected from other content within the same paper.

5.1.0.2 Judge Model.

In this study, we use GPT-4o to assess the entailment relationship between model responses and their cited evidence ⁵.

5.1.0.3 Model Choice.

Among the open-source models, we choose InternVL-2.5 (8B and 78B) [21], Qwen2-VL (7B and 78B) [22], Llama 3.2-Vision (11B and 90B) [23], as well as Llava-OneVision (and its chat version) and MiniCPM-V-2.6 [24]. For proprietary models, we test GPT-4o (GPT-4o-2024-11-20) and GPT-4o-mini (GPT-4o-mini-2024-07-18) [25].

5.2 Main Results↩︎

As shown in Table 2, smaller open-source models achieve lower Citation F1 scores and struggle to select evidence that adequately supports their responses. Furthermore, they also perform poorly in selecting evidence that directly answers the query, as shown by their low Source F1 and Source Exact Match scores. As model size increases, we observe an improvement in citation performance, suggesting that scaling model size enhances attribution capability. In comparison, GPT-4o achieves an 84.24% Citation F1 score on single-source Explanation questions, demonstrating strong citation quality. However, it struggles with source reliability, with Source Exact Match scores remaining low at 24.50% for single-source and 21.27% for multi-source settings. This indicates that even state-of-the-art models struggle to consistently cite evidence that is directly relevant to answering the query, underscoring the difficulty of precise citation in multimodal contexts.

5.2.0.1 Does Question Difficulty Influence Model Citation Performance?

Model performance reflects the difficulty of the questions, with higher accuracy scores observed on Locating questions compared to Explanation questions, indicating that Explanation questions are more challenging. As shown in Table 2, as question difficulty increases, model citation performance tends to decrease. For instance, GPT-4o achieves 85.74% in Source F1 for single source Locating questions but drops to 56.82% for single source Explanation questions. Explanation questions place higher demands on citation generation, as they require an in-depth analysis of the inputs.

5.2.0.2 Can MLLMs Cite Accurately in Multi-Source Scenarios?

In multi-source scenarios, models achieve higher Citation F1 and Source F1 scores, as the ground truth citations contain more items. However, the stricter Source Exact Match metric is lower than in single-source scenarios. This highlights the challenge of citing in multi-source scenarios, where models must correctly include relevant sources while avoiding irrelevant ones.

Figure 5: Results of MCiteBench across different modalities. Source Exact Match(S-EM) score indicates whether the model’s response-level citation is the same as the ground truth..

5.3 Analysis↩︎

In this section, we discuss several research questions, revealing the inherent biases and bottlenecks in the task.

5.3.0.1 RQ1: Do Models Prefer Specific Modalities for Citation?

We analyze model performance in instances where the evidence modality comes from mixed modalities. The number of evidence is set to 2, and we compare this with data from single modalities with the same number of evidence pieces. As shown in Figure 5, most models achieve high Source EM scores when the ground truth evidence is textual but perform poorly when it is visual. This suggests that although MLLMs can process multimodal inputs, they are better at aligning with textual evidence than accurately citing visual information when generating responses.

To further investigate this, we analyze MLLMs’ attention patterns when processing mixed-modality inputs. Using Qwen2-VL-7B as the test model, we calculate the attention distribution across multimodal inputs by averaging attention head scores and normalizing by input source token length across different layers. As shown in Figure 6, the model allocates fewer attention scores to visual inputs compared to text. In contrast, textual information maintains consistently high attention throughout, with 83.7% in early layers and 77.5% in later layers. This indicates that while the model processes all modalities, it prioritizes textual content and utilizes it more effectively than visual data.

Figure 6: Attention distribution across multimodal sources. Visual sources include figures and tables..

5.3.0.2 RQ2: How Do Models Allocate Attention When Generating Source Citations?

Figure 7: Attention heatmap for ground truth and distractors. The heatmap shows the model’s attention distribution when predicting the next token of *“Logic-LM has higher accuracy. According to Figure \(\Diamond\)*” in its response. Although the model answers correctly, its attention in the distractors remains focused on index positions (*e.g.*, *[1], Figure 2*)..

Correctly generating source-identifying tokens (e.g., “[1]”, “Figure 2”) leads to better performance and higher attribution scores. To better understand how models process ground truth evidence and distractors, we analyze their attention distribution when generating source-identifying tokens.

5.3.0.3 Settings

We analyze the attention distribution of Qwen2-VL-7B when generating such tokens. Specifically, we focus on its behavior when predicting the next token after “Logic-LM has higher accuracy. According to Figure \(\Diamond\)” in its response. Notably, the distractors are sampled from unrelated papers, meaning they provide no useful information for answering the question.

5.3.0.4 Results

As shown in Figure 7, the model’s attention heatmap reveals an intriguing pattern: even when the response is based entirely on a specific piece of evidence, the model’s attention is not solely focused on it. When generating the token after “According to Figure”, the model’s attention remains high on textual index positions (e.g., “[1]”, “[2]”), even though the context suggests the model should focus on figure evidence. This suggests that while the model correctly cites the source, it maintains a broader contextual awareness by attending to multiple potential evidence.

Table 3: Comparison of model performance on Understanding and Attribution tasks. Both tasks share the same input sources, but differ in the nature of the questions: one requires answering a question, while the other asks which source can help answer the question.
Model	Understanding	Attribution
Open-Source(7-14B)
Qwen2-VL-7B	0.94	0.59
InternVL2_5-8B	0.91	0.56
Open-Source(>70B)
Qwen2-VL-72B	0.95	0.62
InternVL2_5-78B	0.93	0.62
Proprietary
GPT-4o-mini	0.93	0.61
GPT-4o	0.94	0.63

5.3.0.5 RQ3: Understanding or Attribution: What is the Bottleneck?

Insights from RQ2 show that even when models generate responses based on the correct evidence, they still focus heavily on distractors. This suggests a possible gap between understanding the content and correctly attributing it. This raises a critical question: Does the bottleneck in multimodal citation text generation stem from limitations in understanding or attribution?

On one hand, multimodal understanding requires the model to grasp the meaning of the inputs and use them to generate a coherent response. On the other hand, multimodal attribution demands that the model’s generated response can be traced back to specific, verifiable sources in the input.

5.3.0.6 Settings

We design multiple-choice questions for understanding and attribution. We constructed single source Locating QA pairs and sampled distractors from unrelated papers, ensuring that the questions remain straightforward to answer. Understanding questions involve a 4-option multiple-choice format, while attribution questions use a 5-option multiple-choice format, with each option corresponding to one of the five information sources in the input.

In the attribution task, models select a source identifier (e.g., “[1]”, “Figure 2”) from a fixed set of options, directly reflecting models internal computation during citation generation, where they select the source identifier tokens based on the probability of the next token.

5.3.0.7 Results

Results in Table 3 show that while models achieve over 90% accuracy on understanding questions, they perform worse on attribution questions. This suggests that the bottleneck lies not in multimodal understanding but in multimodal attribution. This finding underscores a fundamental limitation in current multimodal models: they can process and understand multimodal inputs well but struggle to attribute outputs to the correct evidence accurately.

6 Conclusion↩︎

In this paper, we introduce MCiteBench for evaluating multimodal citation text generation in MLLMs. We build MCiteBench using academic papers and their review-rebuttal interactions, ensuring a high-quality benchmark for evaluating multimodal citation text generation. Leveraging this benchmark, we conduct a detailed evaluation of model performance across multiple dimensions, including source reliability, answer accuracy, and citation quality. Through extensive experiments, we find that existing models struggle to accurately attribute their outputs to the correct multimodal sources. Furthermore, we dive deep into the analysis of attention distribution during citation generation and uncover biases and bottlenecks that hinder accurate attribution. We hope that MCiteBench provides valuable insights for citation text generation tasks and contributes to the development of models capable of generating faithful and verifiable responses.

Limitations↩︎

In MCiteBench, we construct multi-level questions and build an evaluation pipeline for multimodal inputs. However, the current design has limitations in citation granularity. First, citations are limited to the sentence level, meaning that we do not distinguish between multiple claims within a single sentence. For example, if a sentence contains multiple claims supported by different evidence, we treat it as a full sentence-level citation. Second, MCiteBench treats subfigures or subtables (e.g., Figure 1a, 1b) as part of the entire figure or table, without distinguishing between them. These limitations highlight areas for future improvement in handling fine-grained attribution tasks.

7 Prompt Design↩︎

7.1 Data Processing Prompts↩︎

We list the prompts used for extracting Explanation QA and generating Locating QA in Table [prompt:extract_prompt], [prompt:generate_qa_prompt].

7.2 Evaluation Metric Prompts↩︎

We list the prompts used for evaluating citation recall, citation precision, and the accuracy of Explanation and Locating questions in Table [prompt:eval95citation95recall], [prompt:eval95citation95precision], [prompt:eval_explanation_questions_acc], [prompt:eval_locating_questions_acc].

You are an expert in evaluating text quality. You will receive a statement from an AI assistant’s response based on a paper, along with a part from the document (which could be a text paragraph, image, or table). Your task is to carefully assess whether this statement is supported by the provided part. Please use the following scale to generate your rating: 0: No support — The statement is largely unrelated to the provided part (text, image, or table), or most key points in the statement do not align with the content of the part. 1: Partially supported — More than half of the content in the statement is supported by the part, but a small portion is either not mentioned or contradicts the part. 2: Fully supported — Most information in the statement is supported by or extracted from the part. This applies only to cases where the statement and the part are almost identical. Ensure that you do not use any information or knowledge outside of the provided part when evaluating. Please return only the rating in JSON format, with 0, 1, or 2. Statement: {sentence}

You are an expert in evaluating text quality. You will receive a statement from an AI assistant’s response based on a paper, along with a part from the document (which could be a text paragraph, image, or table). Your task is to carefully assess whether the provided part contains some key information of the statement. Please use the following scale to generate your rating: 0: Unrelevant — The statement is almost unrelated to the provided part, or all key points of the statement are inconsistent with the the provided part. 1: Relevant — Some key points of the statement are supported by or extracted from the the provided part. Ensure that you do not use any information or knowledge outside of the provided part when evaluating. Please return only the rating in JSON format, with 0 or 1. Statement: {sentence}

8 Human Evaluation↩︎

8.1 Evidence Paring↩︎

Human annotators map each reference to its corresponding content using the GUI shown in Figure 8.

Figure 8: GUI screenshot for human annotators to map each reference to its corresponding content..

8.2 Quality Control↩︎

Human annotators verify data quality and filter out bad cases using the GUI shown in Figure 9.

Figure 9: GUI screenshot for verifying filtered QA..

8.3 Agreement Between Human Annotations and GPT-4o↩︎

To verify the accuracy of our evaluation pipeline, we conducted a manual annotation study on 75 model-generated responses, comprising 25 objective questions and 50 subjective questions, resulting in over 457 entailment judgments. We then compared these human annotations with the entailment judgments produced by GPT-4o. As shown in Table 4, the results indicate a high degree of agreement between human annotations and GPT-4o’s predictions, demonstrating the reliability and correctness of our pipeline. The annotation GUI is shown in Figure 10.

Table 4: Entailment Judgment Alignment: Model vs. Human Ground Truth
Model	Subjective			Objective
2-4 (lr)5-7	F1	Recall	Precision	F1	Recall	Precision
GPT-4o	0.80	0.80	0.79	0.82	0.81	0.83

Figure 10: GUI screenshot for human-annotated entailment verification..

References↩︎

[1]

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. 2024. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. arXiv preprint arXiv:2411.04952.

[2]

Yangning Li, Yinghui Li, Xingyu Wang, Yong Jiang, Zhen Zhang, Xinran Zheng, Hui Wang, Hai-Tao Zheng, Philip S Yu, Fei Huang, et al. 2024. Benchmarking multimodal retrieval augmented generation with dynamic vqa dataset and self-adaptive planning agent. arXiv preprint arXiv:2411.02937.

[3]

Zhixin Zhang, Yiyuan Zhang, Xiaohan Ding, and Xiangyu Yue. 2024. Vision search assistant: Empower vision-language models as multimodal search engines. arXiv preprint arXiv:2410.21220.

[4]

Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions. arXiv preprint arXiv:2311.05232.

[5]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. 2024. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930.

[6]

Tinghui Zhu, Qin Liu, Fei Wang, Zhengzhong Tu, and Muhao Chen. 2024. Unraveling cross-modality knowledge conflicts in large vision-language models. arXiv preprint arXiv:2410.03659.

[7]

Tianyu Gao, Howard Yen, Jiatong Yu, and Danqi Chen. 2023. Enabling large language models to generate text with citations. arXiv preprint arXiv:2305.14627.

[8]

Nelson F Liu, Tianyi Zhang, and Percy Liang. 2023. Evaluating verifiability in generative search engines. arXiv preprint arXiv:2304.09848.

[9]

Haolin Deng, Chang Wang, Xin Li, Dezhang Yuan, Junlang Zhan, Tianhua Zhou, Jin Ma, Jun Gao, and Ruifeng Xu. 2024. Webcites: Attributed query-focused summarization on chinese web search results with citations. arXiv preprint arXiv:2403.01774.

[10]

Rami Aly, Zhiqiang Tang, Samson Tan, and George Karypis. 2024. Learning to generate answers with citations via factual consistency models. arXiv preprint arXiv:2406.13124.

[11]

Chengyu Huang, Zeqiu Wu, Yushi Hu, and Wenya Wang. 2024. Training language models to generate text with citations via fine-grained rewards. arXiv preprint arXiv:2402.04315.

[12]

Aviv Slobodkin, Eran Hirsch, Arie Cattan, Tal Schuster, and Ido Dagan. 2024. Attribute first, then generate: Locally-attributable grounded text generation. arXiv preprint arXiv:2403.17104.

[13]

Weitao Li, Junkai Li, Weizhi Ma, and Yang Liu. 2024. Citation-enhanced generation for llm-based chatbot. arXiv preprint arXiv:2402.16063.

[14]

Sirui Xia, Xintao Wang, Jiaqing Liang, Yifei Zhang, Weikang Zhou, Jiaji Deng, Fei Yu, and Yanghua Xiao. 2024. Ground every sentence: Improving retrieval-augmented llms with interleaved reference-claim generation. arXiv preprint arXiv:2407.01796.

[15]

Jiajie Zhang, Yushi Bai, Xin Lv, Wanjun Gu, Danqing Liu, Minhao Zou, Shulin Cao, Lei Hou, Yuxiao Dong, Ling Feng, et al. 2024. Longcite: Enabling llms to generate fine-grained citations in long-context qa. arXiv preprint arXiv:2409.02897.

[16]

Yilong Xu, Jinhua Gao, Xiaoming Yu, Baolong Bi, Huawei Shen, and Xueqi Cheng. 2024. Aliice: Evaluating positional fine-grained citation generation. arXiv preprint arXiv:2406.13375.

[17]

Ruochen Zhao, Hailin Chen, Weishi Wang, Fangkai Jiao, Xuan Long Do, Chengwei Qin, Bosheng Ding, Xiaobao Guo, Minzhi Li, Xingxuan Li, et al. 2023. Retrieving multimodal information for augmented generation: A survey. arXiv preprint arXiv:2303.10868.

[18]

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. 2024. Mineru: An open-source solution for precise document content extraction. arXiv preprint arXiv:2409.18839.

[19]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. 2023. Judging llm-as-a-judge with mt-bench and chatbot arena. Advances in Neural Information Processing Systems, 36:46595–46623.

[20]

Xiao Liu, Xuanyu Lei, Shengyuan Wang, Yue Huang, Zhuoer Feng, Bosi Wen, Jiale Cheng, Pei Ke, Yifan Xu, Weng Lam Tam, et al. 2023. Alignbench: Benchmarking chinese alignment of large language models. arXiv preprint arXiv:2311.18743.

[21]

Zhe Chen, Weiyun Wang, Yue Cao, Yangzhou Liu, Zhangwei Gao, Erfei Cui, Jinguo Zhu, Shenglong Ye, Hao Tian, Zhaoyang Liu, et al. 2024. Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling. arXiv preprint arXiv:2412.05271.

[22]

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191.

[23]

AI Meta. 2024. Llama 3.2: Revolutionizing edge ai and vision with open, customizable models. Meta AI Blog. Retrieved December, 20:2024.

[24]

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. 2024. Minicpm-v: A gpt-4v level mllm on your phone. arXiv preprint arXiv:2408.01800.

[25]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276.

Corresponding author.↩︎
Details of prompt design and reference extraction strategies are in Appendix 7.1 ↩︎
The human filter process is supported by a dedicated GUI interface, as shown in Appendix [app:quality95control].↩︎
Detailed scoring criteria and judgment prompts are provided in the Appendix [app:evaluation95metric95prompt].↩︎
We validate GPT-4o’s reliability in Appendix 8.3.↩︎

MCiteBench: A Benchmark for Multimodal Citation Text Generation in MLLMs