October 23, 2025
Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.
Recent advances in large vision-language models (VLMs) have delivered impressive performance on tasks such as image captioning and general visual question answering (VQA) [1], [2]. However, these models encounter challenges in information-intensive images that densely interleave diverse textual annotations (legends, labels, captions) with fine-grained graphical elements (charts, diagrams, plots) across multiple scales and formats [3]. Addressing this task requires two interdependent capabilities (Figure 1; [4]): (i) comprehensive and precise localization, which involves not only pinpointing the exact positions of critical cues in densely populated layouts but also ensuring that all query-relevant regions are identified; (ii) multi-hop reasoning, which chains visual analysis—encompassing colors, shapes, and spatial relationships—with textual evidence, thereby integrating dispersed cues into a coherent and complete answer. As each reasoning step builds on the accuracy of the previous one, any intermediate error can propagate through the entire chain, making the overall process highly error-sensitive and difficult to correct retrospectively. Existing work tackles information-intensive visual reasoning with search‐based zoom‐in pipelines that enlarge local regions for detailed reasoning. Specifically, learning-based methods train reinforcement learning policies to guide zoom operations iteratively [5]–[8]. Enhancing its performance would demand costly fine-grained supervision. Moreover, training-free methods perform cropping based on internal attention or confidence scores [9]–[11]. Yet in dense layouts, we find these signals correlate weakly with true relevance, misleading the model into visually similar but irrelevant areas. Consequently, these tool-driven designs fail to capture all evidence for multi-hop reasoning, leaving the core challenges of information-intensive visual reasoning unsolved.
To overcome these limitations, we propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines small draft visual experts with a large verdict model [12]. The framework operates in two stages (Figure 2): (1) Draft stage: multiple lightweight VLMs serve as draft experts, each generating a reasoning path that offers diverse localization candidates; (2) Verdict stage: a large VLM acts as a strong verdict, which receives the reasoning paths as contextual evidence, distinguishes the correct information, and outputs the final answer. SV directly tackles core challenges through complementary strengths: draft experts expand evidence coverage across scattered regions, while the verdict prevents error propagation by synthesizing these multiple perspectives. Importantly, unlike using a large proprietary model to reason over every image section, SV invokes the verdict only once to yield a concise final answer, thereby minimizing computational cost while effectively recovering correct answers. To further balance accuracy and efficiency, SV introduces a consensus expert selection mechanism in the draft stage, ensuring that only reasoning paths with strong agreement are forwarded to the verdict.
We evaluate Speculative Verdict on information-intensive VQA benchmarks, including InfographicVQA [13], ChartMuseum [14], and ChartQAPro [15], which demand reasoning over dense textual and visual content. As a training-free framework, SV consistently outperforms strong open-source models, large proprietary models, and perception-focused search methods while remaining cost-efficient. In particular, SV yields average gains of 4% over small VLMs as draft experts and 10% over GPT-4o [16] as verdict. Beyond overall gains, SV successfully corrects 47-53% of cases where majority voting or the verdict model alone fails, thereby reducing vulnerability to error propagation in information-intensive visual reasoning. Furthermore, SV surpasses all baselines on HR-Bench 4K [17], a benchmark for high-resolution visual perception, underscoring its effectiveness in challenging multimodal reasoning scenarios.
Vision-Language Model Reasoning with Tools. Recent research has explored enhancing VLM perception by manipulating input images with zooming operations to locate relevant regions [18]. (1) Prompting-based methods exploit internal signals of VLMs to decide where to zoom. ViCrop [9] leverages models’ attention maps to highlight query-related regions, thereby generating automatic visual crops. Other works perform tree-based search, where models evaluate candidate sub-images with confidence scores to iteratively narrow down to relevant regions [10], [11]. However, such signals align poorly with the required evidence in information-intensive images, since queries often require reasoning across multiple dispersed regions. (2) Reinforcement learning approaches instead optimize policies that interleave visual zooming with textual reasoning [5]–[8]. By calling zooming tools within the agentic framework, these methods adaptively crop regions and concatenate them into the reasoning trajectory, enabling more active evidence gathering. Yet these methods still fall short on information-intensive images, requiring costly task-specific training to scale.
Speculative Decoding. Speculative decoding is a draft-then-verify decoding paradigm to accelerate LLM inference [19]. Specifically, it utilizes a draft model to generate future tokens, and a larger target model verifies them via parallel rejection sampling. Beyond the vanilla setting, recent work extends acceptance from token-level equivalence to step-level semantic similarity to speed up reasoning [20]–[23]. Collaborative decoding via Speculation [24] further applies speculative decoding with multiple draft LLMs by verifying proposals against a combined distribution of the drafts and the target, yielding greater speedups than standard ensembling. However, these adaptations primarily target speed in LLM inference and also do not address the challenges of vision-language reasoning.
Large Language Model Ensemble. Majority voting aggregates answers by frequency, but fails when the correct solution is produced by a minority. Universal Self-Consistency [25] mitigates this failure mode by prompting the LLM to select the most consistent candidate across samples. Further, learned aggregators read multiple rationales and synthesize them to recover minority-correct information [26], [27]. However, these approaches focus on text-only ensembling. In vision-language reasoning, supervision of ensembling is not cost-effective since multimodal complexity requires costly, fine-grained annotations.
Speculative decoding is an inference‐time optimization originally developed to mitigate the latency of autoregressive generation [12]. The approach employs a draft-then-verify paradigm: (i) a small, fast draft model proposes one or more future tokens speculatively, and (ii) a large, accurate base model verifies these proposals in parallel, accepts or revises the proposals, and generates output that is consistent with the base model’s distribution [19], [28]. This token-level process speeds up inference by committing several tokens at once, while maintaining quality by discarding continuations that diverge from the base model’s distribution.
The key insight is that draft models expand coverage quickly, while the verifier ensures correctness. Although this idea has been mainly applied to accelerate text generation, its high-level principle is also well-suited for information-intensive multimodal reasoning.
Information-intensive visual question answering (VQA) requires models to localize query-relevant regions, perceive diverse fine-grained textual and visual details, and integrate dispersed evidence into a single correct answer. These tasks are highly error-sensitive as elaborated in Section 1: a single misread or mislocalized element often leads to a completely wrong prediction.
To address this challenge, we adapt the draft-then-verify paradigm of speculative decoding to multimodal reasoning. Unlike its original use for inference acceleration, we repurpose the paradigm to improve robustness and error correction in information-intensive visual reasoning. On a high level, our Speculative Verdict (SV) framework operates in two stages (Figure 2):
(i) Draft stage, where multiple lightweight VLMs are selected as draft experts to provide diverse reasoning paths (Section 3.2);
(ii) Verdict stage, where a large VLM acts as verdict to verify, refine, and synthesize these reasoning paths into the final prediction (Section 3.3).
Chain-of-Thought (CoT) prompting exposes models’ intermediate reasoning steps in an explicit, stepwise form [29]. This is critical for information-intensive VQA, where solving a question requires a sequence of localization, evidence extraction, and analytic operations (Figure 1). However, current VLMs often lack fine-grained perception and localization on densely annotated images, and existing tool-driven zoom-in methods are ineffective as elaborated in Section [work:tool]. We therefore utilize multiple VLMs to produce reasoning paths rather than a single direct answer, so that the subsequent verdict can verify and synthesize structured evidence. Concretely, given an image-question pair \((x,q)\), we select \(m\) lightweight VLMs \(\{M_1,\dots,M_m\}\) as draft experts from a pool of \(k\) candidate VLMs via a consensus-based selection mechanism (detailed in Section 3.4). Each selected expert \(M_i\) is then prompted with a CoT template to output a reasoning path \(r_i\).
We observe that each reasoning path \(r_i\) provided by draft experts typically includes: (i) global scan and localization proposals that identify query-related regions, sections, or subplots, often referencing axes, titles, or captions; (ii) evidence extraction, which transforms visual or textual elements into structured cues, including reading legends, mapping colors to series, parsing axis labels, or assembling lists of values or tokens for subsequent operations; (iii) analytic and reasoning operations, which operate over the extracted cues to derive higher-level conclusions, such as filtering or selecting relevant entities, computing differences, sorting across panels, and cross-referencing dispersed cues. As shown in the running case (Figure 3), different experts may match legends to charts differently; some correctly gather the required cues while others misread adjacent values. This diversity yields a complementary but potentially noisy pool of reasoning signals.
The set \(\{r_i\}\) captures diverse cues, offering richer evidence but also introducing contradictions, which motivates the need for a verdict stage to verify and integrate them. Answer-level ensembling (e.g., majority voting) often fails in minority-correct scenarios where many experts converge on the same incorrect decision, such as mislocalizing the query-related region or misreading fine-grained textual details, even after correct localization. This failure mode is frequently observed in information-intensive reasoning (as illustrated in Figure 3). Rather than discarding minority opinions through majority voting, we leverage a stronger model as a verdict to validate grounding, resolve conflicts, and synthesize coherent reasoning from the draft paths.
Specifically, given the image-question pair \((x,q)\) and the drafts’ reasoning paths \(\{r_i\}_{i=1}^m\), we prompt the verdict model \(J\) with: (i) the original image \(x\) as visual input, and (ii) a textual prompt containing the question \(q\) and the concatenated reasoning paths \(\{r_i\}_{i=1}^m\) as context. The verdict processes this multimodal input in a single inference call and outputs the final answer: \[y = J\big(x, q, \{r_i\}_{i=1}^m\big).\]
In this design, the verdict acts not as a voter but as a synthesizer. It evaluates grounding consistency, identifies contradictions across reasoning paths, and integrates consistent cues into a coherent prediction. The case in Figure 3 illustrates this intended role: when only one draft extracts the correct evidence, the verdict is designed to recover it by contrasting against competing but inconsistent paths.
This setup enables us to leverage the reasoning capabilities of large models while keeping the inference cost manageable. The verdict stage reduces the expensive autoregressive decoding phase by concentrating computation in prefill: it processes thousands of tokens from multiple draft reasoning paths as prefill input and produces only several answer tokens sequentially. This design avoids invoking large models iteratively for analyzing each image section separately or generating lengthy rationales, both of which would substantially increase decoding costs.
To keep the verdict input both efficient and accurate, we introduce a training-free expert selection mechanism at the beginning of the draft stage (Section 3.2). Since each question in information-intensive VQA has a unique correct answer, consensus among model answers naturally indicates which reasoning paths are more reliable. Therefore, the key idea here is to measure agreement among candidate answers and retain only those with stronger peer consensus. This mechanism is computed efficiently by prefilling the question and answer tokens, with each draft decoded only once, making it plug-and-play with minimal overhead.
Consensus Score. We define a consensus score that measures how strongly a candidate VLM’s answer is agreed by its peers. Formally, let \(x\) be the input image and \(q=(q_1,\dots,q_n)\) the question tokens. From the pool of \(k\) candidate VLMs \(\{M_i\}_{i=1}^k\), each model produces a candidate answer \(y_i = (y_{i,1}, \dots, y_{i,T})\). For a peer model \(M_j\) (\(j \neq i\)) in the pool, we measure how plausible it finds \(y_i\) by computing the negative log-likelihood (NLL) of the concatenated input \((x, q, y_i)\), i.e., the original image together with the question tokens followed by the candidate answer tokens: \[\mathrm{NLL_j(y_i)} = -\tfrac{sec:1}{T}\sum_{t=1}^{T} \log p_{M_j}(y_{i,t}\mid x,q_{\le n},y_{i,<t}).\] To account for calibration differences, we normalize against \(M_j\)’s own answer \(y_j\), thus relative consensus score from \(M_j\)’s perspective is: \[s_j(y_i) = \bigl|\mathrm{NLL_j(y_i)} - \mathrm{NLL_j(y_j)}\bigr|, \quad j \neq i,\] where a smaller \(s_j(y_i)\) indicates stronger agreement, as \(M_j\) finds \(y_i\) nearly as plausible as its own answer \(y_j\).
To capture overall agreement rather than pairwise consistency, we define the global consensus score of candidate \(y_i\) by summing across all peers: \[s(y_i) = \sum_{j \neq i} s_j(y_i),\] which quantifies the overall level of peer consensus for \(M_i\)’s answer, and a lower \(s(y_i)\) indicates stronger agreement and thus higher reliability.
Consensus Expert Selection Strategy. We adopt a cross-all strategy that selects the \(m\) VLMs with the strongest consensus, measured by the lowest consensus scores, from the pool of \(k\) candidates. As described in Section 3.2, these \(m\) selected VLMs then become the draft experts to generate detailed reasoning paths forwarded to the verdict (Figure 3 illustrates this process). By aggregating agreement across all peers, this strategy provides a holistic measure of reliability. It thus yields a subset of reasoning paths that are well-grounded and compact in size, balancing informativeness and efficiency.
Configuration Details. We set the draft pool size to \(k = 5\) considering efficiency and select \(m = 3\) draft experts in our main experiments. Ablation studies over different \(m\) values are reported in Section 4.4. The draft pool consists of the following VLMs for expert selection: Qwen2.5-VL-7B-Instruct [30], MiMo-VL-7B-RL [31], InternVL3-8B [32], GLM-4.1V-9B-Thinking [33], Ovis2.5-9B [34]. These models are chosen as candidate VLMs based on their strong performance on multimodal benchmarks and their diverse architectural designs. For the verdict models, we employ GPT-4o [16] and Qwen2.5-VL-72B-Instruct respectively, given their superior ability in visual reasoning. In particular, for information-intensive image benchmarks, we preprocess images with PP-StructureV3 [35] to produce a layout-preserving structured format (see Appendix 12.2 for details), provided together with the original image as auxiliary input to the verdict model.
Baselines. We compare SV with proprietary models GPT-4o and GPT-4o-mini, and the large open-source model Qwen2.5-VL-72B-Instruct as it is one of our verdicts. We also evaluate SV against draft experts mentioned above. These baselines are evaluated under the same chain-of-thought prompting template in Appendix 15. Additionally, we include DeepEyes [5] as a representative tool-driven baseline with zoom-in operations.
Benchmarks. We evaluate SV on three information-intensive benchmarks and extend the evaluation to a representative high-resolution benchmark, providing a comprehensive assessment of fine-grained visual reasoning: InfographicVQA [13], ChartMuseum [14], ChartQAPro [15] and HR-Bench 4K [17]. InfographicVQA collects infographics with an average high resolution over 2k, designed to test reasoning over layout, graphical and textual content, including operations such as counting, sorting, and basic arithmetic. ChartMuseum and ChartQAPro introduce substantially greater visual reasoning complexity by covering a broad spectrum of real-world chart types and question formats, revealing a large performance gap between current Large VLMs and humans. These benchmarks require models to visually ground relevant regions, extract information, and conduct reasoning to answer queries.
We further assess generalization to high-resolution images on HR-Bench 4K. It comprises two sub-tasks: FSP (Fine-grained Single-instance Perception) and FCP (Fine-grained Cross-instance Perception), stressing small-object perception and cross-instance reasoning under high-resolution inputs.
As shown in Table [tab:infographics], SV demonstrates superior performance across all benchmarks, outperforming a wide range of baselines. Based on the results, we have the following key observations:
(i) SV shows consistent gains over all strong draft experts’ baselines, with improvements of 3.6% on InfographicVQA, 1.3% on ChartMuseum, and 6.6% on ChartQAPro with GPT-4o as verdict. SV also achieves comparable gains with Qwen2.5-VL-72B-Instruct as a verdict.
(ii) Importantly, SV enables strong error correction beyond simple answer aggregation. Figure 4 analyzes SV’s performance on cases where the verdict itself fails, categorized by expert correctness (minority-correct, majority-correct, zero-correct). Across benchmarks, SV recovers 47-53% of minority-correct cases, where few draft experts are correct and the verdict alone also fails (case in Figure 3). Moreover, SV even recovers 2.5-4.5% of zero-correct cases, where neither the drafts nor the verdict answers correctly (case in Appendix 14). In these cases, SV succeeds because errors in information-intensive visual reasoning are often decomposable, enabling SV to extract partially correct components from different draft reasoning paths while rejecting misleading cues. Thus, SV achieves effective correction where traditional ensemble methods fail.
(iii) SV strengthens large verdict models significantly, and using GPT-4o as verdict delivers stronger results due to its reasoning advantage on information-intensive benchmarks. Specifically, when GPT-4o is used as verdict, SV surpasses the GPT-4o baseline by 11.9% on InfographicVQA, 6.6% on ChartMuseum, and 11.4% on ChartQAPro. These improvements come with reduced inference cost for the large verdict model, demonstrating that SV can outperform much larger or proprietary LVLMs in a cost-efficient manner.
(iv) SV substantially outperforms representative tool-driven pipeline DeepEyes, with gains of +12.9% on InfographicVQA, +21.3% on ChartMuseum, and +11.3% on ChartQAPro. This gap arises because DeepEyes is strong in local grounding but less effective when reasoning over dense textual and visual content. For example, it often focuses on text spans or legends rather than full regions needed for analytical operations, and its zoom-in calls are sometimes redundant or misdirected (see Appendix 13 for error analysis). As a result, it struggles with global comparison and dispersed evidence synthesis. In contrast, SV’s reasoning-path synthesis enables it to integrate evidence across regions reliably without relying on predefined tool-based visual search.
We further assess generalization to high-resolution images using HR-Bench 4K to evaluate whether SV can enhance fine-grained visual perception. The key observations are as follows (Table [tab:infographics]):
(i) With Qwen2.5-VL-72B-Instruct as verdict, SV achieves its largest margin, surpassing the best-performing draft expert by 2.6% and even outperforming the verdict itself by 2.5%. The superior performance of Qwen2.5-VL-72B as verdict on this task correlates with its stronger visual localization capabilities, indicating verdict selection should align with task-specific requirements.
(ii) SV also exceeds DeepEyes, which is explicitly trained with zoom-in tools for iterative visual search on high-resolution perception. This highlights SV’s ability to generalize to high-resolution tasks, where accurate recognition of small objects is critical. Aligning perceptually strong draft experts with a verdict thus provides a simpler yet effective solution for high-resolution reasoning.
To better understand the effectiveness of SV, we conduct ablation studies on information-intensive benchmarks to analyze the impact of individual components. In these experiments, the reasoning baseline refers to the best-performing draft expert in our pool for each benchmark (Table [tab:infographics]).
Number of Draft Experts. Our setting with \(m=3\) draft experts yields a favorable trade-off between accuracy and efficiency, as it determines the number of reasoning paths forwarded to the verdict. As shown in Figure 5, we observe that the performance improves nearly linearly up to three draft experts and then saturates, while inference cost grows roughly linearly with \(m\).
Consensus Expert Selection Strategy. We confirm the effectiveness of our cross-all selection strategy by comparing it with a best-reference strategy. In the best-reference variant, the top-performing draft expert serves as reference and the two most consistent experts are selected with it. While best-reference is expected to be the strongest criterion, cross-all achieves comparable gains while remaining reference-free (Figure [fig:strategy]).
Selection Criteria. Selecting consensus-based experts consistently improves performance, while divergent selection can even fall below the single-draft reasoning baseline (Figure 6). These results support that, for information-intensive tasks, consensus-based selection more reliably identifies the correct reasoning path than enforced diversity.
Impact of Verdict Stage. The verdict stage yields higher performance than majority voting across information-intensive benchmarks (Figure [fig:mv]). Notably, majority voting with all five draft experts performs comparably as majority voting with three draft experts, consistent with our finding that consensus selection can match the performance of all drafts at a lower cost (Figure 5). SV further surpasses both by leveraging the verdict’s error correction ability, successfully capturing minority-correct cases that majority voting discards (Figure 4 and Figure 3).
Choice of Verdict Textual Input. Providing full reasoning paths to the verdict yields substantially better performance than passing only final answers (Table [tab:expertselect]), with improvements of 15% on InfographicVQA, and 4.8% on ChartQAPro. These results highlight that rich contextual evidence is essential for the verdict to recover correct reasoning, whereas final predictions alone are insufficient.
Choice of Verdict Scale. Using a large verdict model yields stronger gains than a small verdict model. For ablations, we select GLM-4.1V-9B-Thinking as the small verdict because it is the strongest reasoning model among the baselines. However, results in Table [tab:verdictinfo] show that it brings only modest improvements, while GPT-4o delivers additional gains of 3.4% on InfographicVQA and 1.3% on ChartMuseum compared to this small verdict. These results indicate that even reasoning-strong small verdicts offer limited benefit in synthesizing correct answers, validating SV’s design principle of invoking a strong verdict only once to achieve robust and efficient error correction.
This paper introduces Speculative Verdict (SV), a training-free framework to address challenges of information-intensive visual reasoning. Inspired by speculative decoding, SV repositions large models as efficient synthesizers rather than computationally expensive step-by-step reasoners. By integrating diverse reasoning paths from lightweight experts, the verdict can distinguish informative cues and recover correctness from structured errors. Experiments show that SV consistently outperforms strong proprietary, open-source, and tool-driven methods, establishing a cost-efficient paradigm for reasoning on information-intensive images.
This work does not involve human subjects, sensitive personal data, biometrics, or medical information. All datasets used are publicly available under permissible licenses and are not privacy-sensitive. We recognize that any automated reasoning system may produce incorrect or misleading outputs. To ensure responsible use, we emphasize that our method is intended for research and analysis rather than deployment in high-stakes settings. Users are encouraged to verify model outputs and apply human oversight when necessary. We take full responsibility for all reported results, analyses, and claims, and we welcome community scrutiny and feedback.
To support reproducibility, we provide comprehensive implementation details throughout our paper. Key experimental configurations, such as draft expert selection, consensus scoring computation, and verdict model specifications, are documented in Section 3.4 and Section 4.1. Detailed prompt templates are presented in Appendix 15. The code is released to further clarify the implementation steps and enable faithful reproduction of our results.
Table [tab:datasetstats] reports the statistics of the four evaluation benchmarks. All benchmarks are based on real-world images rather than synthetic renderings, ensuring the authenticity and diversity of the evaluation setting. In particular, InfographicVQA, ChartMuseum, and ChartQAPro are information-intensive benchmarks: they contain thousands of images and questions with dense textual and numerical content, collected from diverse sources spanning 2594, 157, and 184 distinct web domains respectively [13]–[15]. This diversity reduces source bias and reflects practical challenges in multimodal reasoning.
HR-Bench 4K is used primarily to evaluate the generalization of our method, serving as a high-resolution benchmark with average sizes exceeding 4000×3500 pixels [17]. At the same time, one of our main benchmarks, InfographicVQA, also exhibits high-resolution characteristics. In particular, it frequently contains long-format images where diagrams span large vertical layouts (see the case in Figure 3), which further compounds the difficulty of grounding and multi-hop reasoning across dispersed regions.
Table [tab:cost] reports the average inference cost of invoking GPT-4o as the verdict model per sample across benchmarks. Costs are estimated using the official GPT-4o pricing (version gpt-4o-2024-08-06) as of September 2025. The small variation across benchmarks is mainly attributed to differences in reasoning path length, as more challenging tasks typically induce more complex reasoning. Overall, the inference cost of using GPT-4o as the verdict is under $0.011 per sample across all benchmarks.
Table [tab:recovery95full] and Figure 7 show the detailed recovery statistics across information-intensive benchmarks with GPT-4o as verdict. We break down SV’s performance by expert correctness: (i) cases where the majority of draft experts are correct (majority-correct), (ii) cases where only a minority are correct (minority-correct), (iii) cases where none are correct (zero-correct). While the main paper focuses on the GPT-4o’s error cases to isolate SV’s effectiveness, we provide the full results here for completeness.
Notably, in the zero-correct setting, recovery occurs rarely (2.6-24%), but it demonstrates verdict’s surprising ability to infer the correct answer by synthesizing signal from entirely noisy reasoning.
Beyond the fixed model pool used in the main experiments, we further examine SV’s generalizability across different model pool compositions by testing on pools with varying model sizes and capabilities. The results show that SV successfully leverages reasoning paths from lightweight models, delivering strong performance while maintaining cost efficiency.
Evaluation with 7-9B Model Pool (Non-Thinking). SV maintains its effectiveness when replacing thinking models with faster non-thinking alternatives. Specifically, we replace the two thinking models in our original pool (i.e., GLM-4.1V-9B-Thinking [33] and MiMO-VL-7B-RL [31]) with non-thinking models (i.e., LLaVA-OneVision-1.5-8B [36], and Eagle 2.5-8B [37]), while keeping the remaining three models unchanged. While these substitutes sacrifice some reasoning capability, they enable faster inference. As shown in Table [tab:largeinfographics], with GPT-4o as verdict, SV achieves 86.3% on InfographicVQA under this configuration, surpassing all baselines. Notably, SV outperforms the best draft expert by 4.6% and exceeds the large 72B model by 1.9%. These results demonstrate that SV achieves strong performance by integrating reasoning paths from individually weaker but faster models.
Evaluation with 2-4B Model Pool. We also evaluate SV on an even smaller model pool consisting of 2-4B models: Qwen2.5-VL-Instruct-3B [30], LLaVA-OneVision-1.5-4B [36], InternVL3.5-4B [38], Gemma 3-4B [39], and Ovis2.5-2B [34]. As shown in the Table [tab:smallinfographics], with GPT-4o as verdict, SV achieves 84.5% on InfographicVQA, surpassing the best draft expert by 9.5% and the 72B baseline by 0.3%. This demonstrates SV’s ability to extract effective collective reasoning even from significantly weaker individual models, confirming the robustness of our paradigm across varying model scales.
We examine whether visual input is necessary for the verdict or if reasoning paths alone suffice. Table [tab:visualinput] presents results where the verdict receives only textual reasoning paths without image input. The results show that SV without visual input achieves modest gains over the reasoning baseline on InfographicVQA, and even underperforms on ChartMuseum and ChartQAPro. In contrast, incorporating visual input for verdict yields substantial improvements across all benchmarks. These results demonstrate that visual grounding is essential for the verdict to cross-check the factual accuracy of extracted information and distinguish correct from incorrect interpretations of the image.
In our experimental setup in Section 4.1, we preprocess each image via PP-StructureV3, a document parsing model that generates Markdown representations capturing layout, textual blocks, and visual metadata [35]. This structured representation is then rendered as an image and provided as an additional image input for the verdict. This allows the verdict to access both the raw visual content and a layout-aware text representation simultaneously. To verify whether this input is critical or merely auxiliary, we conduct an ablation study (Table [tab:ocr]).
The results show that SV achieves substantial gains over the reasoning baseline even without structured input. With the structured input, performance is generally slightly improved, though the gain is negligible or even marginally lower in some cases. This pattern suggests that structured OCR-derived signals are not essential for SV’s core performance, but may assist the verdict to distinguish among competing reasoning paths.
As mentioned in Section [work:tool], tool-driven methods represent a line of work that augments vision-language reasoning with explicit zoom-in operations. The representative pipeline DeepEyes is designed to iteratively ground into image regions, and integrate them into the ongoing reasoning trajectory under an RL framework. This mechanism has proven effective on high-resolution benchmarks, where localized inspection of fine details is crucial.
However, DeepEyes is not specifically trained on our benchmarks, which require reasoning over information-intensive images with densely interleaved textual and visual elements. Its performance on InfographicVQA reveals the current limitations of such tool-based pipelines in this domain. We categorize the observed deficiencies into three core challenges:
(i) Tendency toward literal grounding. DeepEyes is proficient at small-scale grounding but often focuses on literal text spans or legends rather than reasoning-critical regions. For example, when a question requires aligning numerical values with a chart axis, the model frequently grounds directly onto the answer text or nearby labels instead of the relevant data regions. This shortcut strategy works for simple queries but fails on complex reasoning on information-intensive images that require global comparison.
(ii) Inefficient tool usage. Although DeepEyes is trained to iteratively apply zoom-in tools, we observe that it invokes only one zoom step in more than half of the test cases. Among the double-zoom cases, 92.8% duplicate the same bounding box, which serves only for verification rather than exploration. In some instances, the model zooms into empty areas or irrelevant regions.
(iii) Lack of robustness on long and dense images. Information-intensive images often contain multi-panel figures and dense annotations. DeepEyes cannot maintain a trajectory across multiple zoom steps, making it difficult to integrate dispersed evidence. As a result, tasks requiring cross-region synthesis, such as counting, sorting, or comparing across multiple subplots, remain challenging for it.
Overall, this analysis indicates that while tool-driven pipelines are promising for high-resolution inspection tasks, they face notable difficulties applying to information-intensive images without domain-specific supervision. In contrast, SV achieves strong performance without additional training, offering a simple and effective alternative for reasoning over complex multimodal inputs.
Figure 8 illustrates a case where all three draft experts produced incorrect reasoning paths, yet the verdict successfully corrected the answer. Specifically, the draft experts faced different types of failures: some mis-extracted information from the image, others extracted the key information correctly but failed to sort the values properly, and thus all generated wrong answers. Interestingly, the verdict itself, when asked directly, also tends to answer “Australia” incorrectly. However, when analyzing the noisy and conflicting reasoning paths together, the verdict was able to recover the correct answer (Portugal).
This example complements the main results section: while Figure 3 illustrates recovery from minority-correct experts, here we present a zero-correct case to show that SV can still synthesize the correct solution even when all drafts and the verdict individually fail.
As described in Section 4.1, we employ a Chain-of-Thought prompt for each consensus expert to generate reasoning paths and apply it identically when evaluating baselines. For InfographicVQA and HR-Bench 4K, we use the same CoT prompt. For ChartMuseum [14], we adopt its official reasoning prompt, and adapt that prompt strategy to ChartQAPro, given their similarity in task complexity. Since ChartQAPro requires different prompt templates tailored to question types [15], we first follow its official template per question type, then concatenate it with our reasoning prompt.
The reasoning prompts for these datasets are shown in Figure 9.
The user prompts used in the verdict stage are identical across datasets except for the final instruction sentence, which is customized (see Figure 11). For GPT-4o as verdict, the system prompt is shown in Figure 10. For Qwen-2.5-VL-72B-Instruct as verdict, we prepend its system prompt at the beginning of the user prompt.
In this work, we used LLMs solely for auxiliary tasks such as language polishing, prompt refining, and proofreading. Importantly, these interventions did not contribute any main scientific insight, experimental design, or methodological advance. All core ideas, experiments, analyses, and claims in this paper are the work of the authors.
None
Figure 9: Prompt templates for reasoning..
None
Figure 10: System prompt template for verdict..
None
Figure 11: User prompt templates for verdict..