Small Drafts, Big Verdict: Information-Intensive Visual Reasoning via Speculation

Yuhan Liu\(^{1}\), Lianhui Qin\(^{2}\), Shengjie Wang\(^{1}\)
\(^1\)New York University, \(^2\)University of California, San Diego
{yl10379, sw5973}@nyu.edu
{l6qin}@ucsd.edu


Abstract

Large Vision-Language Models (VLMs) have achieved remarkable progress in multimodal understanding, yet they struggle when reasoning over information-intensive images that densely interleave textual annotations with fine-grained graphical elements. The main challenges lie in precisely localizing critical cues in dense layouts and multi-hop reasoning to integrate dispersed evidence. We propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines multiple lightweight draft experts with a large verdict model. In the draft stage, small VLMs act as draft experts to generate reasoning paths that provide diverse localization candidates; in the verdict stage, a strong VLM synthesizes these paths to produce the final answer, minimizing computational cost while recovering correct answers. To further improve efficiency and accuracy, SV introduces a consensus expert selection mechanism that forwards only high-agreement reasoning paths to the verdict. Empirically, SV achieves consistent gains on challenging information-intensive and high-resolution visual question answering benchmarks, including InfographicVQA, ChartMuseum, ChartQAPro, and HR-Bench 4K. By synthesizing correct insights from multiple partially accurate reasoning paths, SV achieves both error correction and cost-efficiency compared to large proprietary models or training pipelines. Code is available at https://github.com/Tinaliu0123/speculative-verdict.

1 Introduction↩︎

Recent advances in large vision-language models (VLMs) have delivered impressive performance on tasks such as image captioning and general visual question answering (VQA) [1], [2]. However, these models encounter challenges in information-intensive images that densely interleave diverse textual annotations (legends, labels, captions) with fine-grained graphical elements (charts, diagrams, plots) across multiple scales and formats [3]. Addressing this task requires two interdependent capabilities (Figure 1; [4]): (i) comprehensive and precise localization, which involves not only pinpointing the exact positions of critical cues in densely populated layouts but also ensuring that all query-relevant regions are identified; (ii) multi-hop reasoning, which chains visual analysis—encompassing colors, shapes, and spatial relationships—with textual evidence, thereby integrating dispersed cues into a coherent and complete answer. As each reasoning step builds on the accuracy of the previous one, any intermediate error can propagate through the entire chain, making the overall process highly error-sensitive and difficult to correct retrospectively. Existing work tackles information-intensive visual reasoning with search‐based zoom‐in pipelines that enlarge local regions for detailed reasoning. Specifically, learning-based methods train reinforcement learning policies to guide zoom operations iteratively [5][8]. Enhancing its performance would demand costly fine-grained supervision. Moreover, training-free methods perform cropping based on internal attention or confidence scores [9][11]. Yet in dense layouts, we find these signals correlate weakly with true relevance, misleading the model into visually similar but irrelevant areas. Consequently, these tool-driven designs fail to capture all evidence for multi-hop reasoning, leaving the core challenges of information-intensive visual reasoning unsolved.

To overcome these limitations, we propose Speculative Verdict (SV), a training-free framework inspired by speculative decoding that combines small draft visual experts with a large verdict model [12]. The framework operates in two stages (Figure 2): (1) Draft stage: multiple lightweight VLMs serve as draft experts, each generating a reasoning path that offers diverse localization candidates; (2) Verdict stage: a large VLM acts as a strong verdict, which receives the reasoning paths as contextual evidence, distinguishes the correct information, and outputs the final answer. SV directly tackles core challenges through complementary strengths: draft experts expand evidence coverage across scattered regions, while the verdict prevents error propagation by synthesizing these multiple perspectives. Importantly, unlike using a large proprietary model to reason over every image section, SV invokes the verdict only once to yield a concise final answer, thereby minimizing computational cost while effectively recovering correct answers. To further balance accuracy and efficiency, SV introduces a consensus expert selection mechanism in the draft stage, ensuring that only reasoning paths with strong agreement are forwarded to the verdict.

We evaluate Speculative Verdict on information-intensive VQA benchmarks, including InfographicVQA [13], ChartMuseum [14], and ChartQAPro [15], which demand reasoning over dense textual and visual content. As a training-free framework, SV consistently outperforms strong open-source models, large proprietary models, and perception-focused search methods while remaining cost-efficient. In particular, SV yields average gains of 4% over small VLMs as draft experts and 10% over GPT-4o [16] as verdict. Beyond overall gains, SV successfully corrects 47-53% of cases where majority voting or the verdict model alone fails, thereby reducing vulnerability to error propagation in information-intensive visual reasoning. Furthermore, SV surpasses all baselines on HR-Bench 4K [17], a benchmark for high-resolution visual perception, underscoring its effectiveness in challenging multimodal reasoning scenarios.

Figure 1: Examples of correct reasoning paths for information-intensive image VQA tasks. They illustrate distinct paths: (a) focuses on the localization of a specific chart, symbol identification, and complementary reasoning from a single percentage value; (b) focuses on keyword-based localization, evidence aggregation from multiple entries across the entire image, and cross-entity sorting to select the minimum.

2 Related Work↩︎

Vision-Language Model Reasoning with Tools. Recent research has explored enhancing VLM perception by manipulating input images with zooming operations to locate relevant regions [18]. (1) Prompting-based methods exploit internal signals of VLMs to decide where to zoom. ViCrop [9] leverages models’ attention maps to highlight query-related regions, thereby generating automatic visual crops. Other works perform tree-based search, where models evaluate candidate sub-images with confidence scores to iteratively narrow down to relevant regions [10], [11]. However, such signals align poorly with the required evidence in information-intensive images, since queries often require reasoning across multiple dispersed regions. (2) Reinforcement learning approaches instead optimize policies that interleave visual zooming with textual reasoning [5][8]. By calling zooming tools within the agentic framework, these methods adaptively crop regions and concatenate them into the reasoning trajectory, enabling more active evidence gathering. Yet these methods still fall short on information-intensive images, requiring costly task-specific training to scale.

Speculative Decoding. Speculative decoding is a draft-then-verify decoding paradigm to accelerate LLM inference [19]. Specifically, it utilizes a draft model to generate future tokens, and a larger target model verifies them via parallel rejection sampling. Beyond the vanilla setting, recent work extends acceptance from token-level equivalence to step-level semantic similarity to speed up reasoning [20][23]. Collaborative decoding via Speculation [24] further applies speculative decoding with multiple draft LLMs by verifying proposals against a combined distribution of the drafts and the target, yielding greater speedups than standard ensembling. However, these adaptations primarily target speed in LLM inference and also do not address the challenges of vision-language reasoning.

Large Language Model Ensemble. Majority voting aggregates answers by frequency, but fails when the correct solution is produced by a minority. Universal Self-Consistency [25] mitigates this failure mode by prompting the LLM to select the most consistent candidate across samples. Further, learned aggregators read multiple rationales and synthesize them to recover minority-correct information [26], [27]. However, these approaches focus on text-only ensembling. In vision-language reasoning, supervision of ensembling is not cost-effective since multimodal complexity requires costly, fine-grained annotations.

3 Speculative Verdict↩︎

Speculative decoding is an inference‐time optimization originally developed to mitigate the latency of autoregressive generation [12]. The approach employs a draft-then-verify paradigm: (i) a small, fast draft model proposes one or more future tokens speculatively, and (ii) a large, accurate base model verifies these proposals in parallel, accepts or revises the proposals, and generates output that is consistent with the base model’s distribution  [19], [28]. This token-level process speeds up inference by committing several tokens at once, while maintaining quality by discarding continuations that diverge from the base model’s distribution.

The key insight is that draft models expand coverage quickly, while the verifier ensures correctness. Although this idea has been mainly applied to accelerate text generation, its high-level principle is also well-suited for information-intensive multimodal reasoning.

Figure 2: Overview of Speculative Verdict (SV). Inspired by speculative decoding, SV operates in two stages. In the draft stage, given an input question-image pair, k small candidate VLMs first generate candidate answers, from which we compute a global consensus score s(y_i) for each answer based on pairwise NLL differences. We then select m draft experts with the strongest consensus to generate reasoning paths. In the verdict stage, the large verdict model verifies and integrates these paths to yield the final answer.

3.1 Method Overview↩︎

Information-intensive visual question answering (VQA) requires models to localize query-relevant regions, perceive diverse fine-grained textual and visual details, and integrate dispersed evidence into a single correct answer. These tasks are highly error-sensitive as elaborated in Section 1: a single misread or mislocalized element often leads to a completely wrong prediction.

To address this challenge, we adapt the draft-then-verify paradigm of speculative decoding to multimodal reasoning. Unlike its original use for inference acceleration, we repurpose the paradigm to improve robustness and error correction in information-intensive visual reasoning. On a high level, our Speculative Verdict (SV) framework operates in two stages (Figure 2):

(i) Draft stage, where multiple lightweight VLMs are selected as draft experts to provide diverse reasoning paths (Section 3.2);

(ii) Verdict stage, where a large VLM acts as verdict to verify, refine, and synthesize these reasoning paths into the final prediction (Section 3.3).

3.2 Draft Stage↩︎

Chain-of-Thought (CoT) prompting exposes models’ intermediate reasoning steps in an explicit, stepwise form [29]. This is critical for information-intensive VQA, where solving a question requires a sequence of localization, evidence extraction, and analytic operations (Figure 1). However, current VLMs often lack fine-grained perception and localization on densely annotated images, and existing tool-driven zoom-in methods are ineffective as elaborated in Section [work:tool]. We therefore utilize multiple VLMs to produce reasoning paths rather than a single direct answer, so that the subsequent verdict can verify and synthesize structured evidence. Concretely, given an image-question pair \((x,q)\), we select \(m\) lightweight VLMs \(\{M_1,\dots,M_m\}\) as draft experts from a pool of \(k\) candidate VLMs via a consensus-based selection mechanism (detailed in Section 3.4). Each selected expert \(M_i\) is then prompted with a CoT template to output a reasoning path \(r_i\).

We observe that each reasoning path \(r_i\) provided by draft experts typically includes: (i) global scan and localization proposals that identify query-related regions, sections, or subplots, often referencing axes, titles, or captions; (ii) evidence extraction, which transforms visual or textual elements into structured cues, including reading legends, mapping colors to series, parsing axis labels, or assembling lists of values or tokens for subsequent operations; (iii) analytic and reasoning operations, which operate over the extracted cues to derive higher-level conclusions, such as filtering or selecting relevant entities, computing differences, sorting across panels, and cross-referencing dispersed cues. As shown in the running case (Figure 3), different experts may match legends to charts differently; some correctly gather the required cues while others misread adjacent values. This diversity yields a complementary but potentially noisy pool of reasoning signals.

3.3 Verdict Stage↩︎

The set \(\{r_i\}\) captures diverse cues, offering richer evidence but also introducing contradictions, which motivates the need for a verdict stage to verify and integrate them. Answer-level ensembling (e.g., majority voting) often fails in minority-correct scenarios where many experts converge on the same incorrect decision, such as mislocalizing the query-related region or misreading fine-grained textual details, even after correct localization. This failure mode is frequently observed in information-intensive reasoning (as illustrated in Figure 3). Rather than discarding minority opinions through majority voting, we leverage a stronger model as a verdict to validate grounding, resolve conflicts, and synthesize coherent reasoning from the draft paths.

Specifically, given the image-question pair \((x,q)\) and the drafts’ reasoning paths \(\{r_i\}_{i=1}^m\), we prompt the verdict model \(J\) with: (i) the original image \(x\) as visual input, and (ii) a textual prompt containing the question \(q\) and the concatenated reasoning paths \(\{r_i\}_{i=1}^m\) as context. The verdict processes this multimodal input in a single inference call and outputs the final answer: \[y = J\big(x, q, \{r_i\}_{i=1}^m\big).\]

In this design, the verdict acts not as a voter but as a synthesizer. It evaluates grounding consistency, identifies contradictions across reasoning paths, and integrates consistent cues into a coherent prediction. The case in Figure 3 illustrates this intended role: when only one draft extracts the correct evidence, the verdict is designed to recover it by contrasting against competing but inconsistent paths.

This setup enables us to leverage the reasoning capabilities of large models while keeping the inference cost manageable. The verdict stage reduces the expensive autoregressive decoding phase by concentrating computation in prefill: it processes thousands of tokens from multiple draft reasoning paths as prefill input and produces only several answer tokens sequentially. This design avoids invoking large models iteratively for analyzing each image section separately or generating lengthy rationales, both of which would substantially increase decoding costs.

Figure 3: An illustration of Speculative Verdict on InfographicVQA. Five candidate VLMs first produce candidate answers, with only two providing the correct result. Consensus scoring ranks answers by agreement, and the three with the lowest scores are selected as draft experts. Although some experts commit extraction errors (confusing player’s share with NFL revenue), the verdict synthesizes their reasoning paths and successfully recovers the correct answer (49%). This illustrates SV’s ability to identify reliable experts and achieve error correction.

3.4 Consensus Expert Selection↩︎

To keep the verdict input both efficient and accurate, we introduce a training-free expert selection mechanism at the beginning of the draft stage (Section 3.2). Since each question in information-intensive VQA has a unique correct answer, consensus among model answers naturally indicates which reasoning paths are more reliable. Therefore, the key idea here is to measure agreement among candidate answers and retain only those with stronger peer consensus. This mechanism is computed efficiently by prefilling the question and answer tokens, with each draft decoded only once, making it plug-and-play with minimal overhead.

Consensus Score. We define a consensus score that measures how strongly a candidate VLM’s answer is agreed by its peers. Formally, let \(x\) be the input image and \(q=(q_1,\dots,q_n)\) the question tokens. From the pool of \(k\) candidate VLMs \(\{M_i\}_{i=1}^k\), each model produces a candidate answer \(y_i = (y_{i,1}, \dots, y_{i,T})\). For a peer model \(M_j\) (\(j \neq i\)) in the pool, we measure how plausible it finds \(y_i\) by computing the negative log-likelihood (NLL) of the concatenated input \((x, q, y_i)\), i.e., the original image together with the question tokens followed by the candidate answer tokens: \[\mathrm{NLL_j(y_i)} = -\tfrac{sec:1}{T}\sum_{t=1}^{T} \log p_{M_j}(y_{i,t}\mid x,q_{\le n},y_{i,<t}).\] To account for calibration differences, we normalize against \(M_j\)’s own answer \(y_j\), thus relative consensus score from \(M_j\)’s perspective is: \[s_j(y_i) = \bigl|\mathrm{NLL_j(y_i)} - \mathrm{NLL_j(y_j)}\bigr|, \quad j \neq i,\] where a smaller \(s_j(y_i)\) indicates stronger agreement, as \(M_j\) finds \(y_i\) nearly as plausible as its own answer \(y_j\).

To capture overall agreement rather than pairwise consistency, we define the global consensus score of candidate \(y_i\) by summing across all peers: \[s(y_i) = \sum_{j \neq i} s_j(y_i),\] which quantifies the overall level of peer consensus for \(M_i\)’s answer, and a lower \(s(y_i)\) indicates stronger agreement and thus higher reliability.

Consensus Expert Selection Strategy. We adopt a cross-all strategy that selects the \(m\) VLMs with the strongest consensus, measured by the lowest consensus scores, from the pool of \(k\) candidates. As described in Section 3.2, these \(m\) selected VLMs then become the draft experts to generate detailed reasoning paths forwarded to the verdict (Figure 3 illustrates this process). By aggregating agreement across all peers, this strategy provides a holistic measure of reliability. It thus yields a subset of reasoning paths that are well-grounded and compact in size, balancing informativeness and efficiency.

4 Experiments↩︎

4.1 Setups↩︎

Configuration Details. We set the draft pool size to \(k = 5\) considering efficiency and select \(m = 3\) draft experts in our main experiments. Ablation studies over different \(m\) values are reported in Section 4.4. The draft pool consists of the following VLMs for expert selection: Qwen2.5-VL-7B-Instruct [30], MiMo-VL-7B-RL [31], InternVL3-8B [32], GLM-4.1V-9B-Thinking [33], Ovis2.5-9B [34]. These models are chosen as candidate VLMs based on their strong performance on multimodal benchmarks and their diverse architectural designs. For the verdict models, we employ GPT-4o [16] and Qwen2.5-VL-72B-Instruct respectively, given their superior ability in visual reasoning. In particular, for information-intensive image benchmarks, we preprocess images with PP-StructureV3 [35] to produce a layout-preserving structured format (see Appendix 12.2 for details), provided together with the original image as auxiliary input to the verdict model.

Baselines. We compare SV with proprietary models GPT-4o and GPT-4o-mini, and the large open-source model Qwen2.5-VL-72B-Instruct as it is one of our verdicts. We also evaluate SV against draft experts mentioned above. These baselines are evaluated under the same chain-of-thought prompting template in Appendix 15. Additionally, we include DeepEyes [5] as a representative tool-driven baseline with zoom-in operations.

Benchmarks. We evaluate SV on three information-intensive benchmarks and extend the evaluation to a representative high-resolution benchmark, providing a comprehensive assessment of fine-grained visual reasoning: InfographicVQA [13], ChartMuseum [14], ChartQAPro [15] and HR-Bench 4K [17]. InfographicVQA collects infographics with an average high resolution over 2k, designed to test reasoning over layout, graphical and textual content, including operations such as counting, sorting, and basic arithmetic. ChartMuseum and ChartQAPro introduce substantially greater visual reasoning complexity by covering a broad spectrum of real-world chart types and question formats, revealing a large performance gap between current Large VLMs and humans. These benchmarks require models to visually ground relevant regions, extract information, and conduct reasoning to answer queries.

We further assess generalization to high-resolution images on HR-Bench 4K. It comprises two sub-tasks: FSP (Fine-grained Single-instance Perception) and FCP (Fine-grained Cross-instance Perception), stressing small-object perception and cross-instance reasoning under high-resolution inputs.

image
Figure 4: SV’s correction ability on verdict’s error cases across information-intensive benchmarks (GPT-4o as verdict).We consider only cases where the verdict itself fails, to isolate SV’s independent correction capacity. For each benchmark, three bars denote expert correctness categories (majority-correct, minority-correct, and zero-correct), defined by how many selected experts provide the correct answer. Within each category, the bars are split into the proportion corrected by SV (dark) versus not corrected (light).More details can be found in Appendix 10.

4.2 Results on Information-Intensive Benchmarks↩︎

As shown in Table [tab:infographics], SV demonstrates superior performance across all benchmarks, outperforming a wide range of baselines. Based on the results, we have the following key observations:

(i) SV shows consistent gains over all strong draft experts’ baselines, with improvements of 3.6% on InfographicVQA, 1.3% on ChartMuseum, and 6.6% on ChartQAPro with GPT-4o as verdict. SV also achieves comparable gains with Qwen2.5-VL-72B-Instruct as a verdict.

(ii) Importantly, SV enables strong error correction beyond simple answer aggregation. Figure 4 analyzes SV’s performance on cases where the verdict itself fails, categorized by expert correctness (minority-correct, majority-correct, zero-correct). Across benchmarks, SV recovers 47-53% of minority-correct cases, where few draft experts are correct and the verdict alone also fails (case in Figure 3). Moreover, SV even recovers 2.5-4.5% of zero-correct cases, where neither the drafts nor the verdict answers correctly (case in Appendix 14). In these cases, SV succeeds because errors in information-intensive visual reasoning are often decomposable, enabling SV to extract partially correct components from different draft reasoning paths while rejecting misleading cues. Thus, SV achieves effective correction where traditional ensemble methods fail.

(iii) SV strengthens large verdict models significantly, and using GPT-4o as verdict delivers stronger results due to its reasoning advantage on information-intensive benchmarks. Specifically, when GPT-4o is used as verdict, SV surpasses the GPT-4o baseline by 11.9% on InfographicVQA, 6.6% on ChartMuseum, and 11.4% on ChartQAPro. These improvements come with reduced inference cost for the large verdict model, demonstrating that SV can outperform much larger or proprietary LVLMs in a cost-efficient manner.

(iv) SV substantially outperforms representative tool-driven pipeline DeepEyes, with gains of +12.9% on InfographicVQA, +21.3% on ChartMuseum, and +11.3% on ChartQAPro. This gap arises because DeepEyes is strong in local grounding but less effective when reasoning over dense textual and visual content. For example, it often focuses on text spans or legends rather than full regions needed for analytical operations, and its zoom-in calls are sometimes redundant or misdirected (see Appendix 13 for error analysis). As a result, it struggles with global comparison and dispersed evidence synthesis. In contrast, SV’s reasoning-path synthesis enables it to integrate evidence across regions reliably without relying on predefined tool-based visual search.

4.3 Results on High-Resolution Benchmark↩︎

We further assess generalization to high-resolution images using HR-Bench 4K to evaluate whether SV can enhance fine-grained visual perception. The key observations are as follows (Table [tab:infographics]):

(i) With Qwen2.5-VL-72B-Instruct as verdict, SV achieves its largest margin, surpassing the best-performing draft expert by 2.6% and even outperforming the verdict itself by 2.5%. The superior performance of Qwen2.5-VL-72B as verdict on this task correlates with its stronger visual localization capabilities, indicating verdict selection should align with task-specific requirements.

(ii) SV also exceeds DeepEyes, which is explicitly trained with zoom-in tools for iterative visual search on high-resolution perception. This highlights SV’s ability to generalize to high-resolution tasks, where accurate recognition of small objects is critical. Aligning perceptually strong draft experts with a verdict thus provides a simpler yet effective solution for high-resolution reasoning.

4.4 Ablation Study↩︎

To better understand the effectiveness of SV, we conduct ablation studies on information-intensive benchmarks to analyze the impact of individual components. In these experiments, the reasoning baseline refers to the best-performing draft expert in our pool for each benchmark (Table [tab:infographics]).

Number of Draft Experts. Our setting with \(m=3\) draft experts yields a favorable trade-off between accuracy and efficiency, as it determines the number of reasoning paths forwarded to the verdict. As shown in Figure 5, we observe that the performance improves nearly linearly up to three draft experts and then saturates, while inference cost grows roughly linearly with \(m\).

Consensus Expert Selection Strategy. We confirm the effectiveness of our cross-all selection strategy by comparing it with a best-reference strategy. In the best-reference variant, the top-performing draft expert serves as reference and the two most consistent experts are selected with it. While best-reference is expected to be the strongest criterion, cross-all achieves comparable gains while remaining reference-free (Figure [fig:strategy]).

Selection Criteria. Selecting consensus-based experts consistently improves performance, while divergent selection can even fall below the single-draft reasoning baseline (Figure 6). These results support that, for information-intensive tasks, consensus-based selection more reliably identifies the correct reasoning path than enforced diversity.

Figure 5: Ablations on the number of draft experts m.

Impact of Verdict Stage. The verdict stage yields higher performance than majority voting across information-intensive benchmarks (Figure [fig:mv]). Notably, majority voting with all five draft experts performs comparably as majority voting with three draft experts, consistent with our finding that consensus selection can match the performance of all drafts at a lower cost (Figure 5). SV further surpasses both by leveraging the verdict’s error correction ability, successfully capturing minority-correct cases that majority voting discards (Figure 4 and Figure 3).

Choice of Verdict Textual Input. Providing full reasoning paths to the verdict yields substantially better performance than passing only final answers (Table [tab:expertselect]), with improvements of 15% on InfographicVQA, and 4.8% on ChartQAPro. These results highlight that rich contextual evidence is essential for the verdict to recover correct reasoning, whereas final predictions alone are insufficient.

Choice of Verdict Scale. Using a large verdict model yields stronger gains than a small verdict model. For ablations, we select GLM-4.1V-9B-Thinking as the small verdict because it is the strongest reasoning model among the baselines. However, results in Table [tab:verdictinfo] show that it brings only modest improvements, while GPT-4o delivers additional gains of 3.4% on InfographicVQA and 1.3% on ChartMuseum compared to this small verdict. These results indicate that even reasoning-strong small verdicts offer limited benefit in synthesizing correct answers, validating SV’s design principle of invoking a strong verdict only once to achieve robust and efficient error correction.

Figure 6: Ablations on selection criteria.
image
Ablations on verdict scale. A subset of 1000 samples is tested on InfographicVQA.

5 Conclusion↩︎

This paper introduces Speculative Verdict (SV), a training-free framework to address challenges of information-intensive visual reasoning. Inspired by speculative decoding, SV repositions large models as efficient synthesizers rather than computationally expensive step-by-step reasoners. By integrating diverse reasoning paths from lightweight experts, the verdict can distinguish informative cues and recover correctness from structured errors. Experiments show that SV consistently outperforms strong proprietary, open-source, and tool-driven methods, establishing a cost-efficient paradigm for reasoning on information-intensive images.

6 Ethics Statement↩︎

This work does not involve human subjects, sensitive personal data, biometrics, or medical information. All datasets used are publicly available under permissible licenses and are not privacy-sensitive. We recognize that any automated reasoning system may produce incorrect or misleading outputs. To ensure responsible use, we emphasize that our method is intended for research and analysis rather than deployment in high-stakes settings. Users are encouraged to verify model outputs and apply human oversight when necessary. We take full responsibility for all reported results, analyses, and claims, and we welcome community scrutiny and feedback.

7 Reproducibility Statement↩︎

To support reproducibility, we provide comprehensive implementation details throughout our paper. Key experimental configurations, such as draft expert selection, consensus scoring computation, and verdict model specifications, are documented in Section 3.4 and Section 4.1. Detailed prompt templates are presented in Appendix 15. The code is released to further clarify the implementation steps and enable faithful reproduction of our results.

8 Dataset statistics↩︎

Table [tab:datasetstats] reports the statistics of the four evaluation benchmarks. All benchmarks are based on real-world images rather than synthetic renderings, ensuring the authenticity and diversity of the evaluation setting. In particular, InfographicVQA, ChartMuseum, and ChartQAPro are information-intensive benchmarks: they contain thousands of images and questions with dense textual and numerical content, collected from diverse sources spanning 2594, 157, and 184 distinct web domains respectively [13][15]. This diversity reduces source bias and reflects practical challenges in multimodal reasoning.

HR-Bench 4K is used primarily to evaluate the generalization of our method, serving as a high-resolution benchmark with average sizes exceeding 4000×3500 pixels [17]. At the same time, one of our main benchmarks, InfographicVQA, also exhibits high-resolution characteristics. In particular, it frequently contains long-format images where diagrams span large vertical layouts (see the case in Figure 3), which further compounds the difficulty of grounding and multi-hop reasoning across dispersed regions.

image

9 Costs↩︎

Table [tab:cost] reports the average inference cost of invoking GPT-4o as the verdict model per sample across benchmarks. Costs are estimated using the official GPT-4o pricing (version gpt-4o-2024-08-06) as of September 2025. The small variation across benchmarks is mainly attributed to differences in reasoning path length, as more challenging tasks typically induce more complex reasoning. Overall, the inference cost of using GPT-4o as the verdict is under $0.011 per sample across all benchmarks.

image

10 Supplementary recovery analysis on information-intensive benchmarks↩︎

Table [tab:recovery95full] and Figure 7 show the detailed recovery statistics across information-intensive benchmarks with GPT-4o as verdict. We break down SV’s performance by expert correctness: (i) cases where the majority of draft experts are correct (majority-correct), (ii) cases where only a minority are correct (minority-correct), (iii) cases where none are correct (zero-correct). While the main paper focuses on the GPT-4o’s error cases to isolate SV’s effectiveness, we provide the full results here for completeness.

Notably, in the zero-correct setting, recovery occurs rarely (2.6-24%), but it demonstrates verdict’s surprising ability to infer the correct answer by synthesizing signal from entirely noisy reasoning.

image
Figure 7: SV’s correction ability on verdict’s correct cases (GPT-4o as verdict), complementary to its error cases in the Figure 4.

11 Ablation study on model pool compositions↩︎

Beyond the fixed model pool used in the main experiments, we further examine SV’s generalizability across different model pool compositions by testing on pools with varying model sizes and capabilities. The results show that SV successfully leverages reasoning paths from lightweight models, delivering strong performance while maintaining cost efficiency.

Evaluation with 7-9B Model Pool (Non-Thinking). SV maintains its effectiveness when replacing thinking models with faster non-thinking alternatives. Specifically, we replace the two thinking models in our original pool (i.e., GLM-4.1V-9B-Thinking [33] and MiMO-VL-7B-RL [31]) with non-thinking models (i.e., LLaVA-OneVision-1.5-8B [36], and Eagle 2.5-8B [37]), while keeping the remaining three models unchanged. While these substitutes sacrifice some reasoning capability, they enable faster inference. As shown in Table [tab:largeinfographics], with GPT-4o as verdict, SV achieves 86.3% on InfographicVQA under this configuration, surpassing all baselines. Notably, SV outperforms the best draft expert by 4.6% and exceeds the large 72B model by 1.9%. These results demonstrate that SV achieves strong performance by integrating reasoning paths from individually weaker but faster models.

Evaluation with 2-4B Model Pool. We also evaluate SV on an even smaller model pool consisting of 2-4B models: Qwen2.5-VL-Instruct-3B [30], LLaVA-OneVision-1.5-4B [36], InternVL3.5-4B [38], Gemma 3-4B [39], and Ovis2.5-2B [34]. As shown in the Table [tab:smallinfographics], with GPT-4o as verdict, SV achieves 84.5% on InfographicVQA, surpassing the best draft expert by 9.5% and the 72B baseline by 0.3%. This demonstrates SV’s ability to extract effective collective reasoning even from significantly weaker individual models, confirming the robustness of our paradigm across varying model scales.

image
image

12 Ablation studies on verdict input configuration↩︎

12.1 Impact of visual input to verdict↩︎

We examine whether visual input is necessary for the verdict or if reasoning paths alone suffice. Table [tab:visualinput] presents results where the verdict receives only textual reasoning paths without image input. The results show that SV without visual input achieves modest gains over the reasoning baseline on InfographicVQA, and even underperforms on ChartMuseum and ChartQAPro. In contrast, incorporating visual input for verdict yields substantial improvements across all benchmarks. These results demonstrate that visual grounding is essential for the verdict to cross-check the factual accuracy of extracted information and distinguish correct from incorrect interpretations of the image.

image

12.2 Impact of structured image input to verdict↩︎

In our experimental setup in Section 4.1, we preprocess each image via PP-StructureV3, a document parsing model that generates Markdown representations capturing layout, textual blocks, and visual metadata [35]. This structured representation is then rendered as an image and provided as an additional image input for the verdict. This allows the verdict to access both the raw visual content and a layout-aware text representation simultaneously. To verify whether this input is critical or merely auxiliary, we conduct an ablation study (Table [tab:ocr]).

The results show that SV achieves substantial gains over the reasoning baseline even without structured input. With the structured input, performance is generally slightly improved, though the gain is negligible or even marginally lower in some cases. This pattern suggests that structured OCR-derived signals are not essential for SV’s core performance, but may assist the verdict to distinguish among competing reasoning paths.

image

13 Error Analysis of Tool-driven Pipeline↩︎

As mentioned in Section [work:tool], tool-driven methods represent a line of work that augments vision-language reasoning with explicit zoom-in operations. The representative pipeline DeepEyes is designed to iteratively ground into image regions, and integrate them into the ongoing reasoning trajectory under an RL framework. This mechanism has proven effective on high-resolution benchmarks, where localized inspection of fine details is crucial.

However, DeepEyes is not specifically trained on our benchmarks, which require reasoning over information-intensive images with densely interleaved textual and visual elements. Its performance on InfographicVQA reveals the current limitations of such tool-based pipelines in this domain. We categorize the observed deficiencies into three core challenges:

(i) Tendency toward literal grounding. DeepEyes is proficient at small-scale grounding but often focuses on literal text spans or legends rather than reasoning-critical regions. For example, when a question requires aligning numerical values with a chart axis, the model frequently grounds directly onto the answer text or nearby labels instead of the relevant data regions. This shortcut strategy works for simple queries but fails on complex reasoning on information-intensive images that require global comparison.

(ii) Inefficient tool usage. Although DeepEyes is trained to iteratively apply zoom-in tools, we observe that it invokes only one zoom step in more than half of the test cases. Among the double-zoom cases, 92.8% duplicate the same bounding box, which serves only for verification rather than exploration. In some instances, the model zooms into empty areas or irrelevant regions.

(iii) Lack of robustness on long and dense images. Information-intensive images often contain multi-panel figures and dense annotations. DeepEyes cannot maintain a trajectory across multiple zoom steps, making it difficult to integrate dispersed evidence. As a result, tasks requiring cross-region synthesis, such as counting, sorting, or comparing across multiple subplots, remain challenging for it.

Overall, this analysis indicates that while tool-driven pipelines are promising for high-resolution inspection tasks, they face notable difficulties applying to information-intensive images without domain-specific supervision. In contrast, SV achieves strong performance without additional training, offering a simple and effective alternative for reasoning over complex multimodal inputs.

14 Qualitative example↩︎

Figure 8 illustrates a case where all three draft experts produced incorrect reasoning paths, yet the verdict successfully corrected the answer. Specifically, the draft experts faced different types of failures: some mis-extracted information from the image, others extracted the key information correctly but failed to sort the values properly, and thus all generated wrong answers. Interestingly, the verdict itself, when asked directly, also tends to answer “Australia” incorrectly. However, when analyzing the noisy and conflicting reasoning paths together, the verdict was able to recover the correct answer (Portugal).

This example complements the main results section: while Figure 3 illustrates recovery from minority-correct experts, here we present a zero-correct case to show that SV can still synthesize the correct solution even when all drafts and the verdict individually fail.

15 Prompt templates↩︎

15.1 Chain-of-Thought Prompts↩︎

As described in Section 4.1, we employ a Chain-of-Thought prompt for each consensus expert to generate reasoning paths and apply it identically when evaluating baselines. For InfographicVQA and HR-Bench 4K, we use the same CoT prompt. For ChartMuseum [14], we adopt its official reasoning prompt, and adapt that prompt strategy to ChartQAPro, given their similarity in task complexity. Since ChartQAPro requires different prompt templates tailored to question types [15], we first follow its official template per question type, then concatenate it with our reasoning prompt.

The reasoning prompts for these datasets are shown in Figure 9.

15.2 Prompts for Verdict↩︎

The user prompts used in the verdict stage are identical across datasets except for the final instruction sentence, which is customized (see Figure 11). For GPT-4o as verdict, the system prompt is shown in Figure 10. For Qwen-2.5-VL-72B-Instruct as verdict, we prepend its system prompt at the beginning of the user prompt.

16 The Use of Large Language Models (LLMs)↩︎

In this work, we used LLMs solely for auxiliary tasks such as language polishing, prompt refining, and proofreading. Importantly, these interventions did not contribute any main scientific insight, experimental design, or methodological advance. All core ideas, experiments, analyses, and claims in this paper are the work of the authors.

Figure 8: A qualitative zero-correct case corrected by verdict. All three draft experts fail due to errors in extracting or sorting visual information, yet the verdict synthesizes their noisy reasoning paths to recover the correct answer (i.e., Portugal).

None

Figure 9: Prompt templates for reasoning..

None

Figure 10: System prompt template for verdict..

None

Figure 11: User prompt templates for verdict..

References↩︎

[1]
Zongxia Li, Xiyang Wu, Hongyang Du, Fuxiao Liu, Huy Nghiem, and Guangyao Shi. A survey of state of the art large vision language models: Alignment, benchmark, evaluations and challenges. arXiv preprint arXiv:2501.02189, 2025.
[2]
Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, et al. Mme-survey: A comprehensive survey on evaluation of multimodal llms. arXiv preprint arXiv:2411.15296, 2024.
[3]
Zhaochen Su, Peng Xia, Hangyu Guo, Zhenhua Liu, Yan Ma, Xiaoye Qu, Jiaqi Liu, Yanshu Li, Kaide Zeng, Zhengyuan Yang, et al. Thinking with images for multimodal reasoning: Foundations, methods, and future frontiers. arXiv preprint arXiv:2506.23918, 2025.
[4]
Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, and Hamid Rezatofighi. Explain before you answer: A survey on compositional visual reasoning, 2025.
[5]
Ziwei Zheng, Michael Yang, Jack Hong, Chenxiao Zhao, Guohai Xu, Le Yang, Chao Shen, and Xing Yu. Deepeyes: Incentivizing" thinking with images" via reinforcement learning. arXiv preprint arXiv:2505.14362, 2025.
[6]
Alex Su, Haozhe Wang, Weiming Ren, Fangzhen Lin, and Wenhu Chen. Pixel reasoner: Incentivizing pixel-space reasoning with curiosity-driven reinforcement learning. arXiv preprint arXiv:2505.15966, 2025.
[7]
Yue Fan, Xuehai He, Diji Yang, Kaizhi Zheng, Ching-Chen Kuo, Yuting Zheng, Sravana Jyothi Narayanaraju, Xinze Guan, and Xin Eric Wang. Grit: Teaching mllms to think with images. arXiv preprint arXiv:2505.15879, 2025.
[8]
Xintong Zhang, Zhi Gao, Bofei Zhang, Pengxiang Li, Xiaowen Zhang, Yang Liu, Tao Yuan, Yuwei Wu, Yunde Jia, Song-Chun Zhu, et al. Chain-of-focus: Adaptive visual search and zooming for multimodal reasoning via rl. arXiv preprint arXiv:2505.15436, 2025.
[9]
Jiarui Zhang, Mahyar Khayatkhoei, Prateek Chhikara, and Filip Ilievski. Mllms know where to look: Training-free perception of small visual details with multimodal llms. arXiv preprint arXiv:2502.17422, 2025.
[10]
Haozhan Shen, Kangjia Zhao, Tiancheng Zhao, Ruochen Xu, Zilun Zhang, Mingwei Zhu, and Jianwei Yin. Zoomeye: Enhancing multimodal llms with human-like zooming capabilities through tree-based image exploration. arXiv preprint arXiv:2411.16044, 2024.
[11]
Wenbin Wang, Yongcheng Jing, Liang Ding, Yingjie Wang, Li Shen, Yong Luo, Bo Du, and Dacheng Tao. Retrieval-augmented perception: High-resolution image perception meets visual rag. arXiv preprint arXiv:2503.01222, 2025.
[12]
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pp. 19274–19286. PMLR, 2023.
[13]
Minesh Mathew, Viraj Bagal, Rubèn Pérez Tito, Dimosthenis Karatzas, Ernest Valveny, and C. V Jawahar. Infographicvqa. arXiv preprint arXiv:2104.12756, 2021.
[14]
Liyan Tang, Grace Kim, Xinyu Zhao, Thom Lake, Wenxuan Ding, Fangcong Yin, Prasann Singhal, Manya Wadhwa, Zeyu Leo Liu, Zayne Sprague, et al. Chartmuseum: Testing visual reasoning capabilities of large vision-language models. arXiv preprint arXiv:2505.13444, 2025.
[15]
Ahmed Masry, Mohammed Saidul Islam, Mahir Ahmed, Aayush Bajaj, Firoz Kabir, Aaryaman Kartha, Md Tahmid Rahman Laskar, Mizanur Rahman, Shadikur Rahman, Mehrad Shahmohammadi, et al. Chartqapro: A more diverse and challenging benchmark for chart question answering. arXiv preprint arXiv:2504.05506, 2025.
[16]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024.
[17]
Wenbin Wang, Liang Ding, Minyan Zeng, Xiabin Zhou, Li Shen, Yong Luo, Wei Yu, and Dacheng Tao. Divide, conquer and combine: A training-free framework for high-resolution image perception in multimodal large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp. 7907–7915, 2025.
[18]
Yushi Hu, Weijia Shi, Xingyu Fu, Dan Roth, Mari Ostendorf, Luke Zettlemoyer, Noah A Smith, and Ranjay Krishna. Visual sketchpad: Sketching as a visual chain of thought for multimodal language models. Advances in Neural Information Processing Systems, 37: 139348–139379, 2024.
[19]
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. arXiv preprint arXiv:2401.07851, 2024.
[20]
Wang Yang, Xiang Yue, Vipin Chaudhary, and Xiaotian Han. Speculative thinking: Enhancing small-model reasoning with large model guidance at inference time. arXiv preprint arXiv:2504.12329, 2025.
[21]
Rui Pan, Yinwei Dai, Zhihao Zhang, Gabriele Oliaro, Zhihao Jia, and Ravi Netravali. Specreason: Fast and accurate inference-time compute via speculative reasoning. arXiv preprint arXiv:2504.07891, 2025.
[22]
Yichao Fu, Rui Ge, Zelei Shao, Zhijie Deng, and Hao Zhang. Scaling speculative decoding with lookahead reasoning. arXiv preprint arXiv:2506.19830, 2025.
[23]
Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient llm reasoning. arXiv preprint arXiv:2501.19324, 2025.
[24]
Jiale Fu, Yuchu Jiang, Junkai Chen, Jiaming Fan, Xin Geng, and Xu Yang. Fast large language model collaborative decoding via speculation. arXiv preprint arXiv:2502.01662, 2025.
[25]
Xinyun Chen, Renat Aksitov, Uri Alon, Jie Ren, Kefan Xiao, Pengcheng Yin, Sushant Prakash, Charles Sutton, Xuezhi Wang, and Denny Zhou. Universal self-consistency for large language model generation. arXiv preprint arXiv:2311.17311, 2023.
[26]
Jianing Qi, Xi Ye, Hao Tang, Zhigang Zhu, and Eunsol Choi. Learning to reason across parallel samples for llm reasoning. arXiv preprint arXiv:2506.09014, 2025.
[27]
Wenting Zhao, Pranjal Aggarwal, Swarnadeep Saha, Asli Celikyilmaz, Jason Weston, and Ilia Kulikov. The majority is not always right: Rl training for solution aggregation. arXiv preprint arXiv:2509.06870, 2025.
[28]
Chen Zhang, Zhuorui Liu, and Dawei Song. Beyond the speculative game: A survey of speculative execution in large language models. arXiv preprint arXiv:2404.14897, 2024.
[29]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35: 24824–24837, 2022.
[30]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923, 2025.
[31]
LLM-Core-Team Xiaomi. Mimo-vl technical report, 2025. URL https://arxiv.org/abs/2506.03569.
[32]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479, 2025.
[33]
V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Chen, Jing Chen, Jinhao Chen, Jinghao Lin, Jinjiang Wang, Junjie Chen, Leqi Lei, Letian Gong, Leyi Pan, Mingdao Liu, Mingde Xu, Mingzhi Zhang, Qinkai Zheng, Sheng Yang, Shi Zhong, Shiyu Huang, Shuyuan Zhao, Siyan Xue, Shangqin Tu, Shengbiao Meng, Tianshu Zhang, Tianwei Luo, Tianxiang Hao, Tianyu Tong, Wenkai Li, Wei Jia, Xiao Liu, Xiaohan Zhang, Xin Lyu, Xinyue Fan, Xuancheng Huang, Yanling Wang, Yadong Xue, Yanfeng Wang, Yanzi Wang, Yifan An, Yifan Du, Yiming Shi, Yiheng Huang, Yilin Niu, Yuan Wang, Yuanchang Yue, Yuchen Li, Yutao Zhang, Yuting Wang, Yu Wang, Yuxuan Zhang, Zhao Xue, Zhenyu Hou, Zhengxiao Du, Zihan Wang, Peng Zhang, Debing Liu, Bin Xu, Juanzi Li, Minlie Huang, Yuxiao Dong, and Jie Tang. Glm-4.5v and glm-4.1v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning, 2025. URL https://arxiv.org/abs/2507.01006.
[34]
Shiyin Lu, Yang Li, Yu Xia, Yuwei Hu, Shanshan Zhao, Yanqing Ma, Zhichao Wei, Yinglun Li, Lunhao Duan, Jianshan Zhao, et al. Ovis2. 5 technical report. arXiv preprint arXiv:2508.11737, 2025.
[35]
Cheng Cui, Ting Sun, Manhui Lin, Tingquan Gao, Yubo Zhang, Jiaxuan Liu, Xueqing Wang, Zelun Zhang, Changda Zhou, Hongen Liu, et al. Paddleocr 3.0 technical report. arXiv preprint arXiv:2507.05595, 2025.
[36]
Xiang An, Yin Xie, Kaicheng Yang, Wenkang Zhang, Xiuwei Zhao, Zheng Cheng, Yirui Wang, Songcen Xu, Changrui Chen, Chunsheng Wu, et al. Llava-onevision-1.5: Fully open framework for democratized multimodal training. arXiv preprint arXiv:2509.23661, 2025.
[37]
Guo Chen, Zhiqi Li, Shihao Wang, Jindong Jiang, Yicheng Liu, Lidong Lu, De-An Huang, Wonmin Byeon, Matthieu Le, Tuomas Rintamaki, et al. Eagle 2.5: Boosting long-context post-training for frontier vision-language models. arXiv preprint arXiv:2504.15271, 2025.
[38]
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Shenglong Ye, Jie Shao, et al. Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025.
[39]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ramé, Morgane Rivière, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025.