Conan: Progressive Learning to Reason Like a Detective over Multi-Scale Visual Evidence

Kun Ouyang\(^{1}\),Yuanxin Liu\(^{1}\),Linli Yao\(^{1}\),Yishuo Cai\(^{1}\),
Hao Zhou\(^{2}\),Jie Zhou\(^{2}\),Fandong Meng\(^{2}\),Xu Sun\(^{1}\)
1
\(^{1}\)State Key Laboratory for Multimedia Information Processing,
School of Computer Science, Peking University
\(^{2}\)WeChat AI, Tencent Inc., China
kunouyang10@gmail.com,


Abstract

Video reasoning, which requires multi-step deduction across frames, remains a major challenge for multimodal large language models (MLLMs). While reinforcement learning (RL)-based methods enhance reasoning capabilities, they often rely on text-only chains that yield ungrounded or hallucinated conclusions. Conversely, frame-retrieval approaches introduce visual grounding but still struggle with inaccurate evidence localization. To address these challenges, we present Conan, a framework for evidence-grounded multi-step video reasoning. Conan identifies contextual and evidence frames, reasons over cross-frame clues, and adaptively decides when to conclude or explore further. To achieve this, we (1) construct Conan-91K, a large-scale dataset of automatically generated reasoning traces that includes frame identification, evidence reasoning, and action decision, and (2) design a multi-stage progressive cold-start strategy combined with an Identification–Reasoning–Action (AIR) RLVR training framework to jointly enhance multi-step visual reasoning. Extensive experiments on six multi-step reasoning benchmarks demonstrate that Conan surpasses the baseline Qwen2.5-VL-7B-Instruct by an average of over 10% in accuracy, achieving state-of-the-art performance. Furthermore, Conan generalizes effectively to long-video understanding tasks, validating its strong scalability and robustness.

1 Introduction↩︎

There is only one truth!

– Edogawa Conan

Frontier multimodal large language models (MLLMs) [1][4] have demonstrated remarkable progress on standard video understanding tasks such as question answering [5], temporal grounding [6], and captioning [7]. However, video reasoning [8], [9] remains a substantial challenge. Unlike conventional tasks, video reasoning demands active visual information accumulation across temporal spans and multi-step logical inference to reach well-grounded conclusions.

Inspired by the success of reinforcement learning with verifiable rewards [10] (RLVR) in incentivizing reasoning ability of LLMs, recent works [11][13] have begun extending this paradigm to video reasoning, achieving promising gains. Nevertheless, these approaches primarily rely on pure-text reasoning without explicit grounding in visual evidence, often leading to superficial or hallucinated reasoning chains that fail to reflect the actual video content. To integrate visual evidence into the reasoning process, concurrent works [14][16] introduce the frame retrieval mechanism to enable video chain-of-thought (Video-CoT) reasoning, boosting performance of long-video understanding [17], [18]. However, these approaches usually suffer from inaccurate or implicit evidence localization, yielding unreliable reasoning paths. Additionally, some methods [15], [16] partially rely on benchmark-specific training data (Video-Holmes [8] and LongVideoReason [19]), making it difficult to disentangle solid reasoning improvements from in-domain overfitting.

Motivated by these challenges, we aim to equip MLLMs with multi-step, evidence-grounded video reasoning skills, analogous to how Conan as a detective (Figure [fig:teaser]): Specifically, our framework identifies relevant frames at multiple scales (including both contextual and evidence frames) reasons over cross-frame clues to form coherent chains of deduction, and decides whether to draw the final conclusion or continue exploring the video. Achieving this goal raises two core challenges: 1) How to automatically construct a high-quality evidence-based reasoning dataset that explicitly captures evidence localization, multi-step deductive reasoning, and confident action decision, and 2) How to design effective training procedure and objectives.

To tackle the first challenge, we introduce Conan-91k, a large-scale dataset for Conan-style evidence reasoning. Built upon the key-frame identification dataset GenS-Video-150K [20], we develop an automated pipeline to generate interleaved video-text reasoning traces using the advanced reasoning LLM Kimi K2, as illustrated in Figure 1. Each reasoning trace contains three key components: 1) Frame Identification distinguishes between evidence, contextual and irrelevant frames 2) Evidence Reasoning conducts textual reasoning over the question and accumulated visual clues 3) Action Decision decides whether to sample additional frames or reach the final conclusion.

To address the second challenge, we propose a training procedure that combines a multi-stage progressive cold-start strategy for SFT with a joint Identification–Reasoning–Action (AIR) optimization framework for RLVR. Specifically, the progressive cold-start strategy incrementally enhances the model’s multi-step reasoning—starting from textual reasoning, advancing to multimodal alignment, and culminating in vision-centric deduction. Building upon this foundation, the joint AIR RLVR framework further guides the model to perform multi-step reasoning over multi-scale visual evidence. Together, these components empower Conan to “seek, deduce, and act” across visual clues, achieving reliable and verifiable multi-step reasoning.

Extensive experiments on six challenging multi-step reasoning benchmarks (MMR-V [21], Video-Holmes [8], VRBench [22], VCRBench [23], LongVideoReason [19], and Human-P&C [24]) demonstrate that Conan consistently surpasses state-of-the-art MLLMs, achieving over 10% accuracy gains over the baseline Qwen2.5-VL-7B-Instruct. Moreover, Conan generalizes effectively to long-video understanding tasks (LongVideoBench [17], MLVU [18], LVBench [25], and Video-MME [26]), validating strong scalability and robustness.

In summary, our contributions are threefold: 1) We introduce Conan-91k, the first large-scale dataset for multi-scale evidence reasoning with evidence difficulty-aware sampling. 2) We propose a multi-stage progressive cold-start strategy and a joint Identification–Reasoning–Action RLVR framework to foster gradual acquisition of evidence-based reasoning skills. 3) We conduct extensive experiments, which demonstrate that Conan achieves state-of-the-art multi-step reasoning performance among frontier MLLMs and also generalizes well to standard long-video understanding.

2 Related Work↩︎

2.1 Video Reasoning Tasks↩︎

Recent advances in multimodal large language models (MLLMs) such as Qwen2.5-VL [1], Kimi-VL [2], MiMo-VL [3], and GPT-4o [4], have substantially improved video understanding including captioning [7], question answering [5], and temporal grounding [6]. However, these capabilities mainly reflect perceptual understanding [27], whereas video reasoning [28], which demands multi-hop deduction and causal inference across frames, remains insufficiently explored and evaluated. To address this gap, several benchmarks have been introduced to assess reasoning capabilities of MLLMs, such as Video-Holmes [8], VideoReasonBench [9], MMR-V [21], VRBench [22], and VCRBench [23]. Unlike conventional video understanding tasks focused on recognizing visual content, these benchmarks require models to actively locate, connect, and reason over multiple relevant clues, demanding a deeper comprehension of temporal dependencies and causal structures in dynamic visual narratives.

2.2 Video Reasoning Models↩︎

Inspired by the reasoning advancements of the DeepSeek-R1 [10], several studies [11], [13] adopt reinforcement learning with verifiable rewards [10] (RLVR) to promote video reasoning in MLLMs. While these approaches [11][13] encourage step-by-step reasoning, most are limited to text-only chain of thought, lacking explicit grounding in visual evidence, which often leads to unverified or hallucinated reasoning. To bridge this gap, concurrent works like Video-MTR [14], and FrameThinker [16] incorporate frame retrieval actions into the reasoning process, enabling dynamic evidence gathering and long-form understanding. Despite great improvement of long video understanding [17], [18], [25], they partially depend on benchmark-specific training set and still lack a reliable evidence identification mechanism, rendering the retrieval actions less reliable. Motivated by this, we aim to develop a framework named Conan that incentivizes deductive-like reasoning abilities in MLLMs, combining precise evidence identification, logical multi-step reasoning, and confident action decision towards robust video reasoning.

Figure 1: a) Reasoning Trace Construction. b) Data Example. c) Multi-stage Progressive Cold-start, including textual, multimodal alignment, and vision-centric reasoning stages. d) The Joint Identification-Reasoning-Action RLVR.

3 Dataset Construction↩︎

In this section, we introduce the construction procedure of Conan-91k dataset: (i) Data Collection & Processing, (ii) Reasoning Trace Construction, (iii) Evidence Difficulty-Aware Sampling, (iv) Dataset Statistics.

3.1 Data Collection & Processing↩︎

We collect source data from the GenS-Video-150K [20] dataset, which provides dense frame descriptions, multiple-choice and free-form QA pairs, as well as frame-level relevance scores. Leveraging these relevance scores, we categorize video frames into three types: 1) Evidence frames, which are directly relevant to answering the question; 2) Contextual frames, which offer auxiliary hints that may support the reasoning process; and 3) Irrelevant frames, which bear no relation to the question. This multi-scale frame categorization establishes the foundation for subsequent stepwise reasoning trace simulation.

3.2 Reasoning Trace Construction↩︎

Starting from the processed data above, we apply an automatic pipeline that constructs Conan-style video–text interleaved reasoning traces with the assistance of a strong reasoning LLM, Kimi K2 [29]. Figure 1 a) illustrates the pipeline. The core loop proceeds as follows:

  • We first sample \(16\) frames uniformly from the raw video and retain each frame’s type label (evidence, contextual, or irrelevant).

  • If all sampled frames are irrelevant, we select the action Random Frame Sampling, which randomly samples \(8\) new frames and continue the loop.

  • If some sampled frames are evidence or contextual but the evidence proportion does not exceed a predefined retrieval threshold, we select the action Specific Frame Retrieval, which uniformly retrieves \(8\) frames within specific clips that contain evidence, and continue the loop.

  • If the proportion of evidence frames exceeds the threshold, we terminate the loop by selecting Confident Question Answering as the final action.

  • For every loop iteration, in addition to the sampled frames and the chosen action, we prompt K2 with the frame types, the QA pair, dense frame descriptions, timestamps, and the action decision. K2 generates a coherent textual reasoning trace that (a) analyzes the QA and sampled frames, and (b) reaches the chosen action.

3.3 Evidence Difficulty-Aware Sampling↩︎

To facilitate a progressive training curriculum from simple to complex reasoning cases, we adopt an Evidence Difficulty-Aware Sampling (EDAS) strategy. In particular, we define an Evidence Difficulty Index (EDI) to quantify each sample’s reasoning complexity based on the proportion and temporal dispersion of evidence frames. Let the evidence ratio be \(P=m/N\), where \(m\) and \(N\) denote the numbers of evidence and total frames, respectively. The temporal variance of evidence frames is computed as \(\operatorname{Var= \frac{1}{m} \sum_{i=1}^{m} \left(x_i - \bar{x}\right)^2}\), where \(x_i\) represents the temporal position of \(i\)-th evidence frame, and \(\bar{x}\) is the mean position of all evidence frames. The overall difficulty is then defined as \(\mathrm{EDI}= (1-P)\,\mathrm{Var}\), where higher EDI values indicating sparser evidences and more challenging reasoning scenarios. Utilizing this metric, we adopt a curriculum-aligned sampling scheme: for SFT, we select 60k samples with 70% having \(\mathrm{EDI}< 0.5\), while for RLVR, we select 31k samples with 70% having \(\mathrm{EDI}> 0.5\). This progressive sampling design ensures a smooth transition from low-difficulty grounding during SFT to high-difficulty multi-hop reasoning in RLVR, fostering gradual acquisition of robust evidence-based reasoning skills.

3.4 Dataset Statistics↩︎

The final Conan-91k dataset comprises two subsets: 1) Conan-CoT-60k. 60k in total, including 25k one-round, 25k two-round, and 10k three-round reasoning samples for cold-start. 2) Conan-RLVR-31k. 31k challenging reasoning samples for RLVR to enhance model robustness and reasoning generalization in complex scenarios.

4 Training Procedure↩︎

This section details the training procedure of Conan, comprising two key phases: Multi-Stage Progressive Cold-Start and joint Identification–Reasoning–Action (AIR) RLVR.

4.1 Multi-Stage Progressive Cold-Start↩︎

To progressively activate the multi-step reasoning abilities, we propose a multi-stage progressive cold-start training strategy, which consists of the following three stages:

Textual Reasoning Stage. In the initial stage, the model is trained on \(10\)k single-round samples from Conan-CoT-60k, where frames are represented solely by dense textual descriptions and timestamps. This stage focuses on learning temporal and causal reasoning across ordered frame descriptions, establishing a structured reasoning foundation for subsequent multimodal learning.

Multimodal Alignment Reasoning Stage. The second stage incorporates \(25\)k single-round and \(10\)k two-round samples combining raw visual frames with corresponding textual descriptions and timestamps. This multimodal configuration bridges the gap between text-only and vision-language reasoning, promoting effective alignment between visual and linguistic cues. Moreover, the inclusion of two-round samples introduces the frame retrieval action, allowing the model to learn how to gather supplementary evidence for complex reasoning tasks.

Vision-Centric Reasoning Stage. In the final stage, the model is trained on the complete Conan-CoT-60k, performing reasoning directly over visual frames with interleaved timestamps. This stage compels the model to execute multi-step reasoning across multi-scale visual clues, thereby achieving a higher level of perceptual understanding and visual reasoning competence.

Overall, this progressive curriculum, from text-based to multimodal and ultimately to vision-centric reasoning, enables the model to incrementally acquire robust multi-scale evidence reasoning skills, laying a solid foundation for the subsequent RLVR phase.

4.2 AIR RLVR↩︎

Building upon the model Conan-SFT obtained from the cold-start process, we further enhance its multi-step reasoning capabilities via the AIR RLVR framework. Given that the model has already learned to produce reasoning traces consisting of: 1) frame identification, 2) evidence reasoning, and 3) action decision, AIR RLVR aims to optimize the exploration of effective reasoning trajectories through a set of carefully designed reward functions. We first introduce one format reward and two outcome rewards to ensure both structural consistency and answer accuracy.

Format Reward. To enforce structural consistency in model outputs \(y\), we define a format reward \(R_{\text{fmt}}\) that verifies whether specific tags are correctly applied. And the model is restricted to perform only one action (random frame sampling, specific frame retrieval, confident question answering) at a time. The format reward \(R_{fmt}\) is defined as: \[\label{eq:fmt95reward} R_{fmt}(y) = \begin{cases} 0.5, & \text{if y matches format}, \\ 0, & \text{otherwise}. \end{cases}\tag{1}\]

Multi-choice Reward. For multi-choice QA, we apply a binary outcome reward \(R_{\text{mc}}\) based on exact match between the predicted answer \(y\) and ground truth \(\hat{y}\): \[\label{eq:multi-choice95reward} R_{mc}(y,\hat{y}) = \begin{cases} 1, & \text{if } y = \hat{y}, \\ 0, & \text{otherwise}. \end{cases}\tag{2}\]

Free-form Reward. For free-form QA, the outcome reward \(R_{\text{free}}\) is computed as the average of ROUGE-1, ROUGE-2, and ROUGE-L scores [30] between the predicted answer \(y\) and the reference answer \(\hat{y}\): \[\label{eq:free95reward} \begin{align} R_{\text{free}}(y,\hat{y}) = \frac{1}{3}\bigl( \operatorname{R_1}(y,\hat{y}) +\operatorname{R_2}(y,\hat{y}) +\operatorname{R_L}(y,\hat{y}) \bigr) \end{align}\tag{3}\]

To evaluate the joint quality of the multi-scale frame identification, frames retrieval, we design an identification reward \(R_{ide}\) and a retrieval reward \(R_{ret}\): 1) the identification reward \(R_{ide}\) measures the average accuracy of identified evidence/contextual frames across steps, 2) the retrieval reward \(R_{ret}\) evaluates the quality of retrieved frames by computing the average ratio of evidence/contextual frames among all retrieved frames. And the final joint identification-retrieval-outcome reward \(R_{IRO}\) is formulated as: \[\label{eq:ERO95reward} R_{IRO} =\left\{ \begin{align} &R_{fmt} + R_{o} + R_{ide} + R_{ret}, &\text{if } R_{o} > 0, \\ &R_{fmt} + R_{o}, &\text{otherwise}, \end{align} \right.\tag{4}\] where \(R_{o}\) is the outcome reward and \(o \in \{mc, free\}\). This reward shaping encourages the model to generate structurally valid, evidence-grounded, and accurate reasoning traces while improving retrieval efficiency. Finally, we prompt the model generate a group of response \(\{y_1, y_2, \cdots, y_G\}\), where \(G\) is the number of generated responses, and adopt the GRPO [10] algorithm for reinforcement optimization to stabilize training and refine the reasoning policy.

Table 1: Evaluation results of base model Qwen2.5-VL-7B-Instruct, Conan, and other baselines on the six multi-step reasoning benchmarks. \(\dagger\) means the model is partly trained on LongVideoReason-QA [19], the training set of LongVideoReason. \(\ast\) denotes the multi-choice subset.
#Params Overall MMR-V Video-Holmes VRBench VCRBench\(\ast\) LongVideoReason Human-P&C
GPT-4o [4] - - 44.0 42.0 68.7 46.9 - 48.4
Gemini 1.5 Pro - - 41.2 44.0 69.3 52.6
Gemini 2.0 Flash - 42.6 30.6 51.7 56.1
Gemini 2.5 Flash - - 51.2 60.0
Gemini 2.5 Pro - - 45.0
LLaVA-OneVision-7B [31] 7B - 6.5 - - 30.7 - 48.5
InternVL3-8B [32] 8B - 33.6 32.3 - - - 54.2
Kimi-VL-A3B-Instruct [2] 3B/16B 44.4 32.4 32.4 60.5 34.3 64.6 42.0
Qwen2.5-VL-72B-Instruct [1] 72B 55.1 39.1 40.2 72.7 50.8 72.3 55.7
Qwen2.5-VL-7B-Instruct [1] 7B 46.9 30.1 28.5 66.4 46.5 61.8 48.2
Video-R1 [11] 7B 44.4 36.3 36.5 69.5 48.0 70.3 49.8
VideoChat-R1 [13] 7B 49.8 36.1 33.0 61.5 48.2 67.9 51.8
Video-MTR [14] 7B 49.1 36.5 35.7 69.7 48.1 57.3 47.2
Rewatch-R1\(\dagger\) [15] 7B 50.9 45.3 37.8 79.1 49.8 70.5 51.6
Conan SFT 7B 49.1 35.4 34.9 64.4 43.3 66.0 50.4
Conan 7B 57.4  (\(\uparrow 10.5\) ) 42.7 (\(\uparrow 12.6\) ) 44.6 (\(\uparrow 16.1\) ) 81.0 (\(\uparrow 14.6\) ) 51.0 (\(\uparrow 4.5\) ) 72.8 (\(\uparrow 11.0\) ) 52.3 (\(\uparrow 4.1\) )

5 Experiment↩︎

5.1 Evaluation Setups↩︎

Implementation Details 1) Training Settings. We adopt Qwen-2.5-VL-7B-Instruct [1] as the base model. During the multi-stage cold start, the model is trained for up to one epoch per stage with a global batch size of \(32\). The trained model is then used for the AIR RLVR phase, also trained for one epoch under the same batch configuration. The maximum completion length is set to \(4,000\) tokens, with a generation temperature of \(1.0\) and number of \(8\). Each input video contains \(16\) initial frames, and the model is allowed to retrieve up to \(8\) additional frames per reasoning step. 2) Evaluation Settings. The generation temperature is fixed at \(1.0\) and each sample is evaluated for three times. The maximum new tokens is set to \(4,000\) when reasoning traces are included, and \(128\) for direct answering. The number of reasoning rounds is limited to three to maintain retrieval efficiency. videos are standardized to \(16\) frames for multi-step reasoning benchmarks, \(32\) frames for long-video understanding benchmarks during evaluation at a resolution of \(448 \times 28 \times 28\).

Multi-step Reasoning Benchmarks. We conduct comprehensive evaluation on six challenging multi-step reasoning benchmarks: MMR-V [21], Video-Holmes [8], VRBench [22], VCRBench [23] (multi-choice subset), LongVideoReason [19], and Human-P&C [24]. Accuracy is used as the primary evaluation metric.

Compared Methods. We compare Conan with a set of closed-source MLLMs, and open-source models, including non-reasoning models, text-CoT models, and video-CoT models.

5.2 Main Results↩︎

The evaluation results on the six multi-step reasoning benchmarks are shown in Table 1. And we have the following key observations and analyses.

Overall Analysis. Conan substantially surpasses its base model Qwen2.5-VL-7B-Instruct across all benchmarks, with average accuracy gains exceeding 10%. Remarkably, Conan also outperforms the advanced GPT-4o on most benchmarks, underscoring its superior capabilities of multi-step, evidence-grounded reasoning. Furthermore, two advanced text-CoT models (Video-R1 and VideoChat-R1) perform notably worse than Conan, demonstrating the effectiveness of the identification–reasoning–action mechanism in grounding reasoning on accurate visual evidence. In addition, compared with concurrent Video-CoT approaches (Video-MTR and Rewatch-R1), Conan exhibits stronger multi-hop reasoning abilities, highlighting the superiority of accurate evidence identification.

Table 2: Evaluation results on long-video understanding.
LongVideoBench MLVU LVBench VideoMME
Qwen2.5-VL-7B-Instruct [1] 48.9 52.8 34.4 55.8
Video-R1 [11] 55.6 62.5 38.3 58.6
VideoChat-R1 [13] 54.3 60.5 38.0 56.9
Rewatch-R1 [15] 50.5 55.2 37.2 58.9
Video-MTR [14] 56.4 59.7 38.6 59.3
FrameThinker [16] 52.9 59.1 36.6 -
Conan 56.6 (\(\uparrow 7.7\) ) 63.4 (\(\uparrow 10.6\) ) 39.2 (\(\uparrow 4.8\) ) 60.5 (\(\uparrow 4.7\) )
Table 3: Ablation results of Conan, where the best results are in boldface.
Overall MMR-V Video-Holmes VRBench VCRBench\(\star\) LongVideoReason Human-P&C
w-binary scale 53.8 39.7 39.1 75.4 46.9 70.2 51.3
w/o-data sampling 55.2 40.0 40.8 78.9 48.4 72.3 50.7
w/o-textual reasoning 57.0 42.2 43.9 81.4 49.8 73.1 51.7
w/o-multimodal alignment reasoning 56.4 43.5 44.2 80.4 48.2 72.5 49.3
w/o-vision-centric reasoning 53.0 39.0 36.5 75.2 50.1 68.5 48.7
w-direct RLVR 51.0 39.1 36.8 73.3 46.3 62.3 47.9
w/o-evidence reward 53.8 39.9 40.0 74.8 46.1 71.7 50.1
w/o-retrieval reward 54.0 38.2 42.4 75.7 47.8 69.2 50.5
w-text CoT 55.2 41.1 41.0 76.8 48.8 70.5 52.8
Conan 57.4 42.7 44.6 81.0 51.0 72.8 52.3
Figure 2: Training dynamics in the AIR RLVR process of Conan.

Long video understanding. Beyond multi-step reasoning, Conan exhibits strong generalization to long-video understanding tasks. As presented in Table 2, Conan consistently outperforms Qwen2.5-VL-7B-Instruct across LongVideoBench [17], MLVU [18], LVBench [25], and Video-MME [26], achieving state-of-the-art performance compared with both text-CoT and video-CoT models. These results indicate that the high-quality, multi-scale evidence reasoning data and progressive training strategy not only enhance stepwise reasoning but also effectively boost long-video understanding.

5.3 Ablation Study↩︎

To investigate the contribution of each component in our framework, we conduct a comprehensive ablation study with multiple Conan variants.

For the Conan-91k dataset, we design two variants: 1) w-binary scale, which removes the contextual frame type in multi-scale frame identification by merging it into the irrelevant category, reducing frame labels to only evidence and irrelevant types; and 2) w/o-data sampling, which omits the evidence difficulty-aware sampling process and applies uniform random sampling instead.

For the multi-stage progressive cold-start strategy, we develop four variants: 1) w/o-textual reasoning, which excludes the textual reasoning stage; 2) w/o-multimodal alignment reasoning, which skips the multimodal alignment reasoning stage; and 3) w/o-vision-centric reasoning, which removes the vision-centric reasoning stage. 4) w-direct RLVR, which bypasses the three cold-start stages and directly employs AIR RLVR to train the model on the Conan-RLVR-31k dataset.

For the AIR RLVR framework, we design three variants: w/o-identification reward, which discards the identification reward; w/o-retrieval reward, which removes the retrieval reward; and w-text CoT, which enforces single-round text-CoT paradigm performing pure textual reasoning and answering without additional actions in training and evaluation procedures.

Figure 3: A qualitative example from VRBench showing the reasoning traces of Video-R1 (Text CoT), Video-MTR (Video CoT), and Conan for comparison.

The ablation results in Table 3 reveal several key findings: a) Dataset design. w-binary scale underperforms Conan, confirming the benefits of multi-scale frame identification in providing richer contextual cues. Moreover, Conan outperforms w/o-data sampling, which validates that evidence difficulty-aware sampling effectively guides the model to progressively acquire multi-step reasoning abilities. b) Progressive cold-start. Conan consistently surpasses w/o-textual reasoning, w/o-multimodal alignment reasoning, and w/o-vision-centric reasoning across most multi-step reasoning benchmarks, and shows substantial gains over w-direct RLVR. These results demonstrate that the multi-stage progressive cold-start strategy is crucial for gradually activating the model’s multi-hop reasoning capabilities. c) AIR RLVR. Conan outperforms w/o-evidence reward and w/o-retrieval reward, proving the effectiveness of the two rewards in enhancing accurate evidence localization, and efficient frame retrieval, respectively.

5.4 Training Dynamics↩︎

To gain a deeper understanding of how the model’s behavior evolves during end-to-end reinforcement learning, we perform a fine-grained analysis of its training dynamics. As shown in Figure 2, the training process can be divided into two stages:

Stage I: Accuracy-Oriented Evidence Exploration. Building upon the solid foundation established through multi-stage cold-start training, the model initially enters a phase characterized by frequent yet progressively more accurate frame retrieval. During this stage, it actively queries additional clips to maximize the stepwise identification–retrieval–outcome reward across tasks. The increasing retrieval precision reflects an early learning strategy where the model compensates for incomplete internal reasoning by broadly exploring the visual context to identify useful evidence. This stage marks a transitional period in which the model learns to recognize the importance of accurate evidence localization before optimizing retrieval efficiency.

Stage II: Efficient Evidence Retrieval. As training progresses, the model transitions to a more refined and selective retrieval policy. It significantly reduces retrieval frequency while maintaining high reward accuracy, indicating that Conan has internalized a compact and efficient multi-step reasoning strategy, retrieving evidence only when necessary, much like a detective who strategically gathers key clues rather than exhaustively examining all information.

5.5 Qualitative Evaluation↩︎

Figure 3 illustrates a qualitative comparison on VRBench [22] between Text CoT (Video-R1), Video CoT (Video-MTR), and Conan. The text CoT model Video-R1 performs single-round textual reasoning without visual grounding, leading to a hallucinated answer based solely on linguistic priors. The video CoT model Video-MTR incorporates frame retrieval but fails to localize relevant evidence, resulting in weak reasoning alignment between retrieved frames and the question. In contrast, Conan performs multi-round, evidence-grounded reasoning. In Round 1, it detects the absence of causal evidence and performs random frame sampling to broaden the search scope. In Round 2, guided by contextual cues, it executes specific frame retrieval around key timestamps where player interactions occur. In Round 3, Conan identifies frames depicting color-triggered game events, such as team-based activities and flag captures, and integrates these observations to infer that “different colored shirts trigger different game events or interactions.” This progressive process, from exploration to targeted verification to confident deduction, demonstrates Conan’s superior ability to accurately locate, reason over, and act upon relevant visual evidence compared with both Text CoT and Video CoT models.

6 Conclusion and Future work↩︎

In this work, we present Conan, a unified framework that empowers multimodal large language models to perform Conan-like visual reasoning through multi-scale frame identification, evidence-based reasoning, and confident action decision. Employing the Conan-91k dataset, constructed via multi-scale evidence categorization, Conan simulation, and evidence difficulty-aware sampling, we devise a multi-stage progressive cold-start strategy alongside a joint Evidence–Reasoning–Action RLVR framework to progressively cultivate robust multi-step reasoning abilities. Extensive experiments across six multi-step reasoning and four long-video understanding benchmarks demonstrate that Conan consistently outperforms the base model Qwen2.5-VL-7B-Instruct, achieving state-of-the-art accuracy and strong generalization over both text-CoT and video-CoT models. In future work, we plan to extend Conan toward chain-of-frame reasoning, enabling dynamic frame generation during reasoning to provide visual evidence beyond the video for solving more complex video reasoning tasks.

References↩︎

[1]
S. Bai et al., “Qwen2. 5-vl technical report,” arXiv preprint arXiv:2502.13923, 2025.
[2]
K. Team et al., “Kimi-vl technical report,” arXiv preprint arXiv:2504.07491, 2025.
[3]
L.-C.-T. Xiaomi, “MiMo-VL technical report,” arXiv preprint arXiv:2506.03569, 2025.
[4]
A. Hurst et al., “Gpt-4o system card,” arXiv preprint arXiv:2410.21276, 2024.
[5]
S. Antol et al., “Vqa: Visual question answering,” 2015, pp. 2425–2433.
[6]
J. Gao, C. Sun, Z. Yang, and booktitle=Proceedings. of the I. international conference on computer vision Nevatia Ram, “Tall: Temporal activity localization via language query,” 2017, pp. 5267–5275.
[7]
S. Venugopalan, H. Xu, J. Donahue, M. Rohrbach, R. Mooney, and booktitle=Proceedings. of the 2015. C. of the N. A. C. of the A. for C. L. H. L. T. Saenko Kate, “Translating videos to natural language using deep recurrent neural networks,” 2015, pp. 1494–1504.
[8]
J. Cheng, Y. Ge, T. Wang, Y. Ge, J. Liao, and Y. Shan, “Video-holmes: Can MLLM think like holmes for complex video reasoning?” arXiv preprint arXiv:2505.21374, 2025.
[9]
Y. Liu et al., “VideoReasonBench: Can MLLMs perform vision-centric complex video reasoning?” arXiv preprint arXiv:2505.23359, 2025.
[10]
D. Guo et al., “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025.
[11]
K. Feng et al., “Video-r1: Reinforcing video reasoning in mllms,” arXiv preprint arXiv:2503.21776, 2025.
[12]
K. Ouyang et al., “SpaceR: Reinforcing MLLMs in video spatial reasoning,” arXiv preprint arXiv:2504.01805, 2025.
[13]
X. Li et al., “VideoChat-R1: Enhancing spatio-temporal perception via reinforcement fine-tuning,” arXiv preprint arXiv:2504.06958, 2025.
[14]
Y. Xie, T. Chen, Z. Ge, and L. Ni, “Video-MTR: Reinforced multi-turn reasoning for long video understanding,” arXiv preprint arXiv:2508.20478, 2025.
[15]
C. Zhang et al., “ReWatch-R1: Boosting complex video reasoning in large vision-language models through agentic data synthesis,” arXiv preprint arXiv:2509.23652, 2025.
[16]
Z. He, X. Qu, Y. Li, S. Huang, D. Liu, and Y. Cheng, “Framethinker: Learning to think with long videos via multi-turn frame spotlighting,” arXiv preprint arXiv:2509.24304, 2025.
[17]
H. Wu, D. Li, B. Chen, and J. Li, “Longvideobench: A benchmark for long-context interleaved video-language understanding,” Advances in Neural Information Processing Systems, vol. 37, pp. 28828–28857, 2024.
[18]
J. Zhou et al., “Mlvu: Benchmarking multi-task long video understanding,” 2025, pp. 13691–13701.
[19]
Y. Chen et al., “Scaling RL to long videos,” 2025.
[20]
L. Yao et al., “Generative frame sampler for long video understanding,” 2025, pp. 17900–17917.
[21]
K. Zhu et al., “MMR-v: What’s left unsaid? A benchmark for multimodal deep reasoning in videos,” arXiv preprint arXiv:2506.04141, 2025.
[22]
J. Yu et al., “VRBench: A benchmark for multi-step reasoning in long narrative videos,” arXiv preprint arXiv:2506.10857, 2025.
[23]
Y. Qi et al., “Vcr-bench: A comprehensive evaluation framework for video chain-of-thought reasoning,” arXiv preprint arXiv:2504.07956, 2025.
[24]
K. Li et al., “HumanPCR: Probing MLLM capabilities in diverse human-centric scenes,” arXiv preprint arXiv:2508.13692, 2025.
[25]
W. Wang et al., “Lvbench: An extreme long video understanding benchmark,” arXiv preprint arXiv:2406.08035, 2024.
[26]
C. Fu et al., “Video-mme: The first-ever comprehensive evaluation benchmark of multi-modal llms in video analysis,” arXiv preprint arXiv:2405.21075, 2024.
[27]
Y. Liu et al., “TempCompass: Do video LLMs really understand videos?” 2024, pp. 8731–8772.
[28]
Y. Li, X. Chen, B. Hu, L. Wang, H. Shi, and M. Zhang, “Videovista: A versatile benchmark for video understanding and reasoning,” arXiv preprint arXiv:2406.11303, 2024.
[29]
K. Team et al., “Kimi k2: Open agentic intelligence,” arXiv preprint arXiv:2507.20534, 2025.
[30]
booktitle=Text. summarization branches out Lin Chin-Yew, “Rouge: A package for automatic evaluation of summaries,” 2004, pp. 74–81.
[31]
B. Li et al., “Llava-onevision: Easy visual task transfer,” arXiv preprint arXiv:2408.03326, 2024.
[32]
Z. Chen et al., “Expanding performance boundaries of open-source multimodal models with model, data, and test-time scaling,” arXiv preprint arXiv:2412.05271, 2024.

  1. Corresponding Author↩︎