May 24, 2025
Recent advances in large language models (LLMs) have introduced latent reasoning as a promising alternative to autoregressive reasoning. By performing internal computation with hidden states from previous steps, latent reasoning benefit from more informative features rather than sampling a discrete chain-of-thought (CoT) path. Yet latent reasoning approaches are often incompatible with LLMs, as their continuous paradigm conflicts with the discrete nature of autoregressive generation. Moreover, these methods rely on CoT traces for training and thus fail to exploit the inherent reasoning patterns of LLMs. In this work, we explore latent reasoning by leveraging the intrinsic capabilities of LLMs via reinforcement learning (RL). To this end, we introduce hybrid reasoning policy optimization(HRPO), an RL-based hybrid latent reasoning approach that (1) integrates prior hidden states into sampled tokens with a learnable gating mechanism, and (2) initializes training with predominantly token embeddings while progressively incorporating more hidden features. This design maintains LLMs’ generative capabilities and incentivizes hybrid reasoning using both discrete and continuous representations. In addition, the hybrid HRPO introduces stochasticity into latent reasoning via token sampling, thereby enabling RL-based optimization without requiring CoT trajectories. Extensive evaluations across diverse benchmarks show that HRPO outperforms prior methods in both knowledge- and reasoning-intensive tasks. Furthermore, HRPO-trained LLMs remain interpretable and exhibit intriguing behaviors like cross-lingual patterns and shorter completion lengths, highlighting the potential of our RL-based approach and offer insights for future work in latent reasoning.
Latent reasoning has emerged as a compelling alternative to traditional autoregressive reasoning methods in large language models (LLMs) [1]–[3]. In contrast to the conventional chain-of-thought (CoT) [4]–[6], which relies on the discrete decoding and sampling process, latent reasoning enables LLMs to reason internally with continuous hidden representations from the previous steps. For instance, Coconut [7] achieves latent reasoning by utilizing the model’s last hidden state as ‘continuous thought’, feeding it back as input embeddings to the next reasoning step, thereby matching the performance of CoT on reasoning-intensive tasks. To show the difference between the autoregressive generation and latent reasoning, we compare both approaches in 1.
Nevertheless, existing methods in latent reasoning utilize extensive CoT traces for training. That is, CoT trajectories are required to learn informative latent representations. An example is CODI [2], which adopts self-distillation to train on discrete CoT tokens and transfers learnt features into continuous thoughts. Although recurrent latent reasoning removes the need for CoT data, it relies on training a multi-block LLM from scratch to reason internally [1]. Moreover, these methods employ tailored training paradigms for latent representation learning, incurring high training costs and overlooking the inherent reasoning capabilities of LLMs [1], [7], [8]. For example, Coconut [7] requires multi-stage training on CoT steps, which not only increases training compute but also delays the model’s acquisition of complete reasoning chains [2]. Furthermore, we find that latent reasoning is often incompatible with LLMs due to the discrepancy between output hidden states and input embeddings (as we show 4.3). That is, feeding hidden states into the next decoding step degrades generation quality (e.g., repetition, incoherence), causing difficulties in adapting LLMs for latent reasoning. Therefore, an ideal latent reasoning method should capitalize on pretrained LLMs’ generalizability by seamlessly integrating continuous representations, preserving LLMs’ interpretability while mitigating CoT‐dependent extensive training for broader applicability.
To this end, we introduce hybrid reasoning policy optimization(HRPO), a novel hybrid latent reasoning optimization framework based on reinforcement learning (RL). HRPO unifies policy learning with latent reasoning, thereby utilizing the LLMs’ intrinsic reasoning patterns without relying on CoT trajectories. To preserve the generative capabilities while encouraging the model to reason in the continuous space, HRPO introduces a gating mechanism to gradually incorporate hidden state representations from previous steps into sampled token embeddings. The gating mechanism is initially configured in a way that the inputs come predominantly from the sampled tokens. As training progresses, the gate learns to incorporate richer, more informative features from previous hidden states for improved internal reasoning. Since the sampling operation introduces stochasticity, HRPO rollouts can be performed like standard RL methods, with hybrid outputs (tokens and latent representations) stored in the rollout buffer for policy updates. For optimization, HRPO leverages a simple outcome-based reward and employs the hybrid rollout buffer to calculate log probabilities, enabling policy gradient updates that adaptively integrate both token-level and latent representations. By bridging discrete and continuous reasoning, HRPO provides a scalable and training-efficient solution that unlocks latent reasoning in existing LLMs. As a result, HRPO enhances the adaptability of latent reasoning and leads to superior performance on both knowledge- and reasoning-intensive tasks. We highlight our contributions in the following1:
We introduce HRPO, the first reinforcement learning-based approach for hybrid reasoning, empowering LLMs to autonomously develop latent reasoning capabilities.
We design a gating mechanism to preserve LLMs’ generative abilities, which starts by prioritizing sampled token embeddings and, through RL-driven updates, progressively incorporates the continuous representations.
By leveraging the LLMs’ inherent reasoning patterns through HRPO, we mitigate the need for chain-of-thought annotations and expensive multi-stage training, offering an efficient and scalable alternative to existing latent reasoning methods.
To show the efficacy of the proposed hybrid latent reasoning, we evaluate on multiple knowledge and reasoning benchmarks and show that it outperforms existing models and latent reasoning baselines, demonstrating consistent performance gains across diverse scenarios. In addition, we provide insights into RL-based training of latent reasoning models and present intriguing reasoning patterns emerging from HRPO.
Early research in latent reasoning focuses on analyzing the latent space computation within transformer models [9], [10]. For example, [9] study multi-hop reasoning and show that ‘back-patch’
features from later layers can improve performance on challenging queries. Alternatively, latent representations can be used to construct informative features as in-context demonstrations to enhance few-shot performance at test-time [11], [12]. In particular, [11] exploit latent skills to select in-context examples for reasoning-intensive tasks. Different from this line of work, hidden reasoning is also proposed to improve generative capabilities by incorporating latent variables
into language modeling [1], [13]. For instance, [1] propose a depth-recurrence language model that injects latent variables and iteratively processes them to derive the final states used
for decoding. Similarly, special tokens (e.g. <pause>) are inserted to allocate extra test-time compute for internal reasoning, leading to improvements across diverse scenarios [14], [15]. [15] argue that filler tokens act as intermediate reasoning steps in multi-token computations, yielding measurable performance gains on parallelizable problems. Furthermore, implicit reasoning methods transform explicit,
token-level reasoning trajectories into internal reasoning to enhance efficiency or accuracy [16], [17]. For instance, CODI [2] employs a self-distillation to framework to align
explicit and implicit reasoning tokens for improved performance. Concurrent to our work, hidden reasoning approaches [7], [8], [18] leverage previous output hidden states as next input embeddings, enabling compact yet informative
internal reasoning. Nonetheless, the majority of existing methods require processed traces and extensive training. In contrast, we focus on hybrid latent reasoning through reinforcement learning to exploit the inherent generation capabilities of LLMs.
Reinforcement learning (RL) is a paradigm where an agent interacts with an environment, receives feedback, and learns to make decisions that maximize cumulative rewards over time [19]. Recently, RL has been introduced to improve language models by learning from implicit human feedback (RLHF) [20]. Such fine-tuning typically employs policy gradient algorithms and their variants like REINFORCE [21]. To reduce variance, actor-critic methods like A2C [22] are proposed to compute a learnt baseline and leverage advantage estimates for better training dynamics. Similarly, proximal policy optimization (PPO) [23] introduces a clipped surrogate objective to bound policy updates, thereby achieving training stability and robustness to hyperparameter choices. Parallel to these approaches, direct preference optimization (DPO) [24] is introduced to directly optimize language models using pairwise human preference comparisons. DPO’s simpler variant such as SimPO [25] further mitigates the need of reference models. Despite DPO’s efficiency, online RL methods remain preferred for their consistent superior performance [26]. Recently, reinforce leave-one-out (RLOO) [27] proposes REINFORCE-style RL that generates multiple responses and utilizes the mean reward of the other responses as a baseline. Similarly, group relative policy optimization (GRPO) [28] and REINFORCE++ [29] compute baselines from group-level or batch-level reward scores across candidate completions, and thus reduce memory overhead while maintaining accuracy and stability for complex tasks. In this work, we design a novel online RL–driven approach to incentivize hybrid latent reasoning by progressively incorporating hidden states into LLM inputs, thereby providing richer representations for improved reasoning performance.
We first describe our notation and settings for hybrid latent reasoning. For input query \(x = [x_1, x_2, \ldots, x_t]\) and its corresponding token embeddings \(E = [e_1, e_2, \ldots,
e_t]\), we describe the raw hidden states from the LLM output at step \(t\) with \(\hat{h}_t\), namely: \[\hat{H} = [\hat{h}_1,\hat{h}_2,\ldots,\hat{h}_t] =
\mathtt{Transformer}(E),\] in which Transformer denotes the transformer model (i.e., decoder layers), \(\hat{H}\) represents the final‐layer hidden states produced by the Transformer. With
the LM head (Head), the next output token \(\hat{x}_{t+1}\) can be sampled from the output distribution over the vocabulary via: \[\hat{x}_{t+1} \sim
\texttt{softmax}(\mathtt{Head}(\hat{h}_t)).\] However, hidden states often lie outside the model’s token embedding manifold, which degrades generation quality when fed directly. To avoid this, we project \(\hat{h}_t\) back into the embedding space to ensure the inputs conform to the model’s learned distribution. Specifically, we use the output probabilities \(p_{t+1}\) to compute a weighted
interpolation over the vocabulary: in which \(\tau\) is the temperature and \(W_{e}\) denotes the embedding matrix of the LLM. In other words, we compute the next input
embedding as a weighted sum of all token embeddings, with weights given by \(p_{t+1}\). In addition, \(p_{t+1}\) is normalized to preserve the scale and variance of the output vector. This
sampling-free mapping ensures differentiability and aligns the projected embedding with the model’s native input space, thus leading to improved training dynamics (see 4.3).
While interpolated embeddings preserve semantic continuity, directly feeding \(h_{t+1}\) as the next token input removes stochasticity and injects noise from irrelevant tokens, causing degraded generation within RL rollouts. As such, we design a hybrid approach for latent reasoning by gradually imposing hidden state representations into the sampled token embeddings with a gating mechanism. Drawing on gated recurrence models [30], [31], we formulate the gating mechanism as: \[\begin{align} r_t & = \sigma (W_a \hat{e}_{t+1} + b_a), \\ i_t & = \sigma (W_x \hat{e}_{t+1} + b_x), \\ a_t &= \texttt{exp}(-c \cdot \texttt{softplus}(\Lambda) \odot r_t), \\ e_{t+1} & = \left\{ \begin{array}{lc} a_t \odot \hat{e}_{t+1} + \sqrt{1-a_t^2} \odot (i_t \odot h_{t+1}) & t \in \texttt{think}, \\ \hat{e}_{t+1} & t \not\in \texttt{think}, \\ \end{array} \right. \label{eq:thinking95residual} \end{align}\tag{2}\] \(e_{t+1}\) is the resulting hybrid input for the next step, \(\hat{e}_{t+1}\) denotes the embedding of the sampled discrete token \(\hat{x}_{t+1}\), whereas \(h_{t+1}\) is the projected hidden states as in 1 . The gates \(r_t\) and \(i_t\) leverages sigmoid function \(\sigma\) to control the blending, \(a_t\) scales \(\hat{e}_{t+1}\), \(c\) is a fixed scaling constant, and \(\Lambda\) is a learnable vector. Note that hybrid reasoning only applies during the reasoning phase (i.e., \(t \in \texttt{think}\)), while the final answer is still generated via standard autoregressive decoding, as we show in 2 (left). By initializing \(a_t \rightarrow 1\) (see 6), the inputs first draw predominantly from the sampled token embeddings, thereby effectively preserving the LLM’s generative capabilities. As the training progresses, the value range of \(a_t\) converges to an optimum range and thus incorporates informative features from both hidden representations and sampled tokens.
Overall, our hybrid reasoning approach projects hidden states into the embedding space via weighted interpolation. Moreover, the sampling steps preserve stochasticity for effective reinforcement learning. We employ a plug-and-play gating mechanism that initially prioritizes sampled token embeddings while gradually integrating latent signals, providing richer inputs for subsequent reasoning.
Rather than relying on strong supervision, we optimize the policy model via hybrid rollouts using reinforcement learning (RL), fully harnessing LLMs’ native reasoning capabilities. Inspired by recent RL advances such as group relative policy optimization (GRPO) [28], we introduce hybrid reasoning policy optimization(HRPO), an efficient RL-driven framework that enable LLMs to fuse discrete tokens with continuous representations for hybrid reasoning.
As illustrated in 2 (right), the proposed HRPO optimizes the policy (parameterized by \(\theta\)) to maximize the expected reward for input \(x\) drawn from dataset \(\mathcal{D}\) and the sampled hybrid outputs \(y\) (discrete tokens) and \(H\) (hidden representations): \[\max_{\theta} \mathbb{E}_{(x, y) \sim \mathcal{D}, (\hat{y}, H) \sim \pi_{\theta}(\cdot|x)} [r(a, y)],\] where \(r\) is a simple outcome-based reward function and \(a\) denotes the ground truth answer (i.e., it outputs 1 for correct prediction in \(y\) and 0 otherwise). The rewards are computed solely on the discrete tokens within the answer span. To obtain an unbiased, low-variance advantage for hybrid latent reasoning, we generate \(g\) hybrid rollouts per input query and compute the advantages by standardizing the rewards within the group (i.e., for the \(i\)-th response, the advantage is calculated by \(\hat{A}_i = \frac{r_i - \texttt{mean}([r_1, r_2, \ldots, r_g])}{\texttt{std}([r_1, r_2, \ldots, r_g])}\)). Consequently, the policy gradients can be estimated with: \[\begin{align} \nabla_{\theta} \mathcal{J}_{\mathrm{{HRPO}}}(\theta) &= \mathbb{E}_{x \sim \mathcal{D}, \{ (y_i, H_i) \}_{i=1}^g \sim \pi_{\theta}(\cdot|x)} \\ & \left[ \frac{1}{g} \sum_{i=1}^{g} \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \nabla_{\theta} \log \pi_{\theta}(y_{i, t} | x, y_{i,<t}, H_{i,<t}) \hat{A}_{i, t} \right] -\beta \nabla_{\theta} \mathbb{D}_{K L}[\pi_\theta \| \pi_{\mathrm{ref}}], \label{eq:hrpo} \end{align}\tag{3}\] where \(\pi_{\mathrm{ref}}\) denotes the reference model and KL-divergence acts as a regularizer, controlled by hyperparameter \(\beta\). This objective follows a simple REINFORCE‐style formulation, fusing discrete token inputs with continuous hidden representations across the reasoning span via the introduced gating mechanism. The hybrid trajectories that yield higher returns are assigned larger advantage estimates, encouraging policy updates to increase the log probabilities of their subsequent reasoning tokens. For the KL divergence term, we compute log probabilities using solely token IDs for \(\pi_{\mathrm{ref}}\), as we find it more effective in preserving training stability. Different from PPO / GRPO objectives, we omit the likelihood ratio and directly use raw log probabilities in 3 because ratio clipping is rarely encountered under our conservative learning schedule. Furthermore, since the hidden representations are directly tied to the parameters \(\theta\), each trajectory should only be used for a single gradient update; attempting to reuse it—even with importance sampling—violates the on-policy constraints. As such, our HRPO implementation remains lightweight, strictly on-policy and could be seamlessly combined with further RL optimizations.
In summary, the proposed HRPO framework unifies hybrid latent reasoning under a simple RL objective that fully leverages LLMs’ intrinsic reasoning capabilities. During rollouts, the decoding process progressively fuses discrete and continuous representations through a learnable gate, preserving coherence while exploiting hidden states. For policy updates, HRPO derives advantages directly from outcome rewards and performs policy gradient steps with KL regularization. As a result, HRPO incentivizes LLMs to dynamically integrate sampled tokens with latent representations, delivering stable and efficient on-policy hybrid reasoning training without a separate value function.
We evaluate HRPO on both knowledge- and reasoning-intensive tasks: (1) open-domain & multi-hop knowledge-intensive question answering (Knowledge); and (2) science, technology, engineering or mathematics (STEM) benchmarks. The experimental results are reported as follows.
| NQ | TriviaQA | HotpotQA | 2WikiMQA | Bamboogle | Average | |
|---|---|---|---|---|---|---|
| Qwen2.5-7B-Instruct | ||||||
| QA | 0.134 | 0.408 | 0.183 | 0.250 | 0.120 | 0.219 |
| CoT | 0.048 | 0.185 | 0.092 | 0.111 | 0.232 | 0.134 |
| IRCoT | 0.224 | 0.478 | 0.133 | 0.149 | 0.224 | 0.242 |
| Search-o1 | 0.151 | 0.443 | 0.187 | 0.176 | 0.296 | 0.251 |
| RAG | 0.349 | 0.585 | 0.299 | 0.235 | 0.208 | 0.335 |
| Qwen2.5-1.5B-Instruct | ||||||
| SFT | 0.094 | 0.193 | 0.129 | 0.210 | 0.024 | 0.130 |
| RAG | 0.288 | 0.477 | 0.228 | 0.203 | 0.072 | 0.254 |
| PPO | 0.327 | 0.527 | 0.256 | 0.242 | 0.184 | 0.307 |
| GRPO | 0.293 | 0.480 | 0.202 | 0.213 | 0.120 | 0.261 |
| HRPO(Ours) | 0.364 | 0.553 | 0.273 | 0.276 | 0.216 | 0.337 |
| Qwen2.5-3B-Instruct | ||||||
| SFT | 0.249 | 0.292 | 0.186 | 0.248 | 0.112 | 0.217 |
| RAG | 0.348 | 0.544 | 0.255 | 0.226 | 0.080 | 0.291 |
| PPO | 0.356 | 0.563 | 0.304 | 0.293 | 0.240 | 0.351 |
| GRPO | 0.381 | 0.570 | 0.308 | 0.303 | 0.272 | 0.367 |
| HRPO(Ours) | 0.378 | 0.593 | 0.316 | 0.318 | 0.296 | 0.380 |
We first evaluate HRPO on five open‑domain and multi‑hop question answering (QA) datasets: Natural Questions (NQ), TriviaQA, HotpotQA, 2WikiMultiHopQA (2WikiMQA) and Bamboogle [32]–[36]. For each query, we use the E5 embedding model [37] to retrieve the top‑3 Wikipedia documents as context (details presented in 6). Following [38], we merge the NQ and HotpotQA training sets to train HRPO models, and evaluate it on each dataset’s evaluation split. The exact match results of HRPO and baselines (including supervised fine-tuning (SFT), retrieval augmented generation (RAG) [39] and RL-based PPO [23] and GRPO [28]) for the 1.5B and 3B Qwen2.5 Instruct models [40] are presented in 1. We also include comparisons to several QA and RAG baselines using the larger Qwen2.5-7B-Instruct as backbone, including: direct inference (QA), chain-of-thought (CoT) [4], interleaving retrieval with CoT (IRCoT) [41], Search-o1 [42] and RAG [39]. For each block in 1, we mark the best performance in bold for clarity.
Across all knowledge benchmarks, HRPO delivers the strongest exact match (EM) scores with smaller Qwen models and rivals the much larger 7B baselines. In particular, we observe: (1) HRPO reaches 0.380 EM with Qwen2.5-3B, outperforming the strongest 7B RAG baseline by 4.5%. Similarly, HRPO with the smaller 1.5B backbone scores an average of 0.337, achieving consistent gains and surpassing PPO by 3.0%. (2) HRPO consistently outperforms other RL-based methods. For example, HRPO with both the 1.5B and 3B backbones surpasses the strongest RL baseline by 3.0% and 1.3% respectively; the only dataset both models perform similarly is NQ. (3) Interestingly, GRPO underperforms PPO by 4.6% on the 1.5B backbone but outperforms it by 1.6% on the 3B model, likely a consequence of sparser rewards and limited sampled trajectories with a smaller model. (4) RL-based methods perform on par with the best-performing RAG baseline, with HRPO delivering the largest performance gains—particularly on terse, incomplete queries (NQ) and multi-hop questions (2WikiMQA)—while yielding modest improvements on one-hop datasets like TriviaQA. Overall, these results demonstrate that combining retrieval augmentation with hybrid latent reasoning yields state-of-the-art knowledge performance under computation constraints, establishing HRPO as a competitive alternative to both RL-based learning methods and larger retrieval augmented LLMs.
| GSM8k | MATH | MATH500 | MMLU-ST | ARC-C | Average | |
|---|---|---|---|---|---|---|
| Larger LLMs (Size \(\geq\) 7B) | ||||||
| DeepSeekMath-7B | 0.642 | 0.362 | 0.346 | 0.565 | 0.678 | 0.519 |
| Gemma-2-9B | 0.707 | 0.377 | 0.364 | 0.651 | 0.682 | 0.556 |
| Qwen2.5-7B | 0.854 | 0.498 | 0.464 | 0.723 | 0.637 | 0.635 |
| MAmmoTH2-7B | 0.684 | 0.367 | 0.396 | 0.624 | 0.817 | 0.578 |
| MAmmoTH2-8B | 0.704 | 0.358 | 0.732 | 0.642 | 0.822 | 0.652 |
| Qwen2.5-1.5B-Instruct | ||||||
| SFT | 0.560 | 0.300 | 0.302 | 0.403 | 0.602 | 0.433 |
| Distilled CoT | 0.706 | 0.503 | - | - | - | - |
| PPO | 0.694 | 0.507 | 0.518 | 0.566 | 0.715 | 0.600 |
| GRPO | 0.711 | 0.502 | 0.524 | 0.562 | 0.737 | 0.607 |
| HRPO(Ours) | 0.720 | 0.518 | 0.536 | 0.569 | 0.742 | 0.617 |
| Qwen2.5-3B-Instruct | ||||||
| SFT | 0.670 | 0.348 | 0.360 | 0.454 | 0.474 | 0.461 |
| Distilled CoT | 0.799 | 0.575 | - | - | - | - |
| PPO | 0.819 | 0.597 | 0.604 | 0.582 | 0.811 | 0.682 |
| GRPO | 0.834 | 0.602 | 0.604 | 0.601 | 0.814 | 0.691 |
| HRPO(Ours) | 0.845 | 0.613 | 0.630 | 0.590 | 0.820 | 0.700 |
We also evaluate the performance of the proposed HRPO on the reasoning-intensive STEM datasets: GSM8k, MATH, MATH500, MMLU-STEM (MMLU-ST) and ARC-Challenge (ARC-C) [43]–[47]. 2 reports the performance of HRPO alongside fine-tuned baselines (SFT, SFT with distilled CoT from QwQ [48]) and RL baselines (PPO [23] and GRPO [28]) on the Qwen 2.5 1.5B and 3B Instruct models [40]. In addition, we select several larger LLMs (\(\geq\) 7B in size) using few-shot CoT for comparison [28], [40], [49]. For GSM8k, we train on the training split, and for MATH and MATH500, we train on the MATH training split. For MMLU-ST and ARC-C, we train on the merged auxiliary MMLU and ARC-C training sets. Distilled CoT is only available for GSM8k and MATH due to dataset size constraints. We also highlight the best scores in each block in bold.
Across the five STEM benchmarks, HRPO delivers the strongest results with compact Qwen backbones and could match the performance of much larger LLMs. Our key observations are: (1) SFT underperforms compared to distilled CoT and RL-based methods, suggesting the efficacy of RL with verifiable rewards on reasoning-intensive tasks. (2) With the 3B backbone, HRPO achieves an average accuracy of 0.700, matching the best 7B baseline on four of the datasets. Even the 1.5B HRPO averages at 0.617, outperforming the 7B leader on MATH by 2.0%. (3) At 1.5B, HRPO improves on the strongest alternative GRPO with notable boosts on MATH and MATH500 (1.6% and 1.2%), whereas the average gain narrows at 3B, implying that HRPO is more beneficial for smaller models. (4) HRPO registers the highest accuracies recorded for sub-7B models on MATH (0.613) and MATH500 (0.630), demonstrating the value of RL-based hybrid reasoning on challenging benchmarks. Taken together, these results show that hybrid latent reasoning unlocks the power of much larger LLMs in compact backbones, proving the effectiveness of the proposed HRPO.
Different Strategies for Latent Reasoning. We compare different strategies to compute latent representations. Specifically, we use three methods to integrate hidden states into RL and train the 1.5B Qwen model on the MATH dataset. These variants are: (1) hidden states, which use the final layer hidden states as the next input; (2) interpolation, which employs interpolated embeddings
as defined in 1 ; and (3) HRPO, our hybrid latent reasoning in 2 . We visualize the exponential moving average (EMA) of rewards along with the GRPO baseline in 3. Due to the mismatch between hidden states and embeddings, using hidden states degrades generation and yields nonsensical rollouts with zero reward. Although interpolation performs similar to HRPO for the first few hundred steps, the rewards eventually collapse and only slowly recover, likely because interpolation introduces excessive noise. We also provide a direct comparison between HRPO and latent reasoning methods in 7. Overall, our approach achieves superior training dynamics with faster convergence while maintaining stability comparable to GRPO, highlighting the efficacy of our hybrid design choice in HRPO.
Ratio of Latent Representations. We track how the balance between discrete tokens and continuous latent representations shifts as LLMs learn to reason hybridly. Here, we train Qwen 1.5B on the knowledge task and visualize both the mean hidden ratios (i.e., \(\sqrt{1-a_t^2}\)) and completion lengths (along with GRPO) in 4. Across all runs, the hidden ratio increases steadily, even as the learning rate tapers off toward the end of training under a cosine schedule. In addition, completion lengths increase during the initial phase and later decline across all methods, with the drops most significant in HRPO. Furthermore, setting \(r_{\mathrm{min}} = 0.95\) leads to an interesting behavior where completion lengths substantially decrease—an effect not seen in the other variants2. This may be because the hidden representations effectively capture historical context, thereby shortening completions while maintaining or even improving performance (see 3). As such, hybrid latent reasoning could be particularly effective when leveraging contextual information for reasoning.
| Init Range | Knowledge | |||||
|---|---|---|---|---|---|---|
| 2-7 | NQ | TriviaQA | HotpotQA | 2WikiMQA | Bamboogle | Average |
| 0.364 | 0.553 | 0.273 | 0.264 | 0.184 | 0.328 | |
| 0.336 | 0.553 | 0.263 | 0.276 | 0.216 | 0.329 | |
| 0.336 | 0.534 | 0.258 | 0.275 | 0.216 | 0.324 | |
| Init Range | STEM | |||||
| GSM8k | MATH | MATH500 | MMLU-ST | ARC-C | Average | |
| 0.705 | 0.516 | 0.536 | 0.569 | 0.735 | 0.612 | |
| 0.703 | 0.509 | 0.532 | 0.563 | 0.732 | 0.608 | |
| 0.720 | 0.518 | 0.526 | 0.567 | 0.742 | 0.614 | |
Initialization of \(\Lambda\) for Gating. Beyond hidden ratio, we examine how the initialization of \(\Lambda\)—which control the balance between latent features and token embeddings—affects HRPO performance. Specifically, we initialize \(\texttt{exp}(-c \cdot \texttt{softplus}(\Lambda))\) from \([r_{\mathrm{min}}, 0.999]\) and report the results on Qwen 1.5B in 3, where lowering \(r_{\mathrm{min}}\) yields a higher initial hidden ratio. For the knowledge domain, performance improves as \(r_{\mathrm{min}}\) decreases: the best average performance occurs at \(r_{\mathrm{min}}=0.98\), and most individual datasets peak at \(r_{\mathrm{min}}=0.95\). In contrast, the STEM benchmarks display a bimodal trend: performance rises when \(r_{\mathrm{min}}\) is either lower or higher, but drops for the intermediate range \([0.98, 0.999]\). This pattern implies that the model profits from emphasizing either explicit token trajectories or latent representations, whereas a mid-level mix is sub-optimal. In summary, our results show that knowledge tasks benefit from lower \(r_{\mathrm{min}}\), whereas optimal performance for STEM tasks arises from leaning toward either explicit token trajectories or latent representations.
Sensitivity of \(\tau\) on Hybrid Reasoning. We further investigate the impact of temperature \(\tau\) on HRPO: lower \(\tau\) values reduce noise but overemphasize top tokens, whereas larger \(\tau\) spreads probability mass across more tokens. We explore \(\tau \in \{ 0.3, 0.5, 0.7, 0.9 \}\) and present the rewards and completion lengths of the 1.5B Qwen model on MMLU in 5. The left panel indicates that \(\tau = 0.3\) and \(\tau = 0.5\) converge faster and reach the highest reward plateau, outperforming higher values (\(\tau \geq 0.7\)) and showing the benefits of a smaller \(\tau\). Interestingly, the right panel reveals that both smaller and larger \(\tau\) values shorten completion lengths, while \(\tau = 0.5\) and \(\tau = 0.7\) maintain longer generations. This may be because lower \(\tau\) sharpens token distribution, yielding a confident latent vector that lets HRPO finish quickly. In contrast, higher \(\tau\) flattens the distribution and enhances informativeness, prompting the policy to extract answers in shorter rollouts. Overall, we find HRPO to be robust across varuing \(\tau\) selections, only completion length varies noticeably. Further analysis is in 7.
Hybrid Latent Reasoning Patterns. Finally, we highlight several intriguing reasoning patterns that emerge from HRPO. First, the hybrid outputs show readable trajectories by interpreting the tokens even without any CoT supervision. Second, HRPO exhibits cross-lingual patterns in some completions, fluidly integrating tokens from different languages, suggesting that latent representations can generalize across linguistic boundaries (see 6). Moreover, the hybrid reasoning process often delivers compact yet accurate responses to simple or factual queries, where the model requires fewer decoding steps thanks to the richer context encoded in the hidden representations. These emergent patterns indicate that hybrid latent reasoning can improve both interpretability and efficiency over existing latent reasoning approaches. Further qualitative examples can be found in 8.
In this work, we propose hybrid reasoning policy optimization(HRPO), a novel reinforcement learning (RL) framework that unifies discrete token sampling with continuous latent representations through a learnable gating mechanism. By gradually incorporating hidden features into sampled token embeddings, HRPO incentivizes LLMs to refine their reasoning strategies hybridly. Extensive evaluations on knowledge and STEM benchmarks demonstrate that HRPO outperforms both SFT and RL baselines, achieving consistent gains across diverse scenarios. Moreover, our analysis reveals that HRPO not only ensures stable hybrid latent reasoning but also triggers intriguing reasoning patterns, showing its potential in reasoning-intensive settings and providing insights for RL-based continuous space learning. While promising, we recognize that HRPO introduces additional computation overhead, the on-policy design may reduce large-scale training efficiency, and its continuous representations can be less transparent. Therefore, future work will aim to address these limitations by exploring simpler designs, off-policy extensions and advanced latent reasoning techniques to improve both the interpretability and efficiency of HRPO.
For hybrid latent reasoning, our plug-and-play component is by design compatible with any LLM architecture. We initialize its linear layers with a uniform distribution from \([-1/\sqrt{|H|}, 1/\sqrt{|H|}]\), where \(|H|\) denotes the hidden state dimension. The gating parameter \(\Lambda\) is selected such that the quantity \(a^c = \texttt{exp}(-c \cdot \texttt{softplus}(\Lambda))\) is drawn uniformly from \([r_\mathrm{min}, 0.999]\), with the scalar constant fixed at \(c=8\) [30]. Tuning \(r_\mathrm{min}\) adjusts the initial fraction of hidden states involved in hybrid reasoning; a larger value increases the proportion of sampled token embeddings and can be helpful for enhancing generation quality during the initial training phase. Similarly, the temperature hyperparameter \(\tau\) in 1 can be tuned for optimal task performance, although HRPO remains robust across a wide range of \(\tau\) values. To efficiently train the LLMs with HRPO, we patch the models with optimized kernel implementations3 and employ low-rank adaptation (LoRA) [50]. The default choice of hyperparameters are reported in 4 for HRPO experiments.
| Algorithm | HRPO |
| Epochs | 1 |
| Optimizer | AdamW 8bit |
| Optimizer Momentum | \(\beta_1\), \(\beta_2\) = 0.9, 0.99 |
| Weight Decay | 0.1 |
| Learning Rate | 5e-6 |
| Learning Rate (Linear in 2 ) | 1e-4 |
| Learning Rate (\(\Lambda\) in 2 ) | 1e-3 |
| HRPO\(\beta\) | 0.005 |
| Max Gradient Norm | 0.1 |
| Gradient Accumulation Step | 4 |
| Group size \(g\) in HRPO | 4 / 8 |
| Total Train Batch Size | 32 / 64 |
| LR Scheduler | Cosine with Warmup |
| Warmup Ratio | 0.1 |
| Precision (WA) | BF16-mixed |
| LoRA Modules | query, key, value, dense |
| LoRA Rank | 32 |
| LoRA \(\alpha\) | 64 |
The hyperparameters are selected empirically to balance efficiency and performance, and thanks to HRPO’s lightweight design and additional optimizations, our framework can run on a single GPU across diverse tasks. Additionally, we apply a larger weight-decay coefficient to (1) enhance HRPO training stability and (2) encourage the gating towards incorporating more latent representations (since smaller positive \(\Lambda\) values increase the hidden ratio \(\sqrt{1-a_t^2}\)). For simpler knowledge tasks and GSM8k, we fix the HRPO group size at 4, which already delivers strong performance. For more challenging benchmarks, namely MATH, MATH500, MMLU-ST and ARC-C, we instead generate 8 hybrid completions for each query. As for prompt and completion lengths, we select them empirically based on our observations, and the selected values are summarized in 5.
| Prompt / Completion Length for Knowledge Tasks | 2048 / 512 |
| Prompt / Completion Length for GSM8k | 512 / 512 |
| Prompt / Completion Length for MATH & MATH500 | 512 / 1024 |
| Prompt / Completion Length for MMLU-ST & ARC-C | 512 / 512 |
For both training and evaluation, we build each prompt by prepending a system message that directs the LLM to perform step-by-step internal reasoning before generating its final answer. The user query is then appended, and the entire input is formatted
with the model chat template. Different from prior work [6], [38], we
adopt the minimalist delimiter #### to separate the model’s hybrid reasoning span from its final answer. This is because the delimiter tokenizes as a single unit, adding no length overhead while providing a clear signal to switch from hybrid
latent reasoning to autoregressive answer generation. We also penalize repeated occurrences of the delimiter within the completion (by assigning 0 reward regardless answer correctness) to prevent the model from early termination of hybrid reasoning. We
illustrate full prompts for different type of tasks, showing the system message and example queries in 7, 8 and 9, respectively.
None
Figure 7: Example prompt for knowledge tasks, contexts are partially omitted due to space constraints..
None
Figure 8: Example prompt for GSM8k / MATH / MATH500 in HRPO..
None
Figure 9: Example prompt for MMLU-ST / ARC-C in HRPO..
For each question in our knowledge-intensive QA setup, we embed the query with E5 embedding model [37]. The entire English Wikipedia 2020 dump is pre-encoded with the same model, after which we perform approximate nearest neighbor (ANN) search and select the three highest-scoring documents. These top-3 passages are concatenated to form the external context fed to the LLM, as illustrated in 7. In our evaluation, we generate tokens using greedy decoding and compute latent representations according to 1 , thereby ensuring the reproducibility of our results. For outcome-based reward and evaluation settings on knowledge tasks, we report exact match scores on val / test splits following [38], [51], [52]. For mathematical (GSM8k, MATH and MATH500) and multiple-choice datasets (MMLU-ST and ARC-C), we follow [49] for post-processing and scoring.
Comparison to Latent Reasoning Methods. In addition to strong RL methods such as PPO and GRPO in our main experiments, we also benchmark the proposed HRPO against additional latent reasoning baselines. Specifically, we evaluate HRPO, Coconut and CODI on the GSM8K and MATH reasoning datasets, all using the 1.5B Qwen backbone. For Coconut, we train with its augmented CoT data (no MATH split is available), whereas for CODI we adopt the original datasets’ CoT trajectories. The results are reported in 6. We observe: (1) HRPO achieves the best accuracy on both datasets, with 9.42% and 23.63% respective gains over the best performing latent reasoning baseline CODI. (2) Even compared to distilled CoT from a significantly larger model QwQ, HRPO still scores consistent improvements on both datasets, showing the effectiveness of our hybrid latent reasoning. (3) Coconut lags behind on GSM8k, indicating limitations of latent reasoning by compressing CoT tokens, whereas CODI improves substantially with CoT SFT but still trails Distilled CoT and HRPO. Overall, HRPO achieves the best performance against all baselines, demonstrating its consistent advantages over CoT distillation and prior latent reasoning methods.
| Coconut | CODI | Distilled CoT | HRPO | |||||
|---|---|---|---|---|---|---|---|---|
| 2-3 (l)4-5 (l)6-7 (l)8-9 | GSM8k | MATH | GSM8k | MATH | GSM8k | MATH | GSM8k | MATH |
| Accuracy | 0.315 | - | 0.658 | 0.419 | 0.706 | 0.503 | 0.720 | 0.518 |
| Init Range | Knowledge | |||||
|---|---|---|---|---|---|---|
| 2-7 | NQ | TriviaQA | HotpotQA | 2WikiMQA | Bamboogle | Average |
| 0.845 | 0.613 | 0.622 | 0.576 | 0.820 | 0.695 | |
| 0.842 | 0.600 | 0.614 | 0.585 | 0.813 | 0.691 | |
| 0.838 | 0.606 | 0.630 | 0.590 | 0.817 | 0.696 | |
| Init Range | STEM | |||||
| GSM8k | MATH | MATH500 | MMLU-ST | ARC-C | Average | |
| 0.367 | 0.593 | 0.316 | 0.311 | 0.296 | 0.377 | |
| 0.378 | 0.588 | 0.311 | 0.298 | 0.296 | 0.374 | |
| 0.375 | 0.584 | 0.309 | 0.318 | 0.288 | 0.375 | |
Sensitivity Analysis for \(\Lambda\) and \(\tau\). In addition to the results reported in 3, we further present the performance of various \(\Lambda\) initializations on the Qwen 3B model, as shown in 7. Our observations echo the same trends on the 1.5B backbone: a smaller initial \(r_{\mathrm{min}}\) consistently benefits both knowledge and STEM tasks. Moreover, performance peaks when \(r_{\mathrm{min}}\) is selected either lower or higher, and drops slightly within the intermediate range of \([0.98, 0.999]\). We also examine the sensitivity of the \(\tau\) hyperparameter used to construct the interpolated embeddings and present the corresponding results for both backbone models in 8. The training rewards and completion lengths for GSM8k, MATH and the knowledge tasks are shown in 10, 11 and 12. We note that choosing \(\tau\) in the range of 0.5 – 0.7 offers a reliable balance of efficiency and accuracy, as the performance often peaks around this interval for both backbone models. Overall, we find that HRPO benefits from a smaller initial \(r_{\mathrm{min}}\), which outperforms larger \(r_{\mathrm{min}}\) settings and highlights the value of latent representations for complex reasoning. Moreover, HRPO is robust to the choice of \(\tau\), where the performance scores remain stable with only minor fluctuations at the extremes.
| Model | Qwen2.5-1.5B | Qwen2.5-3B | ||||||
|---|---|---|---|---|---|---|---|---|
| 2-5 (l)6-9 | 0.3 | 0.5 | 0.7 | 0.9 | 0.3 | 0.5 | 0.7 | 0.9 |
| GSM8k | 0.717 | 0.720 | 0.705 | 0.694 | 0.842 | 0.841 | 0.845 | 0.833 |
| MATH | 0.518 | 0.516 | 0.507 | 0.514 | 0.597 | 0.606 | 0.613 | 0.599 |
| MATH500 | 0.522 | 0.536 | 0.532 | 0.524 | 0.622 | 0.614 | 0.622 | 0.630 |
| MMLUST | 0.561 | 0.569 | 0.559 | 0.567 | 0.577 | 0.590 | 0.574 | 0.580 |
| ARC-C | 0.735 | 0.741 | 0.742 | 0.724 | 0.820 | 0.817 | 0.809 | 0.808 |
| NQ | 0.320 | 0.336 | 0.317 | 0.364 | 0.378 | 0.375 | 0.373 | 0.363 |
| TQ | 0.524 | 0.534 | 0.553 | 0.553 | 0.588 | 0.593 | 0.578 | 0.578 |
| HotpotQA | 0.263 | 0.260 | 0.252 | 0.273 | 0.311 | 0.316 | 0.309 | 0.306 |
| 2Wiki | 0.276 | 0.272 | 0.264 | 0.244 | 0.318 | 0.311 | 0.297 | 0.293 |
| Bamboogle | 0.216 | 0.216 | 0.216 | 0.176 | 0.296 | 0.288 | 0.296 | 0.280 |
Additional Analysis for \(\Lambda\) Initialization. We further provide an expanded analysis of how varying \(r_{\mathrm{min}}\) in the initialization of \(\Lambda\) affects training dynamics with the larger Qwen 3B backbone. Figures 13, 14, 15 and 16 plot the reward and completion length curves for the knowledge tasks, GSM8k, MATH and MMLU-ST / ARC-C respectively. Overall, our findings here echo the observations in 4.3: different \(r_{\mathrm{min}}\) values exhibit similarly high training stability and preserve the LLM’s generative capabilities, but selecting a smaller \(r_{\mathrm{min}}\) (i.e., a larger initial hidden ratio) generally accelerates convergence and shortens generated completions. Nevertheless, these benefits are less pronounced for the 3B backbone than for the 1.5B counterpart, which we attribute to the fewer update steps and trainable parameters in HRPO. In summary, our analysis shows that HRPO preserves stable training dynamics and effectively leverages LLMs’ intrinsic reasoning patterns across \(r_{\mathrm{min}}\) values; moreover, choosing a smaller \(r_{\mathrm{min}}\) further enhances convergence and yields shorter generated sequences, which can be especially beneficial for smaller-scale LLMs.
Statistical Significance Analysis on the Improvements of HRPO. In our main experiments, we follow the standard practice of using greedy decoding for pass@1 evaluation, ensuring our results are easy to evaluate and reproducible. To evaluate the significance of the performance gains of HRPO, we conduct additional sampling-based evaluations on the STEM tasks, which exhibit greater variance compared to greedy decoding. Averaged results are presented in 9, with statistically significant outcomes (paired t-test, \(p < 0.05\)) highlighted in bold. These results show that HRPO consistently outperforms PPO and GRPO across both backbones on all benchmark datasets. For the 1.5B backbone, t-tests confirm these gains are statistically significant in three out of five tasks. The improvements are even more pronounced with the 3B model, which achieves an average gain of +1.4% and shows statistical significance in four out of five comparisons. These findings demonstrate that our hybrid-RL framework, HRPO, not only delivers reliable performance gains over established baselines but also does so with high statistical confidence across the majority of STEM tasks.
| Qwen2.5-1.5B | |||||
|---|---|---|---|---|---|
| 2-6 | GSM8k | MATH | MATH500 | MMLU-ST | ARC-C |
| PPO | 0.701 | 0.505 | 0.511 | 0.551 | 0.716 |
| GRPO | 0.710 | 0.510 | 0.512 | 0.554 | 0.722 |
| HRPO | 0.712 | 0.515 | 0.517 | 0.565 | 0.731 |
| Qwen2.5-3B | |||||
| 2-6 | GSM8k | MATH | MATH500 | MMLU-ST | ARC-C |
| PPO | 0.825 | 0.597 | 0.600 | 0.574 | 0.802 |
| GRPO | 0.827 | 0.595 | 0.599 | 0.577 | 0.808 |
| HRPO | 0.838 | 0.606 | 0.609 | 0.585 | 0.815 |
To further highlight HRPO’s reasoning patterns, we present additional qualitative examples. Each example provides the reasoning trace by decoding the sampled tokens from the hybrid reasoning process, and we include both successful and erroneous cases across different tasks in the following. The correct examples are provided in 17, 18, 19, 20, 21, where as the mistakes are provided in 22, 23, 24, 25, 26, we show the raw strings and omit the options / contexts in the examples due to space constraints.
From these examples, we identify four reasoning patterns that can lead to correct answers: (1) Purely English reasoning with coherent trajectories (Figs. 17 and 18), a pattern commonly observed in LLM reasoning outputs. (2) Predominantly English reasoning punctuated by rare tokens (e.g., %n rather than \(\backslash\)n), as shown in 19). (3) Cross-lingual reasoning that interweaves multiple languages (English and Chinese in 20). (4) Reasoning with many uncommon tokens and atypical steps, yet still arriving at the correct answer (21). These latter three patterns are rarely observed in standard reasoning LLMs but are more prevalent in HRPO trained models, demonstrating that HRPO can enhance reasoning by leveraging LLMs’ intrinsic generative capabilities across different languages and token types, thereby delivering improvements across diverse scenarios.
As for reasoning errors, we also identify several common patterns: (1) Cross-lingual mistakes arising from limited parametric or contextual knowledge, as in 22 and 23. (2) Correct answers that violate the predefined format and thus receive a zero score (24). (3) Repetitive loops that continue until the response hits the maximum completion length (25). (4) Cross-lingual reasoning that is nonetheless truncated by the length limit (26). Overall, these patterns indicate that, while HRPO effectively integrates discrete and latent representations in its internal reasoning process, it may be further enhanced through refined output formatting (e.g., potentially with a format reward), extended optimization schedules with conservative learning, increased model parameters, and longer context / generation capabilities, pointing to promising directions for future research.