Abstract

Reinforcement fine-tuning (RFT) is essential for enhancing the reasoning capabilities of large language models (LLM), yet the widely adopted Group Relative Policy Optimization (GRPO) suffers from entropy collapse, where entropy monotonically decreases, exploration vanishes, and policies converge prematurely. Existing entropy-regularized methods only partially alleviate this issue while introducing bias and instability, leaving entropy control unresolved and the connection between entropy, exploration, and performance unclear. We propose Arbitrary Entropy Policy Optimization (AEPO), which eliminates entropy collapse by replacing entropy bonuses with REINFORCE policy gradient on temperature-adjusted distributions and stabilizing entropy through temperature regulation. AEPO integrates three key designs: policy gradient as regularization, distribution as regularization, and REINFORCE as regularization, enabling precise entropy control without distorting optimization. Experiments demonstrate three major contributions: AEPO (1) stabilizes entropy at arbitrary target levels, effectively removing collapse in GRPO; (2) reveals a non-monotonic relation where performance first improves then declines with increasing entropy, clarifying the link between entropy, exploration, and reasoning; and (3) generalizes beyond entropy, providing a broader RFT paradigm where superior target distributions can serve as REINFORCE regularizers. Our code is available at https://github.com/597358816/AEPO.

Figure 1: Entropy across five runs of AEPO. By adjusting only the parameter \mathcal{H}, entropy can be stably maintained at different levels. — Figure 1: Entropy across five runs of AEPO. By adjusting only the parameter \(\mathcal{H}\), entropy can be stably maintained at different levels.

1 Introduction↩︎

Reinforcement fine-tuning (RFT) has become a cornerstone for enhancing the reasoning capabilities of large language models (LLM) [1]–[6]. Among existing approaches, Group Relative Policy Optimization (GRPO) has gained wide adoption due to its efficiency and scalability [7]–[9]. However, GRPO suffers from a well-documented drawback: entropy collapse: As training progresses, policy entropy declines monotonically, sampled outputs converge to nearly identical solutions, and the model prematurely adopts a deterministic policy with limited exploration [10]–[12]. This severely restricts the ability of RFT to discover diverse reasoning strategies. Existing remedies typically rely on entropy regularization and clip refine, which inevitably introduce bias and instability into the optimization objective, creating a trade-off between exploration and stability [13]–[16]. EFRame is the first to achieve large-scale entropy control, but it remains a qualitative analysis without conducting a quantitative study of entropy [17].

Although the issue of entropy collapse in GRPO has been repeatedly noted, it remains unresolved: there is still no principled algorithm capable of precisely regulating entropy throughout training. Existing approaches often treat entropy only as a side indicator—for instance, using it to split steps or guide auxiliary heuristics—while the policy entropy itself continues to collapse as training progresses [15], [18]–[20]. As a result, the relationship among entropy, exploration, and performance remains unclear: it is uncertain whether entropy is a sufficient proxy for exploration or whether exploration itself consistently improves training outcomes. If significant performance improvement can be observed by adjusting entropy to a better range, it would indicate that exploration plays a crucial role in this process. If arbitrary entropy control can be achieved during RFT, it would make it possible to realize exploration at any desired degree and, in turn, establish a principled connection among entropy, exploration, and performance.

Motivated by these challenges, we propose Arbitrary Entropy Policy Optimization (AEPO), a novel policy gradient that directly addresses entropy collapse. AEPO replaces conventional entropy bonuses with a REINFORCE policy gradient [21], [22] applied to samples drawn from temperature-adjusted distributions. Entropy is further stabilized through temperature regulation, ensuring that batch-level entropy remains oscillating around an arbitrary constant \(\mathcal{H}\) throughout training, as shown in Fig. 1. AEPO achieves entropy control through three key design components:

Policy gradient as regularization: Instead of relying on a conventional entropy bonus, we employ a full policy gradient term applied to samples that naturally exhibit high- or low-entropy properties. This design ensures that entropy never dominates the optimization objective, allowing the model to monotonically explore toward higher accuracy throughout training.
Distribution as regularization: Entropy is regulated via temperature-adjusted distributions. Given a predefined entropy threshold \(\mathcal{H}\), when the previous step’s entropy \(\mathcal{H}(\pi_{\theta_{\text{old}}}) < \mathcal{H}\), we regard the distribution under \(T_{\text{high}}>1\) as a better candidate and mix in a small proportion of its samples. Conversely, when \(\mathcal{H}(\pi_{\theta_{\text{old}}}) \geq \mathcal{H}\), we instead sample from the distribution under \(T_{\text{low}}<1\).
REINFORCE as regularization: In Reinforcement Learning with Verifiable Reward (RLVR), the REINFORCE algorithm can filter out negative samples in an unbiased manner, allowing positive samples to form a unidirectional gradient toward a better distribution. This guides the model to optimize from positive samples that align with the target distribution. If negative samples were included at this stage, the entropy control mechanism would fail.

In summary, our contributions are threefold:

Controllable entropy: We propose Arbitrary Entropy Policy Optimization that can stabilize entropy at arbitrary target levels, effectively eliminating entropy collapse in GRPO.
Entropy–performance relation: We find that merely adjusting entropy can directly influence training performance, providing explicit evidence for the correlation between entropy, exploration, and performance. Moreover, we observe a non-monotonic trend in which performance first increases and then decreases as entropy grows, highlighting the existence of an optimal entropy regime.
Generalizability: Beyond entropy control, AEPO provides a broader paradigm for RFT. When a target distribution \(\pi^{*}\) is identified as superior to the current policy \(\pi_{\theta}\), samples from \(\pi^{*}\) can be used to construct a REINFORCE policy gradient as a regularization term, which allows \(\pi\) to progressively approximate \(\pi^{*}\) during long-horizon training.

RFT has become a central paradigm for post-training large language models (LLMs), aligning them with human feedback and task-specific objectives. Early methods such as RLHF leveraged policy optimization (e.g., PPO) to encode human preferences [23]–[26], while Direct Preference Optimization (DPO) [27] later improved efficiency by optimizing policies directly from preference data. Recent models like DeepSeek-R1 [9] and Kimi-1.5 [28] extend RFT through hybrid reward formulations and scalable optimization. Among them, Group Relative Policy Optimization (GRPO) [7], [8] has become the de facto baseline for reasoning-focused RFT, yet suffers from entropy collapse that limits exploration of diverse reasoning strategies.

Entropy has long been regarded as a proxy for exploration in reinforcement optimization. Classical methods employ entropy regularization to stabilize training and encourage diversity [21], [22]. More recent studies extend this idea to GRPO by introducing entropy bonuses into rewards or advantages [14]–[16]. However, such approaches only yield coarse-grained effects: entropy still collapses as training proceeds, or the added bias destabilizes optimization. EFRame [17] represents the first framework to achieve large-scale entropy control through conditional sampling and replay, yet its treatment remains qualitative and lacks quantitative characterization of the entropy–exploration–performance relationship.

In summary, existing methods lack a principled mechanism to precisely regulate entropy throughout training. Moreover, the role of entropy in driving exploration and its connection to downstream performance has not been quantitatively established. Our work addresses this gap by proposing Arbitrary Entropy Policy Optimization (AEPO), which enables controllable entropy regulation and provides explicit evidence of a non-monotonic relationship between entropy, exploration, and reasoning performance.

3 Preliminary↩︎

Our work focuses on fine-tuning LLM using Reinforcement Learning (RL) for tasks with verifiable solutions, such as mathematical reasoning and code generation.

Suppose the LLM is a softmax policy, that is \[\begin{align} \pi_{\theta}(o_{t}|q_{t})=\frac{\text{exp}(l(q_{t},o_{t}))}{\sum_{o'_{t}}\text{exp}(l(q_{t},o'_{t}))}, \end{align}\] where \(q_{t}\) is the concatenation of query q followed by \(o_{<t}\), and \(l(q_{t},o_{t})\) is the logit of token \(o_{t}\) given input \(q_{t}\). Furthermore, given a temperature T, we define: \[\begin{align} \pi_{\theta}^{T}(o_{t}|q_{t})=\frac{\text{exp}(l(q_{t},o_{t})/T)}{\sum_{o'_{t}}\text{exp}(l(q_{t},o'_{t})T)}. \end{align}\]

3.1 Policy-gradient based RL algorithms↩︎

Given a query \(q\), let \(o\) denote a response sampled from policy \(\pi_{\theta}\) for query \(q\). Given a reward function: \[R(q,o) =\boldsymbol{1}[o=o^{*}],\] where \(o^{*}\) is the reference response for query \(q\), the policy objective is: \[\begin{align} \mathcal{J}(\theta) = \mathbb{E}_{q \sim P(Q), o \sim \pi_{\theta}(O|q)}\sum_{t=1}^{|o|}[R(q,o)]. \end{align}\] To optimize the objective function, it is a common practice to use the Policy Gradient algorithm for gradient estimation: \[\begin{align} &\nabla_{\theta}\mathcal{J}_{REINFORCE}(\theta)= \mathbb{E}_{q \sim P(Q), o \sim \pi_{\theta}(O|q)}\\ &\quad \quad \quad \quad \quad \quad \sum_{t=1}^{|o|}[\nabla_{\theta} \text{log}\pi_{\theta}(o_{t}|q,o_{<t}) \cdot R(q,o)],\\ &\nabla_{\theta}\mathcal{J}_{GRPO}(\theta)=\mathbb{E}_{q \sim P(Q), \{o_i\}_{i=1}^{G} \sim \pi_{\theta}(O|q)}\\ &\quad \quad \quad \quad \quad \quad \frac{1}{G} \sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|}[\nabla_{\theta} \text{log}\pi_{\theta}(o_{t}|q,o_{<t}) \cdot \hat{A}_{i,t}], \end{align}\]

where \(R(q,o)\) is the reward for a query–response pair \((q,o)\), and \(\hat{A}_{i,t}\) denotes the estimated advantage. To reduce gradient variance, GRPO extends REINFORCE by introducing group-wise normalization: for each query \(q\), it samples \(G\) responses \(\{o_i\}_{i=1}^{G}\) and normalizes their rewards to compute the relative advantage for stable optimization.

\[\begin{align} \hat{A}_{i,t} = \frac{R(q,o_{i})-\text{mean}(\{R(q,o_{j})\}_{j=1}^{G})}{\text{std}(\{R(q,o_{j})\}_{j=1}^{G})}. \end{align}\]

3.2 Entropy-regularization variants↩︎

In traditional RL, it is common to add an entropy term to the objective to prevent the policy from becoming overly deterministic. Prior work has also explored various approaches in this direction for LLM training.
Entropy-Reg Given a query \(q\), let \(o\) denote a response sampled from policy model \(\pi_{\theta}\) for query \(q\). For each token \(o_{t}\) in response \(o\), we denote the token-level entropy as: \[\begin{align} \mathcal{H}_{t}(\pi_{\theta}):=-\mathbb{E}_{o_{t} \sim \pi_{\theta}(\cdot|q,o_{<t})}[\text{log}\pi_{\theta}(o_{t}|q,o_{<t})], \end{align}\] and then we can further denote that: \[\begin{align} \mathcal{H}(\pi_{\theta}):=\mathbb{E}_{q \sim P(Q), o \sim \pi_{\theta}(O|q)}\frac{1}{|o|}\sum_{t=1}^{|o|}\mathcal{H}_{t}(\pi_{\theta}). \end{align}\] In maximum entropy RL, we optimize for the entropy-regularized objective as follows: \[\begin{align} \mathcal{J}&_{MaxEnt}(\theta)=\mathcal{J}(\theta)+\lambda \cdot \mathcal{H}(\pi_{\theta}) = \nonumber \\ &\mathbb{E}_{q \sim P(Q), o \sim \pi_{\theta}(O|q)}\sum_{t=1}^{|o|}[R(q,o)-\lambda\cdot\text{log}\pi_{\theta}(o_{t}|q,o_{<t})]. \end{align}\]

Entropy-Adv [15] proposed an entropy-guided advantage shaping method. The key idea is to inject an entropy-based term into the advantage function during policy optimization. They define an entropy-based advantage term \(\psi(\mathcal{H}_{t})\) and use it to shape advantage: \[\begin{align} &\psi(\mathcal{H}_{t})=\text{min}(\beta \cdot \mathcal{H}_{t}^{detach},\frac{|\hat{A}_{i,t}|}{\kappa}),\\ &A_{i,t}^{\text{shaped}}=\hat{A}_{i,t}+\psi(\mathcal{H}_{t}), \end{align}\] where\(\beta>0\; and\; \kappa >1\). The entropy term \(\mathcal{H}_{t}^{detach}\) is detached from the computational graph during backpropagation, acting as a fixed offset to the original advantage. The policy gradient of the algorithm retains a format similar to that in GRPO, where only the advantage \(\hat{A}_{i,t}\) is replaced by the shaped one: \[\begin{align} \nabla_{\theta}\mathcal{J}^{\text{shaped}}(\theta)=&\mathbb{E}_{q \sim P(Q), o \sim \pi_{\theta}(O|q)},\\ &\sum_{t=1}^{|o|}[\nabla_{\theta} \text{log}\pi_{\theta}(o_{t}|q,o_{<t}) \cdot A_{i,t}^{\text{shaped}}]. \end{align}\]

Table 1: Comparison of entropy-based optimization objectives.
Policy Gradient
Entropy-Reg	\(\sum_{t}(\hat{A}_{t} \cdot \nabla_{\theta} \text{log}\pi_{\theta}(o_{t}\|q,o_{<t}) +\lambda \cdot \nabla_{\theta}\mathcal{H}_{t})\)
Entropy-Adv	\(\sum_{t}A_{i,t}^{\text{shaped}} \cdot \nabla_{\theta} \text{log}\pi_{\theta}(o_{t}\|q,o_{<t})\)

4 Method↩︎

Arbitrary Entropy Policy Optimization (AEPO) is designed to achieve precise and stable control of policy entropy during RFT. Unlike conventional entropy-regularized methods, AEPO does not introduce explicit entropy bonuses; instead, it adjusts the training distribution and policy gradient to regulate entropy implicitly yet controllably. This section first introduces the key premises underlying AEPO’s design, which reveal the relationship between temperature, entropy, and policy updates, and then details the algorithmic formulation of AEPO.

None

Figure 2: No caption.

4.1 Premise↩︎

The design of AEPO is built upon two empirical premises that connect temperature-based sampling to entropy dynamics. These premises establish the foundation for controllable entropy modulation without introducing bias into the optimization objective.

Premise 1. Higher temperature distributions globally correspond to higher policy entropy, while lower temperature corresponds to lower entropy.

Previous studies, such as [29], show that increasing the sampling temperature generally broadens the model’s output distribution and raises its entropy. Although the correspondence between higher temperature and higher entropy is not a universally proven principle, it holds—or approximately holds—under most intuitive and commonly used model settings. In the context of RFT, temperature thus serves as a practical external variable for indirectly regulating entropy, enabling controllable entropy adjustment.

Figure 3: Entropy dynamics under temperature-controlled sampling. High-temperature positive samples increase entropy, promoting exploration, while low-temperature positive samples reduce entropy, leading to more deterministic behavior.

Lemma 1. [14] If the actor policy \(\pi_\theta\) is a tabular softmax policy updated via natural policy gradient [30] with step size \(\eta\). Then the change in policy entropy between two steps approximately satisfies: \[\begin{align} \mathcal{H}(\pi_\theta^{k+1}) - \mathcal{H}(\pi_\theta^k) \approx& -\eta \cdot \mathbb{E}_{s\sim d_{\mu}^{k}}\text{Cov}_{a\sim\pi^k_\theta(\cdot\mid s)}\\ &\!\bigl[\log\!\pi^k_\theta(a\mid s),\, A^{\pi^k}(s,a)\bigr]. \end{align}\]

The proof can be seen in [31] and [14]. \(\mathcal{H}\) indicates the policy entropy of policy model, and \(\text{Cov}\) denotes covariance, \(\pi^k_\theta\) is the policy at step \(k\), and \(A^{\pi^k}(s,a)\) is the advantage function of action \(a\) under state \(s\). This result indicates that when positive actions have a low probability. If the probability of sampling low-probability actions can be increased, the model will be optimized toward higher entropy.

Premise 2. Positive samples from temperature-adjusted distributions induce predictable entropy change during training.

When positive samples are drawn from a higher-temperature version of the current policy distribution, training with these samples increases the model’s entropy, promoting exploration. Conversely, sampling from a lower-temperature distribution produces entropy reduction, guiding the model toward more deterministic behavior. Figure 3 illustrates this phenomenon: the entropy increases under high-temperature sampling and decreases under low-temperature sampling. Meanwhile, entropy increase is typically slower than entropy decrease, which aligns with intuition. This dynamic forms the empirical basis for AEPO’s temperature-controlled entropy feedback loop, enabling bidirectional regulation of entropy around a target value.

4.2 AEPO↩︎

Building upon the premises introduced above, AEPO is designed to achieve stable and controllable entropy regulation throughout RFT. Its core idea is to replace explicit entropy bonuses with a REINFORCE regularization term that adjusts the sampling distribution according to the current entropy state. In this way, AEPO modulates exploration implicitly—through data distribution—rather than directly through the loss function.

Formally, AEPO augments the standard GRPO objective with an additional policy gradient term applied to samples drawn from temperature-adjusted distributions. The overall policy gradient can be expressed as:

\[\begin{align} & \nabla_{\theta}\mathcal{J}_{AEPO}(\theta) \nonumber \\ =& \underbrace{ \mathbb{E}_{q \sim P(Q), \{o_{i}\}_{i=1}^{G} \sim \pi_{\theta}(O|q)}\sum_{t=1}^{|o|}[\nabla_{\theta} \text{log}\pi_{\theta}(o_{t}|q,o_{<t}) \cdot \hat{A}_{i,t}]}_{\text{GRPO form policy gradient}} \nonumber \\ +& \underbrace{\alpha\cdot \mathbb{E}_{q \sim P(Q), o \sim \textcolor{red}{\pi_{\theta}^{T}(O|q)}}\sum_{t=1}^{|o|}[\nabla_{\theta} \text{log}\pi_{\theta}(o_{t}|q,o_{<t}) \cdot \textcolor{red}{R(q,o)}]}_{\text{REINFORCE form policy gradient}}. \end{align}\]

The implementation of AEPO’s loss function is shown in Eq. 2, which consists of three key design components:

Policy gradient as regularization. AEPO replaces conventional entropy regularization with a full policy gradient term that simultaneously enables entropy control and prevents entropy from dominating the optimization process. As illustrated in Fig. 4, entropy regularization alone tends to drive the optimization toward two extremes—either entropy collapse or entropy explosion. In the former case, the regularization term is too weak to reverse the monotonic entropy decay; in the latter, entropy becomes an irreversible dominant factor in optimization. By contrast, AEPO constrains entropy fluctuation within a narrow and stable range through the policy gradient mechanism, making it remarkably robust to hyperparameters.

Figure 4: Comparison between entropy regularization and AEPO. Entropy regularization often drives optimization toward two extremes—collapse or explosion—while AEPO maintains entropy within a stable and optimal exploration range.

Distribution as regularization. AEPO regulates the optimization direction by adjusting the expected sampling distribution \(\pi^{*}\) derived from the current policy. When the observed entropy \(\mathcal{H}(\pi_{\theta_{\text{old}}})\) is below the target threshold \(\mathcal{H}\), AEPO samples from the higher-temperature distribution \(\pi^{T_{\text{high}}}_{\text{old}}\) to encourage exploration. Conversely, when \(\mathcal{H}(\pi_{\theta_{\text{old}}})\) exceeds \(\mathcal{H}\), AEPO samples from the lower-temperature distribution \(\pi^{T_{\text{low}}}_{\text{old}}\) to promote stability. This bidirectional regulation mechanism achieves fine-grained entropy control, allowing the policy to maintain equilibrium between exploration and convergence.

REINFORCE as regularization. In RLVR, the reward space is binary. This property allows REINFORCE to filter out negative samples in an unbiased manner, ensuring that the gradient is formed from positive samples that align with the desired distribution. Consequently, AEPO’s regularization term produces a one-sided optimization signal that guides the policy toward higher-quality behavior distributions.

Table 2: Main results on mathematical reasoning benchmarks. We compare Qwen2.5-math-7B (base model), GRPO, entropy-based baselines, and AEPO under different entropy thresholds \(\mathcal{H}\). AEPO consistently outperforms all baselines, achieving stable entropy control and delivering substantial improvements across diverse benchmarks, with the best overall performance at \(\mathcal{H}=0.75\).
Benchmarks	AIME24	AMC	College Math	GSM8K	MATH	Minerva Math	Olympiad	Average
Qwen2.5-math-7B
GRPO
Entropy-Reg
Entropy-Adv
AEPO \(\mathcal{H}=0.25\)			48.5
AEPO \(\mathcal{H}=0.50\)	43.3	82.5	48.8	89.5	81.6	38.2	43.0	60.99
AEPO \(\mathcal{H}=0.75\)	50.0(+36.7)	80.0(+40.0)	(+6.50)	89.4(+24.0)	82.0(+16.5)	(+26.5)	42.4(+15.7)	61.36(+23.70)
AEPO \(\mathcal{H}=1.00\)						37.9

5 Experiments↩︎

To validate the effectiveness of our methods, we present the experimental setup and results in the following sections. Our work is based on the EasyR1 and VeRL frameworks [32], [33], and we compare with the RL baselines GRPO [7] and its entropy-regularization variants [13], [15].

5.1 Experimental setup↩︎

Model and Dataset: We conduct experiments to evaluate the effectiveness of AEPO in RFT for mathematical reasoning tasks. The base model is Qwen2.5-Math-7B [34], an LLM specialized for mathematical problem solving. For training, we use the DAPO-17K dataset [10], which contains diverse problem instances curated for RFT.

Benchmark: Evaluation is performed on a broad suite of mathematical reasoning benchmarks, including AIME24 [35], AMC [36], College Math [37], GSM8K [38], MATH [36], Minerva Math [39], and Olympiad [36]. These benchmarks collectively span a wide range of difficulty levels, from grade school arithmetic to advanced competition-level mathematics, and together they cover nearly all mainstream benchmarks for mathematical reasoning, enabling a comprehensive assessment of AEPO’s impact on reasoning performance across diverse tasks.

Implementation: We initialize all policy models and conduct experiments on 8 A800 GPUs (40GB). Unless otherwise specified, we follow the default EasyR1 settings: maximum response length of 2048, global batch size 128, rollout batch size 512, rollout group size \(G\)=5, temperature 1.0, learning rate \(10^{-6}\), \(\epsilon = 0.2\), and a binary reward. For AEPO, \(T_{\text{high}} = 1.2\), \(T_{\text{low}} = 0.8\), we replace 60 temperature-adjusted positive samples in each batch. For entropy-regularized baselines, we set \(\lambda=0.03\) for Entropy-Reg, and \(\beta=0.4\), \(\kappa=2\) for Entropy-Adv [15]. To exclude potential confounding factors, we did not apply KL divergence in any of our experiments.

5.2 Main results↩︎

Table 2 summarizes the performance of AEPO compared with GRPO and entropy-based baselines across seven mathematical reasoning benchmarks. We observe that AEPO consistently outperforms all competing methods, achieving the best results on every benchmark. In particular, AEPO yields an average score of 61.36, representing a relative improvement of +3.40 points over GRPO. Notably, the gains are especially pronounced on high-difficulty datasets such as AIME24 and AMC, where exploration plays a critical role in finding correct reasoning paths.

Compared to entropy-regularized variants, AEPO also exhibits clear advantages. Both Entropy-Reg and Entropy-Adv deliver only marginal improvements over GRPO. As shown in Fig. 5, the entropy regularization term in Entropy-Reg introduces significant bias: during mid-stage training, entropy begins to surge uncontrollably, indicating that entropy gradually replaces accuracy as the dominant optimization signal, with accumulated bias distorting the learning process. Entropy-Adv, on the other hand, fails to fundamentally reverse the trend of entropy collapse, leaving the model with insufficient exploration. In contrast, AEPO achieves consistent improvements across all benchmarks, underscoring the benefits of principled entropy control over heuristic entropy bonuses.

Figure 5: Entropy trajectories of AEPO compared with GRPO, Entropy-Reg, and Entropy-Adv. While GRPO and Entropy-Adv suffer from monotonic entropy collapse and fail to maintain exploration, Entropy-Reg leads to unstable fluctuations. In contrast, AEPO stabilizes entropy around the target level, demonstrating controllable and robust entropy regulation throughout training..

5.3 Entropy and exploration↩︎

cl Loss and performance & Entropy control

(l1emr)1-2

AIME24 & AMC & College & GSM8K & MATH & Minerva Math & Olympiad & Avg 33.3 & 75.0 & 47.7 & 88.9 & 80.0 & 33.5 & 40.1 & 56.93 (-4.43) & entropy collapse

(l1emr)1-2

AIME24 & AMC & College & GSM8K & MATH & Minerva Math & Olympiad & Avg 36.7 & 75.0 & 48.3 & 88.1 & 80.8 & 36.0 & 41.8 & 58.14 (-3.22) & entropy collapse

To further examine the relationship between entropy and exploration, we evaluate AEPO under different entropy thresholds \(\mathcal{H}\). As shown in Table 2, AEPO consistently surpasses GRPO across all benchmarks regardless of the entropy target, with strong improvements on challenging datasets such as AIME24 and MATH. At different thresholds, AEPO maintains clear advantages, indicating that principled entropy regulation yields benefits across a wide range of entropy levels.

Figure 1 further highlights this effect. While GRPO exhibits monotonic entropy collapse, AEPO stably maintains entropy around arbitrary preset levels simply by adjusting the hyperparameter \(\mathcal{H}\), since AEPO is not sensitive to the parameter \(\alpha\), as long as it is chosen within a reasonable range. This controllability allows us to probe the direct impact of entropy on exploration and downstream performance. Notably, we observe a non-monotonic trend: performance first improves as entropy rises from low to moderate levels, but begins to decline when entropy becomes excessively high. This provides explicit evidence that moderate entropy fosters effective exploration, while overly high entropy disperses optimization and harms accuracy.

Together, these results confirm that AEPO not only eliminates entropy collapse but also clarifies the link between entropy, exploration, and performance, offering a controllable pathway to balance exploration with convergence.

6 Ablation Study↩︎

To assess the contribution of each component in AEPO, we design two ablation studies to verify the necessity of distribution as regularization and REINFORCE as regularization for achieving effective entropy control.

6.1 Distribution as regularization↩︎

One critical component of AEPO is the use of temperature-adjusted distributions for entropy control. In our design, the REINFORCE regularization term samples from a modified distribution, where the temperature \(T\) is adaptively adjusted based on the previous step’s entropy. This adjustment ensures that positive samples carry either higher or lower entropy as required, thereby stabilizing the overall entropy around the target threshold \(\mathcal{H}\).

To validate the necessity of this design, we replace the temperature-adjusted distribution with the original distribution, as shown in Table [tab:loss95ablation]. The results demonstrate that when REINFORCE samples are drawn directly from the original policy distribution (i.e., without temperature adjustment), entropy control collapses: the policy entropy monotonically decreases during training, similar to GRPO. More importantly, the average benchmark score drops to 57.39, which is even worse than standard GRPO and far below AEPO (61.36). This shows that the variance of REINFORCE gradients, when unregularized by distribution adjustment, further degrades optimization stability and downstream reasoning accuracy.

These findings confirm that distribution adjustment is indispensable for AEPO: it directly enables controllable entropy regulation, reduces variance, and consistently improves reasoning performance across benchmarks. Moreover, it provides strong evidence for the role of policy gradient as regularization: the REINFORCE term in AEPO influences the exploration ability of the GRPO component through entropy control, thereby shaping the overall optimization dynamics during training.

6.2 REINFORCE as regularization↩︎

Another essential component of AEPO is the use of REINFORCE gradients as a replacement for conventional entropy bonuses. In principle, REINFORCE allows unbiased estimation of gradients from positive samples while discarding negative ones, thereby forming a unidirectional optimization signal toward better distributions. This mechanism is critical for maintaining stable entropy control: when negative samples are included, the entropy-regularizing effect contributed by positive samples is counteracted, and the policy entropy eventually collapses.

To examine the role of REINFORCE, we conduct an ablation where the regularization term is still sampled from temperature-adjusted distributions, but the advantage function \(\hat{A}_t\) is used directly without filtering negative samples. As shown in Table [tab:loss95ablation], this variant fails to prevent entropy collapse, leading to degraded performance across benchmarks. The average score drops to 58.14. This confirms that filtering negative samples via REINFORCE is indispensable.

7 Conclusion and discussion↩︎

In this paper, we propose Arbitrary Entropy Policy Optimization (AEPO), a principled RFT framework that addresses one of the most persistent challenges in RFT—precise and stable entropy control. Unlike traditional entropy regularization methods that trade off exploration against stability, AEPO achieves controllable entropy regulation through a unified design that integrates policy gradient, distribution, and REINFORCE as regularization components. This formulation eliminates the entropy collapse phenomenon in GRPO and maintains policy entropy within an arbitrarily specified range, enabling balanced and consistent exploration throughout training.

Extensive experiments across seven mathematical reasoning benchmarks demonstrate that AEPO consistently outperforms entropy-based baselines, exhibiting greater stability, generalization, and robustness to hyperparameters. More importantly, AEPO reveals a non-monotonic relationship between entropy and reasoning performance, showing that moderate entropy fosters exploration while excessive entropy impairs optimization—offering the first quantitative evidence linking entropy dynamics to reasoning capability in large language models.

Beyond solving entropy collapse, AEPO establishes a generalizable paradigm for learning under target distributions, suggesting its applicability to broader domains such as multi-modal alignment and long-horizon reasoning. In summary, AEPO transforms entropy from a passive statistical indicator into an active, tunable variable that fundamentally governs exploration, stability, and performance in RFT.

Declaration of AI↩︎

AI is only used for translation and language polishing in this paper.

References↩︎

[1]

GLM, T., Zeng, A., Xu, B., Wang, B., Zhang, C., Yin, D., Zhang, D., Rojas, D., Feng, G., Zhao, H., et al. Chatglm: A family of large language models from glm-130b to glm-4 all tools. arXiv preprint arXiv:2406.12793, 2024.

[2]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.

[3]

Schulman, J., Wolski, F., Dhariwal, P., Radford, A., and Klimov, O. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.

[4]

Rafailov, R., Sharma, A., Mitchell, E., Manning, C. D., Ermon, S., and Finn, C. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36: 53728–53741, 2023.

[5]

Zhong, H., Shan, Z., Feng, G., Xiong, W., Cheng, X., Zhao, L., He, D., Bian, J., and Wang, L. Dpo meets ppo: Reinforced token optimization for rlhf. arXiv preprint arXiv:2404.18922, 2024.

[6]

Wang, Z., Bi, B., Pentyala, S. K., Ramnath, K., Chaudhuri, S., Mehrotra, S., Mao, X.-B., Asur, S., et al. A comprehensive survey of llm alignment techniques: Rlhf, rlaif, ppo, dpo and more. arXiv preprint arXiv:2407.16216, 2024.

[7]

Shao, Z., Wang, P., Zhu, Q., Xu, R., Song, J., Bi, X., Zhang, H., Zhang, M., Li, Y., Wu, Y., et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. arXiv preprint arXiv:2402.03300, 2024.

[8]

Liu, A., Feng, B., Xue, B., Wang, B., Wu, B., Lu, C., Zhao, C., Deng, C., Zhang, C., Ruan, C., et al. Deepseek-v3 technical report. arXiv preprint arXiv:2412.19437, 2024.

[9]

Guo, D., Yang, D., Zhang, H., Song, J., Zhang, R., Xu, R., Zhu, Q., Ma, S., Wang, P., Bi, X., et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning. arXiv preprint arXiv:2501.12948, 2025.

[10]

Yu, Q., Zhang, Z., Zhu, R., Yuan, Y., Zuo, X., Yue, Y., Fan, T., Liu, G., Liu, L., Liu, X., et al. Dapo: An open-source llm reinforcement learning system at scale. arXiv preprint arXiv:2503.14476, 2025.

[11]

Li, G., Lin, M., Galanti, T., Tu, Z., and Yang, T. Disco: Reinforcing large reasoning models with discriminative constrained optimization. arXiv preprint arXiv:2505.12366, 2025.

[12]

Zhang, X., Wen, S., Wu, W., and Huang, L. Edge-grpo: Entropy-driven grpo with guided error correction for advantage diversity. arXiv preprint arXiv:2507.21848, 2025.

[13]

Hou, Z., Lv, X., Lu, R., Zhang, J., Li, Y., Yao, Z., Li, J., Tang, J., and Dong, Y. Advancing language model reasoning through reinforcement learning and inference scaling. arXiv preprint arXiv:2501.11651, 2025.

[14]

Cui, G., Zhang, Y., Chen, J., Yuan, L., Wang, Z., Zuo, Y., Li, H., Fan, Y., Chen, H., Chen, W., et al. The entropy mechanism of reinforcement learning for reasoning language models. arXiv preprint arXiv:2505.22617, 2025.

[15]

Cheng, D., Huang, S., Zhu, X., Dai, B., Zhao, W. X., Zhang, Z., and Wei, F. Reasoning with exploration: An entropy perspective. arXiv preprint arXiv:2506.14758, 2025.

[16]

Shen, H. On entropy control in llm-rl algorithms. arXiv preprint arXiv:2509.03493, 2025.

[17]

Wang, C., Wei, L., Zhang, Y., Shao, C., Dan, Z., Huang, W., Wang, Y., and Zhang, Y. Eframe: Deeper reasoning via exploration-filtering-replay reinforcement learning framework. arXiv preprint arXiv:2506.22200, 2025.

[18]

Wang, S., Yu, L., Gao, C., Zheng, C., Liu, S., Lu, R., Dang, K., Chen, X., Yang, J., Zhang, Z., et al. Beyond the 80/20 rule: High-entropy minority tokens drive effective reinforcement learning for llm reasoning. arXiv preprint arXiv:2506.01939, 2025.

[19]

Zheng, T., Xing, T., Gu, Q., Liang, T., Qu, X., Zhou, X., Li, Y., Wen, Z., Lin, C., Huang, W., et al. First return, entropy-eliciting explore. arXiv preprint arXiv:2507.07017, 2025.

[20]

Li, Y., Gu, Q., Wen, Z., Li, Z., Xing, T., Guo, S., Zheng, T., Zhou, X., Qu, X., Zhou, W., et al. Treepo: Bridging the gap of policy optimization and efficacy and inference efficiency with heuristic tree-based modeling. arXiv preprint arXiv:2508.17445, 2025.

[21]

Williams, R. J. Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine learning, 8 (3): 229–256, 1992.

[22]

Sutton, R. S., McAllester, D., Singh, S., and Mansour, Y. Policy gradient methods for reinforcement learning with function approximation. Advances in neural information processing systems, 12, 1999.

[23]

OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.

[24]

Team, G., Georgiev, P., Lei, V. I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vincent, D., Pan, Z., Wang, S., et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024.

[25]

Wei, L., Jiang, Z., Huang, W., and Sun, L. Instructiongpt-4: A 200-instruction paradigm for fine-tuning minigpt-4. arXiv preprint arXiv:2308.12067, 2023.

[26]

Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tuning. arXiv preprint arXiv:2304.08485, 2023.

[27]

[28]

Team, K., Du, A., Gao, B., Xing, B., Jiang, C., Chen, C., Li, C., Xiao, C., Du, C., Liao, C., et al. Kimi k1. 5: Scaling reinforcement learning with llms. arXiv preprint arXiv:2501.12599, 2025.

[29]

Du, W., Yang, Y., and Welleck, S. Optimizing temperature for language models with multi-sample inference. arXiv preprint arXiv:2502.05234, 2025.

[30]

Kakade, S. M. A natural policy gradient. Advances in neural information processing systems, 2001.

[31]

Liu, J. How does rl policy entropy converge during iteration? https://zhuanlan.zhihu.com/p/28476703733, 2025. URL https://zhuanlan.zhihu.com/p/28476703733.

[32]

Yaowei, Z., Junting, L., Shenzhi, W., Zhangchi, F., Dongdong, K., and Yuwen, X. Easyr1: An efficient, scalable, multi-modality rl training framework. https://github.com/hiyouga/EasyR1, 2025.

[33]

Sheng, G., Zhang, C., Ye, Z., Wu, X., Zhang, W., Zhang, R., Peng, Y., Lin, H., and Wu, C. Hybridflow: A flexible and efficient rlhf framework. In Proceedings of the Twentieth European Conference on Computer Systems, pp. 1279–1297, 2025.

[34]

Yang, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Li, C., Liu, D., Huang, F., Wei, H., et al. Qwen2. 5 technical report. arXiv preprint arXiv:2412.15115, 2024.

[35]

HuggingFaceH4. . https://huggingface.co/datasets/HuggingFaceH4/aime_2024, 2025.

[36]

Lightman, H., Kosaraju, V., Burda, Y., Edwards, H., Baker, B., Lee, T., Leike, J., Schulman, J., Sutskever, I., and Cobbe, K. Let’s verify step by step. arXiv preprint arXiv:2305.20050, 2023.

[37]

Tang, Z., Zhang, X., Wang, B., and Wei, F. Mathscale: Scaling instruction tuning for mathematical reasoning. arXiv preprint arXiv:2403.02884, 2024.

[38]

Cobbe, K., Kosaraju, V., Bavarian, M., Chen, M., Jun, H., Kaiser, L., Plappert, M., Tworek, J., Hilton, J., Nakano, R., Hesse, C., and Schulman, J. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168, 2021.

[39]

Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., et al. Solving quantitative reasoning problems with language models. Advances in neural information processing systems, 35: 3843–3857, 2022.

Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning

Abstract

1 Introduction↩︎

3 Preliminary↩︎

3.1 Policy-gradient based RL algorithms↩︎

3.2 Entropy-regularization variants↩︎

4 Method↩︎

4.1 Premise↩︎

4.2 AEPO↩︎

5 Experiments↩︎

5.1 Experimental setup↩︎

5.2 Main results↩︎

5.3 Entropy and exploration↩︎

6 Ablation Study↩︎

6.1 Distribution as regularization↩︎

6.2 REINFORCE as regularization↩︎

7 Conclusion and discussion↩︎

Declaration of AI↩︎

References↩︎

Subjects

Updated on Academus

Arbitrary Entropy Policy Optimization: Entropy Is Controllable in Reinforcement Fine-tuning

Abstract

1 Introduction↩︎

2 Related Work↩︎

3 Preliminary↩︎

3.1 Policy-gradient based RL algorithms↩︎

3.2 Entropy-regularization variants↩︎

4 Method↩︎

4.1 Premise↩︎

4.2 AEPO↩︎

5 Experiments↩︎

5.1 Experimental setup↩︎

5.2 Main results↩︎

5.3 Entropy and exploration↩︎

6 Ablation Study↩︎

6.1 Distribution as regularization↩︎

6.2 REINFORCE as regularization↩︎

7 Conclusion and discussion↩︎

Declaration of AI↩︎

References↩︎

Subjects

Updated on Academus