Abstract

Reward Model (RM) has demonstrated impressive potential for enhancing Large Language Models (LLM), as RM can serve as a proxy for human preferences, providing signals to guide LLMs’ behavior in various tasks. In this paper, we provide a comprehensive overview of relevant research, exploring RMs from the perspectives of preference collection, reward modeling, and usage. Next, we introduce the applications of RMs and discuss the benchmarks for evaluation. Furthermore, we conduct an in-depth analysis of the challenges existing in the field and dive into the potential research directions. This paper is dedicated to providing beginners with a comprehensive introduction to RMs and facilitating future studies. The resources are publicly available at github¹.

1 Introduction↩︎

"The reward of suffering is experience."
— Harry S. Truman

In recent years, the realm of Large Language Models (LLM) [1]–[3] has seen rapid development, injecting vitality into the AI community while driving the advance in various downstream tasks [4]–[6]. Behind these advancements, Alignment techniques ensure that the behavior of LLMs adheres to human values [7], [8]. To reduce human involvement in the alignment process, Reward Models (RM) trained on human preference data can serve as proxies to provide reward signals for subsequent training, e.g., Reinforcement Learning from Human Feedback (RLHF) [9]. As a result, RMs have garnered increasing research interest in recent years [10]–[12].

Figure 1 illustrates an example of RM in the dialogue domain. The goal is to train an LLM-based chatbot following the “3H” principle (Honest, Harmless, and Helpful) [13]. Given two sampled responses generated by LLMs, the RM follows the instruction and ranks the responses according to the aforementioned three dimensions, then selects a better response by LLM-2 that aligned with human values (less harmfulness in this case), which can subsequently be used to optimize the policy model. The ranking process of the RM demonstrates interpretability and traceability. The task instruction, human input, response pairs, and the RM preference can be utilized to optimize the policy LLM in the RL stage.

In this paper, we focus primarily on parameterized RMs in the LLM era, which are used to reflect human preferences. Some surveys [7], [12] have involved the introduction of RMs (See Appendix 8.1 for more details). However, these works lack a systematic organization of RMs or do not include detailed and constructive discussions of RMs. To fill this gap, our main contribution can be summarized as: (1) We present the first comprehensive survey specifically focused on RMs in the LLM era; (2) We systematically review the related works in the field of RMs and introduce an elaborate taxonomy; (3) We discuss the challenges and future directions, which facilitate further research.

The organization of this survey is as follows: We first present the taxonomy of RMs (§2). This section involves preference collection (§2.1), reward modeling (§2.2), and usage (§2.3). Next, we introduce the applications (§3), evaluation benchmarks (§4). Finally, we discuss the challenges that remained in RMs (§5), and propose potential research directions (§6).

2 Taxonomy↩︎

2.1 Preference Collection↩︎

RMs can serve as proxies of humans, where the preferences can originate from different sources, including humans and LLMs. The details are introduced in this section.

2.1.1 Human Preference↩︎

Scaling up model parameters or training data does not guarantee improved alignment with human preferences [14]. In contrast, larger models may still produce hallucinations, harmful outputs, or unhelpful responses [15]. One straightforward approach is to train an RM on human preference data, which subsequently serves as a proxy to provide the training signal during the reinforcement learning phase. Some methods employ human annotators [9], [16] to label pairs of trajectories produced by the interaction between the policy model and the environment. Other works [17] leverage annotators to assign labels to response pairs from LLMs or humans following the collected prompts [18]. On this basis, improving the efficiency and quality of collection requires further investigation.

2.1.1.1 Efficiency.

Some studies have introduced active learning [19] into preference collection. For example, [20] and [21] use an objective of information gain to choose queries. [22] adopts entropy-based sampling methods to select segment pairs. In addition, some approaches [23], [24] leverage data augmentation and sequential pairwise comparison to achieve preference-efficient learning.

2.1.1.2 Quality.

Some works aim to improve the quality from the perspective of annotators, including the introduction of demonstrations [16], active annotator selection [25], user-friendly interfaces [26], [27], and fine-grained goals and rules [17], [28], [29]. Meanwhile, other works focus on the quality of sampled queries, such as selecting diverse batch samples [30], [31] or adopting online collection settings [32] to prevent distribution shift.

2.1.2 AI Preference↩︎

Although collecting preference data from trained human annotators is intuitively suitable for human preference alignment, the high costs [33] may limit its practicality. As the capabilities [34] of LLMs continue to advance, they have demonstrated a high degree of consistency with human judgment [35]. Besides, when AI systems surpass humans in some tasks [36], [37], it’s hard for humans to evaluate the complex behaviors produced by superhuman models [38]. Therefore, AI preferences have garnered increasing research interest and have the potential to become an alternative to human preferences [39].

[10] first introduce RL from AI Feedback (RLAIF) for training a helpful and harmless AI assistant in conversation scenarios, where the RM is trained on a combination of LLM-generated harmlessness preference labels and human-generated helpfulness preference labels. [40] trains an RM on the synthetic comparisons, the quality of which is determined by the model size, the number of in-context shots. [35] directly utilizes off-the-shelf LLMs to provide reward during RL, which can address the out-of-distribution issue between the sampled trajectories from the initial policy and the dataset on which RM trained.

Similar to human preference collection, some subsequent studies attempt to collect scaled and high-quality AI preference pairs. [41] and [42] construct instruction templates to elicit preference. Various LLMs in the model pool are used to generate and evaluate the completions for the instructions. [43] introduces human-defined principles to achieve instructable RM. Other works further integrate AI preferences with human preferences. [44] and [45] enable LLMs to generate synthetic critiques for completions pairs to enhance RMs. In addition, [46] combines LLM-generated responses and human-annotated negative samples to mitigate the problems of noisy positive samples [47].

=[ rectangle, draw=hidden-black, rounded corners, text opacity=1, minimum height=3em, minimum width=5em, inner sep=2pt, align=center, fill opacity=.5, ] =[ my-box, minimum height=1.5em, fill=hidden-blue!90, text=black, align=left, font=, inner xsep=2pt, inner ysep=4pt, ]

fig: — Figure 2: Taxonomy of Reward Models, including Preference Collections, Reward Modeling, and Usage. See Figure 4 in Appendix for full version.

2.2 Reward Modeling↩︎

Reward modeling plays a central role in the alignment of LLMs, especially as a foundational component in reinforcement learning frameworks. RMs have been widely adopted in reinforcement learning research as substitutes for directly using environment rewards [48]. They are particularly relevant to inverse reinforcement learning, which focuses on inferring an agent’s underlying reward function from observed trajectory data [49].

2.2.1 Reward Model Type Level↩︎

In this part, we mainly discuss several reward modeling mechanisms of the RMs according to the underlying model types (Figure 3). Following the taxonomy introduced in [50], [51], the mechanisms include discriminative reward, generative reward, and implicit reward.

2.2.1.1 Discriminative Reward.

Discriminative RMs include a base model and an MLP-based reward head (classifier), which outputs a scalar reward for the given input. Sequence Classifiers (Figure 3 (a)) belong to Discriminative RMs, which model the preference for a single response. For example, [52] proposes conditional RM that incorporates preference data across different domains by leveraging conditional system prompts. [53] introduces absolute rewards for actions to augment the Bradley-Terry (BT) model [54] which is well adapted for the binary comparison task. [55] regularizes the hidden states to improve the generalizability of RMs on out-of-distribution (OOD) data.

Another type of Discriminative RMs is Custom Classifiers (Figure 3 (b)), which take comparison pairs as input or output multiple scores. [56] compares each pair of candidates in the pool and define several scoring functions to select the best candidate. [57] optimizes the ensemble of existing metrics to align with human preferences. [58] and [59] leverage multi-objective rewards for modeling diverse preference. In addition, [59] further uses a gating layer to adaptively allocate suitable objectives to the task.

fig: — Figure 3: Following the taxonomy in [50], [51]. Reward models can be categorized as Discriminative RM (a)(b), Generative RM (c), and Implicit RM (d). (\(x\): prompt, \(y_1,y_2\): responses)

2.2.1.2 Generative Reward.

Unlike discriminative models, generative reward models (Figure 3 (c)) fully leverage the generative capabilities of LLMs to provide preference scores. Some works use general models [60] or train specialized models [61]–[65] to serve as judges, which are available to generate better options of comparison pairs or rate a single response in text format. [66] and [67] extract the next-token probability of the answer indicators as scores. [68] utilizes a trained generative reward model to rewrite the origin response under the minimum editing constraint. The token-level scores can be obtained by contrasting the response pairs. In addition, the Self-Instruct [69] technique can be used to optimize generative reward models. Some works [70]–[72] iteratively train the model with constructed contrasting synthetic preference pairs, reasoning traces (optional), and generated judgments. Generative reward models can be integrated with other LLM-related technologies like Chain-of-Thoughts (CoT) [73] and Retrieve-Augmented Generation (RAG) [74], thereby endowing them with the potential to be applied across broader tasks.

2.2.1.3 Implicit Reward.

Different from explicit RMs, recent studies construct reward-related variables through weaker optimization signals (The variable \(z\) as shown in Figure 3 (d)) to reduce resource costs. DPO [75] and SLiC-HF [76] eliminates explicit reward modeling by defining implicit rewards through generation probabilities, directly optimizing human preference pairs. [77] proves these implicit rewards’ value functions analytically continue their explicit counterparts, enabling automated credit assignment in LLMs. Some subsequent studies aim to improve the robustness of the models. From the perspective of preference data, the design of appropriate data sampling, selection, and filtering strategies [78]–[83] can be utilized to address the quality and distribution issues of preference dataset. Some works [84], [85] attempt to effectively optimize the target policies from multiple responses, while [86] proposes direct reward optimization on single-trajectory data. Other works focus on preference corruption [87], [88] or the preference distribution shift problem [89]. From the perspective of modeling mechanism, recent techniques like token-level optimization [90], [91], reference-free methods [92]–[94], self-play optimization [95]–[97] exhibit practical potential. It should be noted, however, that these methods generally underperform in reward modeling itself compared to explicit optimization results [51].

2.2.2 Reward Granularity Level↩︎

In this subsection, we classify reward mechanisms based on their granularity when functioning as verifiers to solve problems with ground truth. Specifically, an Outcome-level Reward Model (ORM) predicts the probability that a completion results in a correct answer, while a Process-level Reward Model (PRM) assigns a score to each step in the reasoning process.

2.2.2.1 Outcome Level Reward.

For tasks that require more complex reasoning, an ORM can be employed [50], [55], [98]. Typically, the training data for an ORM is constructed differently from standard preference tuning [54]. Specifically, each solution \(s\) is paired with a problem statement or prompt \(p\). The inductive bias appliedin this setup assumes that one completion represents a solution based on whether it is correct or not for the given problem. The ORM \((P \times S \rightarrow \mathbb{R})\) is usually trained using a corss-entropy loss [99], [100] \[\mathcal{L}_{ORM} = -( \hat{y_s} \log{y_s} + (1 - \hat{y_s}) \log{(1-y_s)})\]

2.2.2.2 Process Level Reward.

Despite their proficiency in multi-step reasoning tasks, outcome-supervised methods are still prone to hallucinations, such as reaching the correct answer through an incorrect reasoning path [101]. This indicates the necessity of incorporating process supervision to address these limitations. Additionally, the PRM \((P \times S \rightarrow \mathbb{R^{+}} )\) can be trained using the standard classification loss function below, where \(y_i\) is the prediction score of the PRM and \(\hat{y_i}\) represents the correctness label, and \(N\) is the total number of reasoning step for \(s\). \[\mathcal{L}_{PRM} = - \sum_{i=1}^{N} \hat{y_{s_i}} \log{y_{s_i}} + (1 - \hat{y_{s_i}}) \log{(1-y_{s_i})}\]

Different from the heuristic method [99] that leverage semantically relevance for stepwise verification, [102] introduces the PRM which are trained on human annotated stepwise labels. The PRM evaluates each reasoning step individually and can reduce tracking error and avoid tampering incentives [103]. Moreover, [104] constructs a large-scale stepwise human feedback dataset PRM800K and train a PRM to predict step correctness in the form of tokens.

To further reduce the cost of human annotation, [105] and [106] obtain process-supervised signals based on the Monte Carlo (MC) method. For each step and prefix, the frequency of reaching correct answer within sampled completions can be used to estimate step quality, i.e., Q-value function [107]. Expanding on them, [108] employs an adaption of Monte Carlo Tree Search (MCTS) to construct state-action trees for collecting the PRM training data. In addition, [109] proposes a stepwise discriminator through contrastive learning, where the preference pairs are obtained by aligning LLM-generated incorrect solutions with the reference solution.

Another series of works argue process reward should measure progress and advantages. [110] frames the PRM as a ranking problem to capture inter-dependencies among reasoning steps, which means larger Q-value is more likely to reach correct answer, and a significant gap will exist between correct steps and the first incorrect step. [111] and [112] introduce advantages as PRMs to measure the confidence and likelihood change of generating a correct response. Furthermore, [113] and [114] obtains implicit PRMs from trained ORMs through reward parameterization, which can be used to estimate advantages by calculating token-level rewards.

2.2.2.3 Pros and Cons of different types of RMs.

Table 1: Comparison of advantages and disadvantages of the different granularity of RMs
Granularity	Advantages	Disadvantages
Outcome	Potential in flexible tasks	Leading to false positives solutions.
	Ease of implementation	Sparse reward
Process	Potential in reasoning tasks.	High cost for gathering training data.
	Dense reward	Value estimation yields inferior performance.
	Controllable	Hard to define process reward.
		Scalability and generalization problems

Currently, ORM tends to be better than PRM in the tasks with flexible processes due to its ease of implementation and generalizability, but it may lead to false positives solutions [101] in the reasoning tasks. PRM has demonstrated its potential in reasoning tasks [108], [115], but there are several considerations that require attention. Manual annotation is expensive and not scalable [116], while automated annotation may not produce satisfactory results. [117] finds that MC estimation hinder the capability of PRMs to identify incorrect steps compared to judge LLMs. Besides, process rewards are difficult to define [114], determining the correctness of intermediate steps and the progress of solving problems is challenging. Moreover, it is often suffers from reward hacking [118], while retraining the RM introduces additional complexity and resource requirements. Finally, although PRM excels at reranking top-N responses or assisting in guided search [119], its computational overhead in large-scale reinforcement learning tasks outweighs its benefits in practical experiments [120]. An overview of the opinions is in Table 1.

2.3 Usage↩︎

In the context of LLMs, RMs serve as critical components that help guide model behavior toward desired outcomes. By defining a structured, quantifiable signal that measures how well a generated response aligns with specific goals or user preferences, RMs enable the tuning and optimization of LLM outputs. This RM utility manifests across multiple stages of the LLM life cycle, including data selection, policy training, and the inference stage. In this subsection, we investigate RM utility from these three perspectives in detail.

2.3.0.1 Data Selection

Some studies utilize RMs to select data for the fine-tuning of LLMs. [121] proposes an SFT-like iterative training method, where an RM is utilized to rank the quality of LLM-generated responses. Data with the highest reward can be used to finetune the LLM. [122] further introduces ranking loss to align the LLM-generated score with the RM-generated score. [123] leverages an RM-filtered dataset to fine-tune LLM towards an offline RL objective. [124] evaluates answers and rationale for correctness by RMs, thereby selecting peference pairs to optimize LLMs via DPO [75] objective.

2.3.0.2 Policy Training.

RMs provide feedback signals that reinforce or penalize certain behaviors [14], ultimately shaping the model’s decision-making policies. To mitigate the issue of low robustness, which arises primarily because the RM often struggles with out-of-distribution generalization [125] and mismatched human judgment, several strategies have been investigated. These include length-controlled reward setting [126]–[128], causal reward modeling [129], [130], Bayesian method [131]–[133], and ensemble [134]–[136].

2.3.0.3 Inference.

RMs can be used to rank multiple outputs to deliver responses that best align with application-specific criteria. As discussed in §2.2.2, RMs can be classified as ORM and PRM. PRMs are often used at the inference stage to evaluate the progress and improve reasoning ability [112]. Some RM-guided tree search frameworks [115], [137]–[139] which have been shown to be able to greatly enhance the reasoning abilities of LLMs. In addition, RMs can also be used to evaluate intermediate decoding steps and dynamically decide whether to invoke a more powerful target model to balance resource utilization and performance [140].

3 Applications↩︎

RMs have found extensive applications across multiple domains. Here, we briefly summarize some key areas where RMs are currently utilized.

3.0.0.1 Dialogue.

RMs help mitigate harmful responses by refining them based on ethical guidelines and user intent [10], [15], [141]–[143]. Meanwhile, some works focus on the professionalism [144], [145] in dialogue, requiring agents to accurately and clearly express complex knowledge. Other works attempt to improve the overall dialogue impression [146], [147], including empathy, enthusiasm, humanlikeness, and so on.

3.0.0.2 Reasoning.

In mathematical reasoning [4], [102], RMs, especially PRM, can provide guidance to LLMs to improve logical consistency by balancing the exploration of various solutions with minimizing errors [104], [105], [108], [148]–[151]. Additionally, RMs have also shown promise in code generation [152] by integrating API calls, improving learning efficiency, and optimizing performance [64], [115], [153]–[156].

3.0.0.3 Retrieve & Recommendation.

RMs can be employed to help align the retrieve process with the preferences of strong LLMs [157], which include assessing relevance [158], [159], adaptive retrieval [160], and improving the quality of intermediate queries [161]. As for recommendation systems, RMs can be used to capture nuanced user preferences [162], evaluate LLM-generated user preferences [163], and lead to high-quality explanations [164].

3.0.0.4 Other Applications.

Apart from aforementioned applications in the text domain, RMs have demonstrated potential in other modalities, such as text to audio [165]–[167], text to image [168]–[170], text to video [171]–[173]. Moreover, RMs have been explored in some interactive tasks including robotic manipulation [174], [175] and games [176], [177], which become the foundation of artificial general intelligence.

4 Benchmarks↩︎

RM evaluation is crucial because errors in RM can negatively affect the performance of the final policy [178]–[180]. However, the development of general and standardized benchmarks for RM evaluation remains nascent, making it hard to compare and improve RMs. This is due to several challenges: (1) The most direct way to evaluate an RM is to train a full RL policy and observe its performance, which is very costly [178]. (2) RM evaluation is often tied to the performance of the policy trained with it, making it difficult to assess the RM independently [51]. (3) While creating a dataset for evaluation (e.g., annotating a simple pairwise comparison dataset) is relatively easy, RMs are sensitive to changes in input style, domain, or format [181]. This means RM evaluation requires a more comprehensive approach, considering constructing more dynamic, multi-faceted testing, which further compounds the difficulty. Recently, researchers have tried to construct high-quality benchmarks to explore optimizing RMs within different RL policies, LM architectures, training budgets, etc.

4.0.0.1 ORM Benchmarks.

[51] constructs a comprehensive benchmark RewardBench, which contains human-verified prompt-chosen-rejected trios spanning chat, reasoning, safety, and prior test sets, meanwhile providing a toolkit to audit RM behavior. [181] proposes RM-Bench, which includes chat, code, math, and safety annotated data, and conducts large-scale evaluation on publicly accessible RMs. [182] introduces RMB that involves over 49 real-world scenarios, and discusses the generalization defects in previous benchmarks. Specifically, [178] proposes PPE that evaluate RMs on proxy tasks (related to downstream RLHF outcomes) by launching an end-to-end RLHF experiment.

4.0.0.2 PRM Benchmarks.

With the emergence of reasoning research, LMs are adapted to more complex scenarios like math and multi-hop decision-making tasks, therefore PRMs have appeared and been applied. For evaluating PRMs, [183] propose ProcessBench, which consists of a huge number of cases with annotated step-by-step solutions on competition math problems. [116] introduce PRMBench, comprises thousands of designed problems with stepwise labels, evaluating RMs across multiple dimensions.

In addition to aforementioned studies, some recent works evaluate RMs for specific domains or applications, e.g., Vision-Language [184]–[186], Multilingual Settings [187], and Retrieve-Augmented Generation [188]. These benchmarks collectively mitigate the need for a more comprehensive and fine-grained evaluation of RMs, paving the way for more reliable and robust RMs for training stronger LMs.

5 Challenges↩︎

5.1 Data↩︎

High-quality data collection to reflect human preference is the cornerstone of RM applications, but there still remains some challenges associated with its process. During the collection, potential biases may exist between the preferences of researchers and annotators [189]. Variations in expertise among annotators can introduce noisy data [25], [190], which may be significant in some knowledge-intensive tasks. The issue of assessment quality can result in the inconsistencies [191] between sparse feedback protocols (i.e., ratings and rankings), when dense feedback is expensive to collect. To tackle above challenges, data filtering [192], selection [193], and high-fidelity scalable synthetic data [194] become promising solutions.

5.2 Training↩︎

A critical challenge in RM training is overoptimization which also called reward hacking [195]–[198], where RMs could be excessively optimized to narrow evaluation metric (like accuracy on a sole static benchmark) [179]. An RL policy trained against such RMs may “hack” the reward signal, leading to performance degradation [199]. Some causes of overoptimization include reward tampering [200], [201], mislead [202], and sycophancy [203]. As mentioned in §2.3, several research directions such as RM ensemble [204], data augmentation [130], and robust training [198], [205], [206] have demonstrated potential in mitigating overoptimization, paving the way for more robust RMs.

5.3 Bias in Evaluation↩︎

Using RMs (judge model) for evaluation also introduces intrinsic biases toward superficial quality of text [207]. [208] observe that top-ranking RMs and some popular benchmarks exhibit biases toward the specific format patterns [209] discuss the biases derived from evaluators, including length, concreteness, empty reference, and so on. [210] study the preference leakage problem elicited by the relevance between synthetic data generators and RMs. The aforementioned studies highlight the need to construct robust evaluation benchmarks to detect and mitigate biases.

6 Future Directions↩︎

6.0.0.1 The combination of scalar rewards with rule-based rewards is becoming a growing trend.

In advanced industrial LLMs [120], [211], a robust model can benefit from integrating rule-based and model-based rewards. Rule-based rewards provide clear guidelines, while model-based rewards enable learning from predictions. Specifically, rule-based rewards are applied to tasks with clear ground truths (e.g., mathematics, coding), while reward models are used for tasks without clear ground truths (e.g., creative tasks), enhancing LLMs’ real-world applicability. Incorporating rule-based rewards has become a standard practice in the reinforcement fine-tuning of o1-like [212] longCoT models, and a few works [213]–[215] in the academic community which only utilize rule-based reward have emerged, also achieving strong reasoning capabilities.

6.0.0.2 Reward Design in LLM Long-horizontal Agent Task.

Recent advances in reasoning ability have enabled sophisticated LLMs to tackle complex expert-level tasks [216], with planning playing a key role. OpenAI and Anthropic are exploring tool use, such as search engines [217], code interpreters [218], and web browsers [219] to complete complex GUI tasks [220]. However, ensuring good agent performance is challenging, especially when designing feedback mechanisms for large systems. Creating rules is experimental, and developing an end-to-end reinforcement learning framework for long-horizontal tasks is essential. The key challenge remains ensuring the agent consistently receives rewards and improves monotonically.

RMs are rapidly evolving in the multi-modal domain, which includes the integration of modalities such as image, audio, and video. Compared to single-modality, the collection of multi-modal preference data is more costly. Some techniques such as few-shot learning [221], data synthesis [222] remain to be explored, thereby reducing the reliance on human annotators. Meanwhile, designing a high-quality reward signal [223] is crucial, which involves alignment across different modalities. Finally, exploring methods to enhance the cross-domain generalization of RMs, and bridging the gap between simulated and real-world scenarios, will contribute to the realization of embodied intelligence.

7 Conclusion and Discussion↩︎

In this paper, we present the first comprehensive survey specifically focused on Reward Models in the LLM era. We systematically review related studies of RMs, introduce an elaborate taxonomy, discuss the practical applications, highlight the challenges, and explore potential research directions. Besides, we discuss some open questions about RMs. (1) Is Rule-based reward enough for RL? (2) Is Mixture-of-Experts better than BT Model? (3) How to overcome the reward hacking of RM as LLMs surpass the level of the best expert level? See Appendix 8.4 for more details. We hope that this survey will be helpful to researchers and facilitate further research.

8 Appendix↩︎

8.1 Relevant Survey↩︎

Some previous surveys focus on human-involved RL [224]–[226], while [227] discusses LLM-enhanced RL. [7] and [228] conducts a comprehensive investigation on LLM alignment. [11] and [12] both focus on RLHF, while [11] discusses the researches in which RM is the sore source of information for the objective. [12] overviews the open problems and limitations of RLHF.

Compared with the aforementioned survey, our work primarily focuses on RMs in LLM era. We systematically introduce RMs based on their life-cycles, and explain the popular usages and evaluation perspectives. In addition, we discuss the challenges and potential research directions of RMs in detail. We sincerely hope that this paper can deepen researchers’ understanding of the field and facilitate future works.

8.2 Reward Modeling↩︎

The Bradley-Terry Model [54] can be used for modeling pairwise preference, which is the most commonly reward model assumption. For a prompt \(x\), reward model \(r\), response pair \(y_w,y_l\). It estimates the probability of prefer to \(y_w\) rather than \(y_l\):

\[P(y_w \succ y_l|x) = \frac{1}{1+exp(r(x,y_w)-r(x,y_l))}.\]

An RM \(\widehat{r}\) can be derived by optimizing the following maximum likelihood objectives, where \(\mathcal{D}\) and \(\sigma\) represent the preference dataset and sigmoid function respectively.

\[\widehat{r}\leftarrow\mathop{\arg\max}_{r\in\mathcal{R}}\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma(r(x,y_w)-r(x,y_l))\right].\]

Under RLHF setting [14], the target policy model is optimized by using the learned RM \(\widehat{r}(x,y)\). \(\pi_{\mathsf{ref}}(x,y)\) represent the reference model before update, and the resulting Kullback-Leibler (KL) penalty term is utilized to constrain the size of the policy update [229]:

\[\widehat{\pi}\leftarrow\mathop{\arg\max}_{\pi\in\Pi}\mathbb{E}_{x\sim\mathcal{D},y\sim\pi(\cdot|x)}\left[\widehat{r}(x,y)-\beta\log\frac{\pi(x,y)}{\pi_{\mathsf{ref}}(x,y)}\right]\]

DPO [75] is an alternative alignment approach which can optimize the policy without explicit reward modeling: \[\widehat{\pi}\leftarrow \mathop{\arg\max}_{\pi\in\Pi}\mathbb{E}_{(x,y_w,y_l)\sim\mathcal{D}}\left[\log\sigma\left(\beta\log\frac{\pi(y_w\mid x)}{\pi_{\mathrm{ref}}(y_w\mid x)}-\beta\log\frac{\pi(y_l\mid x)}{\pi_{\mathrm{ref}}(y_l\mid x)}\right)\right],\]

where \(\beta\) is a scalable parameter.

8.3 Reward Shaping & Ensemble↩︎

A major challenge in real-world scenarios is the sparsity and delay of rewards, which can hinder learning. This section focuses on engineering the reward model [230] during reinforcement learning .

8.3.0.1 Reward on Point-wise Feedback

Pointwise feedback assigns numerical values to actions or outcomes, enabling precise adjustments to the agent’s policy. It is effective for tasks where each action’s quality can be independently assessed. For example, [231] and [232] propose a self-training strategy to select the best and worst reward samples. [47] addresses ambiguous preference pairs by incorporating a margin in the reward, improving model generalization. [233] employs a data-centric approach to enhance feedback quality and make reward models more effective.

8.3.0.2 Reward on Binary Feedback

Binary feedback simplifies evaluation by categorizing outcomes as positive or negative, eliminating the need for a ground truth. This makes implementation and interpretation easier. For instance, Nash learning [234] models pairwise preferences by binary feedback but struggles with inconsistent human labeling. Approaches like KTO [235] use the Kahneman-Tversky model [236] to maximize utility, and DRO [237] combines offline reinforcement learning with regularization in binary feedback. Binary feedback also guides agent learning by signaling desirable actions, as explored in [238]. However, it may not capture the full complexity of human preferences.

8.3.0.3 Reward on Ensemble Feedback

Model ensemble [239] is a classic machine learning method for mitigating reward overoptimization and improving policy optimization. Typically, ensemble feedback [134], [135], [204], [231] aims to combine reward signal to further reduce reward hacking during reinforcement fine-tuning. For computational efficiency, [136] propose a LoRA-based ensemble method that reduces the computational cost associated with reward ensembles. Additionally, reward ensemble techniques, such as the Bayesian ensemble method [133], can be used to approximate uncertainty in the feedback.

8.4 Open Questions↩︎

8.4.0.1 Is Rule-based reward enough for RL?

Rule-based rewards are a good way to mitigate reward hacking, but it’s hard to say whether they are enough on their own. Without sufficient supervision, large language models (LLMs) may encounter very sparse rewards, leading to optimization divergence. Additionally, for tasks that don’t have a clear ground truth, designing an effective rule-based reward can be challenging. In such cases, preference learning can be a better option, as it allows us to derive reward signals from comparative feedback rather than relying solely on predefined rules. Thus, while rule-based rewards can be helpful, they may not always provide the necessary robustness for complex tasks.

8.4.0.2 Is Mixture-of-Experts better than BT Model?

There are several works related to Mixture-of-Experts (MoE) models, such as the DMoERM model [240] and LoRA-ensemble [241], [242]. MoE models have shown great potential in creating Pareto-optimal [243], [244] reward models, where they can combine multiple expert models to focus on different aspects of the problem, offering a more versatile and efficient approach. While the BT model has its strengths, MoE models have the advantage of scalability and the ability to improve performance by selecting the most relevant expert for each situation. This flexibility often leads to better generalization and optimization, especially in complex tasks.

8.4.0.3 How to overcome the reward hacking of RM as LLMs surpass the level of best expert level?

As LLMs surpass the performance of the best expert models, overcoming reward hacking becomes more challenging. One approach is to shift from weak-to-strong generalization [245]. This involves designing reward models that encourage more robust, flexible learning that accounts for a wider variety of potential behaviors and outcomes. Instead of relying solely on expert-level feedback, incorporating broader, more generalized reward signals helps ensure that the system doesn’t exploit narrow solutions or hacks. This strategy promotes more meaningful generalization and prevents the model from exploiting loopholes in the reward structure.

fig: — Figure 4: Full taxonomy of Reward Models.

8.5 Evaluation Aspects↩︎

According to the benchmarks introduced in (§4), the evaluation aspects of RMs can be summarized mainly as follows:

8.5.0.1 Consistency.

The aim of RMs is to provide preference signals to LLMs, thus consistency is the primary evaluation aspect for RMs. Furthermore, consistency can be divided into: (1) the alignment between RMs and human preferences, the RMs are required to distinguish between chosen and rejected samples [51], [181], [182] , or identify the correctness of samples directly [183];(2) the alignment between RMs and policy models, such as style-controlled correlation [181] and downstream task correlation [178], [184]

8.5.0.2 Robustness.

On the basis of consistency, RMs should exhibit robustness across the experimental settings and tasks. [179] rewrite the prompts in the RM test dataset to investigate the influence of the prompt semantic bias. In PRM evaluation, [116] requires LLMs to be sensitive to the details of reasoning, including subtle conditions, deception, and multiple solutions.

8.5.0.3 Safety.

Similar to the consistency evaluation, [51] and [181] evaluate RM’s ability to distinguish between safe and unsafe responses. [182] conducts trade-off analysis between the goals of helpfulness and harmlessness.

References↩︎

[1]

OpenAI. technical report. ArXiv preprint, abs/2303.08774, 2023. URL https://arxiv.org/abs/2303.08774.

[2]

Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M. Dai, Anja Hauth, Katie Millican, David Silver, Slav Petrov, Melvin Johnson, Ioannis Antonoglou, Julian Schrittwieser, Amelia Glaese, Jilin Chen, Emily Pitler, Timothy P. Lillicrap, Angeliki Lazaridou, Orhan Firat, James Molloy, Michael Isard, Paul Ronald Barham, Tom Hennigan, Benjamin Lee, Fabio Viola, Malcolm Reynolds, Yuanzhong Xu, Ryan Doherty, Eli Collins, Clemens Meyer, Eliza Rutherford, Erica Moreira, Kareem Ayoub, Megha Goel, George Tucker, Enrique Piqueras, Maxim Krikun, Iain Barr, Nikolay Savinov, Ivo Danihelka, Becca Roelofs, Anaı̈s White, Anders Andreassen, Tamara von Glehn, Lakshman Yagati, Mehran Kazemi, Lucas Gonzalez, Misha Khalman, Jakub Sygnowski, and et al. Gemini: A family of highly capable multimodal models. ArXiv preprint, abs/2312.11805, 2023. URL https://arxiv.org/abs/2312.11805.

[3]

OpenAI. Learning to reason with llms. 2024. URL https://openai.com/index/learning-to-reason-with-llms/.

[4]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.

[5]

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Armando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. ArXiv preprint, abs/2403.07974, 2024. URL https://arxiv.org/abs/2403.07974.

[6]

OpenAI. Introducing simpleqa. 2024. URL https://openai.com/index/introducing-simpleqa/.

[7]

Jiaming Ji, Tianyi Qiu, Boyuan Chen, Borong Zhang, Hantao Lou, Kaile Wang, Yawen Duan, Zhonghao He, Jiayi Zhou, Zhaowei Zhang, Fanzhi Zeng, Kwan Yee Ng, Juntao Dai, Xuehai Pan, Aidan O’Gara, Yingshan Lei, Hua Xu, Brian Tse, Jie Fu, Stephen McAleer, Yaodong Yang, Yizhou Wang, Song-Chun Zhu, Yike Guo, and Wen Gao. alignment: A comprehensive survey. ArXiv preprint, abs/2310.19852, 2023. URL https://arxiv.org/abs/2310.19852.

[8]

Tianhao Shen, Renren Jin, Yufei Huang, Chuang Liu, Weilong Dong, Zishan Guo, Xinwei Wu, Yan Liu, and Deyi Xiong. Large language model alignment: A survey. ArXiv preprint, abs/2309.15025, 2023. URL https://arxiv.org/abs/2309.15025.

[9]

Paul F. Christiano, Jan Leike, Tom B. Brown, Miljan Martic, Shane Legg, and Dario Amodei. Deep reinforcement learning from human preferences. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V. N. Vishwanathan, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 4299–4307, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/d5e2c0adad503c91f91df240d0cd4e49-Abstract.html.

[10]

Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, Carol Chen, Catherine Olsson, Christopher Olah, Danny Hernandez, Dawn Drain, Deep Ganguli, Dustin Li, Eli Tran-Johnson, Ethan Perez, Jamie Kerr, Jared Mueller, Jeffrey Ladish, Joshua Landau, Kamal Ndousse, Kamile Lukosiute, Liane Lovitt, Michael Sellitto, Nelson Elhage, Nicholas Schiefer, Noemı́ Mercado, Nova DasSarma, Robert Lasenby, Robin Larson, Sam Ringer, Scott Johnston, Shauna Kravec, Sheer El Showk, Stanislav Fort, Tamera Lanham, Timothy Telleen-Lawton, Tom Conerly, Tom Henighan, Tristan Hume, Samuel R. Bowman, Zac Hatfield-Dodds, Ben Mann, Dario Amodei, Nicholas Joseph, Sam McCandlish, Tom Brown, and Jared Kaplan. Constitutional AI: harmlessness from AI feedback. ArXiv preprint, abs/2212.08073, 2022. URL https://arxiv.org/abs/2212.08073.

[11]

Timo Kaufmann, Paul Weng, Viktor Bengs, and Eyke Hüllermeier. A survey of reinforcement learning from human feedback. ArXiv preprint, abs/2312.14925, 2023. URL https://arxiv.org/abs/2312.14925.

[12]

Stephen Casper, Xander Davies, Claudia Shi, Thomas Krendl Gilbert, Jérémy Scheurer, Javier Rando, Rachel Freedman, Tomasz Korbak, David Lindner, Pedro Freire, Tony Tong Wang, Samuel Marks, Charbel-Raphaël Ségerie, Micah Carroll, Andi Peng, Phillip J. K. Christoffersen, Mehul Damani, Stewart Slocum, Usman Anwar, Anand Siththaranjan, Max Nadeau, Eric J. Michaud, Jacob Pfau, Dmitrii Krasheninnikov, Xin Chen, Lauro Langosco, Peter Hase, Erdem Biyik, Anca D. Dragan, David Krueger, Dorsa Sadigh, and Dylan Hadfield-Menell. Open problems and fundamental limitations of reinforcement learning from human feedback. Trans. Mach. Learn. Res., 2023, 2023. URL https://openreview.net/forum?id=bx24KpJ4Eb.

[13]

Anthropic. Introducing claude. 2023. URL https://www.anthropic.com/news/introducing-claude.

[14]

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedback. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/b1efde53be364a73914f58805a001731-Abstract-Conference.html.

[15]

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, Nicholas Joseph, Saurav Kadavath, Jackson Kernion, Tom Conerly, Sheer El Showk, Nelson Elhage, Zac Hatfield-Dodds, Danny Hernandez, Tristan Hume, Scott Johnston, Shauna Kravec, Liane Lovitt, Neel Nanda, Catherine Olsson, Dario Amodei, Tom B. Brown, Jack Clark, Sam McCandlish, Chris Olah, Benjamin Mann, and Jared Kaplan. Training a helpful and harmless assistant with reinforcement learning from human feedback. ArXiv preprint, abs/2204.05862, 2022. URL https://arxiv.org/abs/2204.05862.

[16]

Borja Ibarz, Jan Leike, Tobias Pohlen, Geoffrey Irving, Shane Legg, and Dario Amodei. Reward learning from human preferences and demonstrations in atari. In Samy Bengio, Hanna M. Wallach, Hugo Larochelle, Kristen Grauman, Nicolò Cesa-Bianchi, and Roman Garnett (eds.), Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 8022–8034, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/8cbe9ce23f42628c98f80fa0fac8b19a-Abstract.html.

[17]

Zhilin Wang, Yi Dong, Olivier Delalleau, Jiaqi Zeng, Gerald Shen, Daniel Egert, Jimmy Zhang, Makesh Narsimhan Sreedhar, and Oleksii Kuchaiev. Helpsteer 2: Open-source dataset for training top-performing reward models. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/02fd91a387a6a5a5751e81b58a75af90-Abstract-Datasets_and_Benchmarks_Track.html.

[18]

RyokoAI. Ryokoai/sharegpt52k. 2023. URL https://huggingface.co/datasets/RyokoAI/ShareGPT52K.

[19]

Pengzhen Ren, Yun Xiao, Xiaojun Chang, Po-Yao Huang, Zhihui Li, Brij B. Gupta, Xiaojiang Chen, and Xin Wang. A survey of deep active learning. ACM Comput. Surv., 54 (9): 180:1–180:40, 2022. . URL https://doi.org/10.1145/3472291.

[20]

Erdem Biyik, Nicolas Huynh, Mykel J. Kochenderfer, and Dorsa Sadigh. Active preference-based gaussian process regression for reward learning. In Marc Toussaint, Antonio Bicchi, and Tucker Hermans (eds.), Robotics: Science and Systems XVI, Virtual Event / Corvalis, Oregon, USA, July 12-16, 2020, 2020. . URL https://doi.org/10.15607/RSS.2020.XVI.041.

[21]

David Lindner, Matteo Turchetta, Sebastian Tschiatschek, Kamil Ciosek, and Andreas Krause. Information directed reward learning for reinforcement learning. In Marc’Aurelio Ranzato, Alina Beygelzimer, Yann N. Dauphin, Percy Liang, and Jennifer Wortman Vaughan (eds.), Advances in Neural Information Processing Systems 34: Annual Conference on Neural Information Processing Systems 2021, NeurIPS 2021, December 6-14, 2021, virtual, pp. 3850–3862, 2021. URL https://proceedings.neurips.cc/paper/2021/hash/1fa6269f58898f0e809575c9a48747ef-Abstract.html.

[22]

Kimin Lee, Laura M. Smith, and Pieter Abbeel. feedback-efficient interactive reinforcement learning via relabeling experience and unsupervised pre-training. In Marina Meila and Tong Zhang (eds.), Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event, volume 139 of Proceedings of Machine Learning Research, pp. 6152–6163. PMLR, 2021. URL http://proceedings.mlr.press/v139/lee21i.html.

[23]

Jongjin Park, Younggyo Seo, Jinwoo Shin, Honglak Lee, Pieter Abbeel, and Kimin Lee. semi-supervised reward learning with data augmentation for feedback-efficient preference-based reinforcement learning. In The Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022. URL https://openreview.net/forum?id=TfhfZLQ2EJO.

[24]

Minyoung Hwang, Gunmin Lee, Hogun Kee, Chan Woo Kim, Kyungjae Lee, and Songhwai Oh. Sequential preference ranking for efficient reinforcement learning from human feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/99766cda865be123d55a1d9666c7b9fc-Abstract-Conference.html.

[25]

Peter Barnett, Rachel Freedman, Justin Svegliato, and Stuart Russell. Active reward learning from multiple teachers. In Gabriel Pedroza, Xiaowei Huang, Xin Cynthia Chen, Andreas Theodorou, José Hernández-Orallo, Mauricio Castillo-Effen, Richard Mallah, and John A. McDermid (eds.), Proceedings of the Workshop on Artificial Intelligence Safety 2023 (SafeAI 2023) co-located with the Thirty-Seventh AAAI Conference on Artificial Intelligence (AAAI 2023), Washington DC, USA, February 13-14, 2023, volume 3381 of CEUR Workshop Proceedings. CEUR-WS.org, 2023. URL https://ceur-ws.org/Vol-3381/48.pdf.

[26]

Yannick Metz, David Lindner, Raphaël Baur, Daniel A. Keim, and Mennatallah El-Assady. Rlhf-blender: A configurable interactive interface for learning from diverse human feedback. ArXiv preprint, abs/2308.04332, 2023. URL https://arxiv.org/abs/2308.04332.

[27]

Yifu Yuan, Jianye Hao, Yi Ma, Zibin Dong, Hebin Liang, Jinyi Liu, Zhixin Feng, Kai Zhao, and Yan Zheng. Uni-rlhf: Universal platform and benchmark suite for reinforcement learning with diverse human feedback. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=WesY0H9ghM.

[28]

Amelia Glaese, Nat McAleese, Maja Trebacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin J. Chadwick, Phoebe Thacker, Lucy Campbell-Gillingham, Jonathan Uesato, Po-Sen Huang, Ramona Comanescu, Fan Yang, Abigail See, Sumanth Dathathri, Rory Greig, Charlie Chen, Doug Fritz, Jaume Sanchez Elias, Richard Green, Sona Mokrá, Nicholas Fernando, Boxi Wu, Rachel Foley, Susannah Young, Iason Gabriel, William Isaac, John Mellor, Demis Hassabis, Koray Kavukcuoglu, Lisa Anne Hendricks, and Geoffrey Irving. Improving alignment of dialogue agents via targeted human judgements. ArXiv preprint, abs/2209.14375, 2022. URL https://arxiv.org/abs/2209.14375.

[29]

Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/b8c90b65739ae8417e61eadb521f63d5-Abstract-Conference.html.

[30]

Erdem Biyik and Dorsa Sadigh. Batch active preference-based learning of reward functions. In 2nd Annual Conference on Robot Learning, CoRL 2018, Zürich, Switzerland, 29-31 October 2018, Proceedings, volume 87 of Proceedings of Machine Learning Research, pp. 519–528. PMLR, 2018. URL http://proceedings.mlr.press/v87/biyik18a.html.

[31]

Erdem Biyik, Nima Anari, and Dorsa Sadigh. Batch active learning of reward functions from human preferences. ACM Trans. Hum. Robot Interact., 13 (2): 24:1–24:27, 2024. . URL https://doi.org/10.1145/3649885.

[32]

Hanze Dong, Wei Xiong, Bo Pang, Haoxiang Wang, Han Zhao, Yingbo Zhou, Nan Jiang, Doyen Sahoo, Caiming Xiong, and Tong Zhang. workflow: From reward modeling to online RLHF. ArXiv preprint, abs/2405.07863, 2024. URL https://arxiv.org/abs/2405.07863.

[33]

Fabrizio Gilardi, Meysam Alizadeh, and Maël Kubli. Chatgpt outperforms crowd-workers for text-annotation tasks. ArXiv preprint, abs/2303.15056, 2023. URL https://arxiv.org/abs/2303.15056.

[34]

Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. Llms-as-judges: A comprehensive survey on llm-based evaluation methods. ArXiv preprint, abs/2412.05579, 2024. URL https://arxiv.org/abs/2412.05579.

[35]

Harrison Lee, Samrat Phatale, Hassan Mansoor, Thomas Mesnard, Johan Ferret, Kellie Lu, Colton Bishop, Ethan Hall, Victor Carbune, Abhinav Rastogi, and Sushant Prakash. vs. RLHF: scaling reinforcement learning from human feedback with AI feedback. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=uydQ2W41KO.

[36]

David Silver, Thomas Hubert, Julian Schrittwieser, Ioannis Antonoglou, Matthew Lai, Arthur Guez, Marc Lanctot, Laurent Sifre, Dharshan Kumaran, Thore Graepel, Timothy P. Lillicrap, Karen Simonyan, and Demis Hassabis. Mastering chess and shogi by self-play with a general reinforcement learning algorithm. ArXiv preprint, abs/1712.01815, 2017. URL https://arxiv.org/abs/1712.01815.

[37]

Oriol Vinyals, Igor Babuschkin, Wojciech M. Czarnecki, Michaël Mathieu, Andrew Dudzik, Junyoung Chung, David H. Choi, Richard Powell, Timo Ewalds, Petko Georgiev, Junhyuk Oh, Dan Horgan, Manuel Kroiss, Ivo Danihelka, Aja Huang, Laurent Sifre, Trevor Cai, John P. Agapiou, Max Jaderberg, Alexander Sasha Vezhnevets, Rémi Leblond, Tobias Pohlen, Valentin Dalibard, David Budden, Yury Sulsky, James Molloy, Tom Le Paine, Çaglar Gülçehre, Ziyu Wang, Tobias Pfaff, Yuhuai Wu, Roman Ring, Dani Yogatama, Dario Wünsch, Katrina McKinney, Oliver Smith, Tom Schaul, Timothy P. Lillicrap, Koray Kavukcuoglu, Demis Hassabis, Chris Apps, and David Silver. Grandmaster level in starcraft II using multi-agent reinforcement learning. Nat., 575 (7782): 350–354, 2019. . URL https://doi.org/10.1038/s41586-019-1724-z.

[38]

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeffrey Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=ghNRg2mEgN.

[39]

Yann Dubois, Chen Xuechen Li, Rohan Taori, Tianyi Zhang, Ishaan Gulrajani, Jimmy Ba, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacafarm: A simulation framework for methods that learn from human feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/5fc47800ee5b30b8777fdd30abcaaf3b-Abstract-Conference.html.

[40]

Sungdong Kim, Sanghwan Bae, Jamin Shin, Soyoung Kang, Donghyun Kwak, Kang Yoo, and Minjoon Seo. Aligning large language models through synthetic feedback. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 13677–13700, Singapore, 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.emnlp-main.844.

[41]

Ganqu Cui, Lifan Yuan, Ning Ding, Guanming Yao, Bingxiang He, Wei Zhu, Yuan Ni, Guotong Xie, Ruobing Xie, Yankai Lin, Zhiyuan Liu, and Maosong Sun. boosting language models with scaled AI feedback. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=BOorDpKHiJ.

[42]

Min Li. Interpreting language model preferences through the lens of decision trees, 2025. URL https://rlhflow.github.io/posts/2025-01-22-decision-tree-reward-model/.

[43]

Zhiqing Sun, Yikang Shen, Hongxin Zhang, Qinhong Zhou, Zhenfang Chen, David Daniel Cox, Yiming Yang, and Chuang Gan. self-alignment with instructable reward models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=xJbsmB8UMx.

[44]

Zihuiwen Ye, Fraser Greenlee-Scott, Max Bartolo, Phil Blunsom, Jon Ander Campos, and Matthias Gallé. Improving reward models with synthetic critiques. ArXiv preprint, abs/2405.20850, 2024. URL https://arxiv.org/abs/2405.20850.

[45]

Yue Yu, Zhengxing Chen, Aston Zhang, Liang Tan, Chenguang Zhu, Richard Yuanzhe Pang, Yundi Qian, Xuewei Wang, Suchin Gururangan, Chao Zhang, Melanie Kambadur, Dhruv Mahajan, and Rui Hou. Self-generated critiques boost reward modeling for language models. ArXiv preprint, abs/2411.16646, 2024. URL https://arxiv.org/abs/2411.16646.

[46]

Shitong Duan, Xiaoyuan Yi, Peng Zhang, Yan Liu, Zheng Liu, Tun Lu, Xing Xie, and Ning Gu. Negating negatives: Alignment with human negative samples via distributional dispreference optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, pp. 1012–1042. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.findings-emnlp.56.

[47]

Binghai Wang, Rui Zheng, Lu Chen, Yan Liu, Shihan Dou, Caishuang Huang, Wei Shen, Senjie Jin, Enyu Zhou, Chenyu Shi, Songyang Gao, Nuo Xu, Yuhao Zhou, Xiaoran Fan, Zhiheng Xi, Jun Zhao, Xiao Wang, Tao Ji, Hang Yan, Lixing Shen, Zhan Chen, Tao Gui, Qi Zhang, Xipeng Qiu, Xuanjing Huang, Zuxuan Wu, and Yu-Gang Jiang. Secrets of rlhf in large language models part ii: Reward modeling, 2024. URL https://arxiv.org/abs/2401.06080.

[48]

Richard S Sutton. Reinforcement learning: An introduction. A Bradford Book, 2018.

[49]

Andrew Y. Ng and Stuart J. Russell. Algorithms for inverse reinforcement learning. In Pat Langley (ed.), Proceedings of the Seventeenth International Conference on Machine Learning (ICML 2000), Stanford University, Stanford, CA, USA, June 29 - July 2, 2000, pp. 663–670. Morgan Kaufmann, 2000.

[50]

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms. ArXiv preprint, abs/2410.18451, 2024. URL https://arxiv.org/abs/2410.18451.

[51]

Nathan Lambert, Valentina Pyatkin, Jacob Morrison, LJ Miranda, Bill Yuchen Lin, Khyathi Raghavi Chandu, Nouha Dziri, Sachin Kumar, Tom Zick, Yejin Choi, Noah A. Smith, and Hannaneh Hajishirzi. Rewardbench: Evaluating reward models for language modeling. ArXiv preprint, abs/2403.13787, 2024. URL https://arxiv.org/abs/2403.13787.

[52]

Zheng Cai, Maosong Cao, Haojiong Chen, Kai Chen, Keyu Chen, Xin Chen, Xun Chen, Zehui Chen, Zhi Chen, Pei Chu, Xiaoyi Dong, Haodong Duan, Qi Fan, Zhaoye Fei, Yang Gao, Jiaye Ge, Chenya Gu, Yuzhe Gu, Tao Gui, Aijia Guo, Qipeng Guo, Conghui He, Yingfan Hu, Ting Huang, Tao Jiang, Penglong Jiao, Zhenjiang Jin, Zhikai Lei, Jiaxing Li, Jingwen Li, Linyang Li, Shuaibin Li, Wei Li, Yining Li, Hongwei Liu, Jiangning Liu, Jiawei Hong, Kaiwen Liu, Kuikun Liu, Xiaoran Liu, Chengqi Lv, Haijun Lv, Kai Lv, Li Ma, Runyuan Ma, Zerun Ma, Wenchang Ning, Linke Ouyang, Jiantao Qiu, Yuan Qu, Fukai Shang, Yunfan Shao, Demin Song, Zifan Song, Zhihao Sui, Peng Sun, Yu Sun, Huanze Tang, Bin Wang, Guoteng Wang, Jiaqi Wang, Jiayu Wang, Rui Wang, Yudong Wang, Ziyi Wang, Xingjian Wei, Qizhen Weng, Fan Wu, Yingtong Xiong, Xiaomeng Zhao, and et al. Internlm2 technical report. ArXiv preprint, abs/2403.17297, 2024. URL https://arxiv.org/abs/2403.17297.

[53]

Lifan Yuan, Ganqu Cui, Hanbin Wang, Ning Ding, Xingyao Wang, Jia Deng, Boji Shan, Huimin Chen, Ruobing Xie, Yankai Lin, Zhenghao Liu, Bowen Zhou, Hao Peng, Zhiyuan Liu, and Maosong Sun. Advancing LLM reasoning generalists with preference trees. ArXiv preprint, abs/2404.02078, 2024. URL https://arxiv.org/abs/2404.02078.

[54]

Ralph Allan Bradley and Milton E. Terry. Rank analysis of incomplete block designs: I. the method of paired comparisons. Biometrika, 39 (3/4): 324–345, 1952. ISSN 00063444, 14643510. URL http://www.jstor.org/stable/2334029.

[55]

Rui Yang, Ruomeng Ding, Yong Lin, Huan Zhang, and Tong Zhang. Regularizing hidden states enables learning generalizable reward model for llms. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/71f7154547c748c8041505521ca433ab-Abstract-Conference.html.

[56]

Dongfu Jiang, Xiang Ren, and Bill Yuchen Lin. -blender: Ensembling large language models with pairwise ranking and generative fusion. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 14165–14178, Toronto, Canada, 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.acl-long.792.

[57]

Genta Indra Winata, David Anugraha, Lucky Susanto, Garry Kuwanto, and Derry Tanti Wijaya. Metametrics: Calibrating metrics for generation tasks using human preferences. ArXiv preprint, abs/2410.02381, 2024. URL https://arxiv.org/abs/2410.02381.

[58]

Bo Adler, Niket Agarwal, Ashwath Aithal, Dong H. Anh, Pallab Bhattacharya, Annika Brundyn, Jared Casper, Bryan Catanzaro, Sharon Clay, Jonathan M. Cohen, Sirshak Das, Ayush Dattagupta, Olivier Delalleau, Leon Derczynski, Yi Dong, Daniel Egert, Ellie Evans, Aleksander Ficek, Denys Fridman, Shaona Ghosh, Boris Ginsburg, Igor Gitman, Tomasz Grzegorzek, Robert Hero, Jining Huang, Vibhu Jawa, Joseph Jennings, Aastha Jhunjhunwala, John Kamalu, Sadaf Khan, Oleksii Kuchaiev, Patrick LeGresley, Hui Li, Jiwei Liu, Zihan Liu, Eileen Long, Ameya Sunil Mahabaleshwarkar, Somshubra Majumdar, James Maki, Miguel Martinez, Maer Rodrigues de Melo, Ivan Moshkov, Deepak Narayanan, Sean Narenthiran, Jesus Navarro, Phong Nguyen, Osvald Nitski, Vahid Noroozi, Guruprasad Nutheti, Christopher Parisien, Jupinder Parmar, Mostofa Patwary, Krzysztof Pawelec, Wei Ping, Shrimai Prabhumoye, Rajarshi Roy, Trisha Saar, Vasanth Rao Naik Sabavat, Sanjeev Satheesh, Jane Polak Scowcroft, Jason Sewall, Pavel Shamis, Gerald Shen, Mohammad Shoeybi, Dave Sizer, Misha Smelyanskiy, Felipe Soares, Makesh Narsimhan Sreedhar, Dan Su, Sandeep Subramanian, Shengyang Sun, Shubham Toshniwal, Hao Wang, Zhilin Wang, Jiaxuan You, Jiaqi Zeng, Jimmy Zhang, Jing Zhang, Vivienne Zhang, Yian Zhang, and Chen Zhu. Nemotron-4 340b technical report. ArXiv preprint, abs/2406.11704, 2024. URL https://arxiv.org/abs/2406.11704.

[59]

Haoxiang Wang, Wei Xiong, Tengyang Xie, Han Zhao, and Tong Zhang. Interpretable preferences via multi-objective reward modeling and mixture-of-experts, 2024. URL https://arxiv.org/abs/2406.12845.

[60]

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/91f18a1287b398d378ef22505bf41832-Abstract-Datasets_and_Benchmarks.html.

[61]

Junlong Li, Shichao Sun, Weizhe Yuan, Run-Ze Fan, Hai Zhao, and Pengfei Liu. Generative judge for evaluating alignment. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=gtkFw6sZGS.

[62]

Maosong Cao, Alexander Lam, Haodong Duan, Hongwei Liu, Songyang Zhang, and Kai Chen. Compassjudger-1: All-in-one judge model helps model evaluation and evolution. ArXiv preprint, abs/2410.16256, 2024. URL https://arxiv.org/abs/2410.16256.

[63]

Ziyi Ye, Xiangsheng Li, Qiuchi Li, Qingyao Ai, Yujia Zhou, Wei Shen, Dong Yan, and Yiqun Liu. Beyond scalar reward model: Learning generative judge from preference data. ArXiv preprint, abs/2410.03742, 2024. URL https://arxiv.org/abs/2410.03742.

[64]

Nat McAleese, Rai Michael Pokorny, Juan Felipe Ceron Uribe, Evgenia Nitishinskaya, Maja Trebacz, and Jan Leike. critics help catch LLM bugs. ArXiv preprint, abs/2407.00215, 2024. URL https://arxiv.org/abs/2407.00215.

[65]

Bofei Gao, Zefan Cai, Runxin Xu, Peiyi Wang, Ce Zheng, Runji Lin, Keming Lu, Junyang Lin, Chang Zhou, Wen Xiao, Junjie Hu, Tianyu Liu, and Baobao Chang. critics help catch bugs in mathematics: Towards a better mathematical verifier with natural language feedback. ArXiv preprint, abs/2406.14024, 2024. URL https://arxiv.org/abs/2406.14024.

[66]

Dakota Mahan, Duy Phung, Rafael Rafailov, Chase Blagden, Nathan Lile, Louis Castricato, Jan-Philipp Fränken, Chelsea Finn, and Alon Albalak. Generative reward models. ArXiv preprint, abs/2410.12832, 2024. URL https://arxiv.org/abs/2410.12832.

[67]

Lunjun Zhang, Arian Hosseini, Hritik Bansal, Mehran Kazemi, Aviral Kumar, and Rishabh Agarwal. Generative verifiers: Reward modeling as next-token prediction. ArXiv preprint, abs/2408.15240, 2024. URL https://arxiv.org/abs/2408.15240.

[68]

Zhipeng Chen, Kun Zhou, Xin Zhao, Junchen Wan, Fuzheng Zhang, Di Zhang, and Ji-Rong Wen. Improving large language models via fine-grained reinforcement learning with minimum editing constraint. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pp. 5694–5711. Association for Computational Linguistics, 2024. . URL https://doi.org/10.18653/v1/2024.findings-acl.338.

[69]

Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 13484–13508, Toronto, Canada, 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.acl-long.754.

[70]

Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=0NphYCmgua.

[71]

Polina Tsvilodub, Fausto Carcassi, and Michael Franke. Towards neuro-symbolic models of language cognition: Llms as proposers and evaluators. 2024.

[72]

Tianhao Wu, Weizhe Yuan, Olga Golovneva, Jing Xu, Yuandong Tian, Jiantao Jiao, Jason Weston, and Sainbayar Sukhbaatar. Meta-rewarding language models: Self-improving alignment with llm-as-a-meta-judge. ArXiv preprint, abs/2407.19594, 2024. URL https://arxiv.org/abs/2407.19594.

[73]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. Large language models are zero-shot reasoners. In Sanmi Koyejo, S. Mohamed, A. Agarwal, Danielle Belgrave, K. Cho, and A. Oh (eds.), Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/8bb0d291acd4acf06ef112099c16f326-Abstract-Conference.html.

[74]

Patrick S. H. Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Hugo Larochelle, Marc’Aurelio Ranzato, Raia Hadsell, Maria-Florina Balcan, and Hsuan-Tien Lin (eds.), Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/6b493230205f780e1bc26945df7481e5-Abstract.html.

[75]

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/a85b405ed65c6477a4fe8302b5e06ce7-Abstract-Conference.html.

[76]

Yao Zhao, Rishabh Joshi, Tianqi Liu, Misha Khalman, Mohammad Saleh, and Peter J. Liu. Slic-hf: Sequence likelihood calibration with human feedback. ArXiv preprint, abs/2305.10425, 2023. URL https://arxiv.org/abs/2305.10425.

[77]

Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. From r to q\({}^{\mbox{*}}\): Your language model is secretly a q-function. ArXiv preprint, abs/2404.12358, 2024. URL https://arxiv.org/abs/2404.12358.

[78]

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. -dpo: Direct preference optimization with dynamic \(\beta\). In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/ea888178abdb6fc233226d12321d754f-Abstract-Conference.html.

[79]

Dahyun Kim, Yungi Kim, Wonho Song, Hyeonwoo Kim, Yunsu Kim, Sanghoon Kim, and Chanjun Park. sdpo: Don’t use your data all at once. ArXiv preprint, abs/2403.19270, 2024. URL https://arxiv.org/abs/2403.19270.

[80]

Tianqi Liu, Yao Zhao, Rishabh Joshi, Misha Khalman, Mohammad Saleh, Peter J. Liu, and Jialu Liu. Statistical rejection sampling improves preference optimization. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=xbjSwwrQOe.

[81]

Tetsuro Morimura, Mitsuki Sakamoto, Yuu Jinnai, Kenshi Abe, and Kaito Ariu. Filtered direct preference optimization. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pp. 22729–22770. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.emnlp-main.1266.

[82]

Qi Gou and Cam-Tu Nguyen. Mixed preference optimization: Reinforcement learning with data selection and better reference model. ArXiv preprint, abs/2403.19443, 2024. URL https://arxiv.org/abs/2403.19443.

[83]

Aiwei Liu, Haoping Bai, Zhiyun Lu, Yanchao Sun, Xiang Kong, Xiaoming Simon Wang, Jiulong Shan, Albin Madappally Jose, Xiaojiang Liu, Lijie Wen, Philip S. Yu, and Meng Cao. -DPO: Token-level importance sampling for direct preference optimization with estimated weights. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=oF6e2WwxX0.

[84]

Tianqi Liu, Zhen Qin, Junru Wu, Jiaming Shen, Misha Khalman, Rishabh Joshi, Yao Zhao, Mohammad Saleh, Simon Baumgartner, Jialu Liu, Peter J. Liu, and Xuanhui Wang. Lipo: Listwise preference optimization through learning-to-rank. ArXiv preprint, abs/2402.01878, 2024. URL https://arxiv.org/abs/2402.01878.

[85]

Huayu Chen, Guande He, Lifan Yuan, Ganqu Cui, Hang Su, and Jun Zhu. Noise contrastive alignment of language models with explicit rewards. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/d5a58d198afa370a3dff0e1ca4fe1802-Abstract-Conference.html.

[86]

Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Ávila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Rémi Munos, and Bilal Piot. Offline regularised reinforcement learning for large language models alignment. ArXiv preprint, abs/2405.19107, 2024. URL https://arxiv.org/abs/2405.19107.

[87]

Junkang Wu, Yuexiang Xie, Zhengyi Yang, Jiancan Wu, Jiawei Chen, Jinyang Gao, Bolin Ding, Xiang Wang, and Xiangnan He. Towards robust alignment of language models: Distributionally robustifying direct preference optimization. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=CbfsKHiWEn.

[88]

Sayak Ray Chowdhury, Anush Kini, and Nagarajan Natarajan. Provably robust DPO: aligning language models with noisy feedback. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=yhpDKSw7yA.

[89]

Zaiyan Xu, Sushil Vemuri, Kishan Panaganti, Dileep Kalathil, Rahul Jain, and Deepak Ramachandran. Distributionally robust direct preference optimization. ArXiv preprint, abs/2502.01930, 2025. URL https://arxiv.org/abs/2502.01930.

[90]

Yongcheng Zeng, Guoqing Liu, Weiyu Ma, Ning Yang, Haifeng Zhang, and Jun Wang. Token-level direct preference optimization. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=1RZKuvqYCR.

[91]

Zicheng Lin, Tian Liang, Jiahao Xu, Xing Wang, Ruilin Luo, Chufan Shi, Siheng Li, Yujiu Yang, and Zhaopeng Tu. Critical tokens matter: Token-level contrastive estimation enhances llm’s reasoning capability. ArXiv preprint, abs/2411.19943, 2024. URL https://arxiv.org/abs/2411.19943.

[92]

Jiwoo Hong, Noah Lee, and James Thorne. monolithic preference optimization without reference model. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pp. 11170–11189. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.emnlp-main.626.

[93]

Haoran Xu, Amr Sharaf, Yunmo Chen, Weiting Tan, Lingfeng Shen, Benjamin Van Durme, Kenton Murray, and Young Jin Kim. Contrastive preference optimization: Pushing the boundaries of LLM performance in machine translation. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=51iwkioZpn.

[94]

Yu Meng, Mengzhou Xia, and Danqi Chen. Simpo: Simple preference optimization with a reference-free reward. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/e099c1c9699814af0be873a175361713-Abstract-Conference.html.

[95]

Corby Rosset, Ching-An Cheng, Arindam Mitra, Michael Santacroce, Ahmed Awadallah, and Tengyang Xie. Direct nash optimization: Teaching language models to self-improve with general preferences. ArXiv preprint, abs/2404.03715, 2024. URL https://arxiv.org/abs/2404.03715.

[96]

Gokul Swamy, Christoph Dann, Rahul Kidambi, Steven Wu, and Alekh Agarwal. A minimaximalist approach to reinforcement learning from human feedback. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=5kVgd2MwMY.

[97]

Yue Wu, Zhiqing Sun, Huizhuo Yuan, Kaixuan Ji, Yiming Yang, and Quanquan Gu. Self-play preference optimization for language model alignment. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=a3PmRgAB5T.

[98]

Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, Karthik Ganesan, Wei-Lin Chiang, Jian Zhang, and Jiantao Jiao. Starling-7b: Improving helpfulness and harmlessness with RLAIF. In First Conference on Language Modeling, 2024. URL https://openreview.net/forum?id=GqDntYTTbk.

[99]

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. Making language models better reasoners with step-aware verifier. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 5315–5333, Toronto, Canada, 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.acl-long.291.

[100]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems. ArXiv preprint, abs/2110.14168, 2021. URL https://arxiv.org/abs/2110.14168.

[101]

Antonia Creswell, Murray Shanahan, and Irina Higgins. Selection-inference: Exploiting large language models for interpretable logical reasoning. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=3Pf3Wg6o-A4.

[102]

Jonathan Uesato, Nate Kushman, Ramana Kumar, H. Francis Song, Noah Y. Siegel, Lisa Wang, Antonia Creswell, Geoffrey Irving, and Irina Higgins. Solving math word problems with process- and outcome-based feedback. ArXiv preprint, abs/2211.14275, 2022. URL https://arxiv.org/abs/2211.14275.

[103]

Tom Everitt, Victoria Krakovna, Laurent Orseau, and Shane Legg. Reinforcement learning with a corrupted reward channel. In Carles Sierra (ed.), Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, IJCAI 2017, Melbourne, Australia, August 19-25, 2017, pp. 4705–4713. ijcai.org, 2017. . URL https://doi.org/10.24963/ijcai.2017/656.

[104]

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=v8L0pN6EOi.

[105]

Peiyi Wang, Lei Li, Zhihong Shao, Runxin Xu, Damai Dai, Yifei Li, Deli Chen, Yu Wu, and Zhifang Sui. Math-shepherd: Verify and reinforce llms step-by-step without human annotations. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pp. 9426–9439. Association for Computational Linguistics, 2024. . URL https://doi.org/10.18653/v1/2024.acl-long.510.

[106]

Zihan Wang, Yunxuan Li, Yuexin Wu, Liangchen Luo, Le Hou, Hongkun Yu, and Jingbo Shang. Multi-step problem solving through a verifier: An empirical analysis on model-induced process supervision. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, pp. 7309–7319. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.findings-emnlp.429.

[107]

Volodymyr Mnih, Koray Kavukcuoglu, David Silver, Alex Graves, Ioannis Antonoglou, Daan Wierstra, and Martin A. Riedmiller. Playing atari with deep reinforcement learning. CoRR, abs/1312.5602, 2013. URL http://arxiv.org/abs/1312.5602.

[108]

Liangchen Luo, Yinxiao Liu, Rosanne Liu, Samrat Phatale, Harsh Lara, Yunxuan Li, Lei Shu, Yun Zhu, Lei Meng, Jiao Sun, and Abhinav Rastogi. Improve mathematical reasoning in language models by automated process supervision. ArXiv preprint, abs/2406.06592, 2024. URL https://arxiv.org/abs/2406.06592.

[109]

Muhammad Khalifa, Lajanugen Logeswaran, Moontae Lee, Honglak Lee, and Lu Wang. : Discriminator-guided chain-of-thought reasoning. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 15299–15328, Singapore, 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.findings-emnlp.1022.

[110]

Wendi Li and Yixuan Li. Process reward model with q-value rankings. ArXiv preprint, abs/2410.11287, 2024. URL https://arxiv.org/abs/2410.11287.

[111]

Jianqiao Lu, Zhiyang Dou, Hongru Wang, Zeyu Cao, Jianbo Dai, Yunlong Feng, and Zhijiang Guo. Autopsv: Automated process-supervised verifier. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/9246aa822579d9b29a140ecdac36ad60-Abstract-Conference.html.

[112]

Amrith Setlur, Chirag Nagpal, Adam Fisch, Xinyang Geng, Jacob Eisenstein, Rishabh Agarwal, Alekh Agarwal, Jonathan Berant, and Aviral Kumar. Rewarding progress: Scaling automated process verifiers for LLM reasoning. ArXiv preprint, abs/2410.08146, 2024. URL https://arxiv.org/abs/2410.08146.

[113]

Lifan Yuan, Wendi Li, Huayu Chen, Ganqu Cui, Ning Ding, Kai Zhang, Bowen Zhou, Zhiyuan Liu, and Hao Peng. Free process rewards without process labels. ArXiv preprint, abs/2412.01981, 2024. URL https://arxiv.org/abs/2412.01981.

[114]

Ganqu Cui, Lifan Yuan, Zefan Wang, Hanbin Wang, Wendi Li, Bingxiang He, Yuchen Fan, Tianyu Yu, Qixin Xu, Weize Chen, Jiarui Yuan, Huayu Chen, Kaiyan Zhang, Xingtai Lv, Shuo Wang, Yuan Yao, Xu Han, Hao Peng, Yu Cheng, Zhiyuan Liu, Maosong Sun, Bowen Zhou, and Ning Ding. Process reinforcement through implicit rewards. ArXiv preprint, abs/2502.01456, 2025. URL https://arxiv.org/abs/2502.01456.

[115]

Qianli Ma, Haotian Zhou, Tingkai Liu, Jianbo Yuan, Pengfei Liu, Yang You, and Hongxia Yang. Let’s reward step by step: Step-level reward model as the navigators for reasoning. ArXiv preprint, abs/2310.10080, 2023. URL https://arxiv.org/abs/2310.10080.

[116]

Mingyang Song, Zhaochen Su, Xiaoye Qu, Jiawei Zhou, and Yu Cheng. Prmbench: A fine-grained and challenging benchmark for process-level reward models. ArXiv preprint, abs/2501.03124, 2025. URL https://arxiv.org/abs/2501.03124.

[117]

Zhenru Zhang, Chujie Zheng, Yangzhen Wu, Beichen Zhang, Runji Lin, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. The lessons of developing process reward models in mathematical reasoning. ArXiv preprint, abs/2501.07301, 2025. URL https://arxiv.org/abs/2501.07301.

[118]

Teng Wang, Zhangyi Jiang, Zhenqi He, Wenhan Yang, Yanan Zheng, Zeyu Li, Zifan He, Shenyang Tong, and Hailei Gong. Towards hierarchical multi-step reward models for enhanced reasoning in large language models, 2025. URL https://arxiv.org/abs/2503.13551.

[119]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters. ArXiv preprint, abs/2408.03314, 2024. URL https://arxiv.org/abs/2408.03314.

[120]

DeepSeek-AI, Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, Erhang Li, Fangyun Lin, Fucong Dai, Fuli Luo, Guangbo Hao, Guanting Chen, Guowei Li, H. Zhang, Han Bao, Hanwei Xu, Haocheng Wang, Honghui Ding, Huajian Xin, Huazuo Gao, Hui Qu, Hui Li, Jianzhong Guo, Jiashi Li, Jiawei Wang, Jingchang Chen, Jingyang Yuan, Junjie Qiu, Junlong Li, J. L. Cai, Jiaqi Ni, Jian Liang, Jin Chen, Kai Dong, Kai Hu, Kaige Gao, Kang Guan, Kexin Huang, Kuai Yu, Lean Wang, Lecong Zhang, Liang Zhao, Litong Wang, Liyue Zhang, Lei Xu, Leyi Xia, Mingchuan Zhang, Minghua Zhang, Minghui Tang, Meng Li, Miaojun Wang, Mingming Li, Ning Tian, Panpan Huang, Peng Zhang, Qiancheng Wang, Qinyu Chen, Qiushi Du, Ruiqi Ge, Ruisong Zhang, Ruizhe Pan, Runji Wang, R. J. Chen, R. L. Jin, Ruyi Chen, Shanghao Lu, Shangyan Zhou, Shanhuang Chen, Shengfeng Ye, Shiyu Wang, Shuiping Yu, Shunfeng Zhou, Shuting Pan, S. S. Li, Shuang Zhou, Shaoqing Wu, Shengfeng Ye, Tao Yun, Tian Pei, Tianyu Sun, T. Wang, Wangding Zeng, Wanjia Zhao, Wen Liu, Wenfeng Liang, Wenjun Gao, Wenqin Yu, Wentao Zhang, W. L. Xiao, Wei An, Xiaodong Liu, Xiaohan Wang, Xiaokang Chen, Xiaotao Nie, Xin Cheng, Xin Liu, Xin Xie, Xingchao Liu, Xinyu Yang, Xinyuan Li, Xuecheng Su, Xuheng Lin, X. Q. Li, Xiangyue Jin, Xiaojin Shen, Xiaosha Chen, Xiaowen Sun, Xiaoxiang Wang, Xinnan Song, Xinyi Zhou, Xianzu Wang, Xinxia Shan, Y. K. Li, Y. Q. Wang, Y. X. Wei, Yang Zhang, Yanhong Xu, Yao Li, Yao Zhao, Yaofeng Sun, Yaohui Wang, Yi Yu, Yichao Zhang, Yifan Shi, Yiliang Xiong, Ying He, Yishi Piao, Yisong Wang, Yixuan Tan, Yiyang Ma, Yiyuan Liu, Yongqiang Guo, Yuan Ou, Yuduan Wang, Yue Gong, Yuheng Zou, Yujia He, Yunfan Xiong, Yuxiang Luo, Yuxiang You, Yuxuan Liu, Yuyang Zhou, Y. X. Zhu, Yanhong Xu, Yanping Huang, Yaohui Li, Yi Zheng, Yuchen Zhu, Yunxian Ma, Ying Tang, Yukun Zha, Yuting Yan, Z. Z. Ren, Zehui Ren, Zhangli Sha, Zhe Fu, Zhean Xu, Zhenda Xie, Zhengyan Zhang, Zhewen Hao, Zhicheng Ma, Zhigang Yan, Zhiyu Wu, Zihui Gu, Zijia Zhu, Zijun Liu, Zilin Li, Ziwei Xie, Ziyang Song, Zizheng Pan, Zhen Huang, Zhipeng Xu, Zhongyu Zhang, and Zhen Zhang. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning, 2025. URL https://arxiv.org/abs/2501.12948.

[121]

Hanze Dong, Wei Xiong, Deepanshu Goyal, Yihan Zhang, Winnie Chow, Rui Pan, Shizhe Diao, Jipeng Zhang, Kashun Shum, and Tong Zhang. reward ranked finetuning for generative foundation model alignment. Trans. Mach. Learn. Res., 2023, 2023. URL https://openreview.net/forum?id=m7p5O7zblY.

[122]

Hongyi Yuan, Zheng Yuan, Chuanqi Tan, Wei Wang, Songfang Huang, and Fei Huang. rank responses to align language models with human feedback. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/23e6f78bdec844a9f7b6c957de2aae91-Abstract-Conference.html.

[123]

Çaglar Gülçehre, Tom Le Paine, Srivatsan Srinivasan, Ksenia Konyushkova, Lotte Weerts, Abhishek Sharma, Aditya Siddhant, Alex Ahern, Miaosen Wang, Chenjie Gu, Wolfgang Macherey, Arnaud Doucet, Orhan Firat, and Nando de Freitas. Reinforced self-training (rest) for language modeling. ArXiv preprint, abs/2308.08998, 2023. URL https://arxiv.org/abs/2308.08998.

[124]

Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason Weston. Iterative reasoning preference optimization. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/d37c9ad425fe5b65304d500c6edcba00-Abstract-Conference.html.

[125]

Benjamin Pikus, Will LeVine, Tony Chen, and Sean Hendryx. A baseline analysis of reward models’ ability to accurately analyze foundation models under distribution shift. ArXiv preprint, abs/2311.14743, 2023. URL https://arxiv.org/abs/2311.14743.

[126]

Lichang Chen, Chen Zhu, Jiuhai Chen, Davit Soselia, Tianyi Zhou, Tom Goldstein, Heng Huang, Mohammad Shoeybi, and Bryan Catanzaro. disentangled reward mitigates hacking in RLHF. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=zcIV8OQFVF.

[127]

Hang Zhou, Chenglong Wang, Yimin Hu, Tong Xiao, Chunliang Zhang, and Jingbo Zhu. Prior constraints-based reward model training for aligning large language models. In Maosong Sun, Jiye Liang, Xianpei Han, Zhiyuan Liu, Yulan He, Gaoqi Rao, Yubo Chen, and Zhiliang Tian (eds.), Chinese Computational Linguistics - 23rd China National Conference, CCL 2024, Taiyuan, China, July 25-28, 2024, Proceedings, volume 14761 of Lecture Notes in Computer Science, pp. 555–570. Springer, 2024. . URL https://doi.org/10.1007/978-981-97-8367-0_33.

[128]

Ryan Park, Rafael Rafailov, Stefano Ermon, and Chelsea Finn. Disentangling length from quality in direct preference optimization. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pp. 4998–5017. Association for Computational Linguistics, 2024. . URL https://doi.org/10.18653/v1/2024.findings-acl.297.

[129]

Chaoqi Wang, Zhuokai Zhao, Yibo Jiang, Zhaorun Chen, Chen Zhu, Yuxin Chen, Jiayi Liu, Lizhu Zhang, Xiangjun Fan, Hao Ma, and Sinong Wang. Beyond reward hacking: Causal rewards for large language model alignment. ArXiv preprint, abs/2501.09620, 2025. URL https://arxiv.org/abs/2501.09620.

[130]

Tianqi Liu, Wei Xiong, Jie Ren, Lichang Chen, Junru Wu, Rishabh Joshi, Yang Gao, Jiaming Shen, Zhen Qin, Tianhe Yu, Daniel Sohn, Anastasia Makarova, Jeremiah Zhe Liu, Yuan Liu, Bilal Piot, Abe Ittycheriah, Aviral Kumar, and Mohammad Saleh. : Robust reward model training mitigates reward hacking. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=88AS5MQnmC.

[131]

Adam X Yang, Maxime Robeyns, Thomas Coste, Zhengyan Shi, Jun Wang, Haitham Bou-Ammar, and Laurence Aitchison. Bayesian reward models for llm alignment. ArXiv preprint, abs/2402.13210, 2024. URL https://arxiv.org/abs/2402.13210.

[132]

Dexun Li, Cong Zhang, Kuicai Dong, Derrick-Goh-Xin Deik, Ruiming Tang, and Yong Liu. Aligning crowd feedback via distributional preference reward modeling. ArXiv preprint, abs/2402.09764, 2024. URL https://arxiv.org/abs/2402.09764.

[133]

Yuzi Yan, Xingzhou Lou, Jialian Li, Yiping Zhang, Jian Xie, Chao Yu, Yu Wang, Dong Yan, and Yuan Shen. Reward-robust RLHF in llms. ArXiv preprint, abs/2409.15360, 2024. URL https://arxiv.org/abs/2409.15360.

[134]

[135]

Alexandre Ramé, Nino Vieillard, Léonard Hussenot, Robert Dadashi, Geoffrey Cideron, Olivier Bachem, and Johan Ferret. on the benefits of weight averaged reward models. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=s7RDnNUJy6.

[136]

Shun Zhang, Zhenfang Chen, Sunli Chen, Yikang Shen, Zhiqing Sun, and Chuang Gan. Improving reinforcement learning from human feedback with efficient reward model ensemble, 2024. URL https://arxiv.org/abs/2401.16635.

[137]

Jinhao Jiang, Zhipeng Chen, Yingqian Min, Jie Chen, Xiaoxue Cheng, Jiapeng Wang, Yiru Tang, Haoxiang Sun, Jia Deng, Wayne Xin Zhao, Zheng Liu, Dong Yan, Jian Xie, Zhongyuan Wang, and Ji-Rong Wen. Technical report: Enhancing LLM reasoning with reward-guided tree search. ArXiv preprint, abs/2411.11694, 2024. URL https://arxiv.org/abs/2411.11694.

[138]

Mingqian He, Yongliang Shen, Wenqi Zhang, Zeqi Tan, and Weiming Lu. Advancing process verification for large language models via tree-based preference learning. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, EMNLP 2024, Miami, FL, USA, November 12-16, 2024, pp. 2086–2099. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.emnlp-main.125.

[139]

Dan Zhang, Sining Zhoubian, Ziniu Hu, Yisong Yue, Yuxiao Dong, and Jie Tang. Rest-mcts*: LLM self-training via process reward guided tree search. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/76ec4dc30e9faaf0e4b6093eaa377218-Abstract-Conference.html.

[140]

Baohao Liao, Yuhui Xu, Hanze Dong, Junnan Li, Christof Monz, Silvio Savarese, Doyen Sahoo, and Caiming Xiong. Reward-guided speculative decoding for efficient LLM reasoning. ArXiv preprint, abs/2501.19324, 2025. URL https://arxiv.org/abs/2501.19324.

[141]

Amelia Glaese, Nat McAleese, Maja Trębacz, John Aslanides, Vlad Firoiu, Timo Ewalds, Maribeth Rauh, Laura Weidinger, Martin Chadwick, Phoebe Thacker, et al. Improving alignment of dialogue agents via targeted human judgements. ArXiv preprint, abs/2209.14375, 2022. URL https://arxiv.org/abs/2209.14375.

[142]

Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe RLHF: safe reinforcement learning from human feedback. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=TyFrPOKYXw.

[143]

Melody Y. Guan, Manas Joglekar, Eric Wallace, Saachi Jain, Boaz Barak, Alec Helyar, Rachel Dias, Andrea Vallone, Hongyu Ren, Jason Wei, Hyung Won Chung, Sam Toyer, Johannes Heidecke, Alex Beutel, and Amelia Glaese. Deliberative alignment: Reasoning enables safer language models. ArXiv preprint, abs/2412.16339, 2024. URL https://arxiv.org/abs/2412.16339.

[144]

Hongbo Zhang, Junying Chen, Feng Jiang, Fei Yu, Zhihong Chen, Guiming Chen, Jianquan Li, Xiangbo Wu, Zhang Zhiyi, Qingying Xiao, Xiang Wan, Benyou Wang, and Haizhou Li. uatuoGPT, towards taming language model to be a doctor. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp. 10859–10885, Singapore, 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.findings-emnlp.725.

[145]

Songhua Yang, Hanjie Zhao, Senbin Zhu, Guangyu Zhou, Hongfei Xu, Yuxiang Jia, and Hongying Zan. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pp. 19368–19376. AAAI Press, 2024. . URL https://doi.org/10.1609/aaai.v38i17.29907.

[146]

Hui Ma, Bo Zhang, Bo Xu, Jian Wang, Hongfei Lin, and Xiao Sun. Empathy level alignment via reinforcement learning for empathetic response generation. ArXiv preprint, abs/2408.02976, 2024. URL https://arxiv.org/abs/2408.02976.

[147]

Kai Yoshida, Masahiro Mizukami, Seiya Kawano, Canasai Kruengkrai, Hiroaki Sugiyama, and Koichiro Yoshino. Training dialogue systems by AI feedback for improving overall dialogue impression. ArXiv preprint, abs/2501.12698, 2025. URL https://arxiv.org/abs/2501.12698.

[148]

Haipeng Luo, Qingfeng Sun, Can Xu, Pu Zhao, Jianguang Lou, Chongyang Tao, Xiubo Geng, Qingwei Lin, Shifeng Chen, and Dongmei Zhang. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. ArXiv preprint, abs/2308.09583, 2023. URL https://arxiv.org/abs/2308.09583.

[149]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Mingchuan Zhang, Y. K. Li, Y. Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models. ArXiv preprint, abs/2402.03300, 2024. URL https://arxiv.org/abs/2402.03300.

[150]

[151]

Jiachen Zhu, Congmin Zheng, Jianghao Lin, Kounianhua Du, Ying Wen, Yong Yu, Jun Wang, and Weinan Zhang. Retrieval-augmented process reward model for generalizable mathematical reasoning, 2025. URL https://arxiv.org/abs/2502.14361.

[152]

Junqiao Wang, Zeng Zhang, Yangfan He, Yuyang Song, Tianyu Shi, Yuchen Li, Hengyuan Xu, Kunyu Wu, Guangwu Qian, Qiuwu Chen, and Lewei He. Enhancing code llms with reinforcement learning in code generation: A survey. ArXiv preprint, abs/2412.20367, 2024. URL https://arxiv.org/abs/2412.20367.

[153]

Sujan Dutta, Sayantan Mahinder, Raviteja Anantha, and Bortik Bandyopadhyay. Applying RLAIF for code generation with api-usage in lightweight llms. ArXiv preprint, abs/2406.20060, 2024. URL https://arxiv.org/abs/2406.20060.

[154]

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, and Lin Yan. Process supervision-guided policy optimization for code generation. ArXiv preprint, abs/2410.17621, 2024. URL https://arxiv.org/abs/2410.17621.

[155]

Daniel Nichols, Pranav Polasam, Harshitha Menon, Aniruddha Marathe, Todd Gamblin, and Abhinav Bhatele. Performance-aligned llms for generating fast code. ArXiv preprint, abs/2404.18864, 2024. URL https://arxiv.org/abs/2404.18864.

[156]

Wei Shen and Chuheng Zhang. Policy filtration in RLHF to fine-tune LLM for code generation. ArXiv preprint, abs/2409.06957, 2024. URL https://arxiv.org/abs/2409.06957.

[157]

Haoyi Xiong, Jiang Bian, Yuchen Li, Xuhong Li, Mengnan Du, Shuaiqiang Wang, Dawei Yin, and Sumi Helal. When search engine services meet large language models: visions and challenges. IEEE Transactions on Services Computing, 2024.

[158]

Yujia Zhou, Zhicheng Dou, and Ji-Rong Wen. Enhancing generative retrieval with reinforcement learning from relevance feedback. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp. 12481–12490, Singapore, 2023. Association for Computational Linguistics. . URL https://aclanthology.org/2023.emnlp-main.768.

[159]

Minsang Kim and Seungjun Baek. Syntriever: How to train your retriever with synthetic data from llms. ArXiv preprint, abs/2502.03824, 2025. URL https://arxiv.org/abs/2502.03824.

[160]

Xinyan Guan, Jiali Zeng, Fandong Meng, Chunlei Xin, Yaojie Lu, Hongyu Lin, Xianpei Han, Le Sun, and Jie Zhou. Deeprag: Thinking to retrieval step by step for large language models, 2025. URL https://arxiv.org/abs/2502.01142.

[161]

Guangzhi Xiong, Qiao Jin, Xiao Wang, Yin Fang, Haolin Liu, Yifan Yang, Fangyuan Chen, Zhixing Song, Dengyu Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Rag-gym: Optimizing reasoning and search agents with process supervision, 2025. URL https://arxiv.org/abs/2502.13957.

[162]

Jie Wang, Alexandros Karatzoglou, Ioannis Arapakis, and Joemon M. Jose. Reinforcement learning-based recommender systems with large language models for state reward and action modeling. In Grace Hui Yang, Hongning Wang, Sam Han, Claudia Hauff, Guido Zuccon, and Yi Zhang (eds.), Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, SIGIR 2024, Washington DC, USA, July 14-18, 2024, pp. 375–385. ACM, 2024. . URL https://doi.org/10.1145/3626772.3657767.

[163]

Chao Sun, Yaobo Liang, Yaming Yang, Shilin Xu, Tianmeng Yang, and Yunhai Tong. Rlrf4rec: Reinforcement learning from recsys feedback for enhanced recommendation reranking. ArXiv preprint, abs/2410.05939, 2024. URL https://arxiv.org/abs/2410.05939.

[164]

Mengyuan Yang, Mengying Zhu, Yan Wang, Linxun Chen, Yilei Zhao, Xiuyuan Wang, Bing Han, Xiaolin Zheng, and Jianwei Yin. Fine-tuning large language model based explainable recommendation with explainable quality reward. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pp. 9250–9259. AAAI Press, 2024. . URL https://doi.org/10.1609/aaai.v38i8.28777.

[165]

Geoffrey Cideron, Sertan Girgin, Mauro Verzetti, Damien Vincent, Matej Kastelic, Zalán Borsos, Brian McWilliams, Victor Ungureanu, Olivier Bachem, Olivier Pietquin, Matthieu Geist, Léonard Hussenot, Neil Zeghidour, and Andrea Agostinelli. Musicrl: Aligning music generation to human preferences. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=EruV94XRDs.

[166]

Huan Liao, Haonan Han, Kai Yang, Tianjiao Du, Rui Yang, Qinmei Xu, Zunnan Xu, Jingquan Liu, Jiasheng Lu, and Xiu Li. aligning text-to-audio model using human preference feedback. In Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI 2024, Jeju, South Korea, August 3-9, 2024, pp. 4542–4550. ijcai.org, 2024. URL https://www.ijcai.org/proceedings/2024/502.

[167]

Jingyi Chen, Ju-Seung Byun, Micha Elsner, and Andrew Perrault. Reinforcement learning for fine-tuning text-to-speech diffusion models. ArXiv preprint, abs/2405.14632, 2024. URL https://arxiv.org/abs/2405.14632.

[168]

Kimin Lee, Hao Liu, Moonkyung Ryu, Olivia Watkins, Yuqing Du, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, and Shixiang Shane Gu. Aligning text-to-image models using human feedback. ArXiv preprint, abs/2302.12192, 2023. URL https://arxiv.org/abs/2302.12192.

[169]

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: Learning and evaluating human preferences for text-to-image generation. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/33646ef0ed554145eab65f6250fab0c9-Abstract-Conference.html.

[170]

Ying Fan, Olivia Watkins, Yuqing Du, Hao Liu, Moonkyung Ryu, Craig Boutilier, Pieter Abbeel, Mohammad Ghavamzadeh, Kangwook Lee, and Kimin Lee. reinforcement learning for fine-tuning text-to-image diffusion models. ArXiv preprint, abs/2305.16381, 2023. URL https://arxiv.org/abs/2305.16381.

[171]

Xun Wu, Shaohan Huang, Guolong Wang, Jing Xiong, and Furu Wei. Boosting text-to-video generative model with mllms feedback. In Amir Globersons, Lester Mackey, Danielle Belgrave, Angela Fan, Ulrich Paquet, Jakub M. Tomczak, and Cheng Zhang (eds.), Advances in Neural Information Processing Systems 38: Annual Conference on Neural Information Processing Systems 2024, NeurIPS 2024, Vancouver, BC, Canada, December 10 - 15, 2024, 2024. URL http://papers.nips.cc/paper_files/paper/2024/hash/fbe2b2f74a2ece8070d8fb073717bda6-Abstract-Conference.html.

[172]

Hangjie Yuan, Shiwei Zhang, Xiang Wang, Yujie Wei, Tao Feng, Yining Pan, Yingya Zhang, Ziwei Liu, Samuel Albanie, and Dong Ni. Instructvideo: Instructing video diffusion models with human feedback. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, pp. 6463–6474. IEEE, 2024. . URL https://doi.org/10.1109/CVPR52733.2024.00618.

[173]

Shuting Wang, Haihong Tang, Zhicheng Dou, and Chenyan Xiong. Harness local rewards for global benefits: Effective text-to-video generation alignment with patch-level reward models, 2025. URL https://arxiv.org/abs/2502.06812.

[174]

Kun Chu, Xufeng Zhao, Cornelius Weber, Mengdi Li, and Stefan Wermter. Accelerating reinforcement learning of robotic manipulations via feedback from large language models. ArXiv preprint, abs/2311.02379, 2023. URL https://arxiv.org/abs/2311.02379.

[175]

Juan Rocamonde, Victoriano Montesinos, Elvis Nava, Ethan Perez, and David Lindner. Vision-language models are zero-shot reward models for reinforcement learning. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=N0I2RtD8je.

[176]

Ellen R. Novoseller, Vinicius G. Goecks, David Watkins, Josh Miller, and Nicholas R. Waytowich. demonstration-inferred preference learning in minecraft. ArXiv preprint, abs/2307.12158, 2023. URL https://arxiv.org/abs/2307.12158.

[177]

Sanjiban Choudhury. Process reward models for llm agents: Practical framework and directions, 2025. URL https://arxiv.org/abs/2502.10325.

[178]

Evan Frick, Tianle Li, Connor Chen, Wei-Lin Chiang, Anastasios N. Angelopoulos, Jiantao Jiao, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. How to evaluate reward models for RLHF. ArXiv preprint, abs/2410.14872, 2024. URL https://arxiv.org/abs/2410.14872.

[179]

Xueru Wen, Jie Lou, Yaojie Lu, Hongyu Lin, Xing Yu, Xinyu Lu, Ben He, Xianpei Han, Debing Zhang, and Le Sun. Rethinking reward model evaluation: Are we barking up the wrong tree? ArXiv preprint, abs/2410.05584, 2024. URL https://arxiv.org/abs/2410.05584.

[180]

Jaymari Chua, Yun Li, Shiyi Yang, Chen Wang, and Lina Yao. safety in generative AI large language models: A survey. ArXiv preprint, abs/2407.18369, 2024. URL https://arxiv.org/abs/2407.18369.

[181]

Yantao Liu, Zijun Yao, Rui Min, Yixin Cao, Lei Hou, and Juanzi Li. Rm-bench: Benchmarking reward models of language models with subtlety and style. ArXiv preprint, abs/2410.16184, 2024. URL https://arxiv.org/abs/2410.16184.

[182]

Enyu Zhou, Guodong Zheng, Binghai Wang, Zhiheng Xi, Shihan Dou, Rong Bao, Wei Shen, Limao Xiong, Jessica Fan, Yurong Mou, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. comprehensively benchmarking reward models in LLM alignment. ArXiv preprint, abs/2410.09893, 2024. URL https://arxiv.org/abs/2410.09893.

[183]

Chujie Zheng, Zhenru Zhang, Beichen Zhang, Runji Lin, Keming Lu, Bowen Yu, Dayiheng Liu, Jingren Zhou, and Junyang Lin. Processbench: Identifying process errors in mathematical reasoning. ArXiv preprint, abs/2412.06559, 2024. URL https://arxiv.org/abs/2412.06559.

[184]

Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, and Qi Liu. Vlrewardbench: A challenging benchmark for vision-language generative reward models. ArXiv preprint, abs/2411.17451, 2024. URL https://arxiv.org/abs/2411.17451.

[185]

Zhaorun Chen, Yichao Du, Zichen Wen, Yiyang Zhou, Chenhang Cui, Zhenzhen Weng, Haoqin Tu, Chaoqi Wang, Zhengwei Tong, Qinglan Huang, Canyu Chen, Qinghao Ye, Zhihong Zhu, Yuqing Zhang, Jiawei Zhou, Zhuokai Zhao, Rafael Rafailov, Chelsea Finn, and Huaxiu Yao. Mj-bench: Is your multimodal reward model really a good judge for text-to-image generation? ArXiv preprint, abs/2407.04842, 2024. URL https://arxiv.org/abs/2407.04842.

[186]

Michihiro Yasunaga, Luke Zettlemoyer, and Marjan Ghazvininejad. Multimodal rewardbench: Holistic evaluation of reward models for vision language models, 2025. URL https://arxiv.org/abs/2502.14191.

[187]

Srishti Gureja, Lester James V. Miranda, Shayekh Bin Islam, Rishabh Maheshwary, Drishti Sharma, Gusti Winata, Nathan Lambert, Sebastian Ruder, Sara Hooker, and Marzieh Fadaee. M-rewardbench: Evaluating reward models in multilingual settings. ArXiv preprint, abs/2410.15522, 2024. URL https://arxiv.org/abs/2410.15522.

[188]

Zhuoran Jin, Hongbang Yuan, Tianyi Men, Pengfei Cao, Yubo Chen, Kang Liu, and Jun Zhao. Rag-rewardbench: Benchmarking reward models in retrieval augmented generation for preference alignment. ArXiv preprint, abs/2412.13746, 2024. URL https://arxiv.org/abs/2412.13746.

[189]

Daniel M. Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B. Brown, Alec Radford, Dario Amodei, Paul F. Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences. ArXiv preprint, abs/1909.08593, 2019. URL https://arxiv.org/abs/1909.08593.

[190]

Oliver Daniels-Koch and Rachel Freedman. The expertise problem: Learning from specialized feedback. ArXiv preprint, abs/2211.06519, 2022. URL https://arxiv.org/abs/2211.06519.

[191]

Hritik Bansal, John Dang, and Aditya Grover. Peering through preferences: Unraveling feedback acquisition for aligning large language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=dKl6lMwbCy.

[192]

Yulan Hu, Qingyang Li, Sheng Ouyang, Ge Chen, Kaihui Chen, Lijun Mei, Xucheng Ye, Fuzheng Zhang, and Yong Liu. Towards comprehensive preference data collection for reward modeling. ArXiv preprint, abs/2406.16486, 2024. URL https://arxiv.org/abs/2406.16486.

[193]

Xun Deng, Han Zhong, Rui Ai, Fuli Feng, Zheng Wang, and Xiangnan He. Less is more: Improving LLM alignment via preference data selection. ArXiv preprint, abs/2502.14560, 2025. URL https://arxiv.org/abs/2502.14560.

[194]

Ruibo Liu, Jerry Wei, Fangyu Liu, Chenglei Si, Yanzhe Zhang, Jinmeng Rao, Steven Zheng, Daiyi Peng, Diyi Yang, Denny Zhou, and Andrew M. Dai. Best practices and lessons learned on synthetic data for language models. ArXiv preprint, abs/2404.07503, 2024. URL https://arxiv.org/abs/2404.07503.

[195]

Joar Skalse, Nikolaus H. R. Howe, Dmitrii Krasheninnikov, and David Krueger. Defining and characterizing reward hacking. ArXiv preprint, abs/2209.13085, 2022. URL https://arxiv.org/abs/2209.13085.

[196]

Leo Gao, John Schulman, and Jacob Hilton. Scaling laws for reward model overoptimization. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 10835–10866. PMLR, 2023. URL https://proceedings.mlr.press/v202/gao23h.html.

[197]

Lilian Weng. Reward hacking in reinforcement learning. lilianweng.github.io, 2024. URL https://lilianweng.github.io/posts/2024-11-28-reward-hacking/.

[198]

Cassidy Laidlaw, Shivam Singhal, and Anca Dragan. Correlated proxies: A new definition and improved mitigation for reward hacking. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=msEr27EejF.

[199]

Nisan Stiennon, Long Ouyang, Jeff Wu, Daniel M. Ziegler, Ryan Lowe, Chelsea Voss, Alec Radford, Dario Amodei, and Paul F. Christiano. Learning to summarize from human feedback. ArXiv preprint, abs/2009.01325, 2020. URL https://arxiv.org/abs/2009.01325.

[200]

Leo McKee-Reid, Christoph Sträter, Maria Angelica Martinez, Joe Needham, and Mikita Balesni. Honesty to subterfuge: In-context reinforcement learning can make honest models reward hack. ArXiv preprint, abs/2410.06491, 2024. URL https://arxiv.org/abs/2410.06491.

[201]

Carson Denison, Monte MacDiarmid, Fazl Barez, David Duvenaud, Shauna Kravec, Samuel Marks, Nicholas Schiefer, Ryan Soklaski, Alex Tamkin, Jared Kaplan, Buck Shlegeris, Samuel R. Bowman, Ethan Perez, and Evan Hubinger. Sycophancy to subterfuge: Investigating reward-tampering in large language models. ArXiv preprint, abs/2406.10162, 2024. URL https://arxiv.org/abs/2406.10162.

[202]

Jiaxin Wen, Ruiqi Zhong, Akbir Khan, Ethan Perez, Jacob Steinhardt, Minlie Huang, Samuel R. Bowman, He He, and Shi Feng. Language models learn to mislead humans via RLHF. ArXiv preprint, abs/2409.12822, 2024. URL https://arxiv.org/abs/2409.12822.

[203]

Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=tvhaxkMKAn.

[204]

Thomas Coste, Usman Anwar, Robert Kirk, and David Krueger. Reward model ensembles help mitigate overoptimization. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=dcjtMYkpXx.

[205]

Xiaoying Zhang, Jean-Francois Ton, Wei Shen, Hongning Wang, and Yang Liu. Overcoming reward overoptimization via adversarial policy optimization with lightweight uncertainty estimation. ArXiv preprint, abs/2403.05171, 2024. URL https://arxiv.org/abs/2403.05171.

[206]

Yuchun Miao, Sen Zhang, Liang Ding, Yuqi Zhang, Lefei Zhang, and Dacheng Tao. The energy loss phenomenon in RLHF:A new perspective on mitigating reward hacking. ArXiv preprint, abs/2501.19358, 2025. URL https://arxiv.org/abs/2501.19358.

[207]

Hui Huang, Yingqi Qu, Jing Liu, Muyun Yang, and Tiejun Zhao. An empirical study of llm-as-a-judge for LLM evaluation: Fine-tuned judge models are task-specific classifiers. ArXiv preprint, abs/2403.02839, 2024. URL https://arxiv.org/abs/2403.02839.

[208]

Xuanchang Zhang, Wei Xiong, Lichang Chen, Tianyi Zhou, Heng Huang, and Tong Zhang. From lists to emojis: How format bias affects model alignment. ArXiv preprint, abs/2409.11704, 2024. URL https://arxiv.org/abs/2409.11704.

[209]

Junsoo Park, Seungyeon Jwa, Meiying Ren, Daeyoung Kim, and Sanghyuk Choi. Offsetbias: Leveraging debiased data for tuning evaluators. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 12-16, 2024, pp. 1043–1067. Association for Computational Linguistics, 2024. URL https://aclanthology.org/2024.findings-emnlp.57.

[210]

Dawei Li, Renliang Sun, Yue Huang, Ming Zhong, Bohan Jiang, Jiawei Han, Xiangliang Zhang, Wei Wang, and Huan Liu. Preference leakage: A contamination problem in llm-as-a-judge. ArXiv preprint, abs/2502.01534, 2025. URL https://arxiv.org/abs/2502.01534.

[211]

Aixin Liu, Bei Feng, Bing Xue, Bingxuan Wang, Bochao Wu, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chenyu Zhang, Chong Ruan, et al. Deepseek-v3 technical report. ArXiv preprint, abs/2412.19437, 2024. URL https://arxiv.org/abs/2412.19437.

[212]

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. ArXiv preprint, abs/2412.16720, 2024. URL https://arxiv.org/abs/2412.16720.

[213]

Jiayi Pan, Junjie Zhang, Xingyao Wang, Lifan Yuan, Hao Peng, and Alane Suhr. Tinyzero. https://github.com/Jiayi-Pan/TinyZero, 2025. Accessed: 2025-01-24.

[214]

Elie Bakouch, Leandro von Werra, and Lewis Tunstall. . https://github.com/huggingface/open-r1, 2025.

[215]

Open-Thoughts-Team. . https://github.com/open-thoughts/open- thoughts, 2025.

[216]

Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Chen Bo Calvin Zhang, Mohamed Shaaban, John Ling, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Richard Ren, Jason Hausenloy, Oliver Zhang, Mantas Mazeika, Tung Nguyen, Daron Anderson, Imad Ali Shah, Mikhail Doroshenko, Alun Cennyth Stokes, Mobeen Mahmood, Jaeho Lee, Oleksandr Pokutnyi, Oleg Iskra, Jessica P. Wang, Robert Gerbicz, John-Clark Levin, Serguei Popov, Fiona Feng, Steven Y. Feng, Haoran Zhao, Michael Yu, Varun Gangal, Chelsea Zou, Zihan Wang, Mstyslav Kazakov, Geoff Galgon, Johannes Schmitt, Alvaro Sanchez, Yongki Lee, Will Yeadon, Scott Sauers, Marc Roth, Chidozie Agu, Søren Riis, Fabian Giska, Saiteja Utpala, Antrell Cheatom, Zachary Giboney, Gashaw M. Goshu, Sarah-Jane Crowson, Mohinder Maheshbhai Naiya, Noah Burns, Lennart Finke, Zerui Cheng, Hyunwoo Park, Francesco Fournier-Facio, Jennifer Zampese, John Wydallis, John B. Wydallis, Ryan G. Hoerr, Mark Nandor, Tim Gehrunger, Jiaqi Cai, Ben McCarty, Jungbae Nam, Edwin Taylor, Jun Jin, Gautier Abou Loume, Hangrui Cao, Alexis C Garretson, Damien Sileo, Qiuyu Ren, Doru Cojoc, Pavel Arkhipov, Usman Qazi, Aras Bacho, Lianghui Li, Sumeet Motwani, Christian Schroeder de Witt, Alexei Kopylov, Johannes Veith, Eric Singer, Paolo Rissone, Jaehyeok Jin, Jack Wei Lun Shi, Chris G. Willcocks, Ameya Prabhu, Longke Tang, Kevin Zhou, Emily de Oliveira Santos, Andrey Pupasov Maksimov, Edward Vendrow, Kengo Zenitani, Joshua Robinson, Aleksandar Mikov, Julien Guillod, Yuqi Li, Ben Pageler, Joshua Vendrow, Vladyslav Kuchkin, Pierre Marion, Denis Efremov, Jayson Lynch, Kaiqu Liang, Andrew Gritsevskiy, Dakotah Martinez, Nick Crispino, Dimitri Zvonkine, Natanael Wildner Fraga, Saeed Soori, Ori Press, Henry Tang, Julian Salazar, Sean R. Green, Lina Brüssel, Moon Twayana, Aymeric Dieuleveut, T. Ryan Rogers, Wenjin Zhang, Ross Finocchio, Bikun Li, Jinzhou Yang, Arun Rao, Gabriel Loiseau, Mikhail Kalinin, Marco Lukas, Ciprian Manolescu, Nate Stambaugh, Subrata Mishra, Ariel Ghislain Kemogne Kamdoum, Tad Hogg, Alvin Jin, Carlo Bosio, Gongbo Sun, Brian P Coppola, Haline Heidinger, Rafael Sayous, Stefan Ivanov, Joseph M Cavanagh, Jiawei Shen, Joseph Marvin Imperial, Philippe Schwaller, Shaipranesh Senthilkuma, Andres M Bran, Andres Algaba, Brecht Verbeken, Kelsey Van den Houte, Lynn Van Der Sypt, David Noever, Lisa Schut, Ilia Sucholutsky, Evgenii Zheltonozhskii, Qiaochu Yuan, Derek Lim, Richard Stanley, Shankar Sivarajan, Tong Yang, John Maar, Julian Wykowski, Martí Oller, Jennifer Sandlin, Anmol Sahu, Cesare Giulio Ardito, Yuzheng Hu, Felipe Meneguitti Dias, Tobias Kreiman, Kaivalya Rawal, Tobias Garcia Vilchis, Yuexuan Zu, Martin Lackner, James Koppel, Jeremy Nguyen, Daniil S. Antonenko, Steffi Chern, Bingchen Zhao, Pierrot Arsene, Sergey Ivanov, Rafał Poświata, Chenguang Wang, Daofeng Li, Donato Crisostomi, Ali Dehghan, Andrea Achilleos, John Arnold Ambay, Benjamin Myklebust, Archan Sen, David Perrella, Nurdin Kaparov, Mark H Inlow, Allen Zang, Kalyan Ramakrishnan, Daniil Orel, Vladislav Poritski, Shalev Ben-David, Zachary Berger, Parker Whitfill, Michael Foster, Daniel Munro, Linh Ho, Dan Bar Hava, Aleksey Kuchkin, Robert Lauff, David Holmes, Frank Sommerhage, Anji Zhang, Richard Moat, Keith Schneider, Daniel Pyda, Zakayo Kazibwe, Mukhwinder Singh, Don Clarke, Dae Hyun Kim, Sara Fish, Veit Elser, Victor Efren Guadarrama Vilchis, Immo Klose, Christoph Demian, Ujjwala Anantheswaran, Adam Zweiger, Guglielmo Albani, Jeffery Li, Nicolas Daans, Maksim Radionov, Václav Rozhoň, Vincent Ginis, Ziqiao Ma, Christian Stump, Jacob Platnick, Volodymyr Nevirkovets, Luke Basler, Marco Piccardo, Niv Cohen, Virendra Singh, Josef Tkadlec, Paul Rosu, Alan Goldfarb, Piotr Padlewski, Stanislaw Barzowski, Kyle Montgomery, Aline Menezes, Arkil Patel, Zixuan Wang, Jamie Tucker-Foltz, Jack Stade, Declan Grabb, Tom Goertzen, Fereshteh Kazemi, Jeremiah Milbauer, Abhishek Shukla, Hossam Elgnainy, Yan Carlos Leyva Labrador, Hao He, Ling Zhang, Alan Givré, Hew Wolff, Gözdenur Demir, Muhammad Fayez Aziz, Younesse Kaddar, Ivar Ängquist, Yanxu Chen, Elliott Thornley, Robin Zhang, Jiayi Pan, Antonio Terpin, Niklas Muennighoff, Hailey Schoelkopf, Eric Zheng, Avishy Carmi, Jainam Shah, Ethan D. L. Brown, Kelin Zhu, Max Bartolo, Richard Wheeler, Andrew Ho, Shaul Barkan, Jiaqi Wang, Martin Stehberger, Egor Kretov, Peter Bradshaw, JP Heimonen, Kaustubh Sridhar, Zaki Hossain, Ido Akov, Yury Makarychev, Joanna Tam, Hieu Hoang, David M. Cunningham, Vladimir Goryachev, Demosthenes Patramanis, Michael Krause, Andrew Redenti, David Aldous, Jesyin Lai, Shannon Coleman, Jiangnan Xu, Sangwon Lee, Ilias Magoulas, Sandy Zhao, Ning Tang, Michael K. Cohen, Micah Carroll, Orr Paradise, Jan Hendrik Kirchner, Stefan Steinerberger, Maksym Ovchynnikov, Jason O. Matos, Adithya Shenoy, Michael Wang, Yuzhou Nie, Paolo Giordano, Philipp Petersen, Anna Sztyber-Betley, Paolo Faraboschi, Robin Riblet, Jonathan Crozier, Shiv Halasyamani, Antonella Pinto, Shreyas Verma, Prashant Joshi, Eli Meril, Zheng-Xin Yong, Allison Tee, Jérémy Andréoletti, Orion Weller, Raghav Singhal, Gang Zhang, Alexander Ivanov, Seri Khoury, Nils Gustafsson, Hamid Mostaghimi, Kunvar Thaman, Qijia Chen, Tran Quoc Khánh, Jacob Loader, Stefano Cavalleri, Hannah Szlyk, Zachary Brown, Himanshu Narayan, Jonathan Roberts, William Alley, Kunyang Sun, Ryan Stendall, Max Lamparth, Anka Reuel, Ting Wang, Hanmeng Xu, Pablo Hernández-Cámara, Freddie Martin, Thomas Preu, Tomek Korbak, Marcus Abramovitch, Dominic Williamson, Ida Bosio, Ziye Chen, Biró Bálint, Eve J. Y. Lo, Maria Inês S. Nunes, Yibo Jiang, M Saiful Bari, Peyman Kassani, Zihao Wang, Behzad Ansarinejad, Yewen Sun, Stephane Durand, Guillaume Douville, Daniel Tordera, George Balabanian, Earth Anderson, Lynna Kvistad, Alejandro José Moyano, Hsiaoyun Milliron, Ahmad Sakor, Murat Eron, Isaac C. McAlister, Andrew Favre D. O., Shailesh Shah, Xiaoxiang Zhou, Firuz Kamalov, Ronald Clark, Sherwin Abdoli, Tim Santens, Harrison K Wang, Evan Chen, Alessandro Tomasiello, G. Bruno De Luca, Shi-Zhuo Looi, Vinh-Kha Le, Noam Kolt, Niels Mündler, Avi Semler, Emma Rodman, Jacob Drori, Carl J Fossum, Luk Gloor, Milind Jagota, Ronak Pradeep, Honglu Fan, Tej Shah, Jonathan Eicher, Michael Chen, Kushal Thaman, William Merrill, Moritz Firsching, Carter Harris, Stefan Ciobâcă, Jason Gross, Rohan Pandey, Ilya Gusev, Adam Jones, Shashank Agnihotri, Pavel Zhelnov, Siranut Usawasutsakorn, Mohammadreza Mofayezi, Alexander Piperski, Marc Carauleanu, David K. Zhang, Kostiantyn Dobarskyi, Dylan Ler, Roman Leventov, Ignat Soroko, Thorben Jansen, Scott Creighton, Pascal Lauer, Joshua Duersch, Vage Taamazyan, Dario Bezzi, Wiktor Morak, Wenjie Ma, William Held, Tran Đuc Huy, Ruicheng Xian, Armel Randy Zebaze, Mohanad Mohamed, Julian Noah Leser, Michelle X Yuan, Laila Yacar, Johannes Lengler, Katarzyna Olszewska, Hossein Shahrtash, Edson Oliveira, Joseph W. Jackson, Daniel Espinosa Gonzalez, Andy Zou, Muthu Chidambaram, Timothy Manik, Hector Haffenden, Dashiell Stander, Ali Dasouqi, Alexander Shen, Emilien Duc, Bita Golshani, David Stap, Mikalai Uzhou, Alina Borisovna Zhidkovskaya, Lukas Lewark, Miguel Orbegozo Rodriguez, Mátyás Vincze, Dustin Wehr, Colin Tang, Shaun Phillips, Fortuna Samuele, Jiang Muzhen, Fredrik Ekström, Angela Hammon, Oam Patel, Faraz Farhidi, George Medley, Forough Mohammadzadeh, Madellene Peñaflor, Haile Kassahun, Alena Friedrich, Claire Sparrow, Rayner Hernandez Perez, Taom Sakal, Omkar Dhamane, Ali Khajegili Mirabadi, Eric Hallman, Kenchi Okutsu, Mike Battaglia, Mohammad Maghsoudimehrabani, Alon Amit, Dave Hulbert, Roberto Pereira, Simon Weber, Handoko, Anton Peristyy, Stephen Malina, Samuel Albanie, Will Cai, Mustafa Mehkary, Rami Aly, Frank Reidegeld, Anna-Katharina Dick, Cary Friday, Jasdeep Sidhu, Hassan Shapourian, Wanyoung Kim, Mariana Costa, Hubeyb Gurdogan, Brian Weber, Harsh Kumar, Tong Jiang, Arunim Agarwal, Chiara Ceconello, Warren S. Vaz, Chao Zhuang, Haon Park, Andrew R. Tawfeek, Daattavya Aggarwal, Michael Kirchhof, Linjie Dai, Evan Kim, Johan Ferret, Yuzhou Wang, Minghao Yan, Krzysztof Burdzy, Lixin Zhang, Antonio Franca, Diana T. Pham, Kang Yong Loh, Joshua Robinson, Abram Jackson, Shreen Gul, Gunjan Chhablani, Zhehang Du, Adrian Cosma, Jesus Colino, Colin White, Jacob Votava, Vladimir Vinnikov, Ethan Delaney, Petr Spelda, Vit Stritecky, Syed M. Shahid, Jean-Christophe Mourrat, Lavr Vetoshkin, Koen Sponselee, Renas Bacho, Florencia de la Rosa, Xiuyu Li, Guillaume Malod, Leon Lang, Julien Laurendeau, Dmitry Kazakov, Fatimah Adesanya, Julien Portier, Lawrence Hollom, Victor Souza, Yuchen Anna Zhou, Julien Degorre, Yiğit Yalın, Gbenga Daniel Obikoya, Luca Arnaboldi, Rai, Filippo Bigi, M. C. Boscá, Oleg Shumar, Kaniuar Bacho, Pierre Clavier, Gabriel Recchia, Mara Popescu, Nikita Shulga, Ngefor Mildred Tanwie, Denis Peskoff, Thomas C. H. Lux, Ben Rank, Colin Ni, Matthew Brooks, Alesia Yakimchyk, Huanxu, Liu, Olle Häggström, Emil Verkama, Hans Gundlach, Leonor Brito-Santana, Brian Amaro, Vivek Vajipey, Rynaa Grover, Yiyang Fan, Gabriel Poesia Reis e Silva, Linwei Xin, Yosi Kratish, Jakub Łucki, Wen-Ding Li, Sivakanth Gopi, Andrea Caciolai, Justin Xu, Kevin Joseph Scaria, Freddie Vargus, Farzad Habibi, Long, Lian, Emanuele Rodolà, Jules Robins, Vincent Cheng, Tony Fruhauff, Brad Raynor, Hao Qi, Xi Jiang, Ben Segev, Jingxuan Fan, Sarah Martinson, Erik Y. Wang, Kaylie Hausknecht, Michael P. Brenner, Mao Mao, Xinyu Zhang, David Avagian, Eshawn Jessica Scipio, Alon Ragoler, Justin Tan, Blake Sims, Rebeka Plecnik, Aaron Kirtland, Omer Faruk Bodur, D. P. Shinde, Zahra Adoul, Mohamed Zekry, Ali Karakoc, Tania C. B. Santos, Samir Shamseldeen, Loukmane Karim, Anna Liakhovitskaia, Nate Resman, Nicholas Farina, Juan Carlos Gonzalez, Gabe Maayan, Sarah Hoback, Rodrigo De Oliveira Pena, Glen Sherman, Elizabeth Kelley, Hodjat Mariji, Rasoul Pouriamanesh, Wentao Wu, Sandra Mendoza, Ismail Alarab, Joshua Cole, Danyelle Ferreira, Bryan Johnson, Mohammad Safdari, Liangti Dai, Siriphan Arthornthurasuk, Alexey Pronin, Jing Fan, Angel Ramirez-Trinidad, Ashley Cartwright, Daphiny Pottmaier, Omid Taheri, David Outevsky, Stanley Stepanic, Samuel Perry, Luke Askew, Raúl Adrián Huerta Rodríguez, Ali M. R. Minissi, Sam Ali, Ricardo Lorena, Krishnamurthy Iyer, Arshad Anil Fasiludeen, Sk Md Salauddin, Murat Islam, Juan Gonzalez, Josh Ducey, Maja Somrak, Vasilios Mavroudis, Eric Vergo, Juehang Qin, Benjámin Borbás, Eric Chu, Jack Lindsey, Anil Radhakrishnan, Antoine Jallon, I. M. J. McInnis, Pawan Kumar, Laxman Prasad Goswami, Daniel Bugas, Nasser Heydari, Ferenc Jeanplong, Archimedes Apronti, Abdallah Galal, Ng Ze-An, Ankit Singh, Joan of Arc Xavier, Kanu Priya Agarwal, Mohammed Berkani, Benedito Alves de Oliveira Junior, Dmitry Malishev, Nicolas Remy, Taylor D. Hartman, Tim Tarver, Stephen Mensah, Javier Gimenez, Roselynn Grace Montecillo, Russell Campbell, Asankhaya Sharma, Khalida Meer, Xavier Alapont, Deepakkumar Patil, Rajat Maheshwari, Abdelkader Dendane, Priti Shukla, Sergei Bogdanov, Sören Möller, Muhammad Rehan Siddiqi, Prajvi Saxena, Himanshu Gupta, Innocent Enyekwe, Ragavendran P V, Zienab EL-Wasif, Aleksandr Maksapetyan, Vivien Rossbach, Chris Harjadi, Mohsen Bahaloohoreh, Song Bian, John Lai, Justine Leon Uro, Greg Bateman, Mohamed Sayed, Ahmed Menshawy, Darling Duclosel, Yashaswini Jain, Ashley Aaron, Murat Tiryakioglu, Sheeshram Siddh, Keith Krenek, Alex Hoover, Joseph McGowan, Tejal Patwardhan, Summer Yue, Alexandr Wang, and Dan Hendrycks. Humanity’s last exam, 2025. URL https://arxiv.org/abs/2501.14249.

[217]

OpenAI. Introducing deep research, 2025. URL https://openai.com/index/introducing-deep-research/.

[218]

Cursor. Cursor - the ai code editor, 2025. URL https://www.cursor.com/. Accessed: 2025-02-16.

[219]

OpenAI. Introducing operator, 2025. URL https://openai.com/index/introducing-operator/.

[220]

Chaoyun Zhang, Shilin He, Jiaxu Qian, Bowen Li, Liqun Li, Si Qin, Yu Kang, Minghua Ma, Qingwei Lin, Saravan Rajmohan, et al. Large language model-brained gui agents: A survey. ArXiv preprint, abs/2411.18279, 2024. URL https://arxiv.org/abs/2411.18279.

[221]

Donald Joseph Hejna III and Dorsa Sadigh. Few-shot preference learning for human-in-the-loop RL. In Karen Liu, Dana Kulic, and Jeffrey Ichnowski (eds.), Conference on Robot Learning, CoRL 2022, 14-18 December 2022, Auckland, New Zealand, volume 205 of Proceedings of Machine Learning Research, pp. 2014–2025. PMLR, 2022. URL https://proceedings.mlr.press/v205/iii23a.html.

[222]

Robert Wijaya, Ngoc-Bao Nguyen, and Ngai-Man Cheung. Multimodal preference data synthetic alignment with reward model. ArXiv preprint, abs/2412.17417, 2024. URL https://arxiv.org/abs/2412.17417.

[223]

Ali Emre Narin. Evolutionary reward design and optimization with multimodal large language models. Proceedings of the 3rd Workshop on Advances in Language and Vision Research (ALVR), 2024. URL https://api.semanticscholar.org/CorpusID:270819969.

[224]

Christian Arzate Cruz and Takeo Igarashi. A survey on interactive reinforcement learning: Design principles and open challenges. In Ron Wakkary, Kristina Andersen, Will Odom, Audrey Desjardins, and Marianne Graves Petersen (eds.), DIS ’20: Designing Interactive Systems Conference 2020, Eindhoven, The Netherlands, July 6-10, 2020, pp. 1195–1209. ACM, 2020. . URL https://doi.org/10.1145/3357236.3395525.

[225]

Anis Najar and Mohamed Chetouani. Reinforcement learning with human advice: A survey. Frontiers Robotics AI, 8: 584075, 2021. . URL https://doi.org/10.3389/frobt.2021.584075.

[226]

Carl Orge Retzlaff, Srijita Das, Christabel Wayllace, Payam Mousavi, Mohammad Afshari, Tianpei Yang, Anna Saranti, Alessa Angerschmid, Matthew E. Taylor, and Andreas Holzinger. Human-in-the-loop reinforcement learning: A survey and position on requirements, challenges, and opportunities. J. Artif. Intell. Res., 79: 359–415, 2024. . URL https://doi.org/10.1613/jair.1.15348.

[227]

Yuji Cao, Huan Zhao, Yuheng Cheng, Ting Shu, Guolong Liu, Gaoqi Liang, Junhua Zhao, and Yun Li. Survey on large language model-enhanced reinforcement learning: Concept, taxonomy, and methods. ArXiv preprint, abs/2404.00282, 2024. URL https://arxiv.org/abs/2404.00282.

[228]

Zhichao Wang, Bin Bi, Shiva Kumar Pentyala, Kiran Ramnath, Sougata Chaudhuri, Shubham Mehrotra, Zixu James Zhu, Xiang-Bo Mao, Sitaram Asur, and Na Claire Cheng. A comprehensive survey of LLM alignment techniques: Rlhf, rlaif, ppo, DPO and more. ArXiv preprint, abs/2407.16216, 2024. URL https://arxiv.org/abs/2407.16216.

[229]

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms. ArXiv preprint, abs/1707.06347, 2017. URL https://arxiv.org/abs/1707.06347.

[230]

Minae Kwon, Sang Michael Xie, Kalesha Bullard, and Dorsa Sadigh. Reward design with language models. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net, 2023. URL https://openreview.net/pdf?id=10uNUgI5Kl.

[231]

Alizée Pace, Jonathan Mallinson, Eric Malmi, Sebastian Krause, and Aliaksei Severyn. West-of-n: Synthetic preference generation for improved reward modeling. ArXiv preprint, abs/2401.12086, 2024. URL https://arxiv.org/abs/2401.12086.

[232]

Yuu Jinnai, Tetsuro Morimura, Kaito Ariu, and Kenshi Abe. Regularized best-of-n sampling to mitigate reward hacking for language model alignment. ArXiv preprint, abs/2404.01054, 2024. URL https://arxiv.org/abs/2404.01054.

[233]

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, and Yahui Zhou. Skywork-reward: Bag of tricks for reward modeling in llms, 2024. URL https://arxiv.org/abs/2410.18451.

[234]

Rémi Munos, Michal Valko, Daniele Calandriello, Mohammad Gheshlaghi Azar, Mark Rowland, Zhaohan Daniel Guo, Yunhao Tang, Matthieu Geist, Thomas Mesnard, Côme Fiegel, Andrea Michi, Marco Selvi, Sertan Girgin, Nikola Momchev, Olivier Bachem, Daniel J. Mankowitz, Doina Precup, and Bilal Piot. Nash learning from human feedback. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. OpenReview.net, 2024. URL https://openreview.net/forum?id=Y5AmNYiyCQ.

[235]

Kawin Ethayarajh, Winnie Xu, Niklas Muennighoff, Dan Jurafsky, and Douwe Kiela. Kto: Model alignment as prospect theoretic optimization, 2024. URL https://arxiv.org/abs/2402.01306.

[236]

Jack S Levy. An introduction to prospect theory. Political psychology, pp. 171–186, 1992.

[237]

Pierre Harvey Richemond, Yunhao Tang, Daniel Guo, Daniele Calandriello, Mohammad Gheshlaghi Azar, Rafael Rafailov, Bernardo Avila Pires, Eugene Tarassov, Lucas Spangher, Will Ellsworth, Aliaksei Severyn, Jonathan Mallinson, Lior Shani, Gil Shamir, Rishabh Joshi, Tianqi Liu, Remi Munos, and Bilal Piot. Offline regularised reinforcement learning for large language models alignment, 2024. URL https://arxiv.org/abs/2405.19107.

[238]

Akifumi Wachi, Wataru Hashimoto, and Kazumune Hashimoto. Long-term safe reinforcement learning with binary feedback. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 2024, Fourteenth Symposium on Educational Advances in Artificial Intelligence, EAAI 2014, February 20-27, 2024, Vancouver, Canada, pp. 21656–21663. AAAI Press, 2024. . URL https://doi.org/10.1609/aaai.v38i19.30164.

[239]

M. A. Ganaie, Minghui Hu, Ashwani Kumar Malik, Muhammad Tanveer, and Ponnuthurai N. Suganthan. Ensemble deep learning: A review. Eng. Appl. Artif. Intell., 115: 105151, 2022. . URL https://doi.org/10.1016/j.engappai.2022.105151.

[240]

Shanghaoran Quan. Dmoerm: Recipes of mixture-of-experts for effective reward modeling, 2024. URL https://arxiv.org/abs/2403.01197.

[241]

Michelle Halbheer, Dominik J. Mühlematter, Alexander Becker, Dominik Narnhofer, Helge Aasen, Konrad Schindler, and Mehmet Ozgur Turkoglu. Lora-ensemble: Efficient uncertainty modelling for self-attention networks, 2024. URL https://arxiv.org/abs/2405.14438.

[242]

Shihan Dou, Enyu Zhou, Yan Liu, Songyang Gao, Jun Zhao, Wei Shen, Yuhao Zhou, Zhiheng Xi, Xiao Wang, Xiaoran Fan, Shiliang Pu, Jiang Zhu, Rui Zheng, Tao Gui, Qi Zhang, and Xuanjing Huang. Loramoe: Alleviate world knowledge forgetting in large language models via moe-style plugin, 2023. URL https://arxiv.org/abs/2312.09979.

[243]

Seung Hyun Lee, Yinxiao Li, Junjie Ke, Innfarn Yoo, Han Zhang, Jiahui Yu, Qifei Wang, Fei Deng, Glenn Entis, Junfeng He, Gang Li, Sangpil Kim, Irfan Essa, and Feng Yang. Parrot: Pareto-optimal multi-reward reinforcement learning framework for text-to-image generation, 2024. URL https://arxiv.org/abs/2401.05675.

[244]

Alexandre Ramé, Guillaume Couairon, Corentin Dancette, Jean-Baptiste Gaya, Mustafa Shukor, Laure Soulier, and Matthieu Cord. Rewarded soups: towards pareto-optimal alignment by interpolating weights fine-tuned on diverse rewards. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023, 2023. URL http://papers.nips.cc/paper_files/paper/2023/hash/e12a3b98b67e8395f639fde4c2b03168-Abstract-Conference.html.

[245]

https://github.com/JLZhong23/awesome-reward-models↩︎

A Comprehensive Survey of Reward Models: Taxonomy, Applications, Challenges, and Future