Disentangling Reasoning Tokens and Boilerplate Tokens For Language Model Fine-tuning

Ziang Ye1,21 Zhenru Zhang2 Yang Zhang3 Jianxin Ma2
Junyang Lin2 Fuli Feng1

1University of Science and Technology of China 2Alibaba Group
3National University of Singapore
yza03@mail.ustc.edu.cn {zhangzhenru.zzr,junyang.ljy}@alibaba-inc.com
{zyang1580,majx13fromthu,fulifeng93}@gmail.com


Abstract

When using agent-task datasets to enhance agent capabilities for Large Language Models (LLMs), current methodologies often treat all tokens within a sample equally. However, we argue that tokens serving different roles—specifically, reasoning tokens versus boilerplate tokens (e.g., those governing output format)—differ significantly in importance and learning complexity, necessitating their disentanglement and distinct treatment. To address this, we propose a novel Shuffle-Aware Discriminator (SHAD) for adaptive token discrimination. SHAD classifies tokens by exploiting predictability differences observed after shuffling input-output combinations across samples: boilerplate tokens, due to their repetitive nature among samples, maintain predictability, whereas reasoning tokens do not. Using SHAD, we propose the Reasoning-highlighted Fine-Tuning (RFT) method, which adaptively emphasizes reasoning tokens during fine-tuning, yielding notable performance gains over common Supervised Fine-Tuning (SFT).

1 Introduction↩︎

Recently, there has been a surge of enthusiasm in researching Agents based on Large Language Models (LLMs) [1], [2], with the aim of achieving human-level artificial intelligence or beyond. Despite LLMs showcasing remarkable capabilities in various areas, they have not inherently demonstrated strong agent capabilities, such as multi-step reasoning [3][5] and tool use [6][9]. This shortfall has directed significant attention toward incorporating datasets tailored for agent tasks to enhance the agent capabilities of LLMs [10][13]. These datasets offer structured examples of standard reasoning chains for solving agent tasks [6], [12], enabling LLMs to learn from them and thereby enhance their agent capabilities.

Figure 1: Examples of reasoning tokens (green) and boilerplate tokens (yellow and blue). Boilerplate tokens can be further categorized into format tokens (yellow) and template-connecting tokens (blue).

When leveraging these datasets to bolster LLMs’ agent capabilities, existing research often treats all tokens within a sample equally [6], [10], [11], [13], [14]. However, we argue that these tokens could differ substantially in learning difficulty and importance. Given the standardized structure of the data, tokens within a sample can be divided into two categories as depicted in Figure 1: 1) boilerplate tokens, which include format tokens that constrain the output structure, and template-connecting tokens that serve as standard transitional phrases for reasoning, such as “Based on the user’s request... By doing so... This way..."; and 2) reasoning tokens, which provide sample-specific reasoning information crucial for task solving. Boilerplate tokens are distinctly less critical for task solving compared to reasoning tokens and are easier to learn due to their repetitive nature across many samples.

Figure 2: Loss changes for different types of tokens in the sampled test data that the model fails to answer for the regular SFT training.

It is crucial to distinguish between the reasoning and boilerplate components and handle them separately. Failure to do so may result in undesired effects, such as overfitting to the boilerplate components, as depicted in Figure 2, ultimately leading to inadequate agent capabilities. While manually crafting regular expressions to filter out boilerplate tokens appears to be a feasible solution, it can be highly inefficient when dealing with data of diverse formats. Additionally, creating regular expressions for template-connecting tokens of transitional phrases poses challenges due to their potential variability in language. Therefore, an automated and adaptive approach for segregating these components is highly desirable.

This study introduces a novel SHuffle-Aware Discriminator (SHAD) to achieve automated and adaptive token distinction. Considering boilerplate tokens are usually consistent across samples, they can be treated as sample-independent. Consequently, shuffling the correspondence between input and output across data samples does not alter the predictability of boilerplate tokens. However, such shuffling introduces noise that complicates the prediction of reasoning tokens, by causing mismatches between the tokens and the input queries2. SHAD is developed based on this principle. Specifically, it fine-tunes an LLM model using a small portion of shuffled data and then compares the token-level loss between the tuned and original models to classify tokens for the target data. A token is classified as a boilerplate token3 if the loss on the tuned model decreases; otherwise, it is classified as a reasoning token.

Based on SHAD, we have developed a new Reasoning-highlighted Fine-Tuning (RFT) approach, which adaptively assigns greater weights to challenging reasoning tokens to emphasize the learning of reasoning. This approach demonstrates superior performance compared to existing supervised fine-tuning methods across several common agent benchmarks. Further analysis reveals that our method could effectively identify reasoning tokens and strengthen the learning of these tokens, ultimately enhancing the learning of agent capabilities for LLMs.

The main contributions of this work are summarized as follows:

  • We emphasize the differences in learning difficulty and importance between reasoning and boilerplate tokens for agent learning, highlighting the critical importance of effectively distinguishing between them.

  • We introduce SHAD, a novel method that automatically discriminates between reasoning and boilerplate tokens based on their predictability differences observed after shuffling input-output combinations.

  • We have developed a new fine-tuning method RFT rooted in SHAD, improving the effectiveness of learning agent capabilities for LLMs.

2 Related Work↩︎

Figure 3: Illustration of the SHAD method, which classifies tokens through three steps. In step 1, a small subset of the data is sampled, and the output of the sampled data is shuffled. In step 2, the LLM is tuned using the shuffled data. In step 3, tokens are classified by comparing the prediction losses between the tuned and original models.

\(\bullet\) Token Differentiation. Typically, when tuning LLMs, the sequence-level loss is optimized, treating all involved tokens equally. However, recent studies across various domains have increasingly recognized that tokens play different roles. For instance,  [15] suggest that not all tokens are necessary during pretraining, especially in domain-specific contexts, and propose leveraging a reference model trained on high-quality data to distinguish between token importance. Similarly,  [16], [17] recognize token differences in preference learning for LLMs, and accordingly introduce token-level rewards to better align models with human preferences. Among existing works, Agent-Flan  [12] is the most relevant to ours, sharing a similar motivation to account for token differences in agent tuning. However, it only considers “format tokens" as boilerplate tokens, overlooking template-connecting tokens, which are more challenging to disentangle from reasoning tokens. Additionally, it does not emphasize the importance of distinguishing (or classifying) these tokens, resulting in a fundamental difference in both the problems addressed and the solutions proposed. We focus on automatically disentangling reasoning tokens from boilerplate tokens, whereas Agent-Flan prioritizes converting agent data into a standard conversational format.

\(\bullet\)Enhancing Agent Capability for LLMs. To tackle complex real-world problems, it is essential to enhance LLMs’ agent capabilities, such as the ability of external tool use and multi-step reasoning [18][22]. Prior works [4], [13], [23][25] have focused on developing frameworks that prompt LLMs to integrate tools better and engage in deeper reasoning before taking action. Subsequent works have further constructed diverse and well-structured agent-task benchmark datasets, e.g., Toolllama [6], Toolalpaca [26], and APIGen [8], considering these specific datasets for further tuning of LLMs to more directly and effectively enhance their agent abilities. Recently, [12] proposed Agent-Flan, a dataset rewrite method to enable LLMs better to learn reasoning and tool use at a step level. Although these methods train LLMs on agent datasets and achieve promising results, they often struggle with overfitting and generalization issues [12]. Our RFT with SHAD can better utilize these datasets to learn reasoning, achieving superior performance on agent tasks while maintaining good generalization ability on out-of-distribution benchmarks.

3 Methodology↩︎

In this section, we first introduce the SHuffle-Aware Discriminator (SHAD), which is proposed to adaptively distinguish between reasoning and boilerplate tokens. We then discuss how to develop our Reasoning-highlighted Fine-Tuning (RFT) based on the discrimination results.

3.1 SHAD: Adaptive Token Discriminator↩︎

To develop SHAD, our foundational idea is that boilerplate tokens, which template outputs, should be interchangeable across many samples, whereas reasoning tokens are specific to individual samples and cannot be swapped. Consequently, shuffling the combination of inputs and outputs across samples does not alter the predictability of boilerplate tokens, unlike reasoning tokens. Leveraging this principle, we could achieve automated and adaptive token discrimination through the tree steps (as show in Figure 3):

  • Data Shuffle: Select a small ratio of the data and shuffle the combinations of inputs and outputs among the sampled items.

  • Model Tuning: Fine-tune an LLM model using the shuffled data.

  • Classifying: Classify tokens based on the loss change between the tuned and original models for the target data. Compared to the original model, if a token’s loss decreases, it is likely a boilerplate token; otherwise, a reasoning token.

Next, we elaborate on these three steps:

Figure 4: Example of shuffled data. After shuffling, the assistant’s responses no longer correspond to the original queries. However, some tokens (boilerplate tokens, red) remain semantically similar to the original response and are therefore predictable. In contrast, reasoning tokens (green) no longer align with the query, resulting in noise. Note that ‘Action’ and ‘Action Input’ are directly copied from ‘Thought’ and could be considered as non-reasoning.

 \(\bullet\)Data Shuffle. This is the core step of our method, creating distinct predictability for the reasoning tokens and boilerplate tokens. The shuffle is performed by randomly reassigning the input-output combinations between samples. When implementing, we just select a small ratio (1%) of the target dataset and shuffle it for use in the subsequent model tuning step, to avoid large tuning costs and overfitting on the whole dataset.

Let \((x^i, y^i)\) denote the \(i\)-th sample for the sampled dataset, with \(x^i\) as the input and \(y^i\) as the output. Denote all the inputs of all samples as \(X = [x^1, \dots, x^{N}]\), and the corresponding outputs as \(Y = [y^1, \dots, y^{N}]\), where \(N\) denotes the size of sampled dataset. We shuffle \(Y\), and then re-combine the inputs in \(X\) and outputs in the shuffled \(Y\) to construct the shuffled dataset \(\mathcal{D}_s\). This means, for the \(i\)-th original sample \((x^i, y^i)\), its input \(x^i\) may be combined with the \(j\)-th sample’s output \(y^j\) to form a new sample \((x^i, y^j)\), while its output \(y^i\) may be combined with the \(k\)-th sample’s input \(x^k\) to form a new sample \((x^k, y^i)\). With this operation, the mapping relationship between the inputs and outputs becomes noise for reasoning tokens, making them unpredictable. As for the boilerplate tokens, since they are shared across samples, their predictability remains intact. Figure 4 provides an example to illustrate this.

 \(\bullet\)Model Tuning. After obtaining the shuffle data, we leverage them to fine-tune an LLM model. The model tuning is performed according to the classic causal language modeling. Formally, \[\theta_{s} = argmin \sum_{(x^{\prime},y^{\prime})\in \mathcal{D}_{s}} l(x^{\prime},y^{\prime};\theta),\] where \(\theta\) denotes the learnable model parameters, and \(l(x^{\prime};y^{\prime}; \theta)\) denotes the loss for a shuffled sample \((x^{\prime},y^{\prime})\in \mathcal{D}_{s}\), and \(\theta_{s}\) denotes the optimized \(\theta\). As the output is shuffled for the input, the tuned model is only expected to learn to predict boilerplate tokens effectively.

 \(\bullet\)Classifying. After tuning the model with shuffled data, we evaluate the role of each token in a target sample by comparing the token-level prediction loss between the tuned and original models. Given that the tuned model should primarily learn boilerplate tokens, we classify a token as ‘boilerplate’ if its prediction loss decreases in the tuned model relative to the original; otherwise, we classify it as a ‘reasoning’ token.

Given a sample \((x, y)\) in the target dataset, we focus on classifying the tokens in the output part. Formally, for the \(k\)-th token \(y_k\) in the output, we first compute the prediction loss difference (denoted as \(LD(y_k)\)) between the tuned and original models as follows: \[LD(y_k) = l_s(y_k) - l_o(y_k),\] where \(l_s(y_k)\) and \(l_o(y_k)\) represent the loss calculated on the tuned model and the original model, respectively, given by: \[\begin{align} l_s(y_k) &= -\log(P(y_k|x, y_{<k}; \theta_{s})), \\ l_o(y_k) &= -\log(P(y_k|x, y_{<k}; \theta_{o})). \end{align}\] Here, \(P(y_k|x, y_{<k}; \theta_{s})\) and \(P(y_k|x, y_{<k}; \theta_{o})\) denote the predicted probabilities of the token \(y_k\) from the tuned model (parameterized by \(\theta_{s}\)) and the original model (parameterized by \(\theta_{o}\)), respectively.

Based on the calculated loss difference \(LD(y_k)\), the token is classified as follows: \[Classifier(y_k) = \begin{cases} \text{boilerplate}, & \text{if } LD(y_k) \leq 0 \\ \text{reasoning}, & \text{otherwise} \end{cases}\]

Note that our token classification can be conducted offline with a single forward pass of LLM computation for each sample, without affecting the efficiency of the subsequent agent tuning process.

3.2 Reasoning-highlighted Fine-Tuning↩︎

Agent-tuning data often follows fixed formats and similar reasoning trajectories, making boilerplate tokens easily learned. To prevent overfitting to these tokens and enhance reasoning capabilities, we propose focusing more on reasoning tokens identified by our SHAD method during fine-tuning.

Instead of manually assigning fixed weights to the two types of tokens, we utilize an adaptive weight assignment to align the dynamic learning process better. Specifically, we compare the total losses of the reasoning and boilerplate parts, applying the softmax function to assign higher weights to the part with the greater loss. Notably, since the reasoning part typically exhibits a higher loss (see Figure 6), our method naturally assigns greater weights to emphasize reasoning learning. Furthermore, when the loss difference between the two parts diminishes, our method can adaptively adjust the weights to promote a more balanced learning process for the two parts. Given the nature of highlighting reasoning, we name our method Reasoning-highlighted Fine-Tuning (RFT).

Formally, let \(\mathcal{L}_{b}\) and \(\mathcal{L}_r\) represent the total loss for the boilerplate and reasoning tokens, respectively. The re-weighted loss of our RFT, denoted as \(\mathcal{L}_{RFT}\), can be formulated as follows: \[\mathcal{L}_{RFT} = \omega_b \mathcal{L}_{b} + \omega_r \mathcal{L}_{r},\] where \[\label{eq:omega} \begin{align} \omega_b &= \frac{\exp(\mathcal{L}_{b}/\tau)}{\exp(\mathcal{L}_{b}/\tau)+\exp(\mathcal{L}_{r}/\tau)}, \\ \omega_r &= \frac{\exp(\mathcal{L}_{r}/\tau)}{\exp(\mathcal{L}_{b}/\tau)+\exp(\mathcal{L}_{r}/\tau)}. \end{align}\tag{1}\] Here, \(\tau\) is the temperature coefficient of the softmax function. A smaller \(\tau\) results in greater weight being assigned to the part with the higher loss.

4 Experiments↩︎

We now present experiments to evaluate the effectiveness of our method in enhancing LLMs’ agent capabilities, particularly in multi-step planning and tool usage, for solving complex real-world problems. We begin by detailing the experimental setup, followed by the analyses of the results.

Table 1: Performance comparison between baselines, SHAD+RFT, and its variants. Accuracy is reported for BFCL, Nexus, and T-eval, while pass rate, assessed by GPT-4, is used for StableToolBench. ‘AVG’ represents the average performance across all evaluation datasets.The best results among baselines and SHAD+RFT are highlighted in bold, and the second-best are underlined.
Model Method Held-In Held-Out AVG
3-6 StableToolbench BFCL T-eval Nexus
LLaMA3-8B SFT 43.1 85.9 67.0 14.0 52.5
Regex 36.2 81.0 54.3 6.45 44.5
Rho-1 24.5 82.9 68.4 19.0 48.7
RewardFT 44.4 89.3 66.3 8.0 52.0
[2pt/2pt] SHAD+RFT 50.1 87.6 71.8 27.8 59.3
SHAD+\(\alpha\)-FT 47.0 87.2 68.8 28.7 57.9
Regex+RFT 41.2 83.81 61.1 12.4 49.6
LLaMA3.1-8B SFT 48.5 89.3 64.2 19.5 55.4
Regex 42.3 82.1 58.6 14.3 49.3
Rho-1 30.6 84.6 67.0 26.0 52.0
RewardFT 48.2 88.2 66.4 19.1 55.5
[2pt/2pt] SHAD+RFT 50.4 89.4 68.3 32.0 60.0
SHAD+\(\alpha\)-FT 49.2 88.2 63.8 28.9 57.5
Regex+RFT 46.7 80.31 57.6 16.2 50.2

4.1 Experiment Setup↩︎

\(\bullet\)Training Data. We use LLaMA3.1-8B and LLaMA3-8B as the backbone models, fine-tuning them to solve agent tasks. The training dataset is constructed from two commonly used multi-step planning and tool-use benchmarks, ToolBench [6] and APIGen [8], supplemented with general data from ShareGPT [27]. The general data is used to preserve general capabilities like instruction-following, as demonstrated in previous work [11]. ToolBench and APIGen provide a variety of examples for solving complex real-world user queries across different environments, all organized in a standard agent-specific format: “Thought-Action-Action Input" or JSON style.

\(\bullet\) Evaluation Setting. To comprehensively evaluate the proposed method, we consider two evaluation settings: held-in task evaluation and held-out task evaluation, following prior work [11]. The held-in evaluation focuses on measuring performance on tasks similar to those used during training, while the held-out evaluation assesses the model’s generalization to novel tasks. For the held-in setting, we use the StableToolBench [28] and BFCL [29] benchmarks. These datasets align with our agent tuning datasets: StableToolBench shares the same source as ToolBench, while BFCL serves as the leave-out evaluation data for APIGen. For the held-out setting, we use two additional benchmarks: 1) T-eval [30], a comprehensive step-level reasoning benchmark, and 2) Nexus [31], a complex single-step nested tool-use benchmark. Both benchmarks provide a diverse set of tools for LLMs to choose from, with tasks in StableToolBench and T-eval often requiring multiple steps to complete. In accordance with the evaluation metrics established in the original benchmarks, accuracy is employed for BFCL4, T-eval, and Nexus, while StableToolBench5 is evaluated using the pass rate assessed by GPT-4.

\(\bullet\) Compared Methods. To evaluate our RFT method developed on SHAD (denoted as SHAD+RFT), we compare it against the following baselines: 1) SFT, standard supervised fine-tuning; 2) Regex, which uses regular expressions to distinguish formatting tokens from other tokens and re-weights their losses with constant values; 2) Rho-1 [15], which leverages a reference model trained on high-quality data to identify noise tokens and then mask them during fine-tuning; and 3) Reward-based Fine-Tuning (RewardFT) [16], [17], which assigns token-level reward scores for tuning using a DPO-based reward model. It is important to note that Rho-1 and RewardT were not originally designed for agent tuning tasks; however, we have extended them for this purpose, with implementation details provided in the Appendix 11.

In addition to the above baselines, we also compare our method with two of its variants to assess its core design components: 1) SHAD+\(\alpha\)-FT, which retains the SHAD component but assigns a fixed weight \(\alpha\) to reasoning tokens to emphasize them; and 2) Regex+RFT, which preserves the RFT weighting mechanism, but uses regular expressions for the token distinction. The implementation details of \(\alpha\)-FT are also provided in Appendix 11.

Figure 5: Case study of tokens classified by SHAD. The blue regions represent reasoning tokens, identified by an increase in loss on the model tuned with shuffled data compared to the original model. In contrast, the brown regions indicate boilerplate tokens, characterized by a decrease in loss on the tuned model.

4.2 Main Results↩︎

Table 1 summarizes the performance of all compared methods. From the table, we could draw two main conclusions:

SHAD+RFT Performs Strongly. Our method, SHAD+RFT, outperforms all baselines on all held-in and held-out evaluation datasets, except for the held-in evaluation BFCL with LLaMA3-8B. This highlights the advantage of emphasizing reasoning components in solving complex real-world problems and demonstrates the effectiveness of our method in identifying and highlighting these parts. Notably, while Rho-1 and RewardFT also differentiate between tokens during learning, they are not specifically designed for agent tuning to discover and emphasize reasoning tokens, resulting in comparatively lower performance. Specifically, Rho-1 targets identifying noise tokens to mask during tuning, but fails to distinguish between normal boilerplate and reasoning tokens. The RewardFT method leverages token-level rewards from a DPO-based reward model aligned with human preferences to differentiate tokens, but it is also not designed to identify reasoning tokens that are essential for agent-specific capabilities.

Both SHAD and RFT are Crucial. When comparing SHAD+RFT with its variants, Regex+RFT and SHAD+\(\alpha\)-FT, the original SHAD+RFT consistently demonstrates superior performance. We explain the results as follows:

  • Adaptive weighting in RFT is crucial. Comparing the proposed SHAD+RFT with its variant SHAD+\(\alpha\)-FT, SHAD+RFT consistently outperforms, demonstrating the superiority of RFT’s adaptive mechanism over the fixed weighting approach of \(\alpha\)-FT. This advantage stems from adaptive weighting’s ability to better align with the dynamic learning process, adaptively adjusting weights for reasoning and boilerplate token components, thereby preventing over-learning or under-learning of either part.

  • The importance of SHAD for token differentiation. Replacing SHAD with Regex in SHAD+RFT leads to a significant drop in model performance. This highlights that the effectiveness of reasoning-highlighted fine-tuning depends on accurate token differentiation. The results also demonstrate SHAD’s superior ability to disentangle boilerplate tokens from reasoning tokens. In contrast, Regex relies solely on regular expressions to identify formatting tokens, failing to fully distinguish between template-connecting tokens (one part of boilerplate tokens) and reasoning tokens.

This indicates that replacing either SHAD or RFT diminishes the method’s effectiveness, affirming the importance of both components.

5 Analysis on SHAD and RFT↩︎

In this section, we first present a case study on the effectiveness of SHAD in distinguishing different tokens, followed by a comprehensive analysis of how RFT functions.

a

b

Figure 6: Training loss for SFT and our RFT (based on SHAD). Left: Overall training loss; Right: Training loss for reasoning token part and boilerplate token part..

Case study of tokens classified by SHAD. To further validate SHAD’s ability to identify reasoning tokens, we conducted a series of case studies, with one example of classification result shown in Figure 5 (additional examples are provided in Appendix 13). As shown in the figure, SHAD successfully classifies most query-dependent information— information related to ‘smart-phones’— as reasoning tokens, while formatting tokens (e.g., the attribute names ‘Thought’ and ‘Action’) and common template-connecting tokens like ‘I should call’ and ‘this API’ are classified as boilerplate tokens. This outcome aligns with human understanding of reasoning tokens, verifying the effectiveness of our method again. Interestingly, SHAD does not classify the entire function name ‘smart_phones_for_amazon_api_v2’ as reasoning but only the ‘smart_phones’ portion. We think this is may because the ‘amazon_api_v2’ part is common across many function names. Additionally, When this function name appears in ‘Action’ field, it is classified as boilerplate as it is derived from the thought rather than part of the reasoning process.

Besides, for tokens that can be manually annotated, we conducted a quantitative analysis of classification accuracy, presented in Appendix 12. The results show that our method achieves a very low classification error rate (<3%).

RFT Enhancing Reasoning Token Learning. Blindly treating reasoning and boilerplate tokens equally, as done in SFT, can lead to overfitting on boilerplate tokens while insufficiently learning for reasoning tokens. To further verify the effectiveness of RFT, we compare the training loss between SFT and RFT. The results are summarized in Figure 6. The findings indicate that RFT significantly reduces the loss for reasoning tokens while maintaining a comparable loss for boilerplate tokens compared to SFT, confirming that RFT effectively enhances the learning for reasoning tokens. Additionally, we conducted case studies on the model’s output, presented in Appendix 14, to assess whether our method improves model reasoning. The results show that our method enhances the model’s ability to correctly apply functions in reasoning components (e.g., providing accurate parameters) while preventing overfitting to training formats. A detailed discussion is available in Appendix 14.

The Effect of Hyper-parameter \(\tau\). The temperature coefficient \(\tau\) in Equation 1 plays a crucial role in controlling the strength of our re-weighting mechanism in RFT, so we next investigate its impact. Specifically, we vary \(1/\tau\) within the range of [0, 2] and analyze the corresponding performance of SHAD+RFT (averaged over all evaluation datasets). The results are illustrated in Figure 7. From the figure, we observe that the performance of our method initially increases and then roughly decreases as \(1/\tau\) increases, i.e., as gradually enhancing our re-weighting mechanism. This indicates the importance of carefully selecting the optimal \(\tau\). Fortunately, across a wide range, SHAD+RFT could consistently outperform regular SFT and surpass most baselines (c.f., Table 1).

Figure 7: The performance of our SHAD+RFT method as the temperature coefficient \(\tau\) varies. The performance averaged over all evaluation datasets is reported, with LLaMA3-8B as the backbone. Notably, \(1/\tau = 0\) means assigning equal weights to the reasoning and boilerplate parts, i.e., deactivating our re-weighting mechanism.

6 Conclusion↩︎

In this paper, we highlighted the importance of distinguishing between reasoning and boilerplate tokens and introduced a SHuffle-Aware Discriminator (SHAD) to automatically achieve this. Building on SHAD, we further developed a new Reasoning-Highlighted Fine-Tuning (RFT) method to enhance reasoning learning during LLM fine-tuning, thereby improving agent capabilities. Extensive results demonstrated that our method significantly enhances LLMs’ ability to solve complex real-world problems. In the future, we plan to extend our approach to the entire SFT domain and develop more refined mechanisms, such as token-level re-weighting, to better leverage our token differentiation results.

7 Limitations↩︎

We identify several limitations of our method in both token differentiation and re-weighting during training. First, the effectiveness of our approach depends on boilerplate tokens remaining consistent across different samples. When this consistency is lacking, such as in cases where the diversity of boilerplate tokens is high, our method may fail. Second, our distinction between reasoning and boilerplate tokens relies on rigid, manually defined thresholds for loss differences, which may need refinement. Third, our weighting strategy is currently applied only at the group level, and future optimization may be required at the token level.

Additionally, even with improved reasoning capabilities, model outputs may still exhibit unpredictable behaviors in real-world deployments, potentially leading to incorrect or unsafe actions. There’s also a risk that our approach could reinforce certain biases present in the training data, particularly if those biases are related to reasoning patterns and tool usage decisions. Future work should investigate these risks more comprehensively.

8 Ethical Considerations↩︎

All experiments were conducted using publicly available datasets and models, ensuring no privacy concerns. The Toolbench and ShareGPT datasets are licensed under Apache-2.0, while APIGen is licensed under CC-BY-4.0. The training data was carefully curated and processed to exclude any personally identifiable information. We have maintained transparency in our methodology and results, acknowledging both the strengths and limitations of our approach.

For the large language model use, we utilize ChatGPT to help polish the writing at the sentence level.

9 Detail Information of Training Datasets↩︎

We provide more details of our training datasets in Table 2. To enable the multi-step reasoning ability of LLM, we choose ToolBench [6] and APIGen [8] as our basic datasets. Following the practice in AgentTuning [11] and AgentFlan [12], we also mix ShareGPT [27] and basic datasets for training. We filter the obviously low-quality data that does not follow the request format and sample 5k percent of data from APIGen for data balance. All methods use the same dataset and do not apply token differentiation to general data.

Table 2: Training Dataset detail in our experiment
Dataset Data Size
APIGen 5000
ToolBench 22993
ShareGPT 93481
Total 121474

10 Experimental Details and Resources Required↩︎

Table 3 lists the hyper-parameters used in our model training. For evaluation, we set the inference temperature to \(10 ^ -6\) to ensure reproducibility. When utilizing GPT-4 for evaluation, we follow the practice in ToolLLM [32] and evaluate each response 3 times.

Table 3: Hyperparameters used for model training. Both LLaMA3-8B and LLaMA3.1-8B were trained on NVIDIA A100 GPUs with a batch size of 32 and a maximum sequence length of 3072. Each training session utilized 8 GPUs and took approximately 8 hours.
Params LLaMA3-8B LLaMA3.1-8B
learning rate 1e-5 1e-5
warmup radio 0.05 0.05
max length 3072 3072
batch size 32 32
gpus 8 8

11 Implementation Details↩︎

11.1 Implementation Details of Rho-1↩︎

For the Rho-1 baseline, we train the reference model in self-reference setting [15]. Specifically, we sample 5% data from our training dataset to train the reference model. We follow the original implementation that focuses training on H\(\rightarrow\)L tokens ( i.e., the tokens with loss decreased from high to low during training the reference model) and masks the other tokens.

11.2 Implementation Details of RewardFT↩︎

For the RewardFT baseline, because of the lack of Agent preference data, we use general DPO data ORCA DPO [33] and Ultrafeedback [34] to train the model as token-level reward model under the same setting in [17]. We calculate the token-level reward given by the preference model, then we follow the practice in weighted-MLE [35], taking softmax on all token rewards as the weight to train the model.

11.3 Implementation Details of \(\alpha\)-FT↩︎

A simple and common method for addressing imbalance training is to manually give a fixed weight for each type of token [36]. Here we introduce a weighting factor \(\alpha \in [0, 0.5]\) for boilerplate tokens and \(1- \alpha\) for reasoning tokens. Let \(\mathcal{L}_{b}\) and \(\mathcal{L}_r\) represent the total loss for the boilerplate and reasoning tokens, respectively. The re-weighted loss (denoted as \(\mathcal{L}_{\alpha-balance}\)) can be formulated as follows: \[\mathcal{L}_{\alpha-balance} = \alpha \mathcal{L}_{b} + (1-\alpha) \mathcal{L}_r\] This loss is a simple extension to CE we call \(\alpha\)-FT in this paper that we consider as an experimental baseline for our proposed RFT method.

12 Analysis of Token Classification↩︎

12.1 Challenges in Manual Token Annotation↩︎

The task of manually annotating tokens as either reasoning or boilerplate presents significant challenges that make it impractical for large-scale validation. To illustrate these challenges, we present a detailed example:

Consider the following agent response:

Thought: Based on the user's request to 
fetch weather data for NewYork, I should 
call the get_weather function. This API 
requires the city name and will return 
current weather conditions.
Action: get_weather
Action Input: {"city": "NewYork"}

In this example, while some tokens are clearly boilerplate (e.g., "Action:", "Action Input:", "", ""), others are more ambiguous, "Based on" could be considered a template-connecting phrase (boilerplate) or part of the reasoning process, "should call" might be viewed as either reasoning (indicating decision-making) or a standard template phrase The structure "city name and will return" combines both reasoning content and standard connecting phrases

This intricate interweaving of reasoning and boilerplate elements makes consistent manual annotation extremely challenging, even for expert annotators.

12.2 Evaluation of SHAD Classification↩︎

Given these challenges, we instead focused on evaluating our SHAD method against the subset of tokens that can be clearly classified - specifically, formatting tokens that can be identified through regular expressions. We conducted this evaluation on our two training datasets:

Table 4: SHAD Classification Performance on Formatting Tokens
Dataset Misclassification Rate
ToolBench 0.82%
APIGen 2.62%

Format tokens include:

  • JSON formatting tokens (e.g., "", "", ":", "[", "]")

  • Standard field identifiers (e.g., "Thought:", "Action:", "Action Input:")

  • Common punctuation in structured outputs (e.g., "tool_call", "name", "arguments")

The low misclassification rates shown in Table 4 on these unambiguous tokens provide strong evidence for SHAD’s effectiveness in identifying boilerplate elements. While this evaluation only covers a subset of all boilerplate tokens, it represents the most objective measure possible given the inherent ambiguity in token classification.

Beyond direct evaluation of classification accuracy, we validate our approach through improvements in downstream task performance (see Section 4.2). The significant performance gains achieved by using SHAD’s classification results for weighted training suggest that the method effectively identifies meaningful token distinctions, even for cases where ground truth labels are unavailable.

13 More examples Labeled by SHAD↩︎

In Figure 8, we show several examples of tokens classified by our SHAD method, with blue regions representing reasoning tokens and brown regions indicate boilerplate tokens.

Figure 8: More case studies of tokens classified by SHAD. The blue regions represent reasoning tokens, identified by an increase in loss on the model tuned with shuffled data compared to the original model. In contrast, the brown regions indicate boilerplate tokens, characterized by a decrease in loss on the tuned model.

14 Qualitative Analysis↩︎

In this section, we present several examples of how model trained by our method yeild more accurate answer than model trained by naive SFT in Figure 9. In the response generated by the na ï ve SFT model, we observe overfitting, with formatted tokens (yellow) and template-connect tokens (blue) being erroneously generated. Additionally, the na ï ve SFT model exhibits hallucination, leading to reasoning errors (red). In contrast, our SHAD+RFT method successfully follows the Held-Out instructions and provides accurate reasoning.

15 More Example of Shuffled Data↩︎

In this section, we add more examples of shuffled data in Figure 10 to support the state made that shuffling the correspondence between input and output across data samples does not alter the predictability of boilerplate tokens while reasoning tokens are disruptive after the shuffling.

Figure 9: Comparison example on Held-Out Benchmark Nexus. In the response generated by the na ï ve SFT model, we observe overfitting, with formatted tokens and template-connect tokens being erroneously generated. Additionally, the na ï ve SFT model exhibits hallucination, leading to reasoning errors. In contrast, our SHAD+RFT method successfully follows the Held-Out instructions and provides accurate reasoning, we explicitly mark the different reasoning part in red.

Figure 10: More Example of Shuffled Data. After shuffling, the assistant’ s responses no longer correspond to the original queries. However, some tokens (boilerplate tokens, red) remain semantically similar to the original response and are therefore predictable. In contrast, reasoning tokens (green) no longer align with the query, resulting in noise.

References↩︎

[1]
Lilian Weng. 2023. https://lilianweng.github.io/posts/2023-06-23-agent/. lilianweng.github.io.
[2]
Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, Wayne Xin Zhao, Zhewei Wei, and Jirong Wen. 2024. https://doi.org/10.1007/S11704-024-40231-1. Frontiers Comput. Sci., 18(6):186345.
[3]
Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. http://papers.nips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html. In Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022.
[4]
Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R. Narasimhan, and Yuan Cao. 2023. https://openreview.net/forum?id=WE_vluYUL-X. In The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023. OpenReview.net.
[5]
Shuofei Qiao, Ningyu Zhang, Runnan Fang, Yujie Luo, Wangchunshu Zhou, Yuchen Jiang, Chengfei Lv, and Huajun Chen. 2024. https://doi.org/10.18653/v1/2024.acl-long.165. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 3003–3021, Bangkok, Thailand. Association for Computational Linguistics.
[6]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2024. https://openreview.net/forum?id=dHng2O0Jjr. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024. OpenReview.net.
[7]
Timo Schick, Jane Dwivedi-Yu, Roberto Dessı̀, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. http://papers.nips.cc/paper_files/paper/2023/hash/d842425e4bf79ba039352da0f658a906-Abstract-Conference.html. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
[8]
Zuxin Liu, Thai Hoang, Jianguo Zhang, Ming Zhu, Tian Lan, Shirley Kokane, Juntao Tan, Weiran Yao, Zhiwei Liu, Yihao Feng, Rithesh Murthy, Liangwei Yang, Silvio Savarese, Juan Carlos Niebles, Huan Wang, Shelby Heinecke, and Caiming Xiong. 2024. http://arxiv.org/abs/2406.18518. arXiv preprint. ArXiv:2406.18518 [cs].
[9]
Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334.
[10]
Baian Chen, Chang Shu, Ehsan Shareghi, Nigel Collier, Karthik Narasimhan, and Shunyu Yao. 2023. https://arxiv.org/abs/2310.05915. Preprint, arXiv:2310.05915.
[11]
Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. 2023. http://arxiv.org/abs/2310.12823. arXiv preprint. ArXiv:2310.12823 [cs].
[12]
Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.557. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 9354–9366. Association for Computational Linguistics.
[13]
Andrew Zhao, Daniel Huang, Quentin Xu, Matthieu Lin, Yong-Jin Liu, and Gao Huang. 2023. http://arxiv.org/abs/2308.10144. arXiv preprint. ArXiv:2308.10144 [cs].
[14]
Zehui Chen, Kuikun Liu, Qiuchen Wang, Wenwei Zhang, Jiangning Liu, Dahua Lin, Kai Chen, and Feng Zhao. 2024. http://arxiv.org/abs/2403.12881. arXiv preprint. ArXiv:2403.12881 [cs].
[15]
Zhenghao Lin, Zhibin Gou, Yeyun Gong, Xiao Liu, Yelong Shen, Ruochen Xu, Chen Lin, Yujiu Yang, Jian Jiao, Nan Duan, and Weizhu Chen. 2024. http://arxiv.org/abs/2404.07965. arXiv preprint. ArXiv:2404.07965 [cs].
[16]
Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, and Mingyuan Zhou. Preference-grounded Token-level Guidance for LanguageModelFine-tuning.
[17]
Rafael Rafailov, Joey Hejna, Ryan Park, and Chelsea Finn. 2024. http://arxiv.org/abs/2404.12358. arXiv preprint. ArXiv:2404.12358 [cs].
[18]
Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. http://papers.nips.cc/paper_files/paper/2023/hash/77c33e6a367922d003ff102ffb92b658-Abstract-Conference.html. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
[19]
Reiichiro Nakano, Jacob Hilton, Suchir Balaji, Jeff Wu, Long Ouyang, Christina Kim, Christopher Hesse, Shantanu Jain, Vineet Kosaraju, William Saunders, Xu Jiang, Karl Cobbe, Tyna Eloundou, Gretchen Krueger, Kevin Button, Matthew Knight, Benjamin Chess, and John Schulman. 2021. https://arxiv.org/abs/2112.09332. CoRR, abs/2112.09332.
[20]
Shunyu Yao, Howard Chen, John Yang, and Karthik Narasimhan. 2022. Webshop: Towards scalable real-world web interaction with grounded language agents. Advances in Neural Information Processing Systems, 35:20744–20757.
[21]
Yu Du, Fangyun Wei, and Hongyang Zhang. 2024. https://doi.org/10.48550/arXiv.2402.04253. Preprint, arXiv:2402.04253.
[22]
Bhargavi Paranjape, Scott Lundberg, Sameer Singh, Hannaneh Hajishirzi, Luke Zettlemoyer, and Marco Tulio Ribeiro. 2023. https://doi.org/10.48550/arXiv.2303.09014. Preprint, arXiv:2303.09014.
[23]
Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. http://papers.nips.cc/paper_files/paper/2023/hash/1b44b878bb782e6954cd888628510e90-Abstract-Conference.html. In Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 16, 2023.
[24]
Haojie Pan, Zepeng Zhai, Hao Yuan, Yaojia Lv, Ruiji Fu, Ming Liu, Zhongyuan Wang, and Bing Qin. 2024. http://arxiv.org/abs/2312.04889. arXiv preprint. ArXiv:2312.04889 [cs].
[25]
Weizhou Shen, Chenliang Li, Hongzhan Chen, Ming Yan, Xiaojun Quan, Hehong Chen, Ji Zhang, and Fei Huang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.929. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 16658–16680, Miami, Florida, USA. Association for Computational Linguistics.
[26]
Qiaoyu Tang, Ziliang Deng, Hongyu Lin, Xianpei Han, Qiao Liang, Boxi Cao, and Le Sun. 2023. http://arxiv.org/abs/2306.05301. arXiv preprint. ArXiv:2306.05301 [cs].
[27]
2024. https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered.
[28]
Zhicheng Guo, Sijie Cheng, Hao Wang, Shihao Liang, Yujia Qin, Peng Li, Zhiyuan Liu, Maosong Sun, and Yang Liu. 2024. https://doi.org/10.18653/V1/2024.FINDINGS-ACL.664. In Findings of the Association for Computational Linguistics, ACL 2024, Bangkok, Thailand and virtual meeting, August 11-16, 2024, pages 11143–11156. Association for Computational Linguistics.
[29]
https://gorilla.cs.berkeley.edu/leaderboard.html.
[30]
Zehui Chen, Weihua Du, Wenwei Zhang, Kuikun Liu, Jiangning Liu, Miao Zheng, Jingming Zhuo, Songyang Zhang, Dahua Lin, Kai Chen, and Feng Zhao. 2024. https://doi.org/10.18653/V1/2024.ACL-LONG.515. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ACL 2024, Bangkok, Thailand, August 11-16, 2024, pages 9510–9529. Association for Computational Linguistics.
[31]
Nexusflow.ai team. 2023. https://nexusflow.ai/blogs/ravenv2.
[32]
Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, Sihan Zhao, Lauren Hong, Runchu Tian, Ruobing Xie, Jie Zhou, Mark Gerstein, Dahai Li, Zhiyuan Liu, and Maosong Sun. 2023. http://arxiv.org/abs/2307.16789. arXiv preprint. ArXiv:2307.16789 [cs].
[33]
2024. https://huggingface.co/datasets/Intel/orca_dpo_pairs.
[34]
2024. https://huggingface.co/datasets/allenai/ultrafeedback_binarized_cleaned.
[35]
Shentao Yang, Shujian Zhang, Congying Xia, Yihao Feng, Caiming Xiong, and Mingyuan Zhou. 2023. http://arxiv.org/abs/2306.00398. arXiv preprint. ArXiv:2306.00398 [cs].
[36]
Tsung-Yi Lin, Priya Goyal, Ross Girshick, Kaiming He, and Piotr Dollár. 2020. https://doi.org/10.1109/TPAMI.2018.2858826. IEEE Transactions on Pattern Analysis and Machine Intelligence, 42(2):318–327.

  1. Work was done during the internship in Alibaba Group.↩︎

  2. We will later provide practical examples in Section 3.1 to illustrate how shuffling can cause the reasoning parts of a response to mismatch with the corresponding queries.↩︎

  3. These tokens would be further categorized into formatting tokens and template connecting phrases based on their losses if needed.↩︎

  4. Accuracy is reported by abstract syntax tree evaluation for BFCL.↩︎

  5. We only select three most difficult subsets of StableToolbench — I2-Category, I3-Instruction, and I1-Tool.↩︎