December 19, 2024
Supervised fine-tuning (SFT) plays a crucial role in adapting large language models (LLMs) to specific domains or tasks. However, as demonstrated by empirical experiments, the collected data inevitably contains noise in practical applications, which poses significant challenges to model performance on downstream tasks. Therefore, there is an urgent need for a noise-robust SFT framework to enhance model capabilities in downstream tasks. To address this challenge, we introduce a robust SFT framework (RobustFT) that performs noise detection and relabeling on downstream task data. For noise identification, our approach employs a multi-expert collaborative system with inference-enhanced models to achieve superior noise detection. In the denoising phase, we utilize a context-enhanced strategy, which incorporates the most relevant and confident knowledge followed by careful assessment to generate reliable annotations. Additionally, we introduce an effective data selection mechanism based on response entropy, ensuring only high-quality samples are retained for fine-tuning. Extensive experiments conducted on multiple LLMs across five datasets demonstrate RobustFT’s exceptional performance in noisy scenarios. Our code and data are publicly available.1
Supervised fine-tuning (SFT) has emerged as a critical technique for optimizing Large Language Models’ (LLMs) capabilities, particularly in domain-specific tasks and adapting them to specific scenarios [1], [2]. High-quality scenario-specific data plays a vital role in enhancing model performance on downstream tasks [3], [4]. While such data can be acquired through various means including human annotation [5], scenario-specific collection [6], and model-based self-labeling [7], these data sources inherently contain noise stemming from both human annotation errors and model hallucinations [8].
Data noise in scenario can have catastrophic effects on model performance. As shown in Figure 1, the MMLU [9] evaluation results clearly demonstrate this degradation: as the proportion of noisy data increases, model accuracy shows a sharp decline. Specifically, with just \(30\%\) noise in the training data, the model’s performance deteriorates by \(8.9\%\) compared to the vanilla LLM baseline. This performance degradation becomes increasingly severe as noise levels rise further. These findings underscore the critical importance and practical value of developing noise-robust fine-tuning frameworks for LLMs to maintain reliable downstream performance. This motivates our central research question:
Can LLMs detect inevitable noise and enhance data quality, to improve its performance on target tasks?
The development of a noise-robust LLM fine-tuning framework encounters two major challenges. First, direct noise detection through LLM predictions proves unreliable due to model hallucinations and overconfidence, as validated by our empirical studies in Section 4. Second, while existing noise-robust methods work well for classification tasks with discrete label spaces [10], [11], they are inadequate for LLM fine-tuning scenarios that require contextual and open-ended text generation. Traditional relabeling strategies not only fail to utilize valuable information in noisy generated responses. These challenges highlight the complexity of developing a framework that effectively leverages both model capabilities and data characteristics for robust noise detection and denoising in LLM fine-tuning.
In this paper, we propose RobustFT (Noise-robust LLM Supervised Fine-Tuning), a framework for effective adaptation in downstream scenarios with noisy data. At its core, RobustFT introduces
multi-view noise detection and denoising strategies. For noise detection, RobustFT employs a collaborative multi-expert system, incorporating reasoning-enhanced models to identify potentially noisy data effectively.
For the identified noisy data, RobustFT designs a denoising and data selection process. First, RobustFT utilizes high-confidence data as contextual references for reliable relabeling of noisy
samples. Subsequently, for both context-enhanced and reasoning-enhanced inference, RobustFT employs Review
Agent to examine and synthesize responses. Finally, by computing confidence scores
based on model response entropy and excluding low-confidence samples, we obtain a denoised fine-tuning dataset that facilitates model adaptation to downstream tasks. Overall, by combining noise detection and denoising processes, RobustFT effectively enhances the quality of the fine-tuning dataset while maximizing data utility. We validate RobustFT’s effectiveness through extensive experiments across five datasets, spanning
both general and domain-specific tasks with varying noise levels. Through comprehensive comparative analyses and ablation studies, we demonstrate the superiority of our approach.
Our contributions can be summarized as follows:
New Perspective. We investigate the critical yet understudied challenge of noise-robust supervised fine-tuning for LLMs, which aligns more closely with real-world scenarios where noise is inevitable.
Principled Methodology. We design a self-contained framework to leverage the intrinsic interactions between models and data for effective noise detection and denoising, eliminating dependencies on external models or resources.
Superior Performance. RobustFT exhibits robust performance across diverse noise conditions, demonstrating significant improvements on three open-source LLMs across both general and domain-specific tasks, which validates its broad applicability and practical value.
In practical applications of Large Language Models (LLMs), our objective extends beyond enhancing their general capabilities to improving their performance on downstream tasks. To achieve this, we utilize Supervised Fine-Tuning (SFT) to optimize an LLM \({\mathcal{M}}\) for a target downstream task \({\mathcal{D}}_{task} = \{q_i,~y_i\}_{i=1}^N\), where \(q_i\) denotes the query and \(y_i\) is the expected response. The model’s performance is enhanced by minimizing the loss between its predictions and the expected outputs.
However, the effectiveness of SFT is heavily dependent on the quality of the downstream task data [3], [12]. Various factors, including annotation errors, data processing inconsistencies, and model hallucinations, can introduce both random and systematic noise into downstream datasets \({\mathcal{D}}_{task}\). Our empirical studies in Section 4 demonstrate that \(30\%\) noise in the training data can lead to an \(8.9\%\) degradation on downstream tasks. Therefore, developing robust mechanisms for noise detection and mitigation during the SFT process, particularly ones that can effectively handle open-ended text generation, is crucial and holds significant practical value for optimizing LLM performance.
As discussed above, during the fine-tuning of LLMs on downstream tasks, the training data contains both correctly and incorrectly labeled data pairs. Our primary objective is to develop an effective mechanism for identifying these mislabeled instances. Furthermore, we aim to leverage both the model’s capabilities and contextual information within the dataset to denoise incorrectly labeled data pairs where possible. Through this process, we seek to construct a refined dataset with reduced noise levels. Ultimately, this curated dataset enables more effective enhancement of LLM performance on downstream tasks.
Adapting and fine-tuning Large Language Models (LLMs) in real-world scenarios presents significant challenges, particularly due to the presence of noise in downstream task datasets that can compromise model performance. Our approach addresses this challenge through a systematic framework comprising noise detection-and-denoising mechanisms to prevent performance degradation.
For noise detection, we leverage the consensus among multiple expert LLMs and employ a Checker
to identify noisy samples. For denoising, we employ a two-pronged approach: first, we utilize context-enhanced reasoning with
clean samples to relabel noisy instances through a Review
Agent; second, we implement a perplexity-based data selection mechanism to exclude samples with low confidence scores. As demonstrated in Figure 2, this
dual-process framework effectively mitigates noise-induced performance deterioration.
Effective noise identification is crucial for handling noisy data in downstream tasks. In our approach, we leverage collaborative learning among multiple LLMs to uncover potentially noisy samples, enabling a more robust detection mechanism.
Initially, we utilize the base LLM to generate predictions for all data samples: \[\label{eq:LLM-inference} \hat{y}_i = \mathcal{M}(q_i)\,,\tag{1}\] where \(q_i\) represents the query, \(\mathcal{M}\) denotes the LLM and \(\hat{y}_i\) is the base prediction.
For internal noise detection, we introduce a reasoning-enhanced LLM that iteratively combines reasoning and reflection processes. This LLM first performs step-by-step reasoning, followed by self-reflection on its reasoning path, and iterates between these two stages to achieve superior reasoning capabilities. For each data sample, this iterative process can be formalized as: \[\label{eq:reasoning-enhanced} \hat{y}^{\text{reas}}_i = {\mathcal{M}}_{\text{Reas}}\left(q_i,~{\mathcal{M}}_{\text{Refl}}\left({\mathcal{M}}_{\text{Reas}}\left(q_i,~\cdots\right)\right)\right) \,,\tag{2}\] where \(\hat{y}^{\text{reas}}_i\) represents the final prediction, \({\mathcal{M}}_{\text{Reas}}\) and \({\mathcal{M}}_{\text{Refl}}\) denote the reasoning and reflection LLMs, respectively, with each reflection stage evaluating and refining the previous reasoning output.
To ensure prediction reliability, we implement a consistency-based Checker
mechanism that analyzes multiple prediction sources: the original label (\(y_i\)), the base LLM prediction (\(\hat{y}_i\)), and the reasoning-enhanced prediction (\(\hat{y}^{\text{reas}}_i\)). This mechanism evaluates the agreement among these predictions through a consistency metric:
\[\label{eq:checker} r_i = \texttt{Checker}(y_i,~\hat{y}_i,~\hat{y}^{\text{reas}}_i) \in \{0,1\} \,,\tag{3}\] where \(r_i=1\) indicates high
prediction consistency (reliable sample) and \(r_i=0\) indicates prediction inconsistency (potentially noisy sample). Based on this consistency evaluation, we partition the dataset into clean samples \({\mathcal{D}}_{\text{clean}} = \{(q_i, y_i)| r_i = 1\}\) and potentially noisy samples \({\mathcal{D}}_{\text{noise}} = \{(q_i, y_i) | r_i = 0\}\) for subsequent denoising treatment.
For the potentially noisy dataset \({\mathcal{D}}_{\text{noise}}\), we employ a context learning approach for data relabeling, leveraging external knowledge to reduce noise in the data. Specifically, we project queries from both the reliable dataset \({\mathcal{D}}_{\text{clean}}\) and the potentially noisy dataset \({\mathcal{D}}_{\text{noise}}\) into a shared latent space: \[\label{eq:data-relabeling-encoding} h_i = \text{Encoder}(q_i) \in \mathbb{R}^d\,,\tag{4}\] where \(h_i\) represents the \(d\)-dimensional latent representation of query \(q_i\) obtained through the encoder network.
During inference, for each noisy sample, we retrieve the \(k\) most similar samples from the reliable dataset as context for reasoning: \[\label{eq:context-denoising} \hat{y}^{\text{cont}}_i = {\mathcal{M}}\left(q_i \;\middle|\; \left\{(q_j,~y_j)\right\}_{j \in {\mathcal{N}}_k\left(q_i,~{\mathcal{D}}_{\text{clean}}\right)}\right)\,,\tag{5}\] where \({\mathcal{N}}_k\left(q_i,~{\mathcal{D}}_{\text{clean}}\right)\) denotes the indices of the \(k\) most similar samples to \(q_i\) in \({\mathcal{D}}_{\text{clean}}\) based on their latent representations.
By incorporating external context, we enable the model to generate more reliable responses \(\hat{y}^{\text{cont}}_i\). Combined with the previously obtained reasoning-enhanced predictions \(\hat{y}^{\text{reas}}_i\), we introduce a Review
Agent to evaluate and relabel the data: \[\label{eq:review-agent} \tilde{y}_i =
\texttt{Review}(q_i,~\hat{y}^{\text{cont}}_i,~\hat{y}^{\text{reas}}_i)\,.\tag{6}\]
Through the Review
Agent’s assessment and synthesis, we obtain the relabeled predictions \(\tilde{y}_i\), forming the denoised dataset \({\mathcal{D}}_{\text{denoise}}\).
However, considering the potential for model errors and uncertainties, we must implement a data selection mechanism for the self-annotated denoised dataset to ensure quality and reliability.
While our denoising process generates a refined dataset \({\mathcal{D}}_{\text{denoise}}\) through self-annotation, ensuring the quality of these auto-labeled samples remains crucial. To maintain high data quality and prevent error propagation during subsequent training, we introduce a confidence-based filtering mechanism leveraging entropy metrics. This approach enables us to quantitatively assess the uncertainty in context-enhanced predictions and retain only the most confident samples.
The entropy score for each context-enhanced response is computed as: \[\label{eq:entropy} H(\hat{y}^{\text{cont}}_i) = -\frac{1}{N}\sum_{j=1}^N \log p(y_{ij}|q_i, y_{i<j})\,,\tag{7}\] where \(p(y_{ij}|q_i, y_{i<j})\) represents the model’s prediction probability for the \(j\)-th token conditioned on the input query and previous tokens, and \(N\) denotes the sequence length. Lower entropy scores indicate higher model confidence and more deterministic predictions. Based on these scores, we rank and filter the samples to form our final selected dataset: \[\label{eq:data-selection} D_{\text{select}} = \left\{(q_i, \tilde{y}_i) \;\middle|\; \text{rank}\left(H\left(\hat{y}^{\text{cont}}_i\right)\right) \leq \beta |D_{\text{denoise}}|\right\}\,,\tag{8}\] where \(\beta\) controls the selection ratio, which defaults to \(50\%\) and will be validated in Section 4.3.2.
Through this process, we obtain \({\mathcal{D}}_{\text{select}}\), which demonstrates reduced noise levels and higher confidence scores through the combined application of denoising relabeling and selective filtering.
Through the integration of the processes described above, we combine the reliable dataset \({\mathcal{D}}_{\text{clean}}\) and the selected denoised dataset \({\mathcal{D}}_{\text{select}}\) to form our final fine-tuning dataset \({\mathcal{D}}_{\text{ft}}={\mathcal{D}}_{\text{clean}} \cup {\mathcal{D}}_{\text{select}}\). Then, we fine-tune the LLM on \({\mathcal{D}}_{\text{ft}}\): \[\mathcal{M}' = \mathop{\mathrm{arg\,min}}_{\mathcal{M}} \mathbb{E}_{(q, y) \sim {\mathcal{D}}_{\text{ft}}} \left[-\log p_{\mathcal{M}}(y|q)\right]\,,\] where \(\mathcal{M}'\) represents the evolved model trained on the noise-reduced downstream task dataset. The complete algorithm is summarized in Algorithm 3.
5pt
Method | MMLU | ARC | PubMedQA | Drop | FPB | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2-4 (lr)5-7 (lr)8-10 (lr)11-13 (lr)14-16 | 30% | 50% | 70% | 30% | 50% | 70% | 30% | 50% | 70% | 30% | 50% | 70% | 30% | 50% | 70% |
Vanilla | 65.3 | 65.3 | 65.3 | 82.7 | 82.7 | 82.7 | 72.0 | 72.0 | 72.0 | 87.2 | 87.2 | 87.2 | 75.5 | 75.5 | 75.5 |
Hermes-3 | 65.5 | 65.5 | 65.5 | 68.7 | 68.7 | 68.7 | 64.8 | 64.8 | 64.8 | 87.1 | 87.1 | 87.1 | 59.4 | 59.4 | 59.4 |
Tulu-3 | 55.7 | 55.7 | 55.7 | 73.3 | 73.3 | 73.3 | 63.3 | 63.3 | 63.3 | 85.3 | 85.3 | 85.3 | 54.5 | 54.5 | 54.5 |
SelfLabel | 64.7 | 64.7 | 64.7 | 82.1 | 82.1 | 82.1 | 71.8 | 71.8 | 71.8 | 86.8 | 86.8 | 86.8 | 82.8 | 82.8 | 82.8 |
SFT | 59.5 | 47.5 | 37.3 | 70.7 | 61.7 | 47.5 | 66.4 | 36.7 | 32.8 | 85.3 | 78.6 | 66.4 | 79.7 | 58.4 | 34.9 |
NoiseAL | 66.3 | 65.5 | 66.1 | 84.0 | 83.6 | 83.4 | 74.2 | 72.2 | 71.8 | 86.8 | 84.3 | 82.1 | 81.1 | 78.5 | 72.8 |
SelfRAG | 65.3 | 65.4 | 64.1 | 83.1 | 82.7 | 82.0 | 63.2 | 60.2 | 57.0 | 86.5 | 85.5 | 83.1 | 83.8 | 76.2 | 68.2 |
SelfSelect | 59.1 | 53.4 | 44.0 | 76.8 | 72.1 | 62.6 | 57.8 | 46.0 | 22.6 | 86.2 | 78.8 | 64.4 | 79.8 | 58.4 | 32.0 |
Ours | 68.2 | 68.0 | 67.6 | 84.9 | 84.7 | 84.1 | 75.8 | 75.6 | 75.0 | 90.3 | 88.5 | 87.9 | 84.4 | 80.5 | 76.2 |
4.4 | 4.1 | 3.5 | 2.7 | 2.4 | 1.7 | 5.3 | 5.0 | 4.2 | 3.6 | 1.5 | 0.8 | 11.8 | 6.6 | 0.9 | |
14.6 | 43.2 | 81.2 | 20.1 | 37.3 | 77.1 | 14.2 | 106 | 129 | 5.9 | 12.6 | 32.4 | 5.9 | 37.8 | 110 |
We conducted comprehensive evaluations on five diverse benchmark datasets: MMLU [9], ARC [13], PubMedQA [14], Drop, and FPB [15]. These datasets span multiple domains and task types: MMLU and ARC evaluate general knowledge across various academic disciplines; PubMedQA tests biomedical reasoning capabilities; Drop assesses numerical reasoning and reading comprehension; and FPB examines financial domain expertise. For each dataset, we constructed experiments with different noise rates (i.e., 30%, 50%, and 70%) to evaluate model performance under different scenarios.
Base Models. We employed diverse model architectures, including Gemma2-9B [16] and Llama3.1-8B [17], along with models of varying parameter sizes such as Llama3.2-3B [17].
Baselines. To comprehensively validate our method’s effectiveness, we implemented several baseline approaches: (1) Vanilla: direct model inference; (2) SFT-enhanced solutions utilizing supplementary data to improve LLM performance, including Hermes-3 [18] and Tulu-3 [19] 2; (3) Standard SFT [20] using potentially noisy training data; (4) Denoising approaches, including the state-of-the-art NoiseAL [10] and LLM-based denoising methods such as SelfLabel and SelfSelect; (5) Self-enhancement methods like SelfRAG [21], which augments inference context using training data. Detailed baseline implementations are provided in the Appendix.
We partitioned each dataset into training and test sets, introducing varying degrees of noise perturbation in the training data. For model fine-tuning, we employed Low-Rank Adaptation (LoRA) [20] implemented through Llama-factory [22] across all open-source models. The fine-tuning process was conducted for 2 epochs. We set the \(n=4\) and \(\theta=50\%\), with further parameter analysis planned for subsequent experiments. The implementation code is available in our anonymous repository. Comprehensive data and training configurations are detailed in the Appendix.
5pt
Model | MMLU | ARC | PubMedQA | Drop | FPB | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2-4 (lr)5-7 (lr)8-10 (lr)11-13 (lr)14-16 | 30% | 50% | 70% | 30% | 50% | 70% | 30% | 50% | 70% | 30% | 50% | 70% | 30% | 50% | 70% |
Llama3.2 3B | |||||||||||||||
Vanilla | 54.9 | 54.9 | 54.9 | 72.4 | 72.4 | 72.4 | 57.8 | 57.8 | 57.8 | 71.0 | 71.0 | 71.0 | 39.9 | 39.9 | 39.9 |
SFT | 55.0 | 48.4 | 38.3 | 66.1 | 58.5 | 42.9 | 63.2 | 49.2 | 37.5 | 77.3 | 73.7 | 61.3 | 56.2 | 49.4 | 31.3 |
Ours | 58.5 | 58.2 | 57.9 | 74.6 | 74.3 | 72.6 | 68.9 | 67.9 | 67.9 | 78.9 | 77.6 | 75.6 | 66.1 | 59.4 | 46.8 |
Llama3.1 8B | |||||||||||||||
Vanilla | 65.3 | 65.3 | 65.3 | 82.7 | 82.7 | 82.7 | 72.0 | 72.0 | 72.0 | 87.2 | 87.2 | 87.2 | 75.5 | 75.5 | 75.5 |
SFT | 59.5 | 47.5 | 37.3 | 70.7 | 61.7 | 47.5 | 66.4 | 36.7 | 32.8 | 85.3 | 78.6 | 66.4 | 79.7 | 58.4 | 34.9 |
Ours | 68.2 | 68.0 | 67.6 | 84.9 | 84.7 | 84.1 | 75.8 | 75.6 | 75.0 | 90.3 | 88.5 | 87.9 | 84.4 | 80.5 | 73.2 |
Gemma2 9B | |||||||||||||||
Vanilla | 70.3 | 70.3 | 70.3 | 90.2 | 90.2 | 90.2 | 66.4 | 66.4 | 66.4 | 90.7 | 90.7 | 90.7 | 83.1 | 83.1 | 83.1 |
SFT | 63.6 | 52.1 | 40.3 | 77.9 | 64.6 | 55.0 | 61.7 | 39.8 | 30.4 | 88.8 | 80.5 | 67.3 | 88.1 | 60.7 | 35.6 |
Ours | 72.5 | 72.1 | 71.3 | 91.8 | 91.5 | 90.4 | 70.8 | 68.8 | 66.8 | 91.9 | 91.8 | 90.9 | 91.8 | 80.8 | 87.7 |
Variant | MMLU | ARC | ||||
---|---|---|---|---|---|---|
2-4 (lr)5-7 | 30% | 50% | 70% | 30% | 50% | 70% |
Llama3.1-8B | ||||||
RobustFT | 68.2 | 68.0 | 67.6 | 84.9 | 84.7 | 84.1 |
w/o Selection | 65.7 | 65.1 | 64.6 | 83.2 | 83.0 | 82.8 |
w/o Checker | 65.3 | 65.0 | 64.9 | 82.7 | 82.6 | 82.2 |
w/o Reviewer | 68.0 | 67.7 | 67.1 | 84.5 | 84.3 | 84.0 |
w/o CER | 67.7 | 67.7 | 67.0 | 84.6 | 84.1 | 83.9 |
w/o REL | 67.4 | 67.2 | 66.9 | 84.1 | 83.9 | 83.6 |
Our comparative experiments with Llama3.1-8B revealed several significant findings. RobustFT consistently demonstrated superior performance across all datasets. The experimental results yielded the following key insights.
Noise management is critical in LLM fine-tuning. The SFT results clearly demonstrate that direct fine-tuning with noisy data substantially degrades model performance, emphasizing the necessity for robust noise detection and removal.
LLMs exhibit limited inherent noise detection capabilities. SelfSelect’s inferior performance compared to SFT indicates that LLMs cannot effectively identify noise, necessitating specialized noise detection and removal mechanisms.
Enhanced SFT approaches lack consistent improvement. Methods like Tulu-3 and Hermes-3 failed to show uniform performance improvements across downstream tasks, suggesting the need for task-specific LLM adaptation strategies.
Inference enhancement methods show modest gains. Notably, these approaches achieved some performance improvements despite potential noise in context data, though the improvements were not comparable to our method’s results.
Denoising approaches demonstrate mixed results. While methods such as NoiseAL and SelfLabel show noise resistance and improvements on some datasets, they exhibit degradation on others.
We conducted extensive experiments across multiple model architectures (Llama3.2-3B, Llama3.1-8B, and Gemma2-9B), as shown in Table 2. Our investigation revealed several noteworthy insights:
Larger models are not inherently more robust. Contrary to common intuition, increased parameter count does not correlate with better noise resistance. In fact, general-purpose large models may be more susceptible to noise during domain-specific fine-tuning due to their lack of domain priors.
Transformation mechanism from general models to domain experts. While Gemma2-9B showed strong general capabilities, it initially performed worse on domain-specific tasks. However, after RobustFT, it effectively adapted to these domains and outperformed Llama3.1-8B, demonstrating the importance of denoising in LLM adaptation.
Critical importance of denoising for smaller models. Smaller models benefit more significantly from denoising strategies during domain-specific training. Our experiments show that effective denoising mechanisms can substantially mitigate the performance gaps of smaller models in downstream tasks.
We conducted ablation experiments on RobustFT across different noise levels (30%, 50%, 70%) using MMLU and ARC datasets. The results reveal several key findings: (1) The complete RobustFT framework consistently achieves optimal performance across all settings, validating its effectiveness. (2) The Selection component proves crucial, as its removal leads to substantial performance drops (e.g., accuracy decreases from 68.2 to 65.7 on MMLU with 30% noise). (3) The Checker component significantly contributes to model performance, particularly on the ARC dataset, demonstrating the effectiveness of our multi-model collaborative noise detection. (4) While the Reviewer component shows modest impact, it still contributes to overall data quality. (5) Both Context-Enhanced Relabeing (CER) and Reasoning-Enhanced LLM (REL) components prove essential, with their removal leading to notable performance degradation, highlighting the importance of our multi-experts collaborative mechanisms in handling noisy data.
We conducted sensitivity analysis of RobustFT on MMLU under different noise levels. As shown in Figure 4, we examine the impact of selection ratio \(\beta\) and context length \(k\). The results show that model performance peaking at \(\beta=40-50\%\), with performance degrading significantly beyond this range due to the inclusion of excessive noisy samples. For context length, performance improves with increasing \(k\) but plateaus, particularly in the range of \(k=3-5\), suggesting that moderate \(k\) provide sufficient reasoning support. These findings validate our default parameter choices (\(\beta=50\%\), \(k=3\)) without requiring extensive hyperparameter search, as our primary focus was on demonstrating the framework’s overall effectiveness.
We conducted perplexity analysis of the models, as shown in Figure 5, revealing several key findings: (1) Noise significantly increases perplexity, as evidenced in both SFT and vanilla models. In contrast, RobustFT maintains relatively low perplexity levels even with increased noise, demonstrating its robustness. (2) The vanilla model exhibits flatter and more dispersed perplexity distributions, indicating frequent uncertainty in predictions. RobustFT effectively concentrates perplexity in lower ranges, suggesting more confident and reliable predictions. (3) The method shows consistency across datasets, with similar perplexity reduction patterns observed on both MMLU and ARC, validating its generalizability across different domains.
We analyzed performance across MMLU categories, as shown in Figure 6. (1) The impact of noise varies significantly across different knowledge domains, with knowledge-intensive categories such as History, Healthcare, and Law experiencing more severe performance degradation under noisy conditions. (2) RobustFT demonstrates balanced and expanded performance across all categories, achieving comprehensive noise resistance rather than isolated improvements, as evidenced by its smooth and expanded radar plot.
We evaluated the inference stability of models under different noise conditions, as shown in Figure 7. Specifically, we employed GPT-4o to rephrase the instructions and conducted five independent tests, reporting both mean performance and standard deviation. Results show that RobustFT maintains consistent performance, with only minimal variance increase at higher noise rates.
Noisy label learning has been a fundamental challenge in NLP [10], [23]–[28], primarily focusing on learning from text classification data containing label noise. Existing approaches can be categorized into three main strategies: (1) Sample selection methods [29] that identify clean samples using fixed thresholds, (2) Label correction techniques [30], [31] that rectify original labels based on model predictions, and (3) Consistency regularization approaches [32], [33] that leverage prediction consistency under different perturbations for label refinement.
Challenges in LLM Era. These conventional methods are primarily designed for well-defined scenarios, with finite discrete label spaces, making them less effective for open-ended generation problems. Moreover, LLMs’ tendency towards hallucination poses significant challenges in noise detection and correction. To address these limitations, RobustFT introduces a novel framework specifically designed for noise-robust downstream fine-tuning of LLMs, moving beyond these constraints.
The vulnerability of LLMs to adversarial attacks through toxic and harmful data during post-training stages has garnered significant attention [34]. Current defense mechanisms primarily focus on several key strategies: distance-based regularization [35], [36], alignment data mixing [37], prompt engineering [38], and data filtering [39]. Different from these methods, RobustFT takes a different approach by emphasizing detection and relabeling mechanisms to prevent performance degradation caused by noisy data introduction, rather than specifically defending against toxic content.
Recent advances in Large Language Models (LLMs) [40] have emphasized the critical role of data quality in Supervised Fine-Tuning (SFT) [41], [42]. Current research primarily explores two approaches: downstream data selection [3], [12], [43] and data synthesis [44], [45] for improved instruction following. To reduce dependence on annotated data, researchers have developed self-evolution methodsthrough self-instruction [46] and self-play [47], enabling models to learn with minimal supervision. Additionally, SemiEvol [48] has demonstrated promising progress by combining a small amount of labeled data with large-scale unlabeled data to enhance LLM performance on downstream tasks. While existing work focuses on instruction selection [49] and self-training mechanisms [7], RobustFT takes a distinct approach by leveraging noisy real-world data for model self-training to enhance downstream performance.
In this work, we address the practical challenge of handling noisy data in downstream LLM applications, a critical issue that has been unexplored in previous research. We propose a novel noise detection and denoising framework RobustFT, which is specifically designed for LLMs. Our approach leverages a multi-expert collaborative mechanism for noise detection, enhanced by a reasoning-enhanced process. Furthermore, we implement context-enhanced reasoning for data relabeling and utilize response entropy for data selection. The effectiveness of RobustFT is consistently demonstrated across various datasets and noise scenarios.
Hermes-3: https://huggingface.co/NousResearch/Hermes-3-Llama-3.1-8B. Tulu-3: https://huggingface.co/allenai/Llama-3.1-Tulu-3-8B-SFT↩︎