October 22, 2025
Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert’s corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at https://rediff-hku.github.io/.
Discrete diffusion models have recently emerged as a promising alternative to the dominant autoregressive (AR) paradigm for vision-language models (VLMs) [1]–[7]. Unlike AR models, which generate text token-by-token in a fixed unidirectional manner, diffusion models conceptualize generation as an iterative denoising process. This approach allows for bidirectional context modeling, granting them greater flexibility in controlling the generation process and a theoretical potential for massive parallelization, promising significant gains in inference efficiency [8]–[11].
However, a significant gap exists between the theoretical promise and the practical reality of these models. Existing discrete diffusion models [1], [3], [8] are often plagued by incoherent and hallucinated artifacts (e.g., formatting errors like sequential commas or visually misaligned text) when parallel generation, frequently defaulting to one-token-per-step decoding process. We argue that these shortcomings are symptoms of a deeper, more fundamental problem: the error cascade driven by a training-inference discrepancy. Models are trained exclusively on clean, ground-truth data but are required at inference to generate from their own noisy, intermediate outputs. In a parallel decoding scenario, this discrepancy becomes catastrophic. As illustrated in Figure 1 (a), an error in a few tokens instantly pollutes the context for all other tokens being generated in parallel, initiating a cycle of compounding errors, which produces a detailed yet entirely fabricated description of the input image.
To break this vicious cycle, we propose a paradigm shift: from passive denoising (mask recovering under fixed context) to active refining. We introduce a corrective framework for vision-language diffusion models, called ReDiff, which systematically teaches the model to identify and correct its own errors during denoising. Unlike previous models that merely fill masked tokens, ReDiff actively refines the entire context to guide the generation process. Our approach consists of a two-stage training process. First, we instill a foundational revision capability by training the model to correct synthetic errors, such as random token corruptions and injected hallucinations, moving beyond simple denoising to build a general capacity for revision. Second, we introduce an online self-correction loop where the model is forced to confront and learn from its own mistakes. By capturing its flawed”drafts" during training and learning to predict an expert’s revision, the model directly mitigates the training-inference discrepancy.
This mistake-driven learning endows the model with a crucial, previously absent capability: the ability to revisit and refine its own outputs, including previously unmasked tokens. By learning to self-correct, our model develops robustness to its own imperfections, effectively breaking the error cascade and enabling robust parallel generation. As shown in Figure 1 (b), our refinement-based model successfully identifies and revises an initial error, leading to a more factually grounded and accurate generation. Our contributions are threefold:
1) We propose a new perspective that reframes the generation process of diffusion models from passive denoising to active, iterative refining to address the core challenge of error cascades.
2) We design and implement a two-stage training framework, featuring a core online self-correction loop that enables the model to learn to fix its own intrinsic errors.
3) Extensive experiments demonstrate that our method significantly improves the coherence and factual accuracy of generated content, exhibiting stability far superior to traditional denoising methods, especially in challenging few-step parallel generation scenarios, thereby greatly enhancing inference efficiency.
Discrete diffusion models [14]–[19] represent a class of generative models tailored for discrete data like text. In contrast to image diffusion models, which corrupt data by adding Gaussian noise towards a standard Gaussian prior, text diffusion models typically operate by replacing original tokens to degrade semantic content. Early approaches, such as D3PM [14], employed discrete Markov chains where a transition matrix is progressively applied to the input, corrupting it towards a uniform distribution (i.e., any token becomes any other with equal probability) or an absorbing state (e.g., a [MASK] token). More recently, mask-and-pred diffusion models have demonstrated significant empirical success. For instance, LLaDA [8] achieves performance comparable to autoregressive large language models by generating sentences from a fully masked sequence, progressively unmasking tokens with the highest confidence. Similarly, Dream [9] has shown strong results by initializing its parameters from a pre-trained autoregressive model.
Theoretically, discrete diffusion models offer advantages over traditional autoregressive models [20]–[24]. Their bidirectional context modeling enables flexible and controllable generation, while their inherent parallelism promises significant acceleration in sampling speed. However, this potential for parallel generation remains largely untapped. Current models often suffer from output degradation—such as repetition and grammatical errors—when attempting to predict multiple tokens per step. Our work directly addresses this by enhancing the stability of parallel decoding. This aligns with a recent line of work exploring the correction of generated content. For example, SEED-Diffusion [10] introduced an “edit-based forward process" for code generation, which adds edit-specific noise in the final 20% of steps to allow for revisions. Likewise, FUDOKI [4], a multimodal model based on discrete flow matching, progressively revises a random sentence, where each word is uniformly sampled from the vocabulary, to the correct answer. Our method is distinct in that it treats revision not as another form of noise, but as a high-level refinement process. Specifically, our framework trains the model to learn from and correct its own characteristic errors, moving beyond simple noise reversal.
Large vision language models (LVLMs) [25]–[30] have achieved remarkable success in vision understanding and have been applied to a myriad of real-world scenarios [31]–[33]. The dominant architecture connects a pre-trained vision encoder [34], [35] to an autoregressive language model via a lightweight module like an MLP or Q-Former. These models first realize cross-model alignment with pre-training and then conduct visual instruction tuning to handle a wide range of vision-centric tasks.
Despite their success, a persistent challenge in LVLMs is the phenomenon of hallucination [36], where the model generates text that is factually inconsistent with the visual input. In autoregressive models, this issue is exacerbated by error propagation; an incorrectly generated token can irreversibly misguide the subsequent generation path. Notably, current multimodal discrete diffusion models, such as LLaDA-V [1], LaViDa [3], and MMaDA [2], also adhere to this limitation, fixing tokens in place once they are unmasked. Our ReDiff, however, leverages the bidirectional attention mechanism inherent to the diffusion paradigm. This allows our model to revisit and optimize already-generated content, enabling a progressive refinement process that directly mitigates hallucination.
In this section, we introduce our refining-enhanced diffusion framework, ReDiff, designed to enhance the generation accuracy and stability of vision-language diffusion models. In contrast to traditional approaches that focus on recovering text from all [MASK] noise, our work emphasizes the high-level refinement of generated text. Guided by an expert model, our framework enables the model to learn from its own generation errors. This fosters a self-correction capability during inference, allowing it to simultaneously unmask new tokens while refining previously generated ones, thereby mitigating the problem of error cascades in parallel generation.
We will first present the preliminaries of discrete diffusion models in Section 3.1. We then introduce the first stage of our approach, foundational revision training, in Section 3.2 . Section 3.3 details the core of our framework, online self-correction learning. Section 3.4 details the inference process.
A discrete diffusion model formalizes text generation through a forward and a reverse process. The forward process gradually corrupts a clean text sequence \(x_0\) into a noisy state \(x_t\) over a series of timesteps \(t\in[0,1]\). In mask-pred models, this is achieved by replacing tokens with a [MASK] token based on a noise schedule \(\gamma_t\), culminating in a fully masked sequence as a prior distribution. The forward process is formulated as: \[q \bigl(x_t[i]=c \,\big|\, x_0[i]\bigr) = \begin{cases} 1-\gamma_t, & \text{if } c = x_0[i],\\[0.5em] \gamma_t, & \text{if } c = \mathbf{M}. \end{cases}\]
The reverse process aims to reverse this corruption. Starting from a fully masked sequence, the model iteratively predicts the original tokens. At each step, it predicts probabilities for all masked positions, unmasks a few high-confidence tokens, re-masks the rest, and feeds the updated sequence back into the model for the next iteration.
The model, a parametric mask predictor, is trained to predict all masked tokens (denoted by a set \(\mathbf{M}\)) simultaneously. The training objective is a cross-entropy loss computed only on the masked tokens: \[\mathcal{L}_{\text{CE}}(\theta) = -\mathbb{E}_{t, v, p_0, r_0, r_t} \left[ \frac{1}{t} \sum_{i=1}^{L_{r_0}} \mathbf{1}[r_t^i = \mathbf{M}] \log p_\theta(r_0^i | v, p_0, r_t) \right], \label{eq:diffusion}\tag{1}\]
where \(v\) and \(p_0\) denote visual content and prompt, \(r_0\) is the correct response, \(t\) is sampled uniformly, and \(r_t\) is sampled from the forward process.
A key advantage of discrete diffusion models is their potential for parallel generation, where multiple tokens are unmasked in a single step, significantly reducing the number of required iterations. However, existing model [1], [8], [9] treat already-unmasked tokens as fixed conditions for future predictions. If an incorrect token is generated, it can derail subsequent steps, leading to an error cascade. Yet, unlike the unidirectional attention in AR models, the bidirectional attention mechanism inherent to these models provides the architectural foundation for updating previously generated tokens, a potential we exploit in our framework.
Observations of existing vision-language diffusion models, especially in few-step generation scenarios, reveal two predominant error types: syntactic chaos (e.g., incoherence, repetition, grammatical errors) and semantic hallucinations (content that contradicts the visual input), as shown in Figure 2 (a). In this first training stage, we teach the model to correct these two types of errors, extending its capability from simple denoising to foundational text revision.
We use two data construction ways. For syntactic errors, we corrupt the text from ground-truth image-text pairs by randomly replacing a fraction of tokens with other tokens from the vocabulary, creating syntactically chaotic inputs. For hallucination errors, we leverage pairs of correct captions and human-corrputed captions with factual errors (e.g., incorrect objects, attributes, or counts), which directly provide examples of visually inconsistent text.
As shown in Figure 2 (b), we task the model with restoring a “polluted" response \(r_t\) to its original, correct version \(r_0\). We first apply the standard masking process to \(r_0\) according to a sampled noise level \(t\). Then, on the remaining unmasked tokens, we inject the synthetic errors described above. This corrupted sequence serves as the model’s input. The model is trained to predict the entire original text \(r_0\). The loss is computed not only on the [MASK] tokens but also on the syntactically corrupted tokens (\(\mathcal{L}_{\rm syntax}\)) and hallucinated tokens (\(\mathcal{L}_{\rm hallucination}\)). We also include a loss on the uncorrupted tokens (\(\mathcal{L}_{\rm clean}\)) to encourage the model to preserve correct content. The loss of each type is calculated as follows: \[\label{eq:revise} \mathcal{L}_{\text{type}}(\theta) = -\mathbb{E}_{t, v, p_0, r_0, r_t} \left[ \frac{1}{t} \frac{1}{N_{\text{type}}} \sum_{i=1}^{L_{r_0}} \mathbf{1}[r_t^i \in \text{type}] \log p_\theta(r_0^i | v, p_0, r_t) \right],\tag{2}\]
where \(\rm type \in \{mask, syntax, hallucination, clean\}\). Each loss component is normalized by the number of its corresponding tokens \(N_{\text{type}}\) to balance their contributions. The total loss is: \[\mathcal{L}_{\text{revision}} = \mathcal{L}_{\text{mask}} + \mathcal{L}_{\text{syntax}} + \mathcal{L}_{\text{hallucination}} + \mathcal{L}_{\text{clean}}. \label{eq:total}\tag{3}\]
After Stage I, we obtain ReDiff-Base, a model equipped with the foundational capability to correct both syntactic errors and factual hallucinations. However, this stage has a limitation: the errors are synthetic and may not reflect the characteristic mistakes the model itself is prone to making.
To teach the model to fix its own idiosyncratic errors, we introduce an online self-correction learning framework. The process, illustrated in Figure 2 (c), proceeds as follows: (1) Generating drafts: We use ReDiff-Base to generate a response for an image, denoted as \(r_{\rm draft}\). We typically use decoding results of different generation steps to cover more mistakes. (2) Expert revision: The image \(I\), the generated draft \(r_{\rm draft}\), and the ground truth are fed to a powerful external “expert model" (e.g., o4-mini). With a carefully designed prompt, the expert model identifies and corrects both grammatical and hallucinatory errors in \(r_{\rm draft}\), producing a refined version, \(r_{\rm refined}\). We specifically extract the pairs of erroneous and corrected segments. (3) Learning to refine: We form a new training instance \(⟨I, r_{\rm draft}, r_{\rm refined}⟩\) and fine-tune our model on these data. Note that the training loss is computed only on the segments that the expert model identified and corrected. This targeted learning prevents the model from being penalized for other potential errors in the draft that the expert may have missed. The training loss is: \[\label{eq:refine} \mathcal{L}_{\text{refine}}(\theta) = -\mathbb{E}_{t, v, p_0, r_\text{draft}, r_\text{refined}} \left[\frac{1}{N_{\text{mistake}}} \sum_{i=1}^{L_{r_0}} \mathbf{1}[r_\text{draft}^i \in \text{mistake}] \log p_\theta(r_\text{refined}^i | v, p_0, r_\text{draft}) \right].\tag{4}\]
To maintain the foundational capabilities learned in the first stage, we mix in a small amount of the Stage I data during this phase. This entire cycle can be iterated: the refined model from one round can be used to generate new drafts for the next round of expert revision and fine-tuning, progressively enhancing its self-correction ability. The key advantage here is that the model learns from its own mistakes, which is a more targeted and efficient way to improve its robustness and the stability of parallel generation.
Our inference process differs from that of traditional discrete diffusion models by integrating refinement into each generation step. Specifically, the process starts with a fully masked sequence. At each step, the model computes the output probability distribution over the entire vocabulary for all token positions. For masked positions, if the inference speed is \(n\) tokens per step, we select the top-\(n\) most confident tokens and unmask them. For previously unmasked positions, we replace the existing tokens with the newly predicted ones. This allows for the simultaneous unmasking of new content and refining of existing content. As more context is generated, previously generated tokens are iteratively updated to be more coherent and factually accurate, effectively reducing the occurrence of syntactic chaos and hallucinations.
max width=
Training Setup. Our primary focus is on enhancing the generative capabilities of vision-language diffusion models. We select detailed image captioning as the representative task to validate our framework, although the methodology is generalizable to other generative tasks. Our model is built upon the existing LLaDA-V model, leveraging its foundational mask prediction capabilities while endowing it with the ability to refine. The training data comprises caption datasets from LLaVA-1.5 [25], ShareGPT4V [37], and the ViCrit dataset [38]. When constructing hallucination revision data in Stage I, we leverage the existing hallucination dataset ViCrit, which contains pairs of correct and hallucinated descriptions. For Stage I (foundational revision training), we use a total of 260k image-text pairs, with a random token replacement probability of 0.1 for creating syntactic chaos. For Stage II (online self-correction learning), we generate approximately 10k draft-refined caption pairs in each round. The drafts are generated with 128, 32, and 16 inference steps, and o4-mini serves as the expert model for revisions. The prompt for o4-mini is detailed in Appendix 7. Our experiments revealed that a single round of this online refinement training yielded the most significant improvements.
Benchmarks and Evaluation Setup. We evaluate our model on three recent benchmarks for detailed image caption: CapMAS [12] uses three metrics evaluated by GPT-4o: CLAIR for overall caption quality, Coverage for the comprehensiveness of the description, and Factuality for the accuracy of the content. CapArena [39] employs a pairwise comparison methodology where the outputs of the test model are compared against those of three baseline models with GPT-4o. A final score is calculated based on these win ratio. DetailCaps-4870 [13] uses the CAPTURE metric, which scores the generated caption by comparing its scene graph to that of the ground-truth description. We compare ReDiff against several vision-language diffusion models, including LLaDA-V, LaViDa, MMaDA, and FUDOKI. We also include results from some typical AR-based VLMs. At inference, the maximum generation length is 128. An inference process of 128 steps corresponds to a speed of 1 token/step, while 32 steps correspond to 4 tokens/step.
As shown in Table [tab:main], our ReDiff achieves state-of-the-art results among all diffusion-based models across each metric. On CapMas, our model’s CLAIR score shows a remarkable 11.2 point improvement over the LLaDA-V, reaching a comparable level to InternVL-2.5. The Coverage and Factuality scores also increase by 5.85 and 2.23 points, respectively, indicating that our captions are not only richer in content but also more accurate. On CapArena, our model outperforms LLaDA-V by 25.67 points. Furthermore, we achieve a CAPTURE score of 61.88, surpassing the powerful Qwen2.5-VL. These results demonstrate that our refining-enhanced diffusion method effectively improves fluency and mitigates hallucinations, leading to a substantial enhancement in overall caption quality.
In Tables [tad:step32capmas] and [tab:step32capa], we compare models trained with the traditional mask-pred objective versus our refinement framework, using identical datasets and base model. Our model consistently outperforms the mask-trained baseline at every step count. Crucially, as the generation speed increases (i.e., fewer steps), our model’s performance degrades much more gracefully, demonstrating superior stability in parallel generation. For instance, on the CLAIR metric, as the speed increases from 1 token/step to 8 tokens/step, the mask-trained model’s score plummets from 74.53 to 46.38, whereas our model’s score only decreases from 76.74 to 67.44. Notably, our model’s performance at 4 tokens/step is higher than that of both LLaDA-V and the mask-trained baseline at 1 token/step. A similar trend is observed for Coverage. The trend for Factuality is less pronounced, as the baseline’s score does not drop significantly at fewer steps. This is because the metric relies on extracting valid items for verification; as the baseline’s output becomes more chaotic, fewer items can be extracted, artificially stabilizing the correctness ratio. On both CapArena and CAPTURE, our model also demonstrates more robust parallel generation, with the CAPTURE score dropping by only 0.65 points when accelerating from 1 to 4 tokens/step.
max width=
max width=
max width=
max width=
max width=
Impact of Each Training Stage. In Table [tab:train32stage], we analyze the individual contributions of our two training stages. Both Stage I (foundational revision) and Stage II (self-correction) independently improve the model’s performance and stability over the LLaDA-V baseline. Furthermore, the most significant gains are achieved when both stages are combined. Notably, Stage II alone provides a more substantial boost than Stage I, confirming that teaching the model to learn from its own intrinsic errors is a highly effective refinement strategy. After Stage I training, the model exhibits stable parallel generation performance. For example, as the speed increases from 1 to 4 tokens/step, CLAIR improves from 71.31 to 71.67, and CapArena changes from -69.17 to -73.17. The combination of the two stages yields a synergistic effect, with Stage I providing a foundational revision ability that is further amplified by Stage II, leading to large improvements in metrics like Factuality (+5.25) and CapArena (+17.67).
Analysis of Foundational Revision Training. As shown in Table [tab:stage1], we investigate different settings for the stage I training. We find that revising syntactic errors primarily boosts overall quality (CLAIR) and Coverage, while also enhancing stability during parallel generation. Conversely, training on hallucination revision exhibits higher Factuality. Combining both error types allows our model to achieve the best overall performance. We also compare dynamic probability for random token replacement in the fourth line, where the dynamic rate is correlated with the noise level t (using t as replacement probability, when t \(<\) 0.1). The results indicate that our fixed replacement rate yields better overall performance.
Impact of Online Self-Correction Training Rounds. In Table [tab:stage-2], we examine the effect of iteration of the stage II training. The results show that while the first round of self-correction provides a substantial performance boost over the ReDiff-Base model, subsequent rounds of training on newly generated data do not yield further significant improvements across most metrics.
We provide qualitative examples to visually demonstrate how the refinement during inference produces more accurate and fluent results, thereby improving the stability of parallel generation.
In Figure 3, we compare the parallel-generated captions from ReDiff and LLaDA-V. The baseline’s output suffers from token repetition (“bus",”the"), grammatical errors, and hallucinations (e.g., misidentifying a person on a bus advertisement as “a woman"). In contrast, our model’s output is fluent, coherent, and factually grounded. In the second example, our model accurately describes all key elements in the scene, whereas the baseline’s output is chaotic and omits significant details. More comparison cases can be found in Appendix 6.
Figure 4 visualizes the token-level changes during a 32-step generation process. It clearly shows the model simultaneously unmasking new tokens and refining previously generated ones. For instance, in the first example, the model refines the erroneous phrase “rocks painted rocks" into”colorful painted rocks" at step 20. At step 28, it corrects “a tall green plant" to”a small green plant" to better match the visual content.
Figure 5 showcases comparison of inference with and without the refinement, showing that the refinement is critical for achieving high-quality outputs. If ReDiff inferences without the refinement, errors tend to accumulate, such as repeated words or symbols and incoherent sentences, ultimately degrading the quality of the caption. This highlights the importance of the model’s refinement capability.
Beyond correcting the model’s own errors during generation, ReDiff also demonstrates a powerful, generalizable ability to revise disturbing inputs. As shown in Figure 6, we provide the model with an image and a user-provided caption containing either syntactic chaos or a factual hallucination. In both cases, our model successfully corrects the initial erroneous text and proceeds to generate a coherent and accurate completion, highlighting the strong revision ability of ReDiff.
In this work, we addressed the critical challenge of error cascades that hampers the performance of vision-language diffusion models, particularly in efficient parallel generation scenarios. We proposed a paradigm shift from passive denoising to active refining by introducing ReDiff, a novel framework centered on a mistake-driven, online self-correction loop. This approach teaches the model to learn from its own characteristic errors, endowing it with the ability to revisit and refine its generated output. Our extensive experiments validate that this method not only achieves state-of-the-art performance but, more importantly, demonstrates far superior stability and factual accuracy in challenging few-step generation regimes where traditional denoising models catastrophically fail. By effectively breaking the error cascade, our work presents a promising path toward developing more robust, efficient, and controllable generative systems.
Figure 7 and Figure 8 show the generation results of ReDiff and LLaDA-V under different inference steps. In the 2 tokens/step scenario, LLaDA-V outputs a great deal of hallucinated content, such as “Goku" and”Vegeta" in the first case, and a mouse and keyboard in the second. This occurs because an initial hallucination can affect subsequent generation, leading to error cascades. In contrast, our ReDiff method produces captions that are consistent with the image content. In the cases of 8 tokens/step, the results of our model are more fluent and have less grammer errors.
In online self-correction learning loop, we use ReDiff-Base to generate caption drafts. Then the image, the draft, and ground truth caption are fed to o4-mini to detect and revise errors. The prompt for o4-mini is shown in Table [tab:o432prompt].
All datasets and models used in this study are publicly available and open-source. No proprietary, private, or personally identifiable information was collected or used. The images employed are either natural scenes or normal human activities, without any violent, explicit, or otherwise harmful content. Therefore, the research meets relevant considerations regarding privacy, ethics and copyright.
Corresponding author.↩︎