Abstract

Discrete diffusion models have emerged as a promising direction for vision-language tasks, offering bidirectional context modeling and theoretical parallelization. However, their practical application is severely hindered by a train-inference discrepancy, which leads to catastrophic error cascades: initial token errors during parallel decoding pollute the generation context, triggering a chain reaction of compounding errors and leading to syntactic errors and semantic hallucinations. To address this fundamental challenge, we reframe the generation process from passive denoising to active refining. We introduce ReDiff, a refining-enhanced diffusion framework that teaches the model to identify and correct its own errors. Our approach features a two-stage training process: first, we instill a foundational revision capability by training the model to revise synthetic errors; second, we implement a novel online self-correction loop where the model is explicitly trained to revise its own flawed drafts by learning from an expert’s corrections. This mistake-driven learning endows the model with the crucial ability to revisit and refine its already generated output, effectively breaking the error cascade. Extensive experiments demonstrate that ReDiff significantly improves the coherence and factual accuracy of generated content, enabling stable and efficient parallel generation far superior to traditional denoising methods. Our codes and models are available at https://rediff-hku.github.io/.

1 Introduction↩︎

Discrete diffusion models have recently emerged as a promising alternative to the dominant autoregressive (AR) paradigm for vision-language models (VLMs) [1]–[7]. Unlike AR models, which generate text token-by-token in a fixed unidirectional manner, diffusion models conceptualize generation as an iterative denoising process. This approach allows for bidirectional context modeling, granting them greater flexibility in controlling the generation process and a theoretical potential for massive parallelization, promising significant gains in inference efficiency [8]–[11].

However, a significant gap exists between the theoretical promise and the practical reality of these models. Existing discrete diffusion models [1], [3], [8] are often plagued by incoherent and hallucinated artifacts (e.g., formatting errors like sequential commas or visually misaligned text) when parallel generation, frequently defaulting to one-token-per-step decoding process. We argue that these shortcomings are symptoms of a deeper, more fundamental problem: the error cascade driven by a training-inference discrepancy. Models are trained exclusively on clean, ground-truth data but are required at inference to generate from their own noisy, intermediate outputs. In a parallel decoding scenario, this discrepancy becomes catastrophic. As illustrated in Figure 1 (a), an error in a few tokens instantly pollutes the context for all other tokens being generated in parallel, initiating a cycle of compounding errors, which produces a detailed yet entirely fabricated description of the input image.

Figure 1: Comparison between standard vision-language diffusion models and our proposed refining-enhanced approach. (a) Mask-pred diffusion is trained for passive denoising (mask recovering under fixed context). An initial error, such as misidentifying the “bus" as a”trunk", triggers an error cascade. The model cannot correct this mistake and proceeds to hallucinate further details based on the flawed context (e.g., “Two men are drinking”), leading to a factually incorrect description. (b) Refining-enhanced diffusion introduces a paradigm of active refining, teaching the model not only to predict masked tokens but also to perform token refinement. Our ReDiff learns to self-correct through an online loop where its own “flawed drafts" are revised by an expert revisor. As a result, the model can identify and correct its initial mistakes (revising”trunk" to “bus",”Two men" to “A man"), breaking the error cascade and generating a factually grounded response. (c) Performance comparison between LLaDA-V [1] and ReDiff under different inference speeds.”CLAIR” and “Coverage” are detailed caption metrics on CapMAS [12], and “CAPTURE” is on DetailCaps-4870 [13]. Our model delivers superior generation quality and achieves more stable results when using fewer inference steps. — Figure 1: Comparison between standard vision-language diffusion models and our proposed refining-enhanced approach. (a) *Mask-pred diffusion* is trained for passive denoising (mask recovering under fixed context). An initial error, such as misidentifying the “bus" as a”trunk", triggers an error cascade. The model cannot correct this mistake and proceeds to hallucinate further details based on the flawed context (e.g., “Two men are drinking”), leading to a factually incorrect description. (b) *Refining-enhanced diffusion* introduces a paradigm of active refining, teaching the model not only to predict masked tokens but also to perform token refinement. Our ReDiff learns to self-correct through an online loop where its own “flawed drafts" are revised by an expert revisor. As a result, the model can identify and correct its initial mistakes (revising”trunk" to “bus",”Two men" to “A man"), breaking the error cascade and generating a factually grounded response. (c) Performance comparison between LLaDA-V [1] and ReDiff under different inference speeds.”CLAIR” and “Coverage” are detailed caption metrics on CapMAS [12], and “CAPTURE” is on DetailCaps-4870 [13]. Our model delivers superior generation quality and achieves more stable results when using fewer inference steps.

To break this vicious cycle, we propose a paradigm shift: from passive denoising (mask recovering under fixed context) to active refining. We introduce a corrective framework for vision-language diffusion models, called ReDiff, which systematically teaches the model to identify and correct its own errors during denoising. Unlike previous models that merely fill masked tokens, ReDiff actively refines the entire context to guide the generation process. Our approach consists of a two-stage training process. First, we instill a foundational revision capability by training the model to correct synthetic errors, such as random token corruptions and injected hallucinations, moving beyond simple denoising to build a general capacity for revision. Second, we introduce an online self-correction loop where the model is forced to confront and learn from its own mistakes. By capturing its flawed”drafts" during training and learning to predict an expert’s revision, the model directly mitigates the training-inference discrepancy.

This mistake-driven learning endows the model with a crucial, previously absent capability: the ability to revisit and refine its own outputs, including previously unmasked tokens. By learning to self-correct, our model develops robustness to its own imperfections, effectively breaking the error cascade and enabling robust parallel generation. As shown in Figure 1 (b), our refinement-based model successfully identifies and revises an initial error, leading to a more factually grounded and accurate generation. Our contributions are threefold:

1) We propose a new perspective that reframes the generation process of diffusion models from passive denoising to active, iterative refining to address the core challenge of error cascades.

2) We design and implement a two-stage training framework, featuring a core online self-correction loop that enables the model to learn to fix its own intrinsic errors.

3) Extensive experiments demonstrate that our method significantly improves the coherence and factual accuracy of generated content, exhibiting stability far superior to traditional denoising methods, especially in challenging few-step parallel generation scenarios, thereby greatly enhancing inference efficiency.

2.1 Large Language Diffusion Models↩︎

Discrete diffusion models [14]–[19] represent a class of generative models tailored for discrete data like text. In contrast to image diffusion models, which corrupt data by adding Gaussian noise towards a standard Gaussian prior, text diffusion models typically operate by replacing original tokens to degrade semantic content. Early approaches, such as D3PM [14], employed discrete Markov chains where a transition matrix is progressively applied to the input, corrupting it towards a uniform distribution (i.e., any token becomes any other with equal probability) or an absorbing state (e.g., a [MASK] token). More recently, mask-and-pred diffusion models have demonstrated significant empirical success. For instance, LLaDA [8] achieves performance comparable to autoregressive large language models by generating sentences from a fully masked sequence, progressively unmasking tokens with the highest confidence. Similarly, Dream [9] has shown strong results by initializing its parameters from a pre-trained autoregressive model.

Theoretically, discrete diffusion models offer advantages over traditional autoregressive models [20]–[24]. Their bidirectional context modeling enables flexible and controllable generation, while their inherent parallelism promises significant acceleration in sampling speed. However, this potential for parallel generation remains largely untapped. Current models often suffer from output degradation—such as repetition and grammatical errors—when attempting to predict multiple tokens per step. Our work directly addresses this by enhancing the stability of parallel decoding. This aligns with a recent line of work exploring the correction of generated content. For example, SEED-Diffusion [10] introduced an “edit-based forward process" for code generation, which adds edit-specific noise in the final 20% of steps to allow for revisions. Likewise, FUDOKI [4], a multimodal model based on discrete flow matching, progressively revises a random sentence, where each word is uniformly sampled from the vocabulary, to the correct answer. Our method is distinct in that it treats revision not as another form of noise, but as a high-level refinement process. Specifically, our framework trains the model to learn from and correct its own characteristic errors, moving beyond simple noise reversal.

2.2 Large Vision Language Models↩︎

Large vision language models (LVLMs) [25]–[30] have achieved remarkable success in vision understanding and have been applied to a myriad of real-world scenarios [31]–[33]. The dominant architecture connects a pre-trained vision encoder [34], [35] to an autoregressive language model via a lightweight module like an MLP or Q-Former. These models first realize cross-model alignment with pre-training and then conduct visual instruction tuning to handle a wide range of vision-centric tasks.

Despite their success, a persistent challenge in LVLMs is the phenomenon of hallucination [36], where the model generates text that is factually inconsistent with the visual input. In autoregressive models, this issue is exacerbated by error propagation; an incorrectly generated token can irreversibly misguide the subsequent generation path. Notably, current multimodal discrete diffusion models, such as LLaDA-V [1], LaViDa [3], and MMaDA [2], also adhere to this limitation, fixing tokens in place once they are unmasked. Our ReDiff, however, leverages the bidirectional attention mechanism inherent to the diffusion paradigm. This allows our model to revisit and optimize already-generated content, enabling a progressive refinement process that directly mitigates hallucination.

3 Methodology↩︎

In this section, we introduce our refining-enhanced diffusion framework, ReDiff, designed to enhance the generation accuracy and stability of vision-language diffusion models. In contrast to traditional approaches that focus on recovering text from all [MASK] noise, our work emphasizes the high-level refinement of generated text. Guided by an expert model, our framework enables the model to learn from its own generation errors. This fosters a self-correction capability during inference, allowing it to simultaneously unmask new tokens while refining previously generated ones, thereby mitigating the problem of error cascades in parallel generation.

We will first present the preliminaries of discrete diffusion models in Section 3.1. We then introduce the first stage of our approach, foundational revision training, in Section 3.2 . Section 3.3 details the core of our framework, online self-correction learning. Section 3.4 details the inference process.

Figure 2: Overview of our proposed two-stage training framework for corrective refining. (a) We illustrate common failure modes in standard vision-language diffusion models, which are prone to generating syntactic errors (e.g., “Domin bus bus") and factual hallucinations (e.g.,”a woman"). (b) In the foundational revision training stage, we instill a general corrective capability by training a base model (ReDiff-Base) to revise synthetic errors that are intentionally injected into ground-truth captions. (c) For the second stage, i.e., online self-correction learning, the model generates its own flawed “drafts". These drafts, containing the model’s intrinsic errors, are then revised by an expert AI assistant. The resulting”draft-refined pairs" provide strong supervision, teaching our final model (ReDiff) to identify and correct its own characteristic mistakes, thus breaking the error cascade.

3.1 Preliminaries of Discrete Diffusion Models↩︎

A discrete diffusion model formalizes text generation through a forward and a reverse process. The forward process gradually corrupts a clean text sequence \(x_0\) into a noisy state \(x_t\) over a series of timesteps \(t\in[0,1]\). In mask-pred models, this is achieved by replacing tokens with a [MASK] token based on a noise schedule \(\gamma_t\), culminating in a fully masked sequence as a prior distribution. The forward process is formulated as: \[q \bigl(x_t[i]=c \,\big|\, x_0[i]\bigr) = \begin{cases} 1-\gamma_t, & \text{if } c = x_0[i],\\[0.5em] \gamma_t, & \text{if } c = \mathbf{M}. \end{cases}\]

The reverse process aims to reverse this corruption. Starting from a fully masked sequence, the model iteratively predicts the original tokens. At each step, it predicts probabilities for all masked positions, unmasks a few high-confidence tokens, re-masks the rest, and feeds the updated sequence back into the model for the next iteration.

The model, a parametric mask predictor, is trained to predict all masked tokens (denoted by a set \(\mathbf{M}\)) simultaneously. The training objective is a cross-entropy loss computed only on the masked tokens: \[\mathcal{L}_{\text{CE}}(\theta) = -\mathbb{E}_{t, v, p_0, r_0, r_t} \left[ \frac{1}{t} \sum_{i=1}^{L_{r_0}} \mathbf{1}[r_t^i = \mathbf{M}] \log p_\theta(r_0^i | v, p_0, r_t) \right], \label{eq:diffusion}\tag{1}\]

where \(v\) and \(p_0\) denote visual content and prompt, \(r_0\) is the correct response, \(t\) is sampled uniformly, and \(r_t\) is sampled from the forward process.

A key advantage of discrete diffusion models is their potential for parallel generation, where multiple tokens are unmasked in a single step, significantly reducing the number of required iterations. However, existing model [1], [8], [9] treat already-unmasked tokens as fixed conditions for future predictions. If an incorrect token is generated, it can derail subsequent steps, leading to an error cascade. Yet, unlike the unidirectional attention in AR models, the bidirectional attention mechanism inherent to these models provides the architectural foundation for updating previously generated tokens, a potential we exploit in our framework.

3.2 Stage I: Foundational Revision Training↩︎

Observations of existing vision-language diffusion models, especially in few-step generation scenarios, reveal two predominant error types: syntactic chaos (e.g., incoherence, repetition, grammatical errors) and semantic hallucinations (content that contradicts the visual input), as shown in Figure 2 (a). In this first training stage, we teach the model to correct these two types of errors, extending its capability from simple denoising to foundational text revision.

We use two data construction ways. For syntactic errors, we corrupt the text from ground-truth image-text pairs by randomly replacing a fraction of tokens with other tokens from the vocabulary, creating syntactically chaotic inputs. For hallucination errors, we leverage pairs of correct captions and human-corrputed captions with factual errors (e.g., incorrect objects, attributes, or counts), which directly provide examples of visually inconsistent text.

As shown in Figure 2 (b), we task the model with restoring a “polluted" response \(r_t\) to its original, correct version \(r_0\). We first apply the standard masking process to \(r_0\) according to a sampled noise level \(t\). Then, on the remaining unmasked tokens, we inject the synthetic errors described above. This corrupted sequence serves as the model’s input. The model is trained to predict the entire original text \(r_0\). The loss is computed not only on the [MASK] tokens but also on the syntactically corrupted tokens (\(\mathcal{L}_{\rm syntax}\)) and hallucinated tokens (\(\mathcal{L}_{\rm hallucination}\)). We also include a loss on the uncorrupted tokens (\(\mathcal{L}_{\rm clean}\)) to encourage the model to preserve correct content. The loss of each type is calculated as follows: \[\label{eq:revise} \mathcal{L}_{\text{type}}(\theta) = -\mathbb{E}_{t, v, p_0, r_0, r_t} \left[ \frac{1}{t} \frac{1}{N_{\text{type}}} \sum_{i=1}^{L_{r_0}} \mathbf{1}[r_t^i \in \text{type}] \log p_\theta(r_0^i | v, p_0, r_t) \right],\tag{2}\]

where \(\rm type \in \{mask, syntax, hallucination, clean\}\). Each loss component is normalized by the number of its corresponding tokens \(N_{\text{type}}\) to balance their contributions. The total loss is: \[\mathcal{L}_{\text{revision}} = \mathcal{L}_{\text{mask}} + \mathcal{L}_{\text{syntax}} + \mathcal{L}_{\text{hallucination}} + \mathcal{L}_{\text{clean}}. \label{eq:total}\tag{3}\]

After Stage I, we obtain ReDiff-Base, a model equipped with the foundational capability to correct both syntactic errors and factual hallucinations. However, this stage has a limitation: the errors are synthetic and may not reflect the characteristic mistakes the model itself is prone to making.

3.3 Stahe II: Online Self-Correction Learning↩︎

To teach the model to fix its own idiosyncratic errors, we introduce an online self-correction learning framework. The process, illustrated in Figure 2 (c), proceeds as follows: (1) Generating drafts: We use ReDiff-Base to generate a response for an image, denoted as \(r_{\rm draft}\). We typically use decoding results of different generation steps to cover more mistakes. (2) Expert revision: The image \(I\), the generated draft \(r_{\rm draft}\), and the ground truth are fed to a powerful external “expert model" (e.g., o4-mini). With a carefully designed prompt, the expert model identifies and corrects both grammatical and hallucinatory errors in \(r_{\rm draft}\), producing a refined version, \(r_{\rm refined}\). We specifically extract the pairs of erroneous and corrected segments. (3) Learning to refine: We form a new training instance \(⟨I, r_{\rm draft}, r_{\rm refined}⟩\) and fine-tune our model on these data. Note that the training loss is computed only on the segments that the expert model identified and corrected. This targeted learning prevents the model from being penalized for other potential errors in the draft that the expert may have missed. The training loss is: \[\label{eq:refine} \mathcal{L}_{\text{refine}}(\theta) = -\mathbb{E}_{t, v, p_0, r_\text{draft}, r_\text{refined}} \left[\frac{1}{N_{\text{mistake}}} \sum_{i=1}^{L_{r_0}} \mathbf{1}[r_\text{draft}^i \in \text{mistake}] \log p_\theta(r_\text{refined}^i | v, p_0, r_\text{draft}) \right].\tag{4}\]

To maintain the foundational capabilities learned in the first stage, we mix in a small amount of the Stage I data during this phase. This entire cycle can be iterated: the refined model from one round can be used to generate new drafts for the next round of expert revision and fine-tuning, progressively enhancing its self-correction ability. The key advantage here is that the model learns from its own mistakes, which is a more targeted and efficient way to improve its robustness and the stability of parallel generation.

3.4 Inference Process↩︎

Our inference process differs from that of traditional discrete diffusion models by integrating refinement into each generation step. Specifically, the process starts with a fully masked sequence. At each step, the model computes the output probability distribution over the entire vocabulary for all token positions. For masked positions, if the inference speed is \(n\) tokens per step, we select the top-\(n\) most confident tokens and unmask them. For previously unmasked positions, we replace the existing tokens with the newly predicted ones. This allows for the simultaneous unmasking of new content and refining of existing content. As more context is generated, previously generated tokens are iteratively updated to be more coherent and factually accurate, effectively reducing the occurrence of syntactic chaos and hallucinations.

4 Experiments↩︎

max width=

4.1 Experiment Settings↩︎

Training Setup. Our primary focus is on enhancing the generative capabilities of vision-language diffusion models. We select detailed image captioning as the representative task to validate our framework, although the methodology is generalizable to other generative tasks. Our model is built upon the existing LLaDA-V model, leveraging its foundational mask prediction capabilities while endowing it with the ability to refine. The training data comprises caption datasets from LLaVA-1.5 [25], ShareGPT4V [37], and the ViCrit dataset [38]. When constructing hallucination revision data in Stage I, we leverage the existing hallucination dataset ViCrit, which contains pairs of correct and hallucinated descriptions. For Stage I (foundational revision training), we use a total of 260k image-text pairs, with a random token replacement probability of 0.1 for creating syntactic chaos. For Stage II (online self-correction learning), we generate approximately 10k draft-refined caption pairs in each round. The drafts are generated with 128, 32, and 16 inference steps, and o4-mini serves as the expert model for revisions. The prompt for o4-mini is detailed in Appendix 7. Our experiments revealed that a single round of this online refinement training yielded the most significant improvements.

Benchmarks and Evaluation Setup. We evaluate our model on three recent benchmarks for detailed image caption: CapMAS [12] uses three metrics evaluated by GPT-4o: CLAIR for overall caption quality, Coverage for the comprehensiveness of the description, and Factuality for the accuracy of the content. CapArena [39] employs a pairwise comparison methodology where the outputs of the test model are compared against those of three baseline models with GPT-4o. A final score is calculated based on these win ratio. DetailCaps-4870 [13] uses the CAPTURE metric, which scores the generated caption by comparing its scene graph to that of the ground-truth description. We compare ReDiff against several vision-language diffusion models, including LLaDA-V, LaViDa, MMaDA, and FUDOKI. We also include results from some typical AR-based VLMs. At inference, the maximum generation length is 128. An inference process of 128 steps corresponds to a speed of 1 token/step, while 32 steps correspond to 4 tokens/step.

4.2 Main Results↩︎

As shown in Table [tab:main], our ReDiff achieves state-of-the-art results among all diffusion-based models across each metric. On CapMas, our model’s CLAIR score shows a remarkable 11.2 point improvement over the LLaDA-V, reaching a comparable level to InternVL-2.5. The Coverage and Factuality scores also increase by 5.85 and 2.23 points, respectively, indicating that our captions are not only richer in content but also more accurate. On CapArena, our model outperforms LLaDA-V by 25.67 points. Furthermore, we achieve a CAPTURE score of 61.88, surpassing the powerful Qwen2.5-VL. These results demonstrate that our refining-enhanced diffusion method effectively improves fluency and mitigates hallucinations, leading to a substantial enhancement in overall caption quality.

In Tables [tad:step32capmas] and [tab:step32capa], we compare models trained with the traditional mask-pred objective versus our refinement framework, using identical datasets and base model. Our model consistently outperforms the mask-trained baseline at every step count. Crucially, as the generation speed increases (i.e., fewer steps), our model’s performance degrades much more gracefully, demonstrating superior stability in parallel generation. For instance, on the CLAIR metric, as the speed increases from 1 token/step to 8 tokens/step, the mask-trained model’s score plummets from 74.53 to 46.38, whereas our model’s score only decreases from 76.74 to 67.44. Notably, our model’s performance at 4 tokens/step is higher than that of both LLaDA-V and the mask-trained baseline at 1 token/step. A similar trend is observed for Coverage. The trend for Factuality is less pronounced, as the baseline’s score does not drop significantly at fewer steps. This is because the metric relies on extracting valid items for verification; as the baseline’s output becomes more chaotic, fewer items can be extracted, artificially stabilizing the correctness ratio. On both CapArena and CAPTURE, our model also demonstrates more robust parallel generation, with the CAPTURE score dropping by only 0.65 points when accelerating from 1 to 4 tokens/step.

max width=

4.3 Ablation Studies↩︎

Impact of Each Training Stage. In Table [tab:train32stage], we analyze the individual contributions of our two training stages. Both Stage I (foundational revision) and Stage II (self-correction) independently improve the model’s performance and stability over the LLaDA-V baseline. Furthermore, the most significant gains are achieved when both stages are combined. Notably, Stage II alone provides a more substantial boost than Stage I, confirming that teaching the model to learn from its own intrinsic errors is a highly effective refinement strategy. After Stage I training, the model exhibits stable parallel generation performance. For example, as the speed increases from 1 to 4 tokens/step, CLAIR improves from 71.31 to 71.67, and CapArena changes from -69.17 to -73.17. The combination of the two stages yields a synergistic effect, with Stage I providing a foundational revision ability that is further amplified by Stage II, leading to large improvements in metrics like Factuality (+5.25) and CapArena (+17.67).

Analysis of Foundational Revision Training. As shown in Table [tab:stage1], we investigate different settings for the stage I training. We find that revising syntactic errors primarily boosts overall quality (CLAIR) and Coverage, while also enhancing stability during parallel generation. Conversely, training on hallucination revision exhibits higher Factuality. Combining both error types allows our model to achieve the best overall performance. We also compare dynamic probability for random token replacement in the fourth line, where the dynamic rate is correlated with the noise level t (using t as replacement probability, when t \(<\) 0.1). The results indicate that our fixed replacement rate yields better overall performance.

Impact of Online Self-Correction Training Rounds. In Table [tab:stage-2], we examine the effect of iteration of the stage II training. The results show that while the first round of self-correction provides a substantial performance boost over the ReDiff-Base model, subsequent rounds of training on newly generated data do not yield further significant improvements across most metrics.

Figure 3: Cases comparison between LLaDA-V and our ReDiff under 4 tokens/step inference speed. ReDiff demonstrates superior fluency and accuracy in its generated captions.

Figure 4: Refinement process of ReDiff at different inference step. Red tokens indicate the errors produced during generation, while green tokens mean the corresponding refined results.

Figure 6: ReDiff can revise wrong input answers.

4.4 Qualitative Analysis↩︎

We provide qualitative examples to visually demonstrate how the refinement during inference produces more accurate and fluent results, thereby improving the stability of parallel generation.

In Figure 3, we compare the parallel-generated captions from ReDiff and LLaDA-V. The baseline’s output suffers from token repetition (“bus",”the"), grammatical errors, and hallucinations (e.g., misidentifying a person on a bus advertisement as “a woman"). In contrast, our model’s output is fluent, coherent, and factually grounded. In the second example, our model accurately describes all key elements in the scene, whereas the baseline’s output is chaotic and omits significant details. More comparison cases can be found in Appendix 6.

Figure 4 visualizes the token-level changes during a 32-step generation process. It clearly shows the model simultaneously unmasking new tokens and refining previously generated ones. For instance, in the first example, the model refines the erroneous phrase “rocks painted rocks" into”colorful painted rocks" at step 20. At step 28, it corrects “a tall green plant" to”a small green plant" to better match the visual content.

Figure 5 showcases comparison of inference with and without the refinement, showing that the refinement is critical for achieving high-quality outputs. If ReDiff inferences without the refinement, errors tend to accumulate, such as repeated words or symbols and incoherent sentences, ultimately degrading the quality of the caption. This highlights the importance of the model’s refinement capability.

Beyond correcting the model’s own errors during generation, ReDiff also demonstrates a powerful, generalizable ability to revise disturbing inputs. As shown in Figure 6, we provide the model with an image and a user-provided caption containing either syntactic chaos or a factual hallucination. In both cases, our model successfully corrects the initial erroneous text and proceeds to generate a coherent and accurate completion, highlighting the strong revision ability of ReDiff.

5 Conclusion↩︎

In this work, we addressed the critical challenge of error cascades that hampers the performance of vision-language diffusion models, particularly in efficient parallel generation scenarios. We proposed a paradigm shift from passive denoising to active refining by introducing ReDiff, a novel framework centered on a mistake-driven, online self-correction loop. This approach teaches the model to learn from its own characteristic errors, endowing it with the ability to revisit and refine its generated output. Our extensive experiments validate that this method not only achieves state-of-the-art performance but, more importantly, demonstrates far superior stability and factual accuracy in challenging few-step generation regimes where traditional denoising models catastrophically fail. By effectively breaking the error cascade, our work presents a promising path toward developing more robust, efficient, and controllable generative systems.

6 More Visualization↩︎

Figure 7 and Figure 8 show the generation results of ReDiff and LLaDA-V under different inference steps. In the 2 tokens/step scenario, LLaDA-V outputs a great deal of hallucinated content, such as “Goku" and”Vegeta" in the first case, and a mouse and keyboard in the second. This occurs because an initial hallucination can affect subsequent generation, leading to error cascades. In contrast, our ReDiff method produces captions that are consistent with the image content. In the cases of 8 tokens/step, the results of our model are more fluent and have less grammer errors.

Figure 7: Cases comparison between LLaDA-V and our ReDiff under 2 tokens/step inference speed.

Figure 8: Cases comparison between LLaDA-V and our ReDiff under 8 tokens/step inference speed.

7 Prompt for Stage II data Construction↩︎

In online self-correction learning loop, we use ReDiff-Base to generate caption drafts. Then the image, the draft, and ground truth caption are fed to o4-mini to detect and revise errors. The prompt for o4-mini is shown in Table [tab:o432prompt].

8 Ethics Statement↩︎

All datasets and models used in this study are publicly available and open-source. No proprietary, private, or personally identifiable information was collected or used. The images employed are either natural scenes or normal human activities, without any violent, explicit, or otherwise harmful content. Therefore, the research meets relevant considerations regarding privacy, ethics and copyright.

References↩︎

[1]

Zebin You, Shen Nie, Xiaolu Zhang, Jun Hu, Jun Zhou, Zhiwu Lu, Ji-Rong Wen, and Chongxuan Li. Llada-v: Large language diffusion models with visual instruction tuning. CoRR, abs/2505.16933, 2025.

[2]

Ling Yang, Ye Tian, Bowen Li, Xinchen Zhang, Ke Shen, Yunhai Tong, and Mengdi Wang. Mmada: Multimodal large diffusion language models. CoRR, abs/2505.15809, 2025.

[3]

Shufan Li, Konstantinos Kallidromitis, Hritik Bansal, Akash Gokul, Yusuke Kato, Kazuki Kozuka, Jason Kuen, Zhe Lin, Kai-Wei Chang, and Aditya Grover. Lavida: A large diffusion language model for multimodal understanding. CoRR, abs/2505.16839, 2025.

[4]

Jin Wang, Yao Lai, Aoxue Li, Shifeng Zhang, Jiacheng Sun, Ning Kang, Chengyue Wu, Zhenguo Li, and Ping Luo. discrete flow-based unified understanding and generation via kinetic-optimal velocities. CoRR, abs/2505.20147, 2025.

[5]

Alexander Swerdlow, Mihir Prabhudesai, Siddharth Gandhi, Deepak Pathak, and Katerina Fragkiadaki. Unified multimodal discrete diffusion. CoRR, abs/2503.20853, 2025.

[6]

Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. Dual diffusion for unified image generation and understanding. In CVPR, pp. 2779–2790. Computer Vision Foundation / IEEE, 2025.

[7]

Runpeng Yu, Xinyin Ma, and Xinchao Wang. Dimple: Discrete diffusion multimodal large language model with parallel decoding. CoRR, abs/2505.16990, 2025.

[8]

Shen Nie, Fengqi Zhu, Zebin You, Xiaolu Zhang, Jingyang Ou, Jun Hu, Jun Zhou, Yankai Lin, Ji-Rong Wen, and Chongxuan Li. Large language diffusion models. arXiv preprint arXiv:2502.09992, 2025.

[9]

Jiacheng Ye, Zhihui Xie, Lin Zheng, Jiahui Gao, Zirui Wu, Xin Jiang, Zhenguo Li, and Lingpeng Kong. Dream 7b: Diffusion large language models. arXiv preprint arXiv:2508.15487, 2025.

[10]

Yuxuan Song, Zheng Zhang, Cheng Luo, Pengyang Gao, Fan Xia, Hao Luo, Zheng Li, Yuehang Yang, Hongli Yu, Xingwei Qu, Yuwei Fu, Jing Su, Ge Zhang, Wenhao Huang, Mingxuan Wang, Lin Yan, Xiaoying Jia, Jingjing Liu, Wei-Ying Ma, Ya-Qin Zhang, Yonghui Wu, and Hao Zhou. Seed diffusion: A large-scale diffusion language model with high-speed inference. CoRR, abs/2508.02193, 2025.

[11]

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion LLM by enabling KV cache and parallel decoding. CoRR, abs/2505.22618, 2025.

[12]

Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, and Sungroh Yoon. Toward robust hyper-detailed image captioning: A multiagent approach and dual evaluation metrics for factuality and coverage. CoRR, abs/2412.15484, 2024.

[13]

Hongyuan Dong, Jiawen Li, Bohong Wu, Jiacong Wang, Yuan Zhang, and Haoyuan Guo. Benchmarking and improving detail image caption. arXiv preprint arXiv:2405.19092, 2024.

[14]

Jacob Austin, Daniel D. Johnson, Jonathan Ho, Daniel Tarlow, and Rianne van den Berg. Structured denoising diffusion models in discrete state-spaces. In NeurIPS, pp. 17981–17993, 2021.

[15]

Aaron Lou, Chenlin Meng, and Stefano Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution. In ICML. OpenReview.net, 2024.

[16]

Zemin Huang, Zhiyang Chen, Zijun Wang, Tiancheng Li, and Guo-Jun Qi. Reinforcing the diffusion chain of lateral thought with diffusion language models. CoRR, abs/2505.10446, 2025.

[17]

Marianne Arriola, Aaron Gokaslan, Justin T. Chiu, Zhihan Yang, Zhixuan Qi, Jiaqi Han, Subham Sekhar Sahoo, and Volodymyr Kuleshov. Block diffusion: Interpolating between autoregressive and diffusion language models. In ICLR. OpenReview.net, 2025.

[18]

Haoran Sun, Lijun Yu, Bo Dai, Dale Schuurmans, and Hanjun Dai. Score-based continuous-time discrete diffusion models. In The Eleventh International Conference on Learning Representations, 2023.

[19]

Subham Sekhar Sahoo, Marianne Arriola, Yair Schiff, Aaron Gokaslan, Edgar Marroquin, Justin T Chiu, Alexander Rush, and Volodymyr Kuleshov. Simple and effective masked diffusion language models. arXiv preprint arXiv:2406.07524, 2024.

[20]

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023.

[21]

Qwen Team. Qwen3: Think deeper, act faster. 2025. URL https://qwenlm.github.io/blog/qwen3/. https://qwenlm.github.io/blog/qwen3/.

[22]

Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024.

[23]

vicuna. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality. https://vicuna.lmsys.org/, 2023.

[24]

OpenAI. Chatgpt. https://openai.com/blog/chatgpt/, 2023.

[25]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In NeurIPS, 2023.

[26]

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Meng Huat Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale Fung, and Steven C. H. Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. In NeurIPS, 2023.

[27]

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. CoRR, abs/2408.03326, 2024.

[28]

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A frontier large vision-language model with versatile abilities. CoRR, abs/2308.12966, 2023.

[29]

Yatai Ji, Rongcheng Tu, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, and Wei Liu. Seeing what you miss: Vision-language pre-training with semantic completion learning. In CVPR, pp. 6789–6798. IEEE, 2023.

[30]

Yi Wang, Xinhao Li, Ziang Yan, Yinan He, Jiashuo Yu, Xiangyu Zeng, Chenting Wang, Changlian Ma, Haian Huang, Jianfei Gao, et al. Internvideo2. 5: Empowering video mllms with long and rich context modeling. arXiv preprint arXiv:2501.12386, 2025.

[31]

Yatai Ji, Shilong Zhang, Jie Wu, Peize Sun, Weifeng Chen, Xuefeng Xiao, Sidi Yang, Yujiu Yang, and Ping Luo. towards movie understanding via id-aware large vision-language model. In ICLR. OpenReview.net, 2025.

[32]

Shilong Zhang, Peize Sun, Shoufa Chen, Min Xiao, Wenqi Shao, Wenwei Zhang, Kai Chen, and Ping Luo. Gpt4roi: Instruction tuning large language model on region-of-interest. CoRR, abs/2307.03601, 2023.

[33]

An-Chieh Cheng, Hongxu Yin, Yang Fu, Qiushan Guo, Ruihan Yang, Jan Kautz, Xiaolong Wang, and Sifei Liu. Spatialrgpt: Grounded spatial reasoning in vision-language models. In NeurIPS, 2024.

[34]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp. 8748–8763. PmLR, 2021.

[35]

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025.

[36]

Zechen Bai, Pichao Wang, Tianjun Xiao, Tong He, Zongbo Han, Zheng Zhang, and Mike Zheng Shou. Hallucination of multimodal large language models: A survey. arXiv preprint arXiv:2404.18930, 2024.

[37]

Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Conghui He, Jiaqi Wang, Feng Zhao, and Dahua Lin. Sharegpt4v: Improving large multi-modal models with better captions. CoRR, abs/2311.12793, 2023.

[38]

Xiyao Wang, Zhengyuan Yang, Chao Feng, Yongyuan Liang, Yuhang Zhou, Xiaoyu Liu, Ziyi Zang, Ming Li, Chung-Ching Lin, Kevin Lin, Linjie Li, Furong Huang, and Lijuan Wang. Vicrit: A verifiable reinforcement learning proxy task for visual perception in vlms. CoRR, abs/2506.10128, 2025.

[39]

Kanzhi Cheng, Wenpo Song, Jiaxin Fan, Zheng Ma, Qiushi Sun, Fangzhi Xu, Chenyang Yan, Nuo Chen, Jianbing Zhang, and Jiajun Chen. Caparena: Benchmarking and analyzing detailed image captioning in the llm era. arXiv preprint arXiv:2503.12329, 2025.

Corresponding author.↩︎

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Abstract

1 Introduction↩︎

2.1 Large Language Diffusion Models↩︎

2.2 Large Vision Language Models↩︎

3 Methodology↩︎

3.1 Preliminaries of Discrete Diffusion Models↩︎

3.2 Stage I: Foundational Revision Training↩︎

3.3 Stahe II: Online Self-Correction Learning↩︎

3.4 Inference Process↩︎

4 Experiments↩︎

4.1 Experiment Settings↩︎

4.2 Main Results↩︎

4.3 Ablation Studies↩︎

4.4 Qualitative Analysis↩︎

5 Conclusion↩︎

6 More Visualization↩︎

7 Prompt for Stage II data Construction↩︎

8 Ethics Statement↩︎

References↩︎

Subjects

Updated on Academus

From Denoising to Refining: A Corrective Framework for Vision-Language Diffusion Model

Abstract

1 Introduction↩︎

2 Related Work↩︎

2.1 Large Language Diffusion Models↩︎

2.2 Large Vision Language Models↩︎

3 Methodology↩︎

3.1 Preliminaries of Discrete Diffusion Models↩︎

3.2 Stage I: Foundational Revision Training↩︎

3.3 Stahe II: Online Self-Correction Learning↩︎

3.4 Inference Process↩︎

4 Experiments↩︎

4.1 Experiment Settings↩︎

4.2 Main Results↩︎

4.3 Ablation Studies↩︎

4.4 Qualitative Analysis↩︎

5 Conclusion↩︎

6 More Visualization↩︎

7 Prompt for Stage II data Construction↩︎

8 Ethics Statement↩︎

References↩︎

Subjects

Updated on Academus