AddSR: Accelerating Diffusion-based
Blind Super-Resolution with
Adversarial Diffusion Distillation

Rui Xie Ying Tai1 Kai Zhang Zhenyu Zhang
Jun Zhou Jian Yang\(^{\star}\)


Abstract

Blind super-resolution methods based on stable diffusion showcase formidable generative capabilities in reconstructing clear high-resolution images with intricate details from low-resolution inputs. However, their practical applicability is often hampered by poor efficiency, stemming from the requirement of thousands or hundreds of sampling steps. Inspired by the efficient text-to-image approach adversarial diffusion distillation (ADD), we design \(\mathtt{AddSR}\) to address this issue by incorporating the ideas of both distillation and ControlNet. Specifically, we first propose a prediction-based self-refinement strategy to provide high-frequency information in the student model output with marginal additional time cost. Furthermore, we refine the training process by employing HR images, rather than LR images, to regulate the teacher model, providing a more robust constraint for distillation. Second, we introduce a timestep-adapting loss to address the perception-distortion imbalance problem introduced by ADD. Extensive experiments demonstrate our \(\mathtt{AddSR}\) generates better restoration results, while achieving faster speed than previous SD-based state-of-the-art models (e.g., \(7\) \(\times\) faster than SeeSR).

1 Introduction↩︎

Blind super-restoration (BSR) aims to convert low-resolution (LR) images that have undergone complex and unknown degradation into clear high-resolution (HR) versions. Differing from classical super-resolution [1][3], where the degradation process is singular and known, BSR is designed to super-resolve real-world degraded images which have more practical value. In real-world scenarios, achieving both effectiveness and efficiency poses a key challenge for BSR.

Generative models, generative adversarial network (GAN) and diffusion model, have demonstrated significant superiority in BSR task due to their ability to generate realistic details. However, they both have disadvantages. GAN-based methods [4][9] incorporate adversarial training to learn a network that fits the mapping function from the distribution of input LR images to that of HR images. While GAN-based methods require only one-step inference, they often struggle to generate satisfactory results when handling natural images with intricate textures (e.g., Fig. 1).

Figure 1: Comparisons on effect and efficiency. Ours-\(4\) indicates the result of \(\mathtt{AddSR}\) is obtained in \(4\) steps, achieving high perception quality restoration performance within \(0.8\)s. In contrast, existing SD-based BSR models suffer from either low perception quality restoration performance (e.g., ResShift) or time-consuming efficiency (e.g., SeeSR).

Recently, diffusion models (DM) have garnered significant attention owing to their potent generative capabilities and the ability to combine information from multiple modalities. DM-based BSR can be roughly divided into two categories: without Stable Diffusion (SD) prior [10][12] and with SD prior [13][16]. SD prior can significantly assist the model in capturing the distribution of natural images, enabling the generation of HR images with realistic details. Due to the iterative refinement characteristic of DM, diffusion-based methods can generally achieve better results than GAN-based methods, albeit with a noticeable drop in efficiency as shown in left of Fig. 1.

As analyzed above, designing a BSR model with both excellent restoration effect and high efficiency is urgently needed for real-world applications. To fill this blank, we propose a novel \(\mathtt{AddSR}\) based on adversarial diffusion distillation (ADD), which enhances restoration effects and accelerates inference speed simultaneously. However, incorporating ADD directly into BSR poses two crucial challenges, which are both addressed in our proposed \(\mathtt{AddSR}\):

(\(1\)) Task inconsistency: There are two differences between ADD and \(\mathtt{AddSR}\). a) Original ADD is proposed for efficient text-to-image generation, while \(\mathtt{AddSR}\) deals with image-to-image restoration. b) Original ADD does not require LR images as input, while \(\mathtt{AddSR}\) incorporates ControlNet to receive LR and HR images in the student and teacher models respectively to support BSR.

Specifically, we incorporate the ideas of distillation and ControlNet. On one hand, we design a simple yet effective method called prediction-based self-refinement (PSR) for the student model. Unlike previous SD-based methods (e.g., SeeSR) that receive the fixed LR images as input for BSR, we estimate HR images from the predicted noise to provide high-frequency information during training. On the other hand, with incorporation of the teacher model, we can directly send the HR image as its input to additionally supply high-frequency prior during training, which is not available in previous SD-based frameworks. This strategy can help the student model produce high perception quality results closely resembling those of the teacher model.

(\(2\)) Perception-distortion imbalance: The perceptual quality of images generated by the diffusion model increases with the number of steps. Constraining generated images in different timesteps with the same GAN loss weight in ADD may lead to insufficient perceptual constraint with small step (e.g., \(1\)) to generate blurry results, and overly restrictive constraint with larger step (e.g., \(4\)) to produce hallucinated results. To tackle this challenge, we develop a timestep-adapting (TA) loss to alleviate this issue. The core idea of TA loss involves employing varying weights for the GAN loss and distillation loss at each timestep, enhancing the constraint of the adversarial loss on the model in the early stages and reducing it in the later stages.

Our main contributions can be summarized as threefold:

\(\bullet\) To our best knowledge, \(\mathtt{AddSR}\) is the first SD-based BSR model that achieves high perception quality within \(0.8\)s, achieving \(\times\) \(7\) faster than the previous state-of-the-art model SeeSR, along with better restoration performance.

\(\bullet\) We propose prediction-based self-refinement to provide high quality controlling signal for regulating the model output and refine the teacher model training process. Moreover, we propose the timestep-adapting loss for achieving the perception-distortion trade-off.

\(\bullet\) Experiments on popular benchmarks against extensive competitors demonstrate that our proposed \(\mathtt{AddSR}\) achieves superior restoration effect and high efficiency simultaneously.

2 Related Work↩︎

Table 1: Comparing restoration effectiveness and efficiency among different methods.
Type Methods Source TimeSteps Effect Efficiency
GAN-based BSRGAN ICCV 2021 1 Poor High
  Real-ESRGAN ICCVW 2021 1    
  LDL CVPR 2022 1    
  FeMaSR MM 2022 1    
DM-based ResShift NeurIPS 2023 15 Moderate Moderate
SD-based StableSR Arxiv 2023 200 Good Low
  PASD Arxiv 2023 20    
  DiffBIR Arxiv 2023 50    
  SeeSR CVPR 2024 50    
  \(\mathtt{AddSR}\) (Ours) - 1\(\sim\)​4 Good High

GAN-based BSR. In recent years, BSR have drawn much attention due to their practicability. Adversarial training [17] is introduced in BSR task to avoid generating over-smooth results. BSRGAN [4] designs a random shuffle strategy to enlarge the degradation space for training a comprehensive SR model. Real-ESRGAN [5] presents a more practical degradation process called “high-order” to synthesize realistic LR images. Both of them utilize the GAN to learn the projection from the distribution of degraded LR images to the distribution of HR images. KDSRGAN [6] estimates the implicit degradation representation to assist the restoration process. While GAN-based BSR methods require only one step to restore the LR image, their capability to super-resolve complex natural images is limited. In this work, our \(\mathtt{AddSR}\) seamlessly attains superior restoration performance with a slight dip in speed, making it a compelling choice. Diffusion-based BSR. Diffusion models has demonstrated significant advantages in image generation tasks (e.g., text-to-image). Recently, some diffusion-based BSR works have been proposed. One common approach [10], [18], [19] is training a non-multimodal diffusion model from scratch, which takes the concatenation of a LR image and noise as input in every step. Another approach [13][16] fully leverages the prior knowledge from a pre-trained multimodal diffusion model (i.e., SD model), which requires training a ControlNet and incorporates new adaptive structures (e.g., cross-attention). SD-based methods excel in performance compared to the aforementioned approaches, as they effectively incorporate high-level information. However, the large number of model parameters and the need for numerous sampling steps pose substantial challenges to their application in the real world.

Efficient Diffusion Models. Several works [20][23] are proposed to accelerate the inference process of DM. Although these methods can reduce the sampling steps from thousands to \(20\)-\(50\), the restoration effect will deteriorate dramatically. A very recent technique called adversarial diffusion distillation [24] are proposed to achieve \(1\) \(\sim\) \(4\) steps inference while maintaining satisfactory generating ability. However, ADD was originally designed for the text-to-image task. Considering the multifaceted nature of BSR, such as image quality, degradation, or the trade-off between fidelity and realness, employing ADD to expedite the SD-based model for BSR is non-trivial. In summary, \(\mathtt{AddSR}\) introduces two pivotal designs to adapt ADD into BSR tasks, making it both effective and efficient.

3 Methodology↩︎

Here, we outline \(\mathtt{AddSR}\). First, we provide a brief introduction to stable diffusion and adversarial diffusion distillation in Sec. 3.1. Next, in Sec. 3.2, we introduce the Image Quality-Adjusted ADD, an acceleration strategy refined from ADD to suit the BSR task. Then, we describe and discuss our proposed prediction-based self-refinement and timesteps-adapting loss in Sec. 3.3 and Sec. 3.4, respectively.

Figure 2: Overview of \(\mathtt{AddSR}\). There are two models, a student model and a teacher model, in our proposed \(\mathtt{AddSR}\). In the teacher model, a frozen ControlNet is employed to receive conditioned information from HR images directly, and a frozen SD model is used to generate the guidance for the student model. In the student model, a ControNet receives the features from the LR image initially and then the predicted HR images in the later stages based on our proposed PSR. The representation cross-attention layers in the student’s SD/ControlNet models are updated during training to enable our student model to handle low-quality images.

3.1 sec:Preliminary↩︎

Stable Diffusion. The goal of \(\mathtt{AddSR}\) is to design an effective and efficient model with Stable Diffusion prior for BSR task. Stable Diffusion is a large-scale text-to-image latent diffusion model. Its forward and denoising process are operated in latent space through an encoder \(\mathcal{E}\) [25] that translates the pixel space \(x\) into latent space \(z\) . The forward process is a Markovian chain that adds Gaussian noise to the latent tensor \(z\) at time \(t\) to generate the noisy state: \[z_t = \sqrt{\overline{\alpha}_t}z+\sqrt{1-\overline{\alpha}_t}\epsilon, ~~\overline{\alpha}_t=\prod \limits_{i=1}^t (1-\beta_i) \label{eq1},\tag{1}\] where \(z=\mathcal{E}(x)\), \(x\) is the sampled data, \(\beta_i \in (0,1)\) is the variance of the Gaussian noise and \(\epsilon\sim(0, \boldsymbol{I})\). Then, a U-Net \(\theta\) is employed to predict the added noise with text prompt \(\tau\). The loss function is specified as follows: \[\mathcal{L}_{SD} = \mathbb{E}_{z,t,\tau,\epsilon}[\Vert \epsilon-\epsilon_\theta(z_t, t, \tau) \Vert_2^2],\] where \(t\) is uniformly sampled. Once the model is trained, we can input descriptions of desired scenes and obtain the corresponding images.

Adversarial Diffusion Distillation. The denoising process is crucial to the diffusion model, while adversarial training plays a central role in GANs. Recently, adversarial diffusion dstillation is designed to combine the above two approaches for accelerating the denoising process. There are two models in ADD, including a ADD-student and a ADD-teacher. Specifically, ADD first chooses a student timestep \(s\) from a set \(T_{student}=\lbrace t_1, t_2, t_3, t_4\rbrace\), adds noise to convert \(x\) to \(x_s\) and send it into ADD-student. Second, ADD-student generates \(\hat{x}_\theta(x_s, s)\), diffuses the result to noisy state \(x_t\) and takes the ADD-teacher’s output \(\hat{x}_\phi(x_t, t)\) as the distillation target. Finally, ADD passes the student result \(\hat{x}_\theta(x_s, s)\) and ground truth \(x_0\) to a discriminator to employ adversarial loss as follows: \[\mathcal{L}_{ADD} = \mathcal{L}_{dis}(\hat{x}_\theta(x_s, s), \hat{x}_\phi(x_t, t)) + \lambda\mathcal{L}_{adv}(\hat{x}_\theta(x_s, s), x_0, \psi), \label{eq3}\tag{2}\] where \(\theta\) is the weight set of student model which is initialized from teacher model, \(\phi\) is the weight set of teacher model, \(\psi\) is the weight set of discriminator and \(\lambda\) is the balance weight. We empirically set \(\lambda\) to \(0.02\).

3.2 Image Quality-Adjusted ADD↩︎

ADD is proposed to expedite the inference process of DM for the text-to-image task, without consideration of receiving low-quality images. Conversely, image quality is an inevitable factor in the BSR task. To design an acceleration strategy for the BSR task, we propose image quality-adjusted ADD, as illustrated in Fig. 2. Specifically, if we do not consider the image quality and directly employ ADD for the BSR task, which means using the same input (i.e., LR images) to control the student and teacher models, we fail to fully utilize the high-frequency information from HR images available in the training phase. In Fig. 2, we substitute LR images with their HR counterparts (i.e., the input of RAM, text encoder and ControlNet) to regulate the output of the teacher model. This compels the student model, controlled with LR images, to produce similar results as the teacher model controlled with HR images.

Figure 3: Illustration of the proposed prediction-based self-refinement (PSR). The previous SD-based methods usually use the LR image to control model’s output. However, the degradation of LR images notably influence the restoration process. The proposed PSR utilizes the predicted HR image from the previous step to control the model’s output, which can provide better supervision with marginal additional time cost.

3.3 Prediction-based Self-Refinement↩︎

As shown in Fig. 3, SD-based methods usually use LR images to control the output of DM in each inference step. However, LR images may suffer from multiple degradations, which can disrupt the restoration process and limit their ability to provide only low-frequency information for controlling the model output (e.g., the first line of Fig. 4). Previous methods [14], [16] employ a degradation removal structure to preprocess the LR image, aiming to mitigate the impact of degradation. However, these approaches often sacrifice efficiency, which is unacceptable when designing an efficient model.

Figure 4: Visual comparisons on with and without PSR. As can be seen, PSR plays an crucial role in restoring accurate details. The results without PSR are not faithful with the GT. With the timestep increases, the results with PSR exhibit better and more accurate details. Please zoom in for a better view.

To address this issue, we propose a novel prediction-based self-refinement approach, which incurs only tiny efficiency overhead. The core idea of PSR is to fully utilize the predicted noise to enhance image restoration. Specifically, we use the following equation

\[x_0 = (x_t-\sqrt{1-\overline{\alpha}_t}\epsilon_{\theta,t}) / \sqrt{\overline{\alpha}_t} \label{eq4}\tag{3}\]

to estimate the HR image \(\hat{x}_0\) from predicted noise in each step, and then control the model output in next step, where \(x_t\) is the noisy state and \(\epsilon_{\theta,t}\) is the predicted noise at time \(t\). The \(\hat{x}_0\) in each step has more high-frequency information to better control the model output (e.g., the right side of Fig. 3 and also Fig. 4). Moreover, the additional processing time, solely attributed to solving Eq. 3 , is nearly negligible. By leveraging our simple yet effective PSR, \(\mathtt{AddSR}\) can notably enhance the restoration outcomes without compromising efficiency (see Tab. 5).

3.4 Timestep-Adapting Loss Mechanism↩︎

Motivation. Perception-distortion trade-off [26] is a well-known phenomenon in SR task. Ideally, a good model can generate image with both high perception and fidelity. However, with the increasing realness of restored image (e.g., by elevating the inference step in diffusion model), the fidelity may deteriorate. We observed that training BSR task with ADD directly exacerbates this phenomenon, as shown in Fig. 5. Specifically, during the initial stages of inference, there is a significant decrease in fidelity, accompanied by improvement in perception quality. In later stages of inference, fidelity remains at a low level, while perception quality undergoes a dramatic increase. The aforementioned scenario may give rise to two issues: (1) When the inference step is small, the quality of restored image is subpar. (2) As the inference step increases, the generated images may exhibit “hallucinations”.

Figure 5: Illustration of the impact of TA loss in perception-distortion trade-off. (a) The perception and fidelity variation trends with and without TA loss. CLIPIQA and PSNR stand for perception quality and fidelity, respectively. (b) The output at different timesteps. The final output without TA loss hallucinates the rock into an animal head, while our \(\mathtt{AddSR}\) with TA loss retains the original appearance of the rock.

The main reason could be attributed to ADD, which employs the same weight for GAN loss and distillation loss across different student timesteps. Since the perception quality of diffusion model-generated images typically increase when elevating the student timeteps, the introduction of weight-invariant ADD may result in insufficient adversarial constraints on the student model during the initial stages of inference, resulting in the inability to generate perceptually high-quality images. However, during later inference stage, the constraints on model from adversarial training may be too strong to produce “hallucinations”.

Timestep-adapting Loss. To address this issue, we dynamically adjust the weight of distillation loss across different student timesteps. In particular, the weight of the distillation loss decreases as the number of timesteps decreases. The loss function can be reformulated as follows:

\[\mathcal{L}_{ADD} = \gamma\mathcal{L}_{dis}(\hat{x}_\theta(x_s, s), \hat{x}_\phi(x_t, t)) + \lambda\mathcal{L}_{adv}(\hat{x}_\theta(x_s, s), x_0, \psi) \label{dqsbvcyj}\tag{4}\]

where \(\gamma\) is a weight that is negatively correlated with timestep. Extensive experiments demonstrate the proposed loss is simple yet effective (see Sec. 4.5).

4 Experiments↩︎

4.1 Experimental Settings↩︎

Training Datasets. We adopt DIV2K [27], Flickr2K [28], first \(20\)K images from LSDIR [29] and first \(10\)K face images from FFHQ [30] for training. We use the same degradation model as Real-ESRGAN [5] to synthesize HR-LR pairs.

Test Datasets. We evaluate \(\mathtt{AddSR}\) on \(4\) datasets: DIV2K-val [27], DRealSR [31], RealSR [32] and RealLR200 [15]. It is worth noting that we conduct \(4\) degradation types on DIV2K-val to comprehensively assess \(\mathtt{AddSR}\), and except RealLR200, all datasets are cropped to \(512\) \(\times\) \(512\) and degraded to \(128\) \(\times\) \(128\) LR image.

Implementation Details. We adopt SeeSR [15] as the teacher model. Note that our approach is applicable to most of the existing SD-based BSR methods for improving restoration results and acceleration. The student model is initialized from the teacher model, and fine-tuned with Adam optimizer for \(50\)K iterations. The batch size and learning rate are set to \(6\) and \(2\) \(\times10^{-5}\), respectively. \(\mathtt{AddSR}\) is trained under \(512\) \(\times\) \(512\) resolution images with \(3\) NVIDIA A\(100\) GPUs (\(40\)G).

Evaluation Metrics. We employ non-reference metrics (e.g., MANIQA [33], MUSIQ [34], CLIPIQA [35]) and reference metrics (e.g., LPIPS [36], PSNR, SSIM [37]) to comprehensively evaluate our \(\mathtt{AddSR}\). We consider non-reference metrics as the primary metrics since they are closer to human perception. Compared Methods. We demonstrate the superiority of \(\mathtt{AddSR}\) against several state-of-the-art BSR methods, including GAN-based methods: BSRGAN [4], Real-ESRGAN [5], MM-RealSR [38], LDL [7], FeMaSR [8] and diffusion-based methods: StableSR [16], ResShift [10], PASD [13], DiffBIR [14], SeeSR [15].

Table 2: Quantitative comparison with state of the arts on different degradation cases.
Datasets Metrics BSRGAN Real-ESRGAN MM-RealSR LDL FeMaSR StableSR-200 ResShift-15 PASD-20 DiffBIR-50 SeeSR-50 \(\mathtt{AddSR}\)-\(1\) \(\mathtt{AddSR}\)-\(4\)
    ICCV 2021 ICCVW 2021 ECCV 2022 CVPR 2022 MM 2022 Arxiv 2023 NeurIPS 2023 Arxiv 2023 Arxiv 2023 CVPR 2024 - -
MANIQA\(^*\uparrow\) 0.3990 0.3859 0.3959 0.3501 0.4603 0.4088 0.4582 0.4405 0.4680 0.5082 0.4223 0.6051
  MUSIQ\(^*\uparrow\) 66.06 63.32 64.22 61.10 65.31 65.46 65.50 66.80 67.61 68.88 64.74 70.07
  CLIPIQA\(^*\uparrow\) 0.5951 0.5367 0.5967 0.5120 0.6773 0.6483 0.6803 0.6396 0.6934 0.7039 0.5801 0.7350
  NIQE\(^*\downarrow\) 5.01 5.21 5.22 5.39 5.80 5.35 5.74 4.68 4.88 5.06 5.42 4.85
  LPIPS\(^*\downarrow\) 0.2003 0.1962 0.1934 0.1892 0.1770 0.1944 0.1544 0.1891 0.2388 0.3085 0.3421 0.3385
  PSNR\(\uparrow\) 25.52 25.30 24.35 25.09 23.74 24.45 25.53 25.15 23.43 24.61 23.75 22.74
  SSIM\(\uparrow\) 0.7091 0.7158 0.7232 0.7282 0.6788 0.6904 0.7206 0.6896 0.6025 0.6709 0.6215 0.5948
MANIQA\(^*\uparrow\) 0.3823 0.3688 0.3796 0.3337 0.4184 0.3587 0.4195 0.4124 0.4648 0.4974 0.4210 0.6006
  MUSIQ\(^*\uparrow\) 64.73 60.89 62.21 58.64 62.96 60.85 62.02 64.41 67.09 68.27 64.21 69.92
  CLIPIQA\(^*\uparrow\) 0.5752 0.5116 0.5687 0.4910 0.6390 0.5819 0.6375 0.6026 0.6857 0.6892 0.5637 0.7256
  NIQE\(^*\downarrow\) 5.17 5.54 5.66 5.72 5.62 5.96 6.25 5.01 5.18 5.32 5.38 5.74
  LPIPS\(^*\downarrow\) 0.2240 0.2267 0.2295 0.2226 0.1979 0.2384 0.2029 0.2223 0.2522 0.2124 0.2930 0.3575
  PSNR\(\uparrow\) 25.07 24.74 24.20 24.45 24.00 24.01 24.95 24.70 22.97 24.12 21.72 19.80
  SSIM\(\uparrow\) 0.6820 0.6890 0.6927 0.6973 0.6730 0.6596 0.6926 0.6688 0.5802 0.6508 0.5580 0.4780
MANIQA\(^*\uparrow\) 0.2645 0.3120 0.3285 0.3138 0.3123 0.3485 0.3741 0.4270 0.4121 0.5537 0.4904 0.6097
  MUSIQ\(^*\uparrow\) 50.47 53.43 56.53 53.30 56.55 52.24 60.99 64.20 61.85 70.32 66.72 70.10
  CLIPIQA\(^*\uparrow\) 04543 0.4761 0.5158 0.6208 0.5178 0.4414 0.5949 0.5503 0.6149 0.7557 0.6641 0.7369
  NIQE\(^*\downarrow\) 7.04 6.00 4.40 5.61 4.27 5.12 6.10 5.02 5.09 4.95 4.73 5.53
  LPIPS\(^*\downarrow\) 0.4611 0.3601 0.3052 0.3138 0.3267 0.4017 0.3129 0.3451 0.3404 0.2999 0.0.3409 0.4092
  PSNR\(\uparrow\) 17.90 21.97 22.04 22.68 21.84 21.20 22.78 22.12 22.22 21.04 21.11 20.58
  SSIM\(\uparrow\) 0.5210 0.6044 0.5998 0.5838 0.5421 0.5077 0.5979 0.5587 0.5311 0.5388 0.5630 0.5526
MANIQA\(^*\uparrow\) 0.3524 0.3374 0.3287 0.3082 0.3271 0.3452 0.3702 0.4024 0.4538 0.5266 0.4599 0.6046
  MUSIQ\(^*\uparrow\) 59.83 55.54 55.30 52.79 60.87 61.21 56.99 63.25 64.50 69.08 65.20 69.91
  CLIPIQA\(^*\uparrow\) 0.5380 0.5047 0.4978 0.4699 0.6061 0.6010 0.5888 0.5733 0.6626 0.7180 0.6057 0.7286
  NIQE\(^*\downarrow\) 5.31 5.69 5.70 5.77 4.87 6.33 7.03 5.49 4.93 5.06 5.42 4.85
  LPIPS\(^*\downarrow\) 0.3223 0.3346 0.3372 0.3272 0.2922 0.3429 0.3526 0.3482 0.3502 0.3085 0.3421 0.3385
  PSNR\(\uparrow\) 23.04 22.70 22.47 22.36 22.17 22.39 22.36 22.25 21.46 21.86 21.57 21.10
  SSIM\(\uparrow\) 0.5866 0.5935 0.5950 0.5948 0.5633 0.5704 0.5574 0.5594 0.5029 0.5474 0.5489 0.5002

Figure 6: Visual comparison on synthetic LR images.

Figure 7: Visual comparison on real-world LR images.

4.2 Evaluation on Synthetic Data↩︎

To demonstrate the superiority of the proposed \(\mathtt{AddSR}\) in handling various degradation cases, we synthesized \(4\) test datasets using the DIV2K-val dataset with different degradation processes. The quantitative results are summarized in Tab. 2. The conclusions include: (\(1\)) Our \(\mathtt{AddSR}\)-\(4\) achieves the highest scores in MANIQA, MUSIQ and CLIPIQA across \(4\) degradation cases, except for a 0.3\(\%\) and 2\(\%\) lower score in MUSIQ and CLIPIQA under noise degradation, recpectively. Especially for MANIQA, \(\mathtt{AddSR}\) surpasses the second-best method by more than 16\(\%\) on average. (\(2\)) The highest scores in PSNR, SSIM and LPIPS are usually generated by GAN-based method. The reason may be that due to the powerful generative abilities of diffusion-based models, some realistic details are generated but not exist in the ground truth, thereby leading to lower scores in full-reference metrics. (\(3\)\(\mathtt{AddSR}\)-\(1\) can generate comparative results against other SD-based methods except SeeSR, but significantly reduces the sampling steps (i.e., from \(\geq\) \(15\) steps to only \(1\) step).

For a more intuitive comparison, we provide visual results in Fig. 6. One can see that GAN-based method like FeMaSR fails to reconstruct the clean and detailed HR images of the three displayed LR images. As for SD-based method DiffBIR, it tends to generate blurry results. This is mainly because DiffBIR uses a degradation removal structure to remove the degradation of LR images. However, the processed LR image is blurry, which may lead to the blurry results. Thanks to our proposed PSR, \(\mathtt{AddSR}\) uses the predicted \(\hat{x}_0\) to control the model output, which has more high-frequency information and nearly no extra time cost. With TA loss, \(\mathtt{AddSR}\) can generate precise images and rich details. In a nutshell, \(\mathtt{AddSR}\) can produce images with better perceptual quality than the state-of-the-art models while requiring fewer inference steps and less time.

Table 3: Quantitative comparison with state of the arts on real-world LR images.
Datasets Metrics BSRGAN Real-ESRGAN MM-RealSR LDL FeMaSR StableSR-200 ResShift-15 PASD-20 DiffBIR-50 SeeSR-50 \(\mathtt{AddSR}\)-\(1\) \(\mathtt{AddSR}\)-\(4\)
    ICCV 2021 ICCVW 2021 ECCV 2022 CVPR 2022 MM 2022 Arxiv 2023 NeurIPS 2023 Arxiv 2023 Arxiv 2023 CVPR 2024 - -
RealSR MANIQA\(^*\uparrow\) 0.3762 0.3727 0.3966 0.3417 0.3609 0.3656 0.3750 0.4041 0.4392 0.5396 0.4878 0.6347
  MUSIQ\(^*\uparrow\) 63.28 60.36 62.94 58.04 59.06 61.11 56.06 62.92 64.04 69.82 67.14 71.99
  CLIPIQA\(^*\uparrow\) 0.5116 0.4492 0.5281 0.4295 0.5408 0.5277 0.5421 0.5187 0.6491 0.6700 0.5518 0.7007
  PSNR\(\uparrow\) 26.49 25.78 23.69 25.09 25.17 25.63 26.34 26.67 25.06 25.24 23.12 20.95
  SSIM\(\uparrow\) 0.7667 0.7621 0.7470 0.7642 0.7359 0.7483 0.7352 0.7577 0.6664 0.7204 0.6548 0.5615
DrealSR MANIQA\(^*\uparrow\) 0.3431 0.3428 0.3625 0.3237 0.3178 0.3222 0.3284 0.3874 0.4646 0.5125 0.4581 0.5919
  MUSIQ\(^*\uparrow\) 57.17 54.27 56.71 52.38 53.70 52.28 50.14 55.33 60.40 65.08 62.13 67.6
  CLIPIQA\(^*\uparrow\) 0.5094 0.4514 0.5171 0.4410 0.5639 0.5101 0.5287 0.5384 0.6397 0.6910 0.5933 0.7027
  PSNR\(\uparrow\) 28.68 28.57 26.84 27.41 26.83 29.14 28.27 29.06 26.56 28.09 26.70 24.39
  SSIM\(\uparrow\) 0.8022 0.8042 0.7959 0.8069 0.7545 0.8040 0.7542 0.7906 0.6436 0.7664 0.7380 0.6594
RealLR200 MANIQA\(^*\uparrow\) 0.3688 0.3656 0.3879 0.3266 0.4099 0.3672 0.4182 0.4193 0.4626 0.4911 0.4180 0.5969
  MUSIQ\(^*\uparrow\) 64.87 62.93 65.24 60.95 64.24 62.89 60.25 66.35 66.84 68.63 66.86 72.63
  CLIPIQA\(^*\uparrow\) 0.5699 0.5423 0.6010 0.5088 0.6547 0.5916 0.6468 0.6203 0.6965 0.6617 0.5853 0.7417

4.3 Evaluation on Real-World Data↩︎

Tab. 3 shows the quantitative results on \(3\) real-world datasets. We can see that our \(\mathtt{AddSR}\) achieves the best scores in MANIQA, MUSIQ and CLIPIQA, the same as in the synthetic degradation cases. This demonstrates that \(\mathtt{AddSR}\) has an excellent generalization ability to handle unknown complex degradations, making it practical in real-world scenarios. Additionally, \(\mathtt{AddSR}\)-\(1\) surpasses the previous GAN-based methods, primarily due to the integration of IQ-ADD that combines a diffusion model with adversarial training. This integration enables \(\mathtt{AddSR}\) to leverage high-level information to enhance the restoration process and generate high perception quality images, even through a one-step inference.

Fig. 7 shows the visualization results. We consider \(3\) kinds of images including building, car and face to comprehensively compare various methods. A noticeable observation is that \(\mathtt{AddSR}\) generate more clear and regular line, as evidenced by the linear pattern of the building in the first example and the intake grille of a car in the second example. In the third example, the original LR image is heavily degraded, FeMaSR and ResShift fail to generate the human face, showing only the blurry outline of the face. DiffBIR can generate more details, yet still unclear. The image generated by SeeSR exhibits artifacts, which may be caused by the LR images degradation. Conversely, our \(\mathtt{AddSR}\) can generate comparative results with FeMaSR and ResShift in one-step. As evaluating the inference steps, \(\mathtt{AddSR}\) generates clearer and more detailed human face, which significantly surpasses the aforementioned methods.

Figure 8: Visual examples illustrate that engaging with manual prompts leads to more precise restoration outcomes. In each group, the prompts for the second and third images are obtained through RAM and manual input, respectively. The results are generated using \(4\) steps. Please zoom in for a better view.

4.4 Prompt-Guided Restoration↩︎

One of the advantages of diffusion model is its ability to integrate with text. In Fig. 8, we demonstrate that our \(\mathtt{AddSR}\) can efficiently achieve more precise restoration results in \(4\) steps by incorporating with manual prompt, i.e., we can manually input the text description of the LR image to assist the restoration process. Specifically, Fig. 8 (a) shows that the plaid shirt in the restored image with the RAM prompt can be edited to camouflage through the manual prompt. In Fig. 8 (b), the Spider-Man restored with the original RAM prompt has a mouth and beard, whereas with the manual prompt, the man is accurately restored wearing a spiderweb suit. In Fig. 8 (c), the word on the chip can be corrected from ‘2ALC515’ to ‘24LC515’ with the manual prompt. In Fig. 8 (d), the mushroom’s background should appear blurry, while the RAM prompt renders the tree branches sharply. Conversely, the manual prompt maintains the background’s intended blur, aligning with the Ground Truth.

Table 4: Ablation study on refined training process.
Exp RAM RealLR200 DrealSR
      MANIQA\(^*\uparrow\) MUSIQ\(^*\uparrow\) CLIPIQA\(^*\uparrow\) MANIQA\(^*\uparrow\) MUSIQ\(^*\uparrow\) CLIPIQA\(^*\uparrow\) PSNR\(\uparrow\)
(1) 0.5603 68.68 0.6744 0.5625 63.19 0.6206 24.10
(2) 0.5745 69.72 0.6978 0.5716 63.14 0.6636 24.09
(3) 0.5871 68.73 0.7058 0.5865 66.17 0.6861 24.54
\(\mathtt{AddSR}\) 0.5969 72.63 0.7417 0.5919 67.60 0.7027 24.39
Table 5: Ablation study on PSR.
Methods Time[s] RealLR200 DrealSR
    MANIQA\(^*\uparrow\) MUSIQ\(^*\uparrow\) CLIPIQA\(^*\uparrow\) MANIQA\(^*\uparrow\) MUSIQ\(^*\uparrow\) CLIPIQA\(^*\uparrow\) PSNR\(\uparrow\)
\(\mathtt{AddSR}\)(w/o PSR) 0.44\(\sim\)​0.77 0.5788 71.19 0.6887 0.5672 66.67 0.6589 25.85
\(\mathtt{AddSR}\) (w PSR) 0.44\(\sim\)​0.80 0.5969 72.63 0.7417 0.5919 67.60 0.7027 24.39

Figure 9: Visual comparison of refined training process and PSR.

4.5 Ablation Study↩︎

Effectiveness of Refined Training Process. To make teacher model provide more information for student model to learn, we refine the training process to replace the LR image of ControlNet, RAM and text encoder input with HR image. Since SeeSR is adopted as the baseline, we also replace the LR image of RAM input with HR image. The quantitative results are shown in Tab. 4. Moreover, the visual comparisons are shown in Fig. 9 (a). We can see that with the supervision from HR input, the restored image of student model become clearer with more details. The quantitative results also demonstrate this conclusion due to the perception quality becomes better.

Effectiveness of PSR. As shown in Tab. 5, PSR can improve the restored LR image’s perception quality notably at only \(0.03\)s slower than the original LR controlling strategy. Fidelity of restored images deteriorates after using PSR, since PSR uses the former step’s output that has more high-frequency information to control next step’s result, and may generate some extra details, as discussed in Sec. 4.2. The visual comparisons are shown in Fig. 9, we observe that the output of the first step, whether or not PSR is employed, remains nearly the same, as both are generated by controlling the LR image. However, when the step reaches \(4\), the model with PSR generates image showing high perception quality.

Table 6: Ablation study on TA loss.
Methods RealLR200 DrealSR
  MANIQA\(^*\uparrow\) MUSIQ\(^*\uparrow\) CLIPIQA\(^*\uparrow\) MANIQA\(^*\uparrow\) MUSIQ\(^*\uparrow\) CLIPIQA\(^*\uparrow\) PSNR\(\uparrow\)
\(\mathtt{AddSR}\) (w/o TA loss) 0.5967 71.55 0.7304 0.5902 67.26 0.6989 24.09
\(\mathtt{AddSR}\) (w TA loss) 0.5969 72.63 0.7417 0.5919 67.60 0.7027 24.39

Figure 10: Visual comparison of TA loss.

Importance of TA Loss. The TA loss aims to increase the fidelity of restored images while maintaining the perception quality. The quantitative results are shown in Tab. 6. As can be seen, although we enlarge the weight of distillation loss in the last two steps, the perception quality still becomes better. The reason may be that the first two steps generate well enough perception quality images, which can provide more information by combining PSR, thus the last two steps can generate high perception quality while achieving good fidelity.

The visual results are shown in Fig. 10. For the left six images, the original background is rock. However, without utilizing TA loss, \(\mathtt{AddSR}\)might hallucinate the background as another blurred wolf. Consequently, in the final step, the background transforms into the eye of a wolf. For the right six images, the image content is a statue. Yet, in the first row, the model hallucinates its hand as a bird. Such hallucination compromises the fidelity of the restored images, as illustrated in Tab. 6. Conversely, with the help of TA loss, the restored images can generate more consistent contents with GTs. As depicted in the second row of Fig. 10, TA loss constrains the model from excessively leveraging its generative capabilities, thereby preserving more information in the image content, aligning closely with the GTs. Specifically, using TA loss, background of the left image retains the rock with out-of-focus appearance, and texture of the statue’s hand in the right image remains unchanged.

5 Conclusion↩︎

We propose \(\mathtt{AddSR}\), an effective and efficient model based on Stable Diffusion prior for blind super-resolution. In contrast to current SD-based BSR approaches that employ LR images to regulate each inference step’s output, \(\mathtt{AddSR}\) substitutes the LR image with the HR image calculated in the preceding step to control the model output and mitigate the degradation impact of LR images. To further augment the model’s restoration capacity, we introduce Timestep-adaptive loss, which assigns distinct weights to GAN loss and distillation loss across different steps, addressing the perception-distortion imbalance issue due to the introduction of original ADD. Additionally, to tailor ADD for the BSR task, we propose an image quality-adjusted strategy, enabling the teacher model to provide better supervision to the student model. Extensive experiments demonstrate that our proposed \(\mathtt{AddSR}\) can generate superior results within \(1\) \(\sim\) \(4\) steps in various degradation scenarios and real-world low-quality images.

5.0.0.1 Limitations.

Our \(\mathtt{AddSR}\) designs an acceleration strategy incorporated with PSR and TA loss for BSR task to generate high perception quality results. Although the inference speed surpasses all of the existing SD-based methods remarkably, there still exists a gap between GAN-based methods and \(\mathtt{AddSR}\). The primary factor is that \(\mathtt{AddSR}\) is built upon Stable Diffusion and ControlNet, which, due to its substantial model parameters and intricate network structure, noticeably hinders the inference time. In the future, we are exploring a more streamlined network architecture to enhance overall efficiency.

6 Comparisons among ADD, SeeSR and \(\mathtt{AddSR}\)↩︎

Figure 11: Comparisons on architecture diagram among ADD, SeeSR and \(\mathtt{AddSR}\). \(x_0\) and y denote HR and LR images, respectively. \(\hat{x}_s\) and \(\hat{x}_t\) denote the predicted \(x_0\) from the timesteps \(s\) and \(t\), respectively. \(\epsilon_s\), \(\epsilon_t\), \(\hat{\epsilon}_s\) and \(\hat{\epsilon}_s\) stand for added and predicted noise in timesteps \(s\) and \(t\), respectively. \(\gamma\) is the weight of timestep-adapting loss and \(\hat{x}_0\) is the predicted result from the former step.

In this section, we provide a comparison among ADD, SeeSR, and our proposed \(\mathtt{AddSR}\). Their architecture diagrams are depicted in Fig. 11. Firstly, the distinctions between ADD and \(\mathtt{AddSR}\) primarily lie in two aspects: \(1\)) Introduction of ControlNet: ADD is originally developed for text-to-image task, which typically only takes text as input. In contrast, \(\mathtt{AddSR}\) is an image-to-image model that requires the additional ControlNet to receive information from the LR image. \(2\)) Perception-distortion Trade-off: ADD aims to generate photo-realistic images from texts. However, introducing ADD into blind SR brings the perception-distortion imbalance issue (please refer to Sec. \(3.4\) in our submission), which is addressed by our proposed timestep-adapting loss in \(\mathtt{AddSR}\).

Secondly, the key differences between SeeSR and \(\mathtt{AddSR}\) are: \(1\))Introduction of Distillation: SeeSR is trained based on vanilla SD model that needs \(50\) inference steps, while \(\mathtt{AddSR}\) utilizes a teacher model to distill an efficient student model to achieve just \(1\) \(\sim\) \(4\) steps. \(2\)) High-frequency Information: SeeSR uses the LR image \(y\) as the input of the ControlNet. In contrast, \(\mathtt{AddSR}\) on one hand adopts the HR image \(x_0\) as the input of the teacher model’s ControlNet to supply the high-frequency signals since the teacher model is not required during inference. On the other hand, \(\mathtt{AddSR}\) proposes a novel prediction-based self-refinement (PSR) to further provide high-frequency information by replacing the LR image with the predicted image as the input of the student model’s ControlNet. Therefore, \(\mathtt{AddSR}\) has the ability to generate results with more realistic details.

Figure 12: Our PSR is able to enhance the high-frequency signals of restored images to generate more photo-realistic details. The high frequency part is obtained using Fourier transform and filtering. Please zoom in for a better view.

7 Effectiveness of Prediction-based Self-Refinement↩︎

Our PSR is proposed to remove the impact of LR degradation and enhance high-frequency signals to regulate the student model output. As shown in Fig. 12, the restored images generated with PSR exhibit more details and sharper edges, while the images generated without PSR tend to be blurry with fewer details.

Figure 13: Visual comparisons between SeeSR-Turbo and \(\mathtt{AddSR}\). All the results are generate by \(2\) steps. Please zoom in for a better view.

Table 7: Quantitative comparsion between SeeSR-Turbo and \(\mathtt{AddSR}\). All the results are generated by \(2\) steps.
Methods RealLR200
  NIQE\(\downarrow\) MANIQA\(\uparrow\) MUSIQ\(\uparrow\) CLIPIQA\(\uparrow\)
SeeSR-Turbo-2 7.87 0.3503 53.88 0.4634
\(\mathtt{AddSR}\)-2 5.22 0.5908 71.27 0.7508

Figure 14: More visual comparison results. Please zoom in for a better view.

8 Comparison with SeeSR-Turbo↩︎

A very recent SD-based method named SeeSR-Turbo2 has been introduced for blind super-resolution through 2-steps inference. To demonstrate the superiority of our \(\mathtt{AddSR}\), we conduct a comparison between SeeSR-Turbo and \(\mathtt{AddSR}\). The quantitative and qualitative results are shown in Tab. 7 and Fig. 13, respectively. One can see that our \(\mathtt{AddSR}\) surpasses SeeSR-Turbo with significant margins on NIQE, MANIQA, MUSIQ and CLIPIQA. In addition, \(\mathtt{AddSR}\) can generate realistic results by \(2\) steps, while SeeSR-Turbo tends to generate blurry results.

9 More Examples↩︎

We provide more visual results in Fig. 14. These examples demonstrate \(\mathtt{AddSR}\) can produce more photo-realistic restored images within \(4\) steps, compared with the SOTA competitors.

References↩︎

[1]
.
[2]
.
[3]
.
[4]
.
[5]
.
[6]
.
[7]
.
[8]
.
[9]
.
[10]
.
[11]
.
[12]
.
[13]
.
[14]
.
[15]
.
[16]
.
[17]
.
[18]
.
[19]
.
[20]
.
[21]
.
[22]
.
[23]
.
[24]
.
[25]
.
[26]
.
[27]
.
[28]
.
[29]
.
[30]
.
[31]
.
[32]
.
[33]
.
[34]
.
[35]
.
[36]
.
[37]
.
[38]
.

  1. Corresponding author. Work done when Rui Xie was an intern at Nanjing University.↩︎

  2. SeeSR-Turbo is open-sourced on March 10th, 2024.↩︎