WaveDH: Wavelet Sub-bands Guided ConvNet

for Efficient Image Dehazing

Seongmin Hwang, Daeyoung Han, Cheolkon Jung, , and Moongu Jeon,

^{1}

April 02, 2024

The surge in interest regarding image dehazing has led to notable advancements in deep learning-based single image dehazing approaches, exhibiting impressive performance in recent studies. Despite these strides, many existing methods fall short in meeting the efficiency demands of practical applications. In this paper, we introduce WaveDH, a novel and compact ConvNet designed to address this efficiency gap in image dehazing. Our WaveDH leverages wavelet sub-bands for guided up-and-downsampling and frequency-aware feature refinement. The key idea lies in utilizing wavelet decomposition to extract low-and-high frequency components from feature levels, allowing for faster processing while upholding high-quality reconstruction. The downsampling block employs a novel squeeze-and-attention scheme to optimize the feature downsampling process in a structurally compact manner through wavelet domain learning, preserving discriminative features while discarding noise components. In our upsampling block, we introduce a dual-upsample and fusion mechanism to enhance high-frequency component awareness, aiding in the reconstruction of high-frequency details. Departing from conventional dehazing methods that treat low-and-high frequency components equally, our feature refinement block strategically processes features with a frequency-aware approach. By employing a coarse-to-fine methodology, it not only refines the details at frequency levels but also significantly optimizes computational costs. The refinement is performed in a maximum 8\(\times\) downsampled feature space, striking a favorable efficiency-vs-accuracy trade-off. Extensive experiments demonstrate that our method, WaveDH, outperforms many state-of-the-art methods on several image dehazing benchmarks with significantly reduced computational costs. Our code is available at https://github.com/AwesomeHwang/WaveDH.

Shell *et al.*: Bare Demo of IEEEtran.cls for IEEE Journals

Single image dehazing, deep learning, wavelet sub-bands, frequency awareness

Haze, a natural atmospheric phenomenon, induces visible degradation in visual quality by affecting object appearance and contrast through color and texture distortion. Images captured in hazy conditions pose challenges for subsequent tasks such as object detection [1], vehicle re-identification [2], and scene understanding [3]. Consequently, the removal of haze from images is a critical concern in low-level vision, essential for developing effective computer vision systems. Image dehazing aims to restore the latent haze-free scene from its hazy observation, presenting an inherently ill-posed and challenging problem. For the single image dehazing task, there is a widely used atmospheric scattering model [4] which estimates the clear image from a single hazy input, expressed as: \[I=J(x)t(x)+A(1-t(x)), \label{eqn1:asm}\tag{1}\] where \(I\) is the captured hazy image, \(J\) is the corresponding clear scene, \(A\) is the global atmospheric light, and \(t\) is the medium transmission, which is formulated by the scene depth \(d\) with the atmosphere scattering parameter \(\beta\) as: \[t(x)=e^{-\beta d(x)}\]

In recent years, deep learning-based methods, leveraging the powerful learning capabilities of convolutional neural networks (CNNs), have shown outstanding performance in various computer vision tasks including single image dehazing. While some methods [6]–[10] still adhere to the atmospheric scattering model, recent studies [11]–[18] prefer an end-to-end approach, achieving superior results by predicting the latent haze-free image or its residuals versus the hazy image. Very recently, vision Transformer (ViT) [19] has also shown promise in image dehazing [20]–[22] based on its strong capability to model long-range dependencies. Despite the remarkable advancements achieved by these networks, they often rely on stacking deeper and more complex models, posing challenges for deployment on resource-limited devices such as surveillance cameras and mobile phones in real-world scenarios. This motivates the need for designing fast and lightweight deep models that offer a better trade-off between performance and computational complexity.

To address the challenges posed by heavy deep dehazing models, several methods have been proposed in recent years. Zhang et al. [23] adopted AOD-Net’s formulation and proposed a fast and accurate multi-scale dehazing network (FAMED-Net), which comprises encoders at three scales and a fusion module. Wu et al. [17] adopted autoencoder-like (AE) framework to make dense convolution computation in the low-resolution space and also reduce the number of layers to design compact model. Another approach is LD-Net [24] which jointly estimates both the transmission map and the atmospheric light, contributing to model efficiency. Although these methods offer compact architectures improving efficiency, the trade-off between efficiency and dehazing performance remains sub-optimal. We believe there is still large room to achieve a better trade-off.

In the pursuit of mitigating computation costs, two primary options are typically considered. The first involves the use of manually designed lightweight structures [25]–[28], which is a strategy that has been proposed over the past few years. Among these structures, depthwise separable convolutions (DSConv) [25] stand out as one of the most fundamental architectures. On the other hand, aside from employing lightweight structures, computation costs can be
alleviated by reducing the feature size [29], [30]. While downsampling operations, such as max
pooling in [23], can effectively decrease computational costs, these pooling-based operations are prone to information dropping, particularly
high-frequency components crucial for texture details. Consequently, this can adversely impact the overall reconstruction quality. Recent studies (*e.g.*, [31]) further emphasize that the applying pooling operations in CNNs could hurt the shift-equivariance of deep networks.

In this paper, our aim is to design an efficient and accurate dehazing network, and to that end, we propose WaveDH—a novel wavelet sub-bands guided dehazing ConvNet. Our WaveDH improves network effectiveness by optimizing up-and-downsampling processes through non-aggressive down-sampling. Additionally, we enhance network efficiency based on a frequency-aware feature refinement block which is designed for efficient representation learning. As the name suggests, the WaveDH is built upon wavelet transform, which decomposes input into four sub-bands for low-and-high frequency components. Since our upsampling block takes the high-frequency components returned from the same level downsampling block as an additional input, the up-and-downsampling processes provide multi-scale information while preserving high-frequency details.

Furthermore, our frequency-aware feature refinement block plays a pivotal role in elevating representation learning efficiency. Operating in a coarse-to-fine manner, this block refines both low-and-high frequency information, ensuring that WaveDH captures intricate scene details comprehensively. Within our refinement block, the Feature Mixing Block (FMB) focuses on learning structure and context at a coarse-grain level with the low-frequency sub-bands. The high-frequency components are refined through a feature distillation mechanism. This mechanism efficiently interacts with the refined low-frequency and high-frequency information at a fine-grained level, ensuring that our WaveDH produces dehazed images with enhanced clarity and fidelity. By leveraging the invertible properties of the wavelet transform, we enhance low-frequency information in a downsampled feature space by a factor of two at each level, leading to a superior trade-off between performance and computational efficiency. Fig. 1 shows the comparison of our WaveDH with other state-of-the-art image dehazing methods on the SOTS indoor set.

Our key contributions can be summarized as follows:

We propose a wavelet sub-bands guided ConvNet, abbreviated as WaveDH, for fast and accurate single image dehazing. Taking advantage of discrete wavelet transform, our WaveDH achieves a superior trade-off between efficiency and performance.

We design wavelet-guided up-and-downsampling blocks that utilize inherent lossless and invertible downsampling properties of the wavelet transform for optimized upsampling and downsampling.

We present a frequency-aware feature refinement block to efficiently learn intermediate feature representations. Our refinement processing adaptively handles features in a coarse-to-fine manner based on frequency-awareness for computational efficiency.

Single image dehazing is a challenging task due to the lack of information in hazy conditions. Traditional methods mainly rely on an atmospheric scattering model [4] and the handcrafted priors. Notable among these is the Dark Channel Prior (DCP) [32], which estimates the medium transmission map. Subsequently, various priors-based methods have been proposed, including color attenuation prior (CAP) [33] and non-local prior [34]. However, these prior-based methods may lead to unrealistic results in complex scenes that do not conform to these priors.

In recent years, learning-based methods using large-scale datasets have dominated single image dehazing. Pioneering works [6], [7] employed convolutional neural networks (CNNs) to estimate the transmission map and global atmospheric light in the physics model to restore a latent hazy-free scene. Since the advent of the pioneer works, deep
learning-based approaches are explored to achieve more accurate results. Li *et al.* [35] reformulated atmospheric model (Eq. (1 )) and proposed an all-in-one dehazing network (AOD-Net). Zhang *et al.* [8] proposed a densely connected pyramid network
(DCPDN) that uses two sub-networks to estimate the transmission map and global atmospheric light, respectively. On the other hand, Liu et al. [11] proposed an attention-based multi-scale network (GridDehazeNet), which learns the feature map to restore the hazy-free image directly instead of estimating the transmission map. FFANet [16] proposed a deep network that introduces feature attention (FA) blocks that leverage both channel and pixel attention to improve haze removal.

Since Dosovitskiy et al. [19] introduced Transformer to computer vision, the Vision Transformer (ViT) architectures have demonstrated the capability of replacing CNNs. Song et al. [20] proposed the Dehazeformer which can be considered as a combination of Swin Transformer and U-Net and showed superior performance. DeHamer [21] which is a hybrid model of CNN and Transformer for image dehazing, which can aggregate global attention and local attention.

While significant progress has been made by aforementioned methods, the reliance on deeper and more complex models for performance improvement hinders real-world deployment. Additionally, most existing methods are spatial-domain-centric, neglecting the exploitation of frequency domain information to estimate clear scene.

Wavelet decomposition, which widely used in signal processing [36], [37], separates low-and-high frequency components from signals. Its application in deep learning architectures, such as CNNs and transformers, enhances spatial and frequency information, improving their performance in various vision tasks. Some works have employed wavelet transform to network design to enhance visual representation learning [38], [39]. Furthermore, it has been extended to diverse tasks such as style transfer [40], face hallucination [41] and image generation [42], [43].

Given its ability to decompose the input image into multi-scale sub-images and its invertibility, wavelet transform finds extensive applications in low-level vision tasks, particularly in image restoration. Bae et al. [44] proposed that deep residual learning in the CNN feature space over wavelet sub-bands can be beneficial for image restoration. Liu et al. [45] introduced a multi-level wavelet CNN (MWCNN) where wavelet transform is employed to reduce the size of feature maps, thereby improving the trade-off between image restoration performance and efficiency. Guo et al. [46] proposed DWSR, a method that takes low-resolution wavelet sub-bands and outputs residuals of corresponding sub-bands of high-resolution wavelet coefficients to recover missing details for image super-resolution. DeWRNet [47] presented an image super-resolution enhancement technique that trains low-and-high frequency sub-images with different models, utilizing high-frequency sub-images and input images derived from stationary wavelet transform for interpolation. Recent studies on deraining [48], [49] have found that wavelet transform is beneficial in decomposing rain image features into different scale information while preserving all information.

When it comes to image dehazing, the idea of exploiting a wavelet decomposition is not new. In [50] and [51], the feature learning is mainly achieved in the wavelet domain to obtain high-quality haze-free output. Specifically, Khan et al. [50] introduced a hybrid approach to dehaze the corrupted image by decomposing high-frequency sub-bands and approximating the low-frequency sub-band of the given hazy image using the wavelet domain. In [51], a multi-scale wavelet and non-local dehazing method were introduced, where non-local dehazing and wavelet denoising are respectively carried out on the low-and-high frequency sub-images to remove haze and noise. Despite the remarkable progress achieved by these approaches, existing methods do not fully exploit the sub-bands domain information and invertible properties of wavelet transform, limiting their performance.

In this paper, our main focus is on exploring the feasibility of a wavelet-guided ConvNet for efficient image dehazing to facilitate practical applications. In contrast to existing methods, our feature refinement block efficiently refines features in a coarse-to-fine fashion by separating low-frequency and high-frequency features. Moreover, by leveraging the invertible properties of wavelet transform, we refine low-frequency information \(\times 2\) downsample feature space at each level without loss of information leading to a better trade-off between performance and computational efficiency.

In this section, we present our proposed single image dehazing network, WaveDH. The method is designed to be efficient and lightweight, utilizing a hierarchical architecture with wavelet-guided up-and-downsampling blocks, and frequency-aware feature refinement blocks. We begin with an overview of the overall architecture, followed by a detailed presentation of the wavelet-guided up-and-downsampling blocks. Finally, we introduce frequency-aware feature refinement block.

Fig. 2 illustrates the comprehensive pipeline of our approach, WaveDH. Our WaveDH has a highly hierarchical architecture (*i.e.*, U-Net [52] like architecture), complemented by skip connections.

Given a hazy image \(I \in \mathbb{R}^{H \times W \times C}\), where \(H \times W\) represents spatial resolution and \(C\) is the number of channels, the
WaveDH initiates by extracting low-level features \(Z_0 \in \mathbb{R}^{H \times W \times D}\) through a convolutional operation with \(D\) channels. These low-level features traverse a
symmetric hierarchical encoder-decoder structure, culminating in high-level deep features \(Z_5 \in \mathbb{R}^{H \times W \times 2D}\). Departing from conventional pooling-based downsamplers and popularly-used upsamplers
like pixelshuffle [53] and transposed convolution [54], we introduce wavelet-based down-and-upsampling blocks (*i.e.*, WaveUP and WaveDown). At the core of the model, frequency-aware bottleneck blocks (*i.e.*, WaveBlock) are strategically crafted to adeptly
handle both low-and-high frequency information. The WaveBlocks are selectively employed in the encoder stage and at the base of the model. In the last step, a convolutional layer is employed to map the features \(Z_5\) to
the residual image \(R \in \mathbb{R}^{H \times W \times C}\), and the final hazy-free estimation \(J\) is yielded as \(J = I + R\). In the following
subsections, we delve into the details of the newly introduced components.

Commonly adopted downsampling operations, such as max pooling and average pooling, inevitably lead to information loss. In addition to these popular pooling-based methods, a strided convolution, considered as a learnable pooling operation, offers a promising alternative that enhances the expressive capabilities of the network. However, it comes with an increase in the number of trainable parameters and lacks invertibility. To overcome these challenges, we introduce wavelet-sub bands guided upsampling and downsampling blocks, leveraging the inherent properties of wavelet transform to perform lossless and invertible downsampling. Fig. 2 (b) and Fig. 2 (e) depict the structure of our blocks in detail.

Our downsampling block, WaveDown, first applies discrete wavelet transform to the given 2D feature map \(F \in \mathbb{R}^{H \times W \times D}\), decomposing it into four wavelet sub-bands. Specifically, DWT uses the
low-pass filter \(L=\left[\begin{array}{ll}1 / \sqrt{2} & 1 / \sqrt{2}\end{array}\right]\) and high pass filter \(H=\left[\begin{array}{ll}-1 / \sqrt{2} & 1 /
\sqrt{2}\end{array}\right]\) to construct four kernels with stride 2 (*i.e.*, \(LL^T, LH^T, HL^T\), and \(HH^T\)). Next, these kernels decompose the input into the four wavelet
sub-bands: \(F_{ll} \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D}, F_{lh} \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D}, F_{hl} \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D}\), and \(F_{hh} \in\mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D}\). \(F_{ll}\) is a low-frequency approximation that retains the main structural information of the feature map at a coarse-grained
level. \(F_{lh}, F_{hl}\), and \(F_{hh}\) are high-frequency components that provide detailed information at a fine-grained level while retaining significant amount of noise of the feature
map.

Rather than discarding all high-frequency components as noise, as done in [55], [56], which
contains a rich amount of detailed information, we utilize not only low-frequency component but also high-frequency components. To reduce noise interference, we introduce a squeeze-and-attention mechanism. Specifically, we concatenate the four wavelet
sub-bands along the channel dimension to form \(\hat{F} = [F_{ll}, F_{lh}, F_{hl}, F_{hh}] \in \mathbb{R}^{H \times W \times 4D}\). We then squeeze all information into a low-dimensional manifold across channels using a
\(1\times1\) convolution layer, mapping the input into a compressed space, expressed as: \[\tilde{F} = \mathbf{W}_{p_1}(\hat{F}),\] where \(\mathbf{W}_{p_1} \in
\mathbb{R}^{4D \times 2D}\) is the squeeze operation, *i.e.*, point-wise convolution, reducing the input channels. \(\tilde{F}\) contains compressed information from \(\hat{F}\) (*e.g.*, high- and low-frequency information and noise).

To generate an attention map, \(F_{ll}\) and \(F_{lh}\), as well as \(F_{ll}\) and \(F_{hl}\), are elementwise added. These sums then pass through 3\(\times\)3 depthwise convolution layers, respectively, and the results are elementwise added again. Finally, the attention map is generated by sigmoid function. The entire process is illustrated in Fig. 2 (c) and can be formulated as follows: \[M = \sigma(\mathbf{W}_{p_2}(\mathbf{W}_{d_1}(F_{ll}+F_{lh}) + \mathbf{W}_{d_2}(F_{ll}+F_{hl}))),\] where \(\sigma\) refers to sigmoid function, \(\mathbf{W}_{p_2} \in \mathbb{R}^{D \times 2D}\) is a point-wise convolution operation increasing the input channels, and \(\mathbf{W}_{d_1}\) and \(\mathbf{W}_{d_2}\) are depthwise convolution operations. Note that, as high-frequency component \(F_{hh}\) contains excessive noise information, it is discarded to generate a more convincing attention map.

The generated attention map \(M\) is then Hadamard multiplied by \(\tilde{F}\) to suppress noise while boosting useful coarse- and fine-grained information (*e.g.*, low-frequency
context and high-frequency details) as illustrated in Fig. 3, this step is formulated as: \[\tilde{F}_{att} = \tilde{F} \odot M,\] where \(\odot\) denotes
Hadamard multiplication.

Finally, by elementwise adding \(\tilde{F}_{att}\) and \(\tilde{F}\), we obtain the final downsampled output feature maps: \[Y = \tilde{F}_{att} + \tilde{F},\] where \(Y \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times 2D}\) is the final output. It is worth mentioning that our downsampling blocks additionally return high-frequency components \(\hat{F}_{high} = [F_{lh}, F_{hl}, F_{hh}] \in \mathbb{R}^{H \times W \times \frac{3}{2}D}\) to provide the upsampling blocks with frequency cues.

To fully exploit the invertible property of the wavelet transform, we introduce a dual-upsample and fusion mechanism, as illustrated in Fig. 2 (e). The wavelet-guided upsampling block, WaveUP, consists of two modules: Inverse Discrete Wavelet Transform (IDWT) and PixelShuffle, along with a novel fusion module. PixelShuffle is more efficient than transposed convolution to enlarge the spatial resolution alleviating checkerboard artifacts associated with transposed convolution. However, relying solely on PixelShuffle may lead to a lack of high-frequency details in lightweight models due to its limited representation capacity. To address this, our upsampling block takes high-frequency components returned from the corresponding downsampling block at the same level as an additional input.

Starting with the input feature \(F \in \mathbb{R}^{H \times W \times D}\), we split it into \(F_1 \in \mathbb{R}^{H \times W \times \frac{D}{2}}\) and \(F_2 \in \mathbb{R}^{H \times W \times \frac{D}{2}}\) along the channel dimension: \[F_1, F_2 = \text{split}(F).\]

To save the number of parameters in the upsampling block, we opt for a \(1 \times 1\) convolutional layer to expand the channel of feature map \(F_1\), followed by passing it through the PixelShuffle layer. Simultaneously, \(F_2\) is concatenated with high-frequency components \(\hat{F}{high}\) from the corresponding downsampling block. This concatenated input is fed into the IDWT layer: \[\begin{align} & \hat{F_1} = \mathrm{PixShuffle}(\mathbf{W}_{p_3}(F_1)), \\ & \hat{F_2} = \mathrm{IDWT}(Concat(F_2, \hat{F}_{high})), \end{align}\] where \(\mathbf{W}{p_3} \in \mathbb{R}^{\frac{D}{2} \times 2D}\) is a \(1 \times 1\) convolution expanding the channel dimension, and \(\mathrm{PixShuffle}(\cdot)\) indicates the PixelShuffle layer. \(Concat(\cdot)\) denotes the concatenation operation along the channel dimension. The two outputs, \(\hat{F_1} \in \mathbb{R}^{2H \times 2W \times \frac{D}{2}}\) and \(\hat{F_2} \in \mathbb{R}^{2H \times 2W \times \frac{D}{2}}\), are then fused using our proposed fusion module (See Fig. 2 (f)).

To selectively search their useful information, our feature fusion process employs co-attention for efficient feature interaction. As depicted in Fig. 2 (f)), our fusion module consists of contrast-aware attention [57] (CCA) and a Fused-MBConv [28] (FMBConv) with some modification. We concatenate \(\hat{F_1}\) and \(\hat{F_2}\) along the channel dimension, which then pass through a CCA layer playing a pivotal role in feature interaction. The interaction of the dual-upsampled features produces channel-wise co-attention scores, used for recalibrating the concatenated features. This process is expressed as: \[\tilde{F} = \mathrm{CCA}(Concat(\hat{F_1}, \hat{F_2})),\] where \(\tilde{F} \in \mathbb{R}^{2H \times 2W \times D}\) is the resultant feature map, and \(\mathrm{CCA}(\cdot)\) indicates the CCA layer. Subsequently, we split the feature map \(\tilde{F}\) into two along the channel dimension and elementwise add each divided feature map: \[\begin{align} & \tilde{F_1}, \tilde{F_2} = split(\tilde{F}), \\ & \hat{R} = \tilde{F_1} + \tilde{F_2}, \end{align}\] where \(\hat{R} \in \mathbb{R}^{2H \times 2W \times \frac{D}{2}}\) is the roughly fused result. This process can be interpreted as a channel-wise weighted summation using co-attention weights. Finally, the output is obtained by applying an FMBConv layer to further refine \(\hat{R}\): \[R = H_{F}(\hat{R}),\] where \(R\) is the final fusion result, and \(H_{F}\) refers to FMBConv.

The original FMBConv significantly increases the number of parameters and flops. Therefore, [58] removed the SE layer and limited the hidden dimension expansion to mitigate this problem. Believing that there is room for a better trade-off between computational cost and performance, we further improved it, as depicted in Fig. 2 (g). Group convolution, known for significantly reducing parameters compared to standard convolution, is employed in our block. We experimentally set the parameter \(r_{conv}\) as \(\frac{3}{2}\). The proposed feature fusion process for dual-upsampling results further preserves the consistent detailed texture structures of the high-frequency.

Our feature refinement block, named WaveBlock, is designed with the goal of frequency-aware discriminative feature learning. As depicted in Fig. 2 (d), this block consists of four key components: Feature Mixing Block (FMB) [58], Efficient Separable Distillation Block (ESDB) [59], and Discrete Wavelet Transform (DWT) and Inverse DWT (IDWT) layers.

In most dehazing methods, there is often a tendency to equally treat both the low-and-high frequency components of the feature map, overlooking significant distinctions between different frequency. This approach not only poses challenges in terms of reducing computational costs and achieving effective dehazing but also lacks flexibility in handling diverse types of information. Typically, low-frequency components convey smoother appearances in global regions, whereas high-frequency components capture local regions with richer details such as edges, textures, and other intricate features.

Our approach starts by employing wavelet decomposition to split the feature map \(F \in \mathbb{R}^{H \times W \times D}\) into a low-frequency sub-band \(F_{ll} \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times D}\) and the concatenation of high-frequency sub-bands \(F_{high} \in \mathbb{R}^{\frac{H}{2} \times \frac{W}{2} \times 3D}\). Subsequently, the low-frequency approximation \(F_{ll}\) passes through further processing within the FMB(s): \[\hat{F_{ll}} = H_{FMBs}(F_{ll}),\] where \(H_{FMBs}\) refers to FMB(s), \(\hat{F_{ll}}\) is the enhanced low-frequency component. Note that compared with the original FMB, we change the channel expansion ratio \(r_{mlp}\) as \(\frac{5}{4}\) (Fig. 4 (c)) from 2 for parameter and computation efficiency. The reason we employ FMB to enhance \(F_{ll}\) is that FMB excels in extracting global features through depth-wise convolutions with large kernel sizes, thereby maintaining a parameter and computation-efficient design. This enhances the low-frequency features, containing main structural information at a coarse-grained level. Importantly, \(\hat{F_{ll}}\) has reduced spatial resolution, leading to substantial savings in computational costs. It’s noteworthy that this reduction in spatial resolution contributes to a more efficient computation process, making our approach particularly advantageous for resource-limited scenarios.

Next, \(\hat{F_{ll}}\) is concatenated with the high-frequency sub-bands \(F_{high}\), and the Inverse DWT (IDWT) is applied to reconstruct the high-resolution feature map: \[\hat{F} = \mathrm{IDWT}(Concat(\hat{F_{ll}}, F_{high})),\] where \(\hat{F} \in \mathbb{R}^{H \times W \times D}\) is the resultant feature map, now enhanced with low-frequency sub-band
information. However, the high-frequency components containing rich detailed information (*e.g.*, object texture details) also need to refine features at a fine-grained level.

For the effective enhancement of high-frequency details, we employ a feature distillation mechanism [59]–[61]. In this process, the feature map \(\hat{F}\) is fed into ESDB for feature distillation, where it efficiently interacts with refined low-frequency and high-frequency information at a
fine-grained level to boost high-frequency information. This pipeline can be expressed as: \[\tilde{F} = H_{ESDB}(\hat{F}),\] where \(H_{ESDB}\) refers to ESDB, and \(\tilde{F}\) is the final result. The ESDB replaces standard convolution with a well-designed depthwise separable convolution (*i.e.*, blueprint separable convolution.) to save computation, enabling efficient feature
extraction at a fine-grained level. We remove the Contrast-aware Attention (CCA) and Enhanced Spatial Attention (ESA) [62] modules in ESDB, and replace them with the
Large Kernel Attention (LKA) module [63], to further enhance computational efficiency. Additionally, the residual connection in ESDB is removed in favor of our
residual learning approach (See Fig. 2 (d) and Fig. 4 (a)).

Our feature refinement block first handles the main structural information at a coarse-grained level and then deals with detailed information at a fine-grained level. Therefore, the feature processing proceeds in a coarse-to-fine fashion.

Model | #Blocks | Channel Dims | Conv Type |
---|---|---|---|

WaveDH | [1, 2, 3] | [32, 64, 128, 64, 32] | GroupConv |

WaveDH-Tiny | [1, 2, 2] | [24, 48, 96, 48, 24] | DWConv |

Models | PSNR | SSIM | #Params (M) | #MACs (G) |
---|---|---|---|---|

WaveDH (Full Model) |
39.35 |
0.995 |
1.490 | 7.824 |

w/o FU | 39.05 (-0.30) | 0.994 (-0.001) | 1.451 (-0.039) | 6.677 (-1.147) |

w/o WA | 39.09 (-0.26) | 0.994 (-0.001) | 1.478 (-0.012) | 7.743 (-0.081) |

w/o FU & DU | 39.08 (-0.27) | 0.994 (-0.001) | 1.513 (+0.023) | 7.080 (-0.744) |

w/o FU & WA | 38.91 (-0.44) | 0.994 (-0.001) | 1.439 (-0.051) |
6.596 (-1.228) |

w/o FU & WA & DU | 38.92 (-0.43) | 0.994 (-0.001) | 1.501 (+0.011) | 6.999 (-0.825) |

For our experiments, we select the RESIDE dataset [5], a widely acknowledged benchmark for single image dehazing. RESIDE is comprehensive, comprising five subsets: Indoor Training Set (ITS), Outdoor Training Set (OTS), Synthetic Objective Testing Set (SOTS), Real World task-driven Testing Set (RTTS), and Hybrid Subjective Testing Set (HSTS). Following the FFA-Net [16], we use ITS (13,990 image pairs) and OTS (313,950 image pairs) for training WaveDH, testing performance on the indoor (500 image pairs) and outdoor (500 image pairs) sets of SOTS. Additionally, the I-HAZE [64] dataset, containing 30 image pairs of hazy and haze-free indoor scenes, is incorporated to diversify test scenarios. We evaluate the performance using Peak Signal-to-Noise Ratio (PSNR) and Structural Similarity Index (SSIM) to ensure a comprehensive comparison with state-of-the-art methods. The choice of these datasets and metrics is pivotal in illustrating our WaveDH’s effectiveness across diverse hazing conditions.

Conv type | PSNR | SSIM | #Params (M) | #MACs (G) |
---|---|---|---|---|

Group Conv |
39.35 | 0.995 |
1.490 | 7.824 |

Standard Conv | 39.61 (+0.26) |
0.995 (+0.000) |
2.729 (+1.239) | 10.576 (+2.752) |

Depthwise Conv | 38.72 (-0.63) | 0.994 (-0.001) | 1.252 (-0.238) |
6.600 (-1.224) |

The models are trained and tested separately for indoor and outdoor scenes. During the training stage, we train our models on ITS for 700 epochs and on OTS for 60 in RGB channels. In each training mini-batch, we randomly crop 32 patches of size \(256 \times 256\) from hazy images as the input. The proposed model is optimized by minimizing L1 loss and the contrastive loss [17] with AdamW optimizer [65]. The initial learning rate is set to 0.0005 on ITS and 0.0002 on OTS, and the cosine annealing strategy [66] is used to adjust the learning rate. All experiments are conducted with the PyTorch framework.

We present two models according to the number of feature channels and the type of convolution in FMBConv. The detailed configurations are reported in Table 1. The training code and our models will be available on public.

In our comprehensive ablation experiments conducted on the SOTS-indoor dataset, we delve into the individual and collective contributions of three pivotal components of the WaveDH model: the Wavelet Attention (WA) module, the dual-upsampling (DU) mechanism, and the fusion (FU) module. Our in-depth analysis, summarized in Table 2, quantifies the impact of each module on the dehazing performance. In addition to these core components, we also present an analysis of the convolution types within the Feature Mixing Block Convolution (FMBConv), detailed in Table 3. Moreover, our study extends to examine the influence of the Group Convolution Expansion Ratio (\(r_{conv}\)) and the MLP Expansion Ratio (\(r_{mlp}\)) on the overall efficacy and efficiency of WaveDH, with findings presented in Tables 4 and 5.

Our ablation study begins with a detailed evaluation of the dual-upsample and fusion mechanism, which is strategically formulated to address the lack of high-frequency information. The dual-upsampling (DU) scheme comprises a pair of upsampling modules, Inverse Discrete Wavelet Transform (IDWT) and PixelShuffle. Our fusion module (FU) is designed to effectively fuse the channel-wise features from the dual-upsampling (DU) module, ensuring a detail-rich output. (See Fig. 5)

To investigate the contribution of fusion module, we simplify the model by removing it and replacing the concatenation operation with an element-wise addition, a naive form of fusion. The experimental results presented in Table 2 indicate that the full WaveDH model achieves a PSNR of 39.35. When the fusion module is removed (w/o FU), there is a noticeable decrease in PSNR to 39.05, indicating a decrease of 0.30. Moreover, we evaluate the effectiveness of DU module in w/o FU model variant by excluding the IDWT layer and using only the PixelShuffle layer with \(1 \times 1\) convolution for upsampling. Although this model variant (w/o FU & DU) shows a slight increase in PSNR, it also resulted in increased computational costs and number of parameters, as indicated in Table 2. These results demonstrate the effectiveness of the dual-upsample and fusion mechanism in enhancing dehazing performance while maintaining computational efficiency.

Ratio | PSNR | SSIM | #Params (M) | #MACs (G) |
---|---|---|---|---|

\(r_{conv}\) = 1.5 | 39.35 |
0.995 |
1.490 | 7.824 |

\(r_{conv}\) = 1.25 | 39.15 (-0.20) | 0.995 (+0.000) |
1.429 (-0.061) |
7.657 (-0.167) |

\(r_{conv}\) = 2 | 38.65 (-0.70) | 0.994 (-0.001) | 1.612 (+0.122) | 8.160 (+0.336) |

Ratio | PSNR | SSIM | #Params (M) | #MACs (G) |
---|---|---|---|---|

\(r_{mlp}\) = 1.25 | 39.35 | 0.995 |
1.490 |
7.824 |

\(r_{mlp}\) = 1.5 | 39.36 (+0.01) |
0.994 (-0.001) | 1.519 (+0.029) | 7.875 (+0.051) |

\(r_{mlp}\) = 2 | 38.55 (-0.80) | 0.994 (-0.001) | 1.577 (+0.087) | 7.975 (+0.151) |

Methods | ITS | OTS | Overhead | |||
---|---|---|---|---|---|---|

2-5 | SOTS-indoor | SOTS-outdoor | ||||

2-7 | PSNR | SSIM | PSNR | SSIM | #Params (M) | #MACs (G) |

(TPAMI’10) DCP [32] | 16.62 | 0.818 | 19.13 | 0.815 | - | - |

(TIP’16) DehazeNet [6] | 21.14 | 0.847 | 22.46 | 0.8514 | 0.009 | 0.514 |

(ICCV’17) AOD-Net [35] | 19.06 | 0.850 | 20.29 | 0.8765 | 0.0018 | 0.114 |

(CVPR’18) GFN [67] | 22.30 | 0.880 | 21.55 | 0.844 | 0.499 | 14.94 |

(WACV’19) GCANet [13] | 30.23 | 0.980 | - | - | 0.702 | 18.41 |

(ICCV’19) GDN [11] | 32.16 | 0.984 | 30.86 | 0.982 | 0.956 | 18.77 |

(CVPR’20) MSBDN [15] | 33.67 | 0.985 | 33.48 | 0.982 | 31.35 | 41.54 |

(ECCV’20) PFDN [68] | 32.68 | 0.976 | - | - | 11.27 | 50.46 |

(AAAI’20) FFA-Net [16] | 36.39 | 0.989 | 33.57 | 0.984 | 4.456 | 287.53 |

(CVPR’21) AECR-Net [17] | 37.17 | 0.990 | - | - | 2.611 | 52.20 |

(TIP’22) SGID-PFF [69] | 38.52 | 0.991 | 30.20 | 0.975 | 13.87 | 152.8 |

(TIP’22) DehazeFormer-M [20] | 38.46 | 0.994 | 34.29 | 0.983 | 4.634 | 48.64 |

(CVPR’22) MAXIM-2S [70] | 38.11 | 0.991 | 34.19 | 0.985 | 14.1 | 216 |

(AAAI’22) UDN [71] | 38.62 | 0.991 | 34.92 | 0.987 | 4.25 | - |

WaveDH-Tiny |
36.93 | 0.992 | 34.52 | 0.983 | 0.543 | 3.507 |

WaveDH |
39.35 | 0.995 | 34.89 | 0.984 | 1.490 | 7.824 |

We also study the impact of the Wavelet Attention (WA) module, which plays a pivotal role in selectively suppressing noise while simultaneously enhancing the low-frequency context and the high-frequency detail information critical for effective dehazing. To quantify the importance of WA, we conduct experiments on a model variant devoid of the WaveAttention module (w/o WA), simplifying the block to only include Discrete Wavelet Transform (DWT) and a \(1 \times 1\) convolution layer. The results, as depicted in Table 2, reveal that removing WA from WaveDH leads to a substantial decrease in PSNR by 0.26, from 39.35 to 39.09, highlighting the efficacy of the WA module in our dehazing process. In terms of computational complexity, the inclusion of WA leads to a slight increase in the number of parameters and Multiply-Accumulates (MACs) by 0.012 million and 0.081 billion, respectively. This minimal rise in complexity reflects the efficiency of the WA module in enhancing the model capability without heavily burdening computational resources.

Furthermore, we visualize the feature maps before and after applying wavelet attention to provide a clearer perspective, as shown in Fig. 6. The enhanced feature maps with wavelet attention reveal distinctly clearer backgrounds and objects by effectively suppressing noisy pixels. It is evident that the WA module yields a more informative representation due to the rich contour features present in the high-frequency components, which are adeptly captured by our WA module that leverages not only low-frequency component but also high-frequency components.

Methods | PSNR | SSIM | #MACs (G) |
---|---|---|---|

AOD-Net [35] | 14.74 | 0.669 | 0.114 |

GCANet [13] | 15.64 | 0.709 | 18.41 |

GDN [11] | 15.12 | 0.710 | 18.77 |

FFANet [16] | 15.57 | 0.701 | 287.53 |

DehazeFlow [72] | 15.28 | 0.695 | - |

MAXIM-2S [70] | 16.25 | 0.718 | 216 |

D4 [73] | 15.61 | 0.702 | 2.246 |

WaveDH-Tiny |
15.91 | 0.720 | 3.507 |

WaveDH |
15.91 | 0.708 | 7.824 |

In this part of our ablation study, we focus on evaluating the impact of different convolution types used in the Fused-MBConv (FMBConv) on the performance of WaveDH. We compare three convolution types: Standard Convolution, Group Convolution, and
Depthwise Convolution. The results of this study are summarized in Table 3. *Standard Convolution*: Achieves the highest PSNR of 39.61 and a SSIM of 0.995.
However, it also resulted in the highest number of parameters (2.729 million) and the greatest computational cost, as indicated by 10.576 billion MACs. *Group Convolution*: Demonstrates a slightly lower PSNR of 39.35 and an SSIM of 0.995. It
significantly reduces the number of parameters to 1.490 million and the computational cost to 7.824 billion MACs, presenting a more balanced trade-off between performance and efficiency. *Depthwise Convolution*: Shows the lowest PSNR of 38.72. It
further reduces the number of parameters to 1.252 million and the computational cost to 6.600 billion MACs, indicating its efficiency in terms of resource utilization.

These results emphasize the trade-offs between convolution types in terms of dehazing performance, parameter efficiency, and computational cost. After a careful consideration of these results, we opt for group convolution in our WaveDH model. This decision is anchored on a well-considered trade-off between performance and computational efficiency. While standard convolution offers slightly better performance, its high computational demand and parameter size are less suited for scenarios where resources are limited or efficiency is paramount. On the other hand, depthwise convolution, despite its impressive efficiency, falls short in maintaining the dehazing performance that our model aims to achieve.

A critical hyperparameter of the Fused-MBConv (FMBConv) in our proposed WaveDH is the group convolution expansion ratio, denoted as \(r_{conv}\), which has a direct bearing on the model capacity and efficiency. We evaluate the influence of \(r_{conv}\) on image dehazing performance through quantitative analysis.

Establishing \(r_{conv}\) at 1.5, our WaveDH achieves an optimal balance between dehazing quality and computational efficiency. This baseline configuration yields a PSNR of 39.35 and an SSIM of 0.995, with a manageable number of parameters of 1.490 million and 7.824 billion MACs, ensuring a balanced computational cost and performance relationship.

Reducing \(r_{conv}\) to 1.25 leads to a slightly more efficient model, as reflected by the reduced number of parameters of 1.429 million and computational expense of 7.657 billion MACs. Nonetheless, this change incurs a discernible performance degradation with a PSNR drop to 39.15. On the other hand, increasing \(r_{conv}\) to 2 leads to a marked decrease in PSNR, falling to 38.65, signaling a counterproductive trade-off with an increase in expansion leading to decreased performance. This outcome demonstrates the drawbacks of an aggressive expansion strategy that fails to yield commensurate improvements in dehazing quality. Based on these observations, we conclude that an expansion ratio of 1.5 within FMBConv is pivotal for optimizing the image dehazing performance of WaveDH.

In the pursuit of fine-tuning the Feature Mixing Block (FMB), we investigate the influence of varying MLP expansion ratios (\(r_{mlp}\)) on the network dehazing efficacy and computational efficiency. The MLP expansion ratio controls the channel expansion within the pointwise convolutional layers. This ablation analysis studies how different values of \(r_{mlp}\) affect the PSNR, SSIM, number of parameters, and MACs on the SOTS-Indoor dataset.

For baseline setting (\(r_{mlp} = 1.25\)), the model achieves a PSNR of 39.35 and an SSIM of 0.995, demonstrating high fidelity in dehazed images while maintaining structural details. This configuration also recorded the lowest computational footprint with 1.490 million parameters and 7.824 billion MACs, indicating a high level of efficiency.

Increasing \(r_{mlp}\) to 1.5, the model exhibits a marginal PSNR improvement of 0.01, achieving a PSNR of 39.36 and a minor decrease in SSIM (-0.001) However, this slight increase in PSNR performance comes at the cost of an increase in the number of parameters (+0.029 million) and MACs (+0.051 billion). Despite these increases, the balance between dehazing quality and computational demand remains acceptable.

A more pronounced expansion with \(r_{mlp}\) set to 2 results in a significant decrease in PSNR, from 39.35 to 38.55, indicating potential over-parameterization of channel projection module, which may not contribute positively to the dehazing capability. This suggests that a high MLP expansion ratio might introduce redundancy and inefficiencies without a proportional gain in dehazing quality. Our analysis clearly shows that an \(r_{mlp}\) of 1.25 is the most plausible choice for our WaveDH, striking an optimal balance between performance enhancement and computational efficiency.

In this section, we provide both quantitative and qualitative comparison to analyze our novel WaveDH approach with existing state-of-the-art (SOTA) dehazing methods based on a diverse set of synthetic and real-world hazy images. For a fair comparison, we use the pretrained models provided by the authors.

Our quantitative evaluation leverages the Synthetic Objective Testing Set (SOTS), including both indoor and outdoor hazy image datasets, as the benchmark for performance metrics including PSNR and SSIM. As outlined in Table 6, our WaveDH not only exhibits a remarkable reduction in parameters and computational complexity but also surpasses SOTA methods in performance metrics. Specifically, within the indoor dataset, our WaveDH achieves best results, recording a PSNR of 39.35dB and an SSIM of 0.995, outshining the UDN by 0.86 dB in PSNR while requiring only 35% of UDN’s parameter numbers. For the SOTS-outdoor dataset, our WaveDH-Tiny model, with just 543K parameters, ranks third in PSNR, surpassing many existing methods. This demonstrates the strong capability of our WaveDH to handle haze on synthetic dataset.

For a real-world comparison, we use the I-HAZE dataset [64], which, notably, was not used during the training phase, ensuring an unbiased evaluation. Here, both WaveDH and the WaveDH-Tiny models demonstrate competitive performance, achieving the second-highest PSNR values as shown in Table 7. Most impressively, WaveDH-Tiny leads in SSIM metrics, indicating the strengths of our methods when it comes to restoring clear scenes from real-world hazy observations.

We extended our evaluation to a qualitative perspective, selecting representative samples for a visual analysis to juxtapose the dehazing performance of WaveDH with other approaches. Figs 7, 8, and 9 serve as visual comparisons on SOTS-indoor, SOTS-outdoor, and I-HAZE datasets, respectively.

Within the SOTS-indoor dataset, we analyze the network performance to dehaze two distinct scenes. As illustrated in Fig. 7, AOD-Net fails to address the haze, leaving much of it unmitigated. Both FFANet and GridDehazeNet, while attempting to dehaze, introduce severe color distortions that result in outputs that are excessively bright or dim. DehazeFormer-M and MAXIM-2S, though they somewhat mitigate color distortion issues, still appears to persist in some areas and introduce artifacts. In contrast, our WaveDH exhibits a commendable restoration of color fidelity across all image regions, preserving finer details and exhibiting minimal artifacts. The qualitative comparison of the SOTS-outdoor dataset, as demonstrated in Fig. 8, indicates that most compared methods fail to remove the haze effectively. On the other hand, our WaveDH stands out by successfully reconstructing hazy-free scenes with preserved textural and color information, with the least haze residual.

Moving to real-world scenarios with the I-HAZE dataset, the complexity increases as the sample distribution significantly deviates from the training data. This shift challenges most methods, causing them to struggle with dehazing, as seen in Fig. 9. The methods, including GCANet, GridDehazeNet, FFANet, and D4, are prone to severe color distortion and the production of artifacts to varying extents. However, MAXIM-2S and our WaveDH distinguish themselves by generating visually convincing results that are virtually artifact-free, reinforcing the superior generalizability of WaveDH. When considering both the visual results and the efficiency of the model, our approach, WaveDH, emerges as a particularly compelling solution for single image dehazing.

In this article, we proposed WaveDH, a novel ConvNet designed for efficient single-image dehazing. Central to our approach is the strategic use of wavelet sub-bands for guided up-and-downsampling, coupled with a novel frequency-aware feature refinement process. The novel squeeze-and-attention scheme in the downsampling block and the dual-upsample and fusion mechanism in the upsampling block are especially noteworthy, enhancing high-frequency detail reconstruction while optimizing computational costs. Our feature refinement block refines the intermediate features through a coarse-to-fine strategy, thereby enhancing the model efficiency and contributing to a well-calibrated balance between accuracy and computational cost. Comprehensive experiments validate our WaveDH’s superior performance over existing state-of-the-art methods, achieving high-quality dehazing with reduced computational demands. However, our method has limitations in real-world hazy scenes, particularly in recovering dense haze regions. Therefore, in our future research, we plan to address address this issue and extend our approach to other low-level vision tasks.

[1]

V. A. Sindagi, P. Oza, R. Yasarla, and V. M. Patel, “Prior-based domain adaptive object detection for hazy and rainy conditions,” in *Computer Vision–ECCV 2020: 16th
European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XIV 16*.1em plus 0.5em minus 0.4emSpringer, 2020, pp. 763–780.

[2]

W.-T. Chen, I.-H. Chen, C.-Y. Yeh, H.-H. Yang, J.-J. Ding, and S.-Y. Kuo, “Sjdl-vehicle: Semi-supervised joint defogging learning for foggy vehicle re-identification,” in
*Proceedings of the AAAI Conference on Artificial Intelligence*, vol. 36, no. 1, 2022, pp. 347–355.

[3]

C. Sakaridis, D. Dai, and L. Van Gool, “Semantic foggy scene understanding with synthetic data,” *International Journal of Computer Vision*, vol. 126, pp. 973–992,
2018.

[4]

E. J. McCartney, “Optics of the atmosphere: scattering by molecules and particles,” *New York*, 1976.

[5]

B. Li, W. Ren, D. Fu, D. Tao, D. Feng, W. Zeng, and Z. Wang, “Benchmarking single-image dehazing and beyond,” *IEEE Transactions on Image Processing*, vol. 28, no. 1,
pp. 492–505, 2018.

[6]

B. Cai, X. Xu, K. Jia, C. Qing, and D. Tao, “Dehazenet: An end-to-end system for single image haze removal,” *IEEE transactions on image processing*, vol. 25, no. 11,
pp. 5187–5198, 2016.

[7]

W. Ren, S. Liu, H. Zhang, J. Pan, X. Cao, and M.-H. Yang, “Single image dehazing via multi-scale convolutional neural networks,” in *Computer Vision–ECCV 2016: 14th
European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part II 14*.1em plus 0.5em minus 0.4emSpringer, 2016, pp. 154–169.

[8]

H. Zhang and V. M. Patel, “Densely connected pyramid dehazing network,” in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2018, pp.
3194–3203.

[9]

X. Yang, H. Li, Y.-L. Fan, and R. Chen, “Single image haze removal via region detection network,” *IEEE Transactions on Multimedia*, vol. 21, no. 10, pp. 2545–2560,
2019.

[10]

S. Zhao, L. Zhang, Y. Shen, and Y. Zhou, “Refinednet: A weakly supervised refinement framework for single image dehazing,” *IEEE Transactions on Image Processing*,
vol. 30, pp. 3391–3404, 2021.

[11]

X. Liu, Y. Ma, Z. Shi, and J. Chen, “Griddehazenet: Attention-based multi-scale network for image dehazing,” in *Proceedings of the IEEE/CVF international conference on
computer vision*, 2019, pp. 7314–7323.

[12]

Y. Qu, Y. Chen, J. Huang, and Y. Xie, “Enhanced pix2pix dehazing network,” in *Proceedings of the IEEE/CVF conference on computer vision and pattern recognition*,
2019, pp. 8160–8168.

[13]

D. Chen, M. He, Q. Fan, J. Liao, L. Zhang, D. Hou, L. Yuan, and G. Hua, “Gated context aggregation network for image dehazing and deraining,” in *2019 IEEE winter
conference on applications of computer vision (WACV)*.1em plus 0.5em minus 0.4emIEEE, 2019, pp. 1375–1383.

[14]

Q. Deng, Z. Huang, C.-C. Tsai, and C.-W. Lin, “Hardgan: A haze-aware representation distillation gan for single image dehazing,” in *European conference on computer
vision*.1em plus 0.5em minus 0.4emSpringer, 2020, pp. 722–738.

[15]

H. Dong, J. Pan, L. Xiang, Z. Hu, X. Zhang, F. Wang, and M.-H. Yang, “Multi-scale boosted dehazing network with dense feature fusion,” in *Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition*, 2020, pp. 2157–2167.

[16]

X. Qin, Z. Wang, Y. Bai, X. Xie, and H. Jia, “Ffa-net: Feature fusion attention network for single image dehazing,” in *Proceedings of the AAAI conference on artificial
intelligence*, vol. 34, no. 07, 2020, pp. 11 908–11 915.

[17]

H. Wu, Y. Qu, S. Lin, J. Zhou, R. Qiao, Z. Zhang, Y. Xie, and L. Ma, “Contrastive learning for compact single image dehazing,” in *Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition*, 2021, pp. 10 551–10 560.

[18]

C. Lin, X. Rong, and X. Yu, “Msaff-net: Multiscale attention feature fusion networks for single image dehazing and beyond,” *IEEE transactions on multimedia*,
2022.

[19]

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly *et al.*, “An image is worth 16x16 words:
Transformers for image recognition at scale,” *arXiv preprint arXiv:2010.11929*, 2020.

[20]

Y. Song, Z. He, H. Qian, and X. Du, “Vision transformers for single image dehazing,” *IEEE Transactions on Image Processing*, vol. 32, pp. 1927–1941, 2023.

[21]

C.-L. Guo, Q. Yan, S. Anwar, R. Cong, W. Ren, and C. Li, “Image dehazing transformer with transmission-aware 3d position embedding,” in *Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition*, 2022, pp. 5812–5820.

[22]

Y. Qiu, K. Zhang, C. Wang, W. Luo, H. Li, and Z. Jin, “Mb-taylorformer: Multi-branch efficient transformer expanded by taylor formula for image dehazing,” in *Proceedings
of the IEEE/CVF International Conference on Computer Vision*, 2023, pp. 12 802–12 813.

[23]

J. Zhang and D. Tao, “Famed-net: A fast and accurate multi-scale end-to-end dehazing network,” *IEEE Transactions on Image Processing*, vol. 29, pp. 72–84,
2019.

[24]

H. Ullah, K. Muhammad, M. Irfan, S. Anwar, M. Sajjad, A. S. Imran, and V. H. C. de Albuquerque, “Light-dehazenet: a novel lightweight cnn architecture for single image
dehazing,” *IEEE transactions on image processing*, vol. 30, pp. 8968–8982, 2021.

[25]

A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision
applications,” *arXiv preprint arXiv:1704.04861*, 2017.

[26]

M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C. Chen, “Mobilenetv2: Inverted residuals and linear bottlenecks,” in *Proceedings of the IEEE conference on computer
vision and pattern recognition*, 2018, pp. 4510–4520.

[27]

N. Ma, X. Zhang, H.-T. Zheng, and J. Sun, “Shufflenet v2: Practical guidelines for efficient cnn architecture design,” in *Proceedings of the European conference on
computer vision (ECCV)*, 2018, pp. 116–131.

[28]

M. Tan and Q. Le, “Efficientnetv2: Smaller models and faster training,” in *International conference on machine learning*.1em plus 0.5em minus 0.4emPMLR, 2021, pp.
10 096–10 106.

[29]

A. Howard, M. Sandler, G. Chu, L.-C. Chen, B. Chen, M. Tan, W. Wang, Y. Zhu, R. Pang, V. Vasudevan *et al.*, “Searching for mobilenetv3,” in *Proceedings of the
IEEE/CVF international conference on computer vision*, 2019, pp. 1314–1324.

[30]

K. Han, Y. Wang, Q. Tian, J. Guo, C. Xu, and C. Xu, “Ghostnet: More features from cheap operations,” in *Proceedings of the IEEE/CVF conference on computer vision and
pattern recognition*, 2020, pp. 1580–1589.

[31]

R. Zhang, “Making convolutional networks shift-invariant again,” in *International conference on machine learning*.1em plus 0.5em minus 0.4emPMLR, 2019, pp.
7324–7334.

[32]

K. He, J. Sun, and X. Tang, “Single image haze removal using dark channel prior,” *IEEE transactions on pattern analysis and machine intelligence*, vol. 33, no. 12,
pp. 2341–2353, 2010.

[33]

Q. Zhu, J. Mai, and L. Shao, “A fast single image haze removal algorithm using color attenuation prior,” *IEEE transactions on image processing*, vol. 24, no. 11, pp.
3522–3533, 2015.

[34]

D. Berman, S. Avidan *et al.*, “Non-local image dehazing,” in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2016, pp.
1674–1682.

[35]

B. Li, X. Peng, Z. Wang, J. Xu, and D. Feng, “Aod-net: All-in-one dehazing network,” in *Proceedings of the IEEE international conference on computer vision*, 2017,
pp. 4770–4778.

[36]

S. Mallat and W. L. Hwang, “Singularity detection and processing with wavelets,” *IEEE transactions on information theory*, vol. 38, no. 2, pp. 617–643, 1992.

[37]

A. Aballe, M. Bethencourt, F. Botana, and M. Marcos, “Using wavelets transform in the analysis of electrochemical noise data,” *Electrochimica Acta*, vol. 44, no. 26,
pp. 4805–4816, 1999.

[38]

Q. Li, L. Shen, S. Guo, and Z. Lai, “Wavelet integrated cnns for noise-robust image classification,” in *Proceedings of the IEEE/CVF Conference on Computer Vision and
Pattern Recognition*, 2020, pp. 7245–7254.

[39]

T. Yao, Y. Pan, Y. Li, C.-W. Ngo, and T. Mei, “Wave-vit: Unifying wavelet and transformers for visual representation learning,” in *European Conference on Computer
Vision*.1em plus 0.5em minus 0.4emSpringer, 2022, pp. 328–345.

[40]

J. Yoo, Y. Uh, S. Chun, B. Kang, and J.-W. Ha, “Photorealistic style transfer via wavelet transforms,” in *Proceedings of the IEEE/CVF International Conference on
Computer Vision*, 2019, pp. 9036–9045.

[41]

H. Huang, R. He, Z. Sun, and T. Tan, “Wavelet domain generative adversarial network for multi-scale face hallucination,” *International Journal of Computer Vision*,
vol. 127, no. 6-7, pp. 763–784, 2019.

[42]

M. Yang, Z. Wang, Z. Chi, and W. Feng, “Wavegan: Frequency-aware gan for high-fidelity few-shot image generation,” in *European Conference on Computer Vision*.1em
plus 0.5em minus 0.4emSpringer, 2022, pp. 1–17.

[43]

B. Zhang, S. Gu, B. Zhang, J. Bao, D. Chen, F. Wen, Y. Wang, and B. Guo, “Styleswin: Transformer-based gan for high-resolution image generation,” in *Proceedings of the
IEEE/CVF conference on computer vision and pattern recognition*, 2022, pp. 11 304–11 314.

[44]

W. Bae, J. Yoo, and J. Chul Ye, “Beyond deep residual learning for image restoration: Persistent homology-guided manifold simplification,” in *Proceedings of the IEEE
conference on computer vision and pattern recognition workshops*, 2017, pp. 145–153.

[45]

P. Liu, H. Zhang, K. Zhang, L. Lin, and W. Zuo, “Multi-level wavelet-cnn for image restoration,” in *Proceedings of the IEEE conference on computer vision and pattern
recognition workshops*, 2018, pp. 773–782.

[46]

T. Guo, H. Seyed Mousavi, T. Huu Vu, and V. Monga, “Deep wavelet prediction for image super-resolution,” in *Proceedings of the IEEE conference on computer vision and
pattern recognition workshops*, 2017, pp. 104–113.

[47]

W.-Y. Hsu and P.-W. Jian, “Detail-enhanced wavelet residual network for single image super-resolution,” *IEEE Transactions on Instrumentation and Measurement*,
vol. 71, pp. 1–13, 2022.

[48]

L. Shen, Z. Yue, Q. Chen, F. Feng, and J. Ma, “Deep joint rain and haze removal from a single image,” in *2018 24th International Conference on Pattern Recognition
(ICPR)*.1em plus 0.5em minus 0.4emIEEE, 2018, pp. 2821–2826.

[49]

H. Huang, A. Yu, Z. Chai, R. He, and T. Tan, “Selective wavelet attention learning for single image deraining,” *International Journal of Computer Vision*, vol. 129,
pp. 1282–1300, 2021.

[50]

H. Khan, M. Sharif, N. Bibi, M. Usman, S. A. Haider, S. Zainab, J. H. Shah, Y. Bashir, and N. Muhammad, “Localization of radiance transformation for image dehazing in wavelet
domain,” *Neurocomputing*, vol. 381, pp. 141–151, 2020.

[51]

W.-Y. Hsu and Y.-S. Chen, “Single image dehazing using wavelet-based haze-lines and denoising,” *IEEE Access*, vol. 9, pp. 104 547–104 559, 2021.

[52]

O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in *Medical Image Computing and Computer-Assisted
Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18*.1em plus 0.5em minus 0.4emSpringer, 2015, pp. 234–241.

[53]

W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient
sub-pixel convolutional neural network,” in *Proceedings of the IEEE conference on computer vision and pattern recognition*, 2016, pp. 1874–1883.

[54]

V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep learning,” *arXiv preprint arXiv:1603.07285*, 2016.

[55]

Q. Li, L. Shen, S. Guo, and Z. Lai, “Wavecnet: Wavelet integrated cnns to suppress aliasing effect for noise-robust image classification,” *IEEE Transactions on Image
Processing*, vol. 30, pp. 7074–7089, 2021.

[56]

T. Williams and R. Li, “Wavelet pooling for convolutional neural networks,” in *International conference on learning representations*, 2018.

[57]

Z. Hui, X. Gao, Y. Yang, and X. Wang, “Lightweight image super-resolution with information multi-distillation network,” in *Proceedings of the 27th acm international
conference on multimedia*, 2019, pp. 2024–2032.

[58]

L. Sun, J. Pan, and J. Tang, “Shufflemixer: An efficient convnet for image super-resolution,” *Advances in Neural Information Processing Systems*, vol. 35, pp.
17 314–17 326, 2022.

[59]

Z. Li, Y. Liu, X. Chen, H. Cai, J. Gu, Y. Qiao, and C. Dong, “Blueprint separable residual network for efficient image super-resolution,” in *Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition*, 2022, pp. 833–843.

[60]

Z. Hui, X. Wang, and X. Gao, “Fast and accurate single image super-resolution via information distillation network,” in *Proceedings of the IEEE conference on computer
vision and pattern recognition*, 2018, pp. 723–731.

[61]

Z. Hui, X. Gao, Y. Yang, and X. Wang, “Lightweight image super-resolution with information multi-distillation network,” in *Proceedings of the 27th acm international
conference on multimedia*, 2019, pp. 2024–2032.

[62]

J. Liu, J. Tang, and G. Wu, “Residual feature distillation network for lightweight image super-resolution,” in *Computer Vision–ECCV 2020 Workshops: Glasgow, UK, August
23–28, 2020, Proceedings, Part III 16*.1em plus 0.5em minus 0.4emSpringer, 2020, pp. 41–55.

[63]

C. Xie, X. Zhang, L. Li, H. Meng, T. Zhang, T. Li, and X. Zhao, “Large kernel distillation network for efficient single image super-resolution,” in *Proceedings of the
IEEE/CVF Conference on Computer Vision and Pattern Recognition*, 2023, pp. 1283–1292.

[64]

C. Ancuti, C. O. Ancuti, R. Timofte, and C. De Vleeschouwer, “I-haze: A dehazing benchmark with real hazy and haze-free indoor images,” in *Advanced Concepts for
Intelligent Vision Systems: 19th International Conference, ACIVS 2018, Poitiers, France, September 24–27, 2018, Proceedings 19*.1em plus 0.5em minus 0.4emSpringer, 2018, pp. 620–631.

[65]

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” *arXiv preprint arXiv:1711.05101*, 2017.

[66]

——, “Sgdr: Stochastic gradient descent with warm restarts,” *arXiv preprint arXiv:1608.03983*, 2016.

[67]

W. Ren, L. Ma, J. Zhang, J. Pan, X. Cao, W. Liu, and M.-H. Yang, “Gated fusion network for single image dehazing,” in *Proceedings of the IEEE conference on computer
vision and pattern recognition*, 2018, pp. 3253–3261.

[68]

J. Dong and J. Pan, “Physics-based feature dehazing networks,” in *Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part
XXX 16*.1em plus 0.5em minus 0.4emSpringer, 2020, pp. 188–204.

[69]

H. Bai, J. Pan, X. Xiang, and J. Tang, “Self-guided image dehazing using progressive feature fusion,” *IEEE Transactions on Image Processing*, vol. 31, pp. 1217–1229,
2022.

[70]

Z. Tu, H. Talebi, H. Zhang, F. Yang, P. Milanfar, A. Bovik, and Y. Li, “Maxim: Multi-axis mlp for image processing,” in *Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition*, 2022, pp. 5769–5780.

[71]

M. Hong, J. Liu, C. Li, and Y. Qu, “Uncertainty-driven dehazing network,” in *Proceedings of the AAAI Conference on Artificial Intelligence*, vol. 36, no. 1, 2022,
pp. 906–913.

[72]

H. Li, J. Li, D. Zhao, and L. Xu, “Dehazeflow: Multi-scale conditional flow network for single image dehazing,” in *Proceedings of the 29th ACM International Conference
on Multimedia*, 2021, pp. 2577–2585.

[73]

Y. Yang, C. Wang, R. Liu, L. Zhang, X. Guo, and D. Tao, “Self-augmented unpaired image dehazing via density and depth decomposition,” in *Proceedings of the IEEE/CVF
conference on computer vision and pattern recognition*, 2022, pp. 2037–2046.

S. Hwang is with the Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Republic of Korea (e-mail: sm.hwang@gm.gist.ac.kr).

C. Jung is with the School of Electronic Engineering, Xidian University, Xi’an 710071, China (e-mail: zhengzk@xidian.edu.cn).

D. Han, and M. Jeon are with the School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology (GIST), Gwangju 61005, Republic of Korea (e-mail: xesta120@gist.ac.kr; mgjeon@gist.ac.kr).

(Corresponding author: Moongu Jeon).↩︎