October 14, 2025
Recently, a complex variational autoencoder (VAE)-based single-channel speech enhancement system based on the DCCRN architecture has been proposed. In this system, a noise suppression VAE (NSVAE) learns to extract clean speech representations from noisy speech using pretrained clean speech and noise VAEs with skip connections. In this paper, we improve DCCRN-VAE by incorporating three key modifications: 1) removing the skip connections in the pretrained VAEs to encourage more informative speech and noise latent representations; 2) using \(\beta\)-VAE in pretraining to better balance reconstruction and latent space regularization; and 3) a NSVAE generating both speech and noise latent representations. Experiments show that the proposed system achieves comparable performance as the DCCRN and DCCRN-VAE baselines on the matched DNS3 dataset but outperforms the baselines on mismatched datasets (WSJ0-QUT, Voicebank-DEMEND), demonstrating improved generalization ability. In addition, an ablation study shows that a similar performance can be achieved with classical fine-tuning instead of adversarial training, resulting in a simpler training pipeline.
Variational Autoencoder, Single-channel speech enhancement, Latent representations
Recently, several generative models, e.g. based on the variational autoencoders (VAEs) [1]–[8], generative adversarial networks [9]–[13], and diffusion models [14]–[17], have been proposed for speech enhancement. VAEs consist of an encoder-decoder architecture, where the encoder maps the input data into latent representations, conditioned by a latent regularization loss, and the decoder aims at reconstructing data from these representations [18]. Since the VAE framework facilitates efficient posterior inference and reliable reconstruction, several VAE-based approaches have been proposed for single-channel speech enhancement. For example, the Bayesian permutation training (PVAE) system [3], [4] uses a noise suppression VAE (NSVAE) to learn the latent representations of two pretrained VAEs, one for clean speech (CVAE) and one for noise (NVAE). Several improvements were proposed for this system. In [8], we showed that removing the latent regularization loss for the pretrained VAEs improves performance and generalization. In addition, in [5] it was proposed to use adversarial training to fine-tune the CVAE and NVAE decoders. Since the PVAE only estimates the clean speech magnitude in the Short-time Fourier Transform (STFT) domain in combination with the noisy phase, in [7], the PVAE was extended to the complex domain based on the DCCRN architecture, leading to DCCRN-VAE. Contrary to the PVAE, the DCCRN-VAE employs skip connections for the pretrained VAEs. Its NSVAE encoder only generates speech latent representations, and only the CVAE decoder is fine-tuned with the NSVAE encoder and skip connections using adversarial training.
In this paper, we propose an improved complex VAE-based model, the I-DCCRN-VAE (see Fig. 1), by introducing three key modifications to the DCCRN-VAE: 1) we remove the skip connections in the pretrained VAEs, as they can dominate the reconstruction process and make the latent representations less informative; 2) inspired by our previous work [8], we use \(\beta\)-VAE in pretraining to better balance reconstruction and latent space regularization; and 3) Similarly to the PVAE system, the NSVAE encoder generates both speech and noise latent representations. In our experiments, we train the proposed I-DCCRN-VAE and the baseline DCCRN-VAE on the DNS3 challenge dataset. Fine-tuning the CVAE decoder for both systems either uses classical fine-tuning or adversarial training. The results show that the proposed I-DCCRN-VAE achieves comparable speech enhancement performance as the baseline DCCRN-VAE (and DCCRN) on the matched dataset, but outperforms the baselines on two mismatched datasets (WSJ0-QUT, Voicebank-DEMAND). Notably, the I-DCCRN-VAE achieves this improved generalization without adversarial training, which is a significant advantage, as it simplifies training by avoiding the convergence and sensitivity issues common to adversarial methods.
After introducing the signal model, in this section, we describe the proposed I-DCCRN-VAE system, highlighting the differences with the baseline DCCRN-VAE [7] in terms of pretrained VAEs, noise suppression VAE, and decoder fine-tuning.
In the STFT domain, the observed noisy speech vector \(\mathbf{Y}_n \in \mathbb{C}^F\) at time frame \(n \in [1, N]\), where \(N\) and \(F\) denote the number of time frames and frequency bins, is given by \[\mathbf{Y}_n = \mathbf{X}_n + \mathbf{V}_n, \] where \(\mathbf{X}_n \in \mathbb{C}^F\) and \(\mathbf{V}_n \in \mathbb{C}^F\) denote the clean speech and noise vectors. In the following, the time frame index \(n\) is omitted for simplicity, except when it is required explicitly.
We assume \(\mathbf{X}\) and \(\mathbf{V}\) to be generated from random processes involving latent speech and noise representations \(\mathbf{z}_x \in \mathbb{C}^L\) and \(\mathbf{z}_v \in \mathbb{C}^L\), \(L\) denotes the latent dimension. These random processes are described by the likelihoods \(p_{\scriptstyle\theta_x}(\mathbf{X}|\mathbf{z}_x)\) and \(p_{\scriptstyle \theta_v}(\mathbf{V}|\mathbf{z}_v)\). \(\mathbf{z}_x\) and \(\mathbf{z}_v\) can be sampled from the posterior distributions \(q_{\scriptstyle \phi_x}(\mathbf{z}_x|\mathbf{X})\) and \(q_{\scriptstyle \phi_v}(\mathbf{z}_v|\mathbf{V})\). Assuming that the above-mentioned distributions are estimated by VAEs, \(\phi_x\) and \(\theta_x\) denote the encoder and decoder parameters of the clean speech VAE (CVAE), while \(\phi_v\) and \(\theta_v\) denote the encoder and decoder parameters of the noise VAE (NVAE). The prior distributions for the latent representations \(\mathbf{z}_x\) and \(\mathbf{z}_v\) are denoted as \(p(\mathbf{z}_x)\) and \(p(\mathbf{z}_v)\). Assuming \(\mathbf{z}_x\) and \(\mathbf{z}_v\) to be independent, \(\mathbf{z}_x\) and \(\mathbf{z}_v\) can also be sampled from the noisy posterior distribution \(q_{\scriptstyle \phi_y}(\mathbf{z}_x, \mathbf{z}_v|\mathbf{Y})=q_{\scriptstyle \phi_y}(\mathbf{z}_x|\mathbf{Y})q_{\scriptstyle \phi_y}(\mathbf{z}_v|\mathbf{Y})\), where \(\phi_y\) denotes the encoder parameters of a noise suppression VAE (NSVAE). In the following, the encoder and decoder parameters are omitted for simplicity.
Similar to the DCCRN-VAE system [7], the proposed I-DCCRN-VAE system in Fig. 1 consists of two pretrained VAEs, a clean speech VAE (CVAE) and a noise VAE (NVAE), and a noise suppression VAE (NSVAE). The training process consists of three steps: 1) pretraining the CVAE and NVAE using clean speech \(\mathbf{X}\) and noise \(\mathbf{V}\), 2) training the NSVAE encoder to extract speech and noise representations \(\mathbf{z}_{x}\) and \(\mathbf{z}_{v}\) from noisy speech \(\mathbf{Y}\), and 3) fine-tuning the CVAE decoder for better speech enhancement. The I-DCCRN-VAE differs from the DCCRN-VAE in the pretraining and NSVAE training steps.
1) Pretrained VAEs: Unlike the DCCRN-VAE, which uses skip connections for the CVAE and the NVAE, we don’t consider skip connections for the pretrained VAEs in the I-DCCRN-VAE. This forces all information to pass through the latent bottleneck, encouraging to learn more informative speech and noise latent representations rather than relying on skip connections for reconstruction. In addition, we also use \(\beta\)-VAE [19] to control the balance between reconstruction and latent space regularization. The pretraining loss for the CVAE is given by: \[\begin{align} & -\mathbb{E}_{q(\mathbf{z}_x|\mathbf{X})}[\log p(\mathbf{X}\vert \mathbf{z}_{x})]\;+ \beta \textrm{KL}(q(\mathbf{z}_{x}\vert \mathbf{X})\Vert p(\mathbf{z}_{x})), \label{cvaeloss} \end{align}\tag{1}\] where \(\mathbb{E}\) denotes expectation, \(\textrm{KL}(\cdot\Vert\cdot)\) denotes the Kullback–Leibler (KL) divergence and \(\beta\) denotes the KL weight factor. As in [7], the posterior distribution is assumed to be a complex multivariate Gaussian distribution with a diagonal covariance matrix and relation matrix, i.e. \[\begin{align} & {q({\boldsymbol{z}}_x|{\boldsymbol{X}})} = \mathcal{N} \left({{\boldsymbol{\mu}}_{x},\operatorname{diag}(\boldsymbol{\sigma}_{x}), \operatorname{diag}(\boldsymbol{\delta}_{x})}\right), \label{cvae95post95distri} \end{align}\tag{2}\] where the mean, variance and relation vectors, \({\boldsymbol{\mu}}_{x} \in \mathbb{C}^L\), \({\boldsymbol{\sigma}}_{{x}} \in \mathbb{R}_+^L\) and \(\boldsymbol{\delta}_{{x}} \in \mathbb{C}^L\), are the outputs of the CVAE encoder. The prior distribution \(p(\mathbf{z}_{x})\) is assumed to be a complex multivariate standard Gaussian distribution, \(p(\mathbf{z}_{x})=\mathcal{N} ({{\boldsymbol{0}},{\boldsymbol{I}}, {\boldsymbol{0}}})\), where \(\boldsymbol{I}\) denotes the identity matrix. To allow for backpropagation, the reparameterization trick [20] is used to sample \(\mathbf{z}_x\) from \(q(\mathbf{z}_x|\mathbf{X})\). To improve reconstruction, the first term in (1 ) is replaced by a combined loss on the complex and magnitude spectrograms, i.e. \[\begin{align} &\frac{1}{N}\sum_{n=1}^N \left(\Vert\mathbf{X}_n-\hat{\mathbf{X}}_n\Vert_2^2+\Vert\vert\mathbf{X}_n\vert-\vert\hat{\mathbf{X}}_n\vert\Vert_2^2\right), \label{cvae95recon} \end{align}\tag{3}\] where \(\hat{\mathbf{X}}_n\) denotes the estimated clean speech STFT vector and \(\vert \cdot \vert\) denotes the magnitude of a vector (element-wise). The NVAE assumes a similar loss and similar distributions as the CVAE, which is not explained in detail here.
2) Noise suppression VAE (Fig. 1 (a)): In contrast to the NSVAE encoder in the DCCRN-VAE, which generates only speech representations \(\mathbf{z}_{x}\) from noisy speech \(\mathbf{Y}\) and applies a residual loss to align intermediate features between the NSVAE encoder and the pretrained CVAE encoder, the NSVAE encoder in the I-DCCRN-VAE generates both speech and noise representations \(\mathbf{z}_{x}\) and \(\mathbf{z}_{v}\) without using a residual loss in training. This follows the probabilistic generative modeling derived in [3], which provides a more complete generative basis. Aiming at making the posterior distributions \(q(\mathbf{z}_x|\mathbf{Y})\) and \(q(\mathbf{z}_v|\mathbf{Y})\) from the NSVAE encoder similar to the posterior distributions \(q(\mathbf{z}_x|\mathbf{X})\) and \(q(\mathbf{z}_v|\mathbf{V})\) from the pretrained VAEs, the NSVAE is trained by minimizing the loss \[\begin{align} \textrm{KL}\left({q({\boldsymbol{z}}_x|{\boldsymbol{Y}})}||{q({\boldsymbol{z}}_x|{\boldsymbol{X}})}\right) + \alpha\textrm{KL}\left({q({\boldsymbol{z}}_v|{\boldsymbol{Y}})}||{q({\boldsymbol{z}}_v|{\boldsymbol{V}})}\right), \label{nsvaeloss} \end{align}\tag{4}\] where \(\alpha\) denotes the noise latent weight factor. It should be noted that when \(\alpha=0\), the NSVAE is trained to generate only speech representations \(\mathbf{z}_{x}\). Similar to (2 ), the posterior distributions estimated from the NSVAE encoder are assumed to follow a complex multivariate Gaussian distribution, i.e. \[\begin{align} & {q({\boldsymbol{z}}_x|{\boldsymbol{Y}})} = \mathcal{N} \left({{\boldsymbol{\mu}}_{{yx}},\operatorname{diag}({\boldsymbol{\sigma}}_{{yx}}), \operatorname{diag}({\boldsymbol{\delta}}_{{yx}})}\right), \\ & {q({\boldsymbol{z}}_v|{\boldsymbol{Y}})} = \mathcal{N}\left({{\boldsymbol{\mu}}_{{yv}},\operatorname{diag}({\boldsymbol{\sigma}}_{{yv}}), \operatorname{diag}({\boldsymbol{\delta}}_{{yv}})}\right),\label{nsvaeposterior} \end{align}\tag{5}\] where the mean vectors \({\boldsymbol{\mu}}_{{yx}}\) and \({\boldsymbol{\mu}}_{{yv}}\), the variance vectors \({\boldsymbol{\sigma}}_{{yx}}\) and \({\boldsymbol{\sigma}}_{{yv}}\), and the relation vectors \({\boldsymbol{\delta}}_{{yx}}\) and \({\boldsymbol{\delta}}_{{yv}}\) are the outputs of the NSVAE encoder.
3) CVAE decoder fine-tuning (Fig. 1 (b)): As estimation errors in the posterior distribution \(q({\boldsymbol{z}}_x|{\boldsymbol{Y}})\) degrade the speech enhancement performance, it was proposed in [7] to fine-tune the CVAE decoder while keeping the NSVAE encoder frozen. Skip connections were added in fine-tuning to provide the CVAE decoder with detailed encoder features and to combat the vanishing gradient problem. Similarly as in [7], the CVAE decoder in the I-DCCRN-VAE is fine-tuned to generate the complex mask \(\mathbf{M}\), which is used to estimate the clean speech \(\hat{\mathbf{X}}\) as \[\begin{align} & \hat{\mathbf{X}}=\mathbf{Y}\cdot \mathbf{M}, \label{estspeech} \end{align}\tag{6}\] where the multiplication is performed element-wise. Fine-tuning is performed using the Scale Invariant Signal-to-Distortion Ratio (SI-SDR) loss between the estimated speech \(\hat{\mathbf{x}}\) and the clean speech \(\mathbf{x}\) in the time domain obtained by inverse STFT and overlap-add, i.e. \[\mathcal{L}_{\text{SI-SDR}} = -10 \log_{10} \left( \frac{\left\| \mathbf{x}_d \right\|_2^2}{\left\| \mathbf{x}_d - \hat{\mathbf{x}} \right\|_2^2} \right), \mathbf{x}_d=\frac{\langle \hat{\mathbf{x}}, \mathbf{x} \rangle}{\|\mathbf{x}\|_2^2}\mathbf{x}. \label{sisdrloss}\tag{7}\] For the fine-tuning step, various training schemes can be applied. Besides classical fine-tuning, which only minimizes SI-SDR loss in (7 ), adversarial training has been used in [7], involving a discriminator network. The discriminator learns to distinguish between estimated clean speech and true clean speech, thereby encouraging the model to produce more realistic results.
This section first presents the experimental setup, including the training and evaluation datasets, the network structure, and the training procedure. Then, the experimental results are presented and discussed, evaluating key differences between the proposed I-DCCRN-VAE and the baseline DCCRN-VAE.
To train all considered VAE-based speech enhancement systems, we used anechoic clean speech and noise from the DNS3 dataset [21], sampled at 16kHz. It should be noted that for clean speech, we only considered the read speech (leaving out emotional speech), while for noise, we did not consider the DEMAND dataset, since it was used for evaluation. We randomly split 50\(\%\) of speakers for CVAE pretraining, 40% of speakers for NSVAE training and fine-tuning, and 10% of speakers for validation. The noise data was split similarly. For NSVAE training and fine-tuning, noisy speech was generated by the DNS script at signal-to-noise ratios (SNRs) between -10 dB and 15 dB. In total, we generated 30 hours of data for pretraining, 20 hours of data for NSVAE training and CVAE decoder fine-tuning, and 10 hours of data for validation.
To evaluate the speech enhancement performance, we used three datasets. As the matched evaluation dataset, we used the official synthetic DNS3 test set at SNRs between 0 dB and 19 dB. To test the generalization ability, we used two mismatched datasets with different speakers and noise from the training dataset, namely WSJ0-QUT [2] and VoiceBank-DEMAND (VB-DMD) [22]. WSJ0-QUT includes cafe, home, street and car noise at SNRs of -5 dB, 0 dB and 5 dB. The official VB-DMD test set includes room, office, bus, cafe and public square noise at SNRs of 2.5 dB, 7.5 dB, 12.5 dB and 17.5 dB.
We used a similar STFT framework and network architectures as for the DCCRN-VAE system [7]. The time-domain signals are transformed to the STFT domain using a Hann window with a frame length of 400, 25% overlap, and a FFT length of 512. For all VAEs, the dimension of the latent representations is equal to \(L=128\). The CVAE and NVAE encoders contain six Conv2d blocks and one complex LSTM layer. The channels for the Conv2d blocks are [32, 64, 128, 128, 256, 256], with a kernel size of (5,2) and a stride of (2,1). The complex LSTM layer outputs the \(L\)-dimensional mean, variance and relation vectors: (\({\boldsymbol{\mu}}_{{x}}\), \({\boldsymbol{\sigma}}_{{x}}\) and \({\boldsymbol{\delta}}_{{x}}\) for the CVAE; \({\boldsymbol{\mu}}_{{v}}\), \({\boldsymbol{\sigma}}_{{v}}\) and \({\boldsymbol{\delta}}_{{v}}\) for the NVAE). The NSVAE encoder has a similar structure, where the only difference is that the LSTM layer generates both speech and noise vectors (\({\boldsymbol{\mu}}_{{yx}}\), \({\boldsymbol{\mu}}_{{yv}}\), \({\boldsymbol{\sigma}}_{{yx}}\), \({\boldsymbol{\sigma}}_{{yv}}\), \({\boldsymbol{\delta}}_{{yx}}\), \({\boldsymbol{\delta}}_{{yv}}\)). The CVAE and NVAE decoders mirror their respective encoders in reverse order. When adversarial training is used for the CVAE decoder fine-tuning, the discriminator has a similar structure to the CVAE encoder, including six Conv2d blocks and one real LSTM layer with a single output.
All networks were trained for a maximum of 1000 epochs. The training was stopped early if the validation loss did not decrease for 20 consecutive epochs. The Adam optimizer was used with a learning rate of 3e-4 (CVAE, NVAE, NSVAE) and a learning rate of 8e-5 (discriminator for adversarial training). All learning rates were halved if the validation loss did not improve for 3 consecutive epochs. The batch size was set to 15. The code can be found on https://github.com/iris1997jiatong/I-DCCRN-VAE.
Aiming at finding the optimal configuration set of hyperparameters for the proposed I-DCCRN-VAE, in the first set of experiments, we evaluate key differences with the DCCRN-VAE. More in particular, we investigate the influence of skip connections and latent space regularization in the pretrained VAEs as well as the influence of the NSVAE training target. It should be noted that in this set of experiments, we only consider classical fine-tuning for the CVAE decoder fine-tuning.
7pt
c|c|cc|cc & *\(\beta\) & &
& & & & &
& \(0.001\) & 15.7 & 303.9 & 16.6 & 398.1
&\(0.01\) & 14.7 & 67.3 & 14.9 & 114.7
& \(0.1\) & 13.0 & 24.0 & 12.4 & 41.6
& \(1\) & 8.4 & 7.7 & 5.5 & 11.6
& - & 39.0 & 0.0 &38.1 & 0.0
2.7pt
c|c|cc|cc|cc & *\(\beta\) & & &
& & & & & & &
* &\(0.001\) & 16.9 & 2.52 & 8.6 & 1.61 & 17.8 & 2.33
& \(0.01\) & 17.2 & 2.49 & 8.7 & 1.65 & 18.0 & 2.44
&\(0.1\) & 16.8 & 2.35 & 8.4 & 1.62 & 18.0 & 2.43
&\(1\) & 16.0 & 2.23 & 7.4 & 1.55 & 17.2 & 2.32
With SC & - & 11.7 & 1.71 & 0.0 & 1.19 & 14.1 & 2.16
2.7pt
c|cc|cc|cc *\(\alpha\) & & &
& & & & & &
& 16.9 & 2.44 & 7.9 & 1.62 & 17.9 & 2.43
1 & 17.2 & 2.49 & 8.7 & 1.65 & 18.0 & 2.44
5pt
c|ccc|ccc|ccc *System & & &
& SI-SDR & PESQ & ESTOI & SI-SDR & PESQ & ESTOI & SI-SDR & PESQ & ESTOI
Unprocessed & & & & & & & & &
(1) DCCRN [23] & & & & & & & & &
(2) DCCRN-VAE (CF) & & & & & & & & &
(3) DCCRN-VAE (ADV) [7] & & & & & & & & &
(4) I-DCCRN-VAE (CF) (Proposed) & & & & & & & & &
(5) I-DCCRN-VAE (ADV) (Proposed) & & & & & & & & &
Table [tab:32sc95pre] shows the influence of skip connections and the KL weight factor \(\beta\) in (1 ) on the reconstruction quality and the latent space of the pretrained CVAE and NVAE for the DNS3 dataset. We use reconstruction SI-SDR to measure reconstruction quality and KL loss (KLL) between estimated posterior distributions (\(q(\mathbf{z}_x|\mathbf{X})\), \(q(\mathbf{z}_v|\mathbf{V})\)) and prior distributions (\(p(\mathbf{z}_x)\), \(p(\mathbf{z}_v)\)) to assess the regularization of the latent space. Lower KL loss indicates a more regularized space. For both CVAE and NVAE, it can be observed that including skip connections yields a much higher reconstruction SI-SDR but a much lower KL loss close to zero than without skip connections. This indicates posterior collapse for the pretrained VAEs, suggesting that speech and noise latent representations are not very informative for speech and noise reconstruction when using skip connections. Without skip connections, it can be observed that as \(\beta\) decreases from 1 to 0.001, the reconstruction SI-SDR and the KLL of both CVAE and NVAE increase. This indicates a clear trade-off: decreasing \(\beta\) improves reconstruction quality at the cost of a less regularized latent space.
Table [tab:32beta95se] evaluates the influence of skip connections and \(\beta\) in the pretrained VAEs on the overall speech enhancement performance in terms of SI-SDR and wide-band Perceptual Evaluation of Speech Quality (PESQ) [24] for the matched DNS3 dataset and the mismatched datasets. First, it can be observed that including skip connections yields significantly lower SI-SDR and PESQ scores than without skip connections. This can be explained by the less informative latent representations in the pretrained VAEs, which the NSVAE encoder is trained to match. Therefore, the NSVAE encoder fails to learn useful information for speech reconstruction from the pretrained VAEs. Without skip connections, a clear trend can be observed, where SI-SDR and PESQ across all datasets first increase and then decrease as \(\beta\) decreases, with \(\beta=0.01\) yielding the best performance (except for PESQ on the matched DNS3 dataset). Combined with results in Table [tab:32sc95pre], we may conclude that both pretrained reconstruction quality and latent space regularization affect the speech enhancement performance. As \(\beta\) decreases from 1 to 0.01, the improved pretrained reconstruction quality leads to better speech enhancement performance for all considered datasets. However, while \(\beta=0.001\) further improves the pretrained reconstruction quality, the speech enhancement performance degrades, especially for both mismatched datasets. This performance drop is likely due to the highly unregularized latent space for \(\beta=0.001\), which degrades the generalization ability.
Table [tab:32nsvae95res] shows the influence of the NSVAE training target in (4 ) on the speech enhancement performance for different datasets. For \(\alpha=0\), the NSVAE encoder only generates speech latent representations, while for \(\alpha=1\), the NSVAE encoder generates both speech and noise latent representations. It should be noted that, based on the results in Table [tab:32beta95se], we consider pretrained VAEs without skip connections with \(\beta=0.01\) here. The results show that training the NSVAE to generate both speech and noise representations (\(\alpha=1\)) yields consistently higher SI-SDR and PESQ scores across all datasets compared to generating only speech representations (\(\alpha=0\)). This suggests that explicitly modeling the noise component contributes to an extraction of speech information from the noisy mixture, thereby improving the speech enhancement performance.
In this section, we compare the performance of the proposed I-DCCRN-VAE, using the optimal configuration (without skip connections in pretrained VAEs, \(\beta=0.01\), \(\alpha=1\)), with DCCRN[23] and DCCRN-VAE with the residual loss [7]. For the DCCRN-VAE and I-DCCRN-VAE systems, we also compare adversarial training and classical fine-tuning for the CVAE decoder. For a fair comparison, the DCCRN baseline was trained on the same noisy dataset used for NSVAE training and fine-tuning, while the DCCRN-VAE baseline used the same datasets as the I-DCCRN-VAE.
Table [tab:32ablation95adv] shows the average SI-SDR, wideband PESQ, and Extended Short-Time Objective Intelligibility (ESTOI) [25] scores for the matched DNS3 dataset and the mismatched datasets. First, it can be observed that for all datasets that compared to classical fine-tuning, adversarial training provides a much larger performance benefit for the baseline DCCRN-VAE (systems (2) and (3)) than for the proposed I-DCCRN-VAE (systems (4) and (5)). This can be explained by the difference in the estimated clean speech between DCCRN-VAE and I-DCCRN-VAE before CVAE fine-tuning. Due to posterior collapse, the speech quality of the DCCRN-VAE is rather poor before fine-tuning; the discriminator can easily distinguish between estimated speech and true clean speech, making adversarial training highly effective. In contrast, since the proposed I-DCCRN-VAE learns informative pretrained latent spaces and already produces high-quality speech before fine-tuning, adversarial training does not bring a large benefit. This is a significant advantage, as classical fine-tuning avoids the convergence and sensitivity issues that are common to adversarial training.
Finally, we compare the speech enhancement performance of the proposed I-DCCRN-VAE (systems (4) and (5)) with DCCRN (system (1)) and DCCRN-VAE using adversarial training (system (3)). On the matched DNS3 dataset, it can be observed that the baseline DCCRN and DCCRN-VAE achieve slightly better SI-SDR or PESQ scores than the I-DCCRN-VAE. However, for both considered mismatched datasets, the proposed I-DCCRN-VAE consistently achieves the best performance in terms of all metrics. Specifically, on the WSJ0-QUT dataset, the I-DCCRN-VAE improves upon baselines by around 1.7 dB in SI-SDR, 0.1 in PESQ, and 0.03 in ESTOI. On the VB-DMD dataset, the respective improvements are 0.6 dB in SI-SDR, around 0.06 in PESQ, and 0.02 in ESTOI. This demonstrates the improved generalization ability of the proposed I-DCCRN-VAE.
This paper proposed the I-DCCRN-VAE for complex VAE-based single-channel speech enhancement, which improves the DCCRN-VAE. We demonstrate that three key modifications are crucial for improvement: 1) removing skip connections in the pretrained VAEs to avoid posterior collapse; 2) using \(\beta\)-VAE in pretrained VAEs to better balance reconstruction and latent space regularization; and 3) the NSVAE generating both speech and noise representations for better speech extraction. Experiments show that the I-DCCRN-VAE achieves comparable performance to baselines on the matched dataset but consistently better performance on two mismatched datasets, demonstrating better generalization. Especially, the I-DCCRN-VAE achieves the performance even with classical fine-tuning, not adversarial training, leading to a simplified training pipeline.