October 23, 2025
The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LM-based models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.
Speech enhancement, decoder-only autoregressive language models, unified framework.
In recent years, the concept of speech enhancement (SE) has become broader, ranging from conventional denoising into reconstructing clean target speech from degraded recordings [1]–[3]. In this context, SE can include many sub-tasks: speech restoration (SR) that aims to restore speech from the degraded recording with various distortions; target speaker extraction (TSE) that extracts the target speech guided by assistive clue, e.g., reference speech of the target speaker; speech separation (SS) that aims at separating all existing speakers from the mixture. Compared with conventional signal processing algorithms, deep neural networks (DNNs) can achieve even better SE performance in non-stationary conditions and thus become the mainstream in this field [4], [5].
Language models (LMs) have made remarkable success in generating text [6], images [7] and audio [8]–[10], highlighting the power of LMs within unimodal and multi-modal domains. Some works have explored applications of LMs to SE by typically predicting the discrete tokens of clean speech, which are extracted by pre-trained neural audio codecs (NACs). For instance, SELM [11] employs WavLM [12] in combination with K-means clustering to extract discrete representations, followed by a non-autoregressive (NAR) LM backbone to predict clean tokens. GenSE [13] is a two-stage approach on the basis of autoregressive (AR) modeling, where the first stage generates clean semantic tokens in noisy semantic conditions, and the outputs are utilized to predict clean acoustic features in the second stage. In [14], a TSE model called LauraTSE adopts Conformer [15] to extract continuous features of mixture and reference speech, serving as prefixes to estimate the discrete tokens of target speech. Although these works have shown the potential of LMs in SE, they are confined to single distortion or task.
Some studies consider more distortions or focus on the unification of multiple tasks to expand the universality of SE systems. MaskSR [16] simutaneously incorporates additive noise, reverberation, clipping and bandwidth limitation, which performs masked prediction learning [17] on multi-layer discrete tokens. LLaSE-G1 [18] employs NAR LM to map continuous WavLM representations of noisy speech to the clean discrete tokens. It handles multiple tasks by introducing a dual-channel input and output architecture. These works involve the paradigm of masked generation or directly mapping, and the effectiveness of AR modeling in multi-task SE framework remains to be further verified. Considering the flexible prefix formulations of the decoder-only model, it has potential to act as an elegant solution for the task unification.
In this work, we therefore propose a decoder-only AR LM-based framework called UniSE to unify multiple sub-tasks of SE, including SR, TSE and SS. Our contribution is threefold. 1) We design a decoder-only SE framework, which extracts features by a frozen WavLM with a learnable adapter as conditions, generating the discrete tokens of clean target speech and reconstructing waveform using an NAC. 2) We propose a task token to distinguish between different operational modes, unifying multiple tasks by switching and combination of these modes. 3) The proposed model achieves comparable or superior performance than advanced baselines on several benchmarks, revealing the potential of decoder-only AR LM in the unification of multiple SE sub-tasks.
An overview of the proposed UniSE is illustrated in Fig. 1, comprising WavLM with adapter to extract continuous speech feature, a discrete speech codec to produce discrete tokens and reconstruct waveforms and a decoder-only LM backbone to model conditional probability.
To extract features from reference and degraded speech for conditional AR modeling, we adopt the pre-trained WavLM1 as the feature extractor, which is a self-supervised learning model pre-trained on large-scale speech data and achieves impressive performance on various downstream tasks. We average the features from all transformer layers in WavLM to obtain sufficient acoustic and semantic information simultaneously. A linear layer is utilized as the adapter to map the output from WavLM into a representation space amenable to LM AR modeling, where the parameters of the former are trainable while the later is frozen during training. The final features extracted from reference and degraded speech are denoted as \({\rm E}_r\) and \({\rm E}_d\), respectively.
We utilize BiCodec [10] to convert the continuous regression problems of SE into discrete autoregressive modeling, which was initially proposed for text-to-speech (TTS) and can obtain high reconstruction quality. During training, the BiCodec Encoder encodes the target speech waveform into global feature \({\rm E}_g\) with a fixed length of 32 tokens and semantic feature \({\rm E}_s\) with a flexible length of 50 tokens per second, where the former is strongly correlated with speaker characteristics and the latter represents speech content. Both features employ single-layer quantization, enabling easy integration with LMs for AR modeling. During inference, the BiCodec decoder combines \({\rm E}_g\) and \({\rm E}_s\) to restore original speech, which benefits from this explicit disentanglement to achieve high fidelity.
The proposed framework adopts the LLaMA architecture [19] as the backbone for AR modeling, aiming to estimate the conditional probability density distribution of target speech discrete representations given optional reference speech and degraded speech. To incorporate SR, TSE and SS tasks within a single framework, we define three operational modes: SR mode, TSE mode and reverse TSE (rTSE) mode. These modes correspond to three distinct learnable task-specific tokens: \({\rm T_{SR}}\), \({\rm T_{TSE}}\), and \({\rm T_{rTSE}}\). For the SR mode, the target speech corresponds to the clean signal of the degraded input. The input sequence of AR LM is formatted as [\({\rm T_{SR}}\), \({\rm D}\), \({\rm E}_d\), \({\rm G}\), \({\rm E}_g\), \({\rm S}\), \({\rm E}_s\)], where \({\rm D}\) denotes the start of degraded speech features, \({\rm G}\) the start of global features, and \({\rm S}\) the start of semantic features, respectively. The output sequence is formulated as \({\boldsymbol{o}}= \left[ {\rm E}_g, {\rm S}, {\rm E}_s, {\rm E} \right]\), with \({\rm E}\) representing the end-of-sequence token. The parameters \(\theta\) of the adapter and decoder-only LM are optimized by minimizing the negative log-likelihood of the predicted outputs: \[\begin{align} \mathcal{L}_{\rm SR} = - \sum_{t=1}^{L} {\rm log}P\left(o_t | {\rm T_{SR}}, {\rm D}, {\rm E}_d, o_{<t} ; \theta\right), \end{align}\] where \(L\) indicates the length of output sequence.
For the TSE mode, the target speech corresponds to the timbre-matched speech component in the degraded input that aligns with the reference audio. The input sequence is formatted as [\({\rm T_{TSE}}\), \({\rm R}\), \({\rm E}_r\), \({\rm D}\), \({\rm E}_d\), \({\rm G}\), \({\rm E}_g\), \({\rm S}\), \({\rm E}_s\)], where \({\rm R}\) denotes the start of reference speech features. The associated loss function is defined as \[\begin{align} \mathcal{L}_{\rm TSE} = - \sum_{t=1}^{L} {\rm log}P\left(o_t | {\rm T_{TSE}}, {\rm R}, {\rm E}_r, {\rm D}, {\rm E}_d, o_{<t} ; \theta\right). \end{align}\] For the rTSE mode, the target speech corresponds to the timbre-mismatched speech component in the degraded input when compared with the reference audio. The input sequence format and loss function \(\mathcal{L}_{\rm rTSE}\) keep identical to that of the TSE mode.
| Distortion | Probability | Hyperparameters |
|---|---|---|
| Noise | 0.8 | SNR \(\in\) [-5, 20] |
| Reverberation | 0.3 | - |
| Clipping | 0.3 | Min_quantile \(\in\) [0.0, 0.1] |
| Max_quantile \(\in\) [0.9, 1.0] | ||
| Bandwidth Limitation | 0.3 | Bandwidth \(\in\) kHz |
| Packet Loss | 0.3 | Rate \(\in\) [0.05, 0.25] |
| Interference Speaker | 0.2 for SR | SIR \(\in\) [2, 20] for SR |
| 1.0 for TSE/rTSE | SIR \(\in\) [-5, 5] for TSE/rTSE |
During inference, we divide the input speech into fixed-length segments for processing, according to the training setup. The SR mode is utilized for SR task, which restores clean speech from the degraded recording with various distortions. When multiple speakers exist in the degraded speech, our model intends to output the louder speaker. The TSE mode processes the TSE task, which extracts timbre-matched speech from the mixture regardless of the relative loudness. While for the SS task, we consider multiple inferences that involve all three modes. Specifically, for the degraded speech containing two speakers (this work only considers the two-speaker case), we first employ the SR mode to extract the louder speaker. This initial result then serves as the reference audio for the TSE mode to separate the first speaker. Since the relative loudness can vary at different periods of input signals, this step helps ensure the speaker consistency across different segments. Finally, using the first speaker as reference, the rTSE mode is applied to extract the remaining speaker.
| Model | Type | With Reverb | No Reverb | ||||
|---|---|---|---|---|---|---|---|
| 3-5 (lr)6-8 | SIG | BAK | OVRL | SIG | BAK | OVRL | |
| Noisy | - | 1.76 | 1.50 | 1.39 | 3.39 | 2.62 | 2.48 |
| Conv-TasNet [4] | D | 2.42 | 2.71 | 2.01 | 3.09 | 3.34 | 3.00 |
| FRCRN [20] | D | 2.93 | 2.92 | 2.28 | 3.58 | 4.13 | 3.34 |
| SELM [11] | G | 3.16 | 3.58 | 2.70 | 3.51 | 4.10 | 3.26 |
| MaskSR [16] | G | 3.53 | 4.07 | 3.25 | 3.59 | 4.12 | 3.34 |
| AnyEnhance [21] | G | 3.50 | 4.04 | 3.20 | 3.64 | 4.18 | 3.42 |
| GenSE [13] | G | 3.49 | 3.73 | 3.19 | 3.65 | 4.18 | 3.43 |
| LLaSE-G1 [18] | G | 3.59 | 4.10 | 3.33 | 3.66 | 4.17 | 3.42 |
| UniSE | G | 3.67 | 4.10 | 3.40 | 3.67 | 4.14 | 3.43 |
| UniSE-SR | G | 3.67 | 4.08 | 3.38 | 3.66 | 4.13 | 3.42 |
Datasets: The clean speech data for training is sourced from the VoxBox dataset [10], which integrates multiple publicly available speech datasets after rigorous data cleaning. Our training set contains 760 hours of LibriSpeech [22] data, 1200 hours from the MLS_English [23] subset, and 1800 hours of the Emilia_ZH [24] subset. The noise corpus comprises approximately 460 hours of data from the DNS Challenge [25], FSD50K [26], WHAM! [27], DESED [28], DEMAND [29], MUSAN [30], DISCO [31], MUSDB18-HQ [32], and TUT Urban Acoustic Scenes [33]. We include 60,000 room impulse response (RIR) samples from SLR28 to simulate reverberation. A data augmentation pipeline is designed to simulate degraded speech, as shown in Table 1. We randomly select operational modes during training, and all distortions are applied to each mode based on probability. Note that interference speaker has different configurations in different mode. All audio samples are sampled at a frequency of 16 kHz.
Implementation Details: The LLaMA-based decoder-only backbone consists of 12 layers, each with 8 attention heads and a hidden dimension of 512, resulting in 63M parameters. Our model is trained using AdamW optimizer with 30 epochs. The learning rate reaches a peak of 0.001 after 4000 warm-up steps, and reduces at a decay factor of 0.98 in each epoch. During training and inference, the lengths of reference speech and degraded speech are set to 5 seconds.
Evaluation Configurations: We evaluate our model on several benchmarks, including test sets from 2020 DNS Challenge [25] and 2025 URGENT Challenge [34] for SR task, Libri2Mix clean test set for the TSE task, and Libri2Mix noisy test set with WSJ0-2mix test set for the SS task. We adopt DNSMOS [35] (SIG, BAK, and OVRL scores represent signal quality, background noise, and overall quality, respectively), NISQA [36] and UTMOS [37] to measure the quality of the generated speech. Following [14], the speaker similarity (SIM) is calculated using WavLM-base2 for TSE.
| Team/Model | Team Rank | OVRL | NISQA | UTMOS |
|---|---|---|---|---|
| Bobbsun | 1 | 2.88 | 3.22 | 2.09 |
| Xiaobin | 2 | 2.92 | 3.24 | 2.16 |
| subatomicseer | 3 | 2.94 | 3.25 | 2.19 |
| wataru9871 | 13 | 3.10 | 3.74 | 2.53 |
| LLaSE-G1 [18] | - | 2.80 | 2.93 | 2.09 |
| UniSE | - | 3.17 | 3.72 | 2.85 |
The ranking takes into account both non-intrusive and intrusive metrics, where the latter are not friendly to generative models.
| Model | Type | SIG | BAK | OVRL | NISQA | SIM | |
|---|---|---|---|---|---|---|---|
| Mixture | - | 3.38 | 3.10 | 2.65 | 2.45 | 0.85 | |
| Spex+ [38] | D | 3.38 | 3.77 | 3.00 | 3.03 | 0.96 | |
| WeSep [39] | D | 3.56 | 3.93 | 3.23 | 4.04 | 0.99 | |
| TSELM-L [40] | G | 3.55 | 4.08 | 3.23 | 4.03 | 0.91 | |
| AnyEnhance [21] | G | 3.64 | 4.07 | 3.35 | 4.28 | 0.91 | |
| LLaSE-G1 [18] | G | 3.53 | 4.01 | 3.22 | 3.89 | 0.92 | |
| LauraTSE [14] | G | 3.61 | 4.08 | 3.34 | 4.33 | 0.97 | |
| UniSE | G | 3.62 | 4.06 | 3.33 | 4.00 | 0.95 | |
| UniSE-TSE | G | 3.61 | 4.06 | 3.33 | 3.99 | 0.95 | |
| BiCodec | - | 3.59 | 4.05 | 3.30 | 4.02 | 0.97 |
| Model | Type | Libri2Mix | WSJ0-2mix | ||||
|---|---|---|---|---|---|---|---|
| 3-5 (lr)6-8 | SIG | BAK | OVRL | SIG | BAK | OVRL | |
| Mixture | - | 2.33 | 1.66 | 1.64 | 3.42 | 3.20 | 2.76 |
| Sepformer [41] | D | 3.33 | 3.88 | 3.02 | 3.43 | 3.96 | 3.14 |
| Mossformer2 [42] | D | 3.44 | 3.94 | 3.11 | 3.50 | 4.05 | 3.23 |
| LLaSE-G1 [18] | G | 3.48 | 3.83 | 3.11 | 3.52 | 3.92 | 3.19 |
| UniSE | G | 3.60 | 4.08 | 3.32 | 3.62 | 4.08 | 3.36 |
| LM | Codec | With Reverb | No Reverb | ||||
|---|---|---|---|---|---|---|---|
| 3-5 (lr)6-8 | SIG | BAK | OVRL | SIG | BAK | OVRL | |
| LLaMA | BiCodec | 3.67 | 4.10 | 3.40 | 3.67 | 4.14 | 3.43 |
| Qwen2 | BiCodec | 3.66 | 4.08 | 3.39 | 3.67 | 4.14 | 3.44 |
| GLM | BiCodec | 3.65 | 4.10 | 3.39 | 3.66 | 4.14 | 3.43 |
| LLaMA | X-codec2 | 3.57 | 4.03 | 3.27 | 3.60 | 4.09 | 3.34 |
Table 2 presents performance comparisons between UniSE and several advanced baselines on 2020 DNS Challenge test set. It is clear that our model achieves state-of-the-art (SOTA) SR performance, and training exclusively in the SR mode (denoted as UniSE-SR) shows comparable performance with UniSE. Notably, LLaSE-G1 employs a backbone with approximately 1 billion parameters, while the size of our model is only 63M, showing superior parameter efficiency. Table 3 reports the SR results of our framework and the systems submitted to the URGENT Challenge, which involves multiple distortions. Our model achieves competitive performance even with unseen distortions (codec artifacts and wind noise), demonstrating a great generalization ability.
TSE results on the Libri2Mix clean test set are summarized in Table 4, showing that our model achieves comparable performance relative to SOTA baselines. Compared to another AR-based method LauraTSE, our framework supports more tasks. The variant UniSE-TSE exclusively trained in the TSE mode achieves close performance, which is similar to the comparison between UniSE and UniSE-SR for SR. This indicates that incorporating more tasks wont deteriorate the performance of individual task in our framework. The results produced by directly processing target speech using BiCodec (the bottom row) reveal the performance limitations of codecs on SE frameworks, demonstrating the necessity of further improving low-bitrate NACs.
Table 5 compares the SS performance of our model with baselines on Libri2Mix and WSJ0-2mix test sets, where the former includes additional noises. UniSE outperforms other discriminative and generative models with OVRL scores of 3.32 on Libri2Mix and 3.36 on WSJ0-2mix, highlighting the effectiveness of our multi-mode inference strategy.
Finally, we carry out an ablation study in Table 6 to analyze the impact of different LM architectures and NACs on the SE performance. Replacing the LM backbone with Qwen2 [43] and GLM [44] results in similar performance, showing the adaptability of our framework. Utilizing X-codec2 [45] clearly leads to a performance decay, which might be caused by its large codebook size that challenges the ability of LM backbone.
In this work, we proposed an SE framework called UniSE, which unifies SR, TSE and SS tasks. UniSE adopts continuous features of the degraded speech and inference speech as conditions to generate discrete tokens of target speech via AR modeling. Multiple operational modes are defined by the task token, enabling task unification by the switching and combination of different modes. Extensive results show that our UniSE achieves competitive performance within each benchmark, verifying the effectiveness of decoder-only AR LM framework in unifying SE tasks. Future work could consider more general audio tasks and efforts can be made to improve the ability of codec.