UniSE: A Unified Framework for Decoder-Only Autoregressive LM-Based Speech Enhancement


Abstract

The development of neural audio codecs (NACs) has largely promoted applications of language models (LMs) to speech processing and understanding. However, there lacks the verification on the effectiveness of autoregressive (AR) LM-based models in unifying different sub-tasks of speech enhancement (SE). In this work, we propose UniSE, a unified decoder-only LM-based framework to handle different SE tasks including speech restoration, target speaker extraction and speech separation. It takes input speech features as conditions and generates discrete tokens of the target speech using AR modeling, which facilitates a compatibility between distinct learning patterns of multiple tasks. Experiments on several benchmarks indicate the proposed UniSE can achieve competitive performance compared to discriminative and generative baselines, showing the capacity of LMs in unifying SE tasks. The demo page is available here: https://github.com/hyyan2k/UniSE.

Speech enhancement, decoder-only autoregressive language models, unified framework.

1 Introduction↩︎

In recent years, the concept of speech enhancement (SE) has become broader, ranging from conventional denoising into reconstructing clean target speech from degraded recordings [1][3]. In this context, SE can include many sub-tasks: speech restoration (SR) that aims to restore speech from the degraded recording with various distortions; target speaker extraction (TSE) that extracts the target speech guided by assistive clue, e.g., reference speech of the target speaker; speech separation (SS) that aims at separating all existing speakers from the mixture. Compared with conventional signal processing algorithms, deep neural networks (DNNs) can achieve even better SE performance in non-stationary conditions and thus become the mainstream in this field [4], [5].

Language models (LMs) have made remarkable success in generating text [6], images [7] and audio [8][10], highlighting the power of LMs within unimodal and multi-modal domains. Some works have explored applications of LMs to SE by typically predicting the discrete tokens of clean speech, which are extracted by pre-trained neural audio codecs (NACs). For instance, SELM [11] employs WavLM [12] in combination with K-means clustering to extract discrete representations, followed by a non-autoregressive (NAR) LM backbone to predict clean tokens. GenSE [13] is a two-stage approach on the basis of autoregressive (AR) modeling, where the first stage generates clean semantic tokens in noisy semantic conditions, and the outputs are utilized to predict clean acoustic features in the second stage. In [14], a TSE model called LauraTSE adopts Conformer [15] to extract continuous features of mixture and reference speech, serving as prefixes to estimate the discrete tokens of target speech. Although these works have shown the potential of LMs in SE, they are confined to single distortion or task.

Some studies consider more distortions or focus on the unification of multiple tasks to expand the universality of SE systems. MaskSR [16] simutaneously incorporates additive noise, reverberation, clipping and bandwidth limitation, which performs masked prediction learning [17] on multi-layer discrete tokens. LLaSE-G1 [18] employs NAR LM to map continuous WavLM representations of noisy speech to the clean discrete tokens. It handles multiple tasks by introducing a dual-channel input and output architecture. These works involve the paradigm of masked generation or directly mapping, and the effectiveness of AR modeling in multi-task SE framework remains to be further verified. Considering the flexible prefix formulations of the decoder-only model, it has potential to act as an elegant solution for the task unification.

Figure 1: Overall architecture of UniSE, where the BiCodec Encoder is only utilized to generate label tokens during training and excluded during inference. The snowflake icon means that parameters are pre-trained and frozen, and the fire icon indicates that parameters are optimized during training.

In this work, we therefore propose a decoder-only AR LM-based framework called UniSE to unify multiple sub-tasks of SE, including SR, TSE and SS. Our contribution is threefold. 1) We design a decoder-only SE framework, which extracts features by a frozen WavLM with a learnable adapter as conditions, generating the discrete tokens of clean target speech and reconstructing waveform using an NAC. 2) We propose a task token to distinguish between different operational modes, unifying multiple tasks by switching and combination of these modes. 3) The proposed model achieves comparable or superior performance than advanced baselines on several benchmarks, revealing the potential of decoder-only AR LM in the unification of multiple SE sub-tasks.

2 METHODOLOGY↩︎

An overview of the proposed UniSE is illustrated in Fig. 1, comprising WavLM with adapter to extract continuous speech feature, a discrete speech codec to produce discrete tokens and reconstruct waveforms and a decoder-only LM backbone to model conditional probability.

2.1 Conditional Feature Extractor↩︎

To extract features from reference and degraded speech for conditional AR modeling, we adopt the pre-trained WavLM1 as the feature extractor, which is a self-supervised learning model pre-trained on large-scale speech data and achieves impressive performance on various downstream tasks. We average the features from all transformer layers in WavLM to obtain sufficient acoustic and semantic information simultaneously. A linear layer is utilized as the adapter to map the output from WavLM into a representation space amenable to LM AR modeling, where the parameters of the former are trainable while the later is frozen during training. The final features extracted from reference and degraded speech are denoted as \({\rm E}_r\) and \({\rm E}_d\), respectively.

2.2 Discrete Token Codec↩︎

We utilize BiCodec [10] to convert the continuous regression problems of SE into discrete autoregressive modeling, which was initially proposed for text-to-speech (TTS) and can obtain high reconstruction quality. During training, the BiCodec Encoder encodes the target speech waveform into global feature \({\rm E}_g\) with a fixed length of 32 tokens and semantic feature \({\rm E}_s\) with a flexible length of 50 tokens per second, where the former is strongly correlated with speaker characteristics and the latter represents speech content. Both features employ single-layer quantization, enabling easy integration with LMs for AR modeling. During inference, the BiCodec decoder combines \({\rm E}_g\) and \({\rm E}_s\) to restore original speech, which benefits from this explicit disentanglement to achieve high fidelity.

2.3 Unified Multi-Task Framework↩︎

The proposed framework adopts the LLaMA architecture [19] as the backbone for AR modeling, aiming to estimate the conditional probability density distribution of target speech discrete representations given optional reference speech and degraded speech. To incorporate SR, TSE and SS tasks within a single framework, we define three operational modes: SR mode, TSE mode and reverse TSE (rTSE) mode. These modes correspond to three distinct learnable task-specific tokens: \({\rm T_{SR}}\), \({\rm T_{TSE}}\), and \({\rm T_{rTSE}}\). For the SR mode, the target speech corresponds to the clean signal of the degraded input. The input sequence of AR LM is formatted as [\({\rm T_{SR}}\), \({\rm D}\), \({\rm E}_d\), \({\rm G}\), \({\rm E}_g\), \({\rm S}\), \({\rm E}_s\)], where \({\rm D}\) denotes the start of degraded speech features, \({\rm G}\) the start of global features, and \({\rm S}\) the start of semantic features, respectively. The output sequence is formulated as \({\boldsymbol{o}}= \left[ {\rm E}_g, {\rm S}, {\rm E}_s, {\rm E} \right]\), with \({\rm E}\) representing the end-of-sequence token. The parameters \(\theta\) of the adapter and decoder-only LM are optimized by minimizing the negative log-likelihood of the predicted outputs: \[\begin{align} \mathcal{L}_{\rm SR} = - \sum_{t=1}^{L} {\rm log}P\left(o_t | {\rm T_{SR}}, {\rm D}, {\rm E}_d, o_{<t} ; \theta\right), \end{align}\] where \(L\) indicates the length of output sequence.

For the TSE mode, the target speech corresponds to the timbre-matched speech component in the degraded input that aligns with the reference audio. The input sequence is formatted as [\({\rm T_{TSE}}\), \({\rm R}\), \({\rm E}_r\), \({\rm D}\), \({\rm E}_d\), \({\rm G}\), \({\rm E}_g\), \({\rm S}\), \({\rm E}_s\)], where \({\rm R}\) denotes the start of reference speech features. The associated loss function is defined as \[\begin{align} \mathcal{L}_{\rm TSE} = - \sum_{t=1}^{L} {\rm log}P\left(o_t | {\rm T_{TSE}}, {\rm R}, {\rm E}_r, {\rm D}, {\rm E}_d, o_{<t} ; \theta\right). \end{align}\] For the rTSE mode, the target speech corresponds to the timbre-mismatched speech component in the degraded input when compared with the reference audio. The input sequence format and loss function \(\mathcal{L}_{\rm rTSE}\) keep identical to that of the TSE mode.

Table 1: Distortion categories and simulation configurations, where SNR denotes signal-to-noise ratio and SIR denotes signal-to-interference ratio.
Distortion Probability Hyperparameters
Noise 0.8 SNR \(\in\) [-5, 20]
Reverberation 0.3 -
Clipping 0.3 Min_quantile \(\in\) [0.0, 0.1]
Max_quantile \(\in\) [0.9, 1.0]
Bandwidth Limitation 0.3 Bandwidth \(\in\) kHz
Packet Loss 0.3 Rate \(\in\) [0.05, 0.25]
Interference Speaker 0.2 for SR SIR \(\in\) [2, 20] for SR
1.0 for TSE/rTSE SIR \(\in\) [-5, 5] for TSE/rTSE

2.4 Inference Strategies↩︎

During inference, we divide the input speech into fixed-length segments for processing, according to the training setup. The SR mode is utilized for SR task, which restores clean speech from the degraded recording with various distortions. When multiple speakers exist in the degraded speech, our model intends to output the louder speaker. The TSE mode processes the TSE task, which extracts timbre-matched speech from the mixture regardless of the relative loudness. While for the SS task, we consider multiple inferences that involve all three modes. Specifically, for the degraded speech containing two speakers (this work only considers the two-speaker case), we first employ the SR mode to extract the louder speaker. This initial result then serves as the reference audio for the TSE mode to separate the first speaker. Since the relative loudness can vary at different periods of input signals, this step helps ensure the speaker consistency across different segments. Finally, using the first speaker as reference, the rTSE mode is applied to extract the remaining speaker.

3 EXPERIMENTS↩︎

3.1 Experimental Setup↩︎

Table 2: DNSMOS scores on 2020 DNS Challenge test sets, where “With Reverb” subset contains reverberation while “No Reverb” subset only involves noise. “D” and “G” denote discriminative and generative methods, respectively.
Model Type With Reverb No Reverb
3-5 (lr)6-8 SIG BAK OVRL SIG BAK OVRL
Noisy - 1.76 1.50 1.39 3.39 2.62 2.48
Conv-TasNet [4] D 2.42 2.71 2.01 3.09 3.34 3.00
FRCRN [20] D 2.93 2.92 2.28 3.58 4.13 3.34
SELM [11] G 3.16 3.58 2.70 3.51 4.10 3.26
MaskSR [16] G 3.53 4.07 3.25 3.59 4.12 3.34
AnyEnhance [21] G 3.50 4.04 3.20 3.64 4.18 3.42
GenSE [13] G 3.49 3.73 3.19 3.65 4.18 3.43
LLaSE-G1 [18] G 3.59 4.10 3.33 3.66 4.17 3.42
UniSE G 3.67 4.10 3.40 3.67 4.14 3.43
UniSE-SR G 3.67 4.08 3.38 3.66 4.13 3.42

Datasets: The clean speech data for training is sourced from the VoxBox dataset [10], which integrates multiple publicly available speech datasets after rigorous data cleaning. Our training set contains 760 hours of LibriSpeech [22] data, 1200 hours from the MLS_English [23] subset, and 1800 hours of the Emilia_ZH [24] subset. The noise corpus comprises approximately 460 hours of data from the DNS Challenge [25], FSD50K [26], WHAM! [27], DESED [28], DEMAND [29], MUSAN [30], DISCO [31], MUSDB18-HQ [32], and TUT Urban Acoustic Scenes [33]. We include 60,000 room impulse response (RIR) samples from SLR28 to simulate reverberation. A data augmentation pipeline is designed to simulate degraded speech, as shown in Table 1. We randomly select operational modes during training, and all distortions are applied to each mode based on probability. Note that interference speaker has different configurations in different mode. All audio samples are sampled at a frequency of 16 kHz.

Implementation Details: The LLaMA-based decoder-only backbone consists of 12 layers, each with 8 attention heads and a hidden dimension of 512, resulting in 63M parameters. Our model is trained using AdamW optimizer with 30 epochs. The learning rate reaches a peak of 0.001 after 4000 warm-up steps, and reduces at a decay factor of 0.98 in each epoch. During training and inference, the lengths of reference speech and degraded speech are set to 5 seconds.

Evaluation Configurations: We evaluate our model on several benchmarks, including test sets from 2020 DNS Challenge [25] and 2025 URGENT Challenge [34] for SR task, Libri2Mix clean test set for the TSE task, and Libri2Mix noisy test set with WSJ0-2mix test set for the SS task. We adopt DNSMOS [35] (SIG, BAK, and OVRL scores represent signal quality, background noise, and overall quality, respectively), NISQA [36] and UTMOS [37] to measure the quality of the generated speech. Following [14], the speaker similarity (SIM) is calculated using WavLM-base2 for TSE.

3.2 Experimental Results↩︎

Table 3: SR results on 2025 URGENT Challenge test set.
Team/Model Team Rank OVRL NISQA UTMOS
Bobbsun 1 2.88 3.22 2.09
Xiaobin 2 2.92 3.24 2.16
subatomicseer 3 2.94 3.25 2.19
wataru9871 13 3.10 3.74 2.53
LLaSE-G1 [18] - 2.80 2.93 2.09
UniSE - 3.17 3.72 2.85

The ranking takes into account both non-intrusive and intrusive metrics, where the latter are not friendly to generative models.

Table 4: TSE results on Libri2Mix clean test set.
Model Type SIG BAK OVRL NISQA SIM
Mixture - 3.38 3.10 2.65 2.45 0.85
Spex+ [38] D 3.38 3.77 3.00 3.03 0.96
WeSep [39] D 3.56 3.93 3.23 4.04 0.99
TSELM-L [40] G 3.55 4.08 3.23 4.03 0.91
AnyEnhance [21] G 3.64 4.07 3.35 4.28 0.91
LLaSE-G1 [18] G 3.53 4.01 3.22 3.89 0.92
LauraTSE [14] G 3.61 4.08 3.34 4.33 0.97
UniSE G 3.62 4.06 3.33 4.00 0.95
UniSE-TSE G 3.61 4.06 3.33 3.99 0.95
BiCodec - 3.59 4.05 3.30 4.02 0.97
Table 5: SS results on Libri2Mix and WSJ0-2mix test sets.
Model Type Libri2Mix WSJ0-2mix
3-5 (lr)6-8 SIG BAK OVRL SIG BAK OVRL
Mixture - 2.33 1.66 1.64 3.42 3.20 2.76
Sepformer [41] D 3.33 3.88 3.02 3.43 3.96 3.14
Mossformer2 [42] D 3.44 3.94 3.11 3.50 4.05 3.23
LLaSE-G1 [18] G 3.48 3.83 3.11 3.52 3.92 3.19
UniSE G 3.60 4.08 3.32 3.62 4.08 3.36
Table 6: Ablation study on 2020 DNS Challenge test sets.
LM Codec With Reverb No Reverb
3-5 (lr)6-8 SIG BAK OVRL SIG BAK OVRL
LLaMA BiCodec 3.67 4.10 3.40 3.67 4.14 3.43
Qwen2 BiCodec 3.66 4.08 3.39 3.67 4.14 3.44
GLM BiCodec 3.65 4.10 3.39 3.66 4.14 3.43
LLaMA X-codec2 3.57 4.03 3.27 3.60 4.09 3.34

Table 2 presents performance comparisons between UniSE and several advanced baselines on 2020 DNS Challenge test set. It is clear that our model achieves state-of-the-art (SOTA) SR performance, and training exclusively in the SR mode (denoted as UniSE-SR) shows comparable performance with UniSE. Notably, LLaSE-G1 employs a backbone with approximately 1 billion parameters, while the size of our model is only 63M, showing superior parameter efficiency. Table 3 reports the SR results of our framework and the systems submitted to the URGENT Challenge, which involves multiple distortions. Our model achieves competitive performance even with unseen distortions (codec artifacts and wind noise), demonstrating a great generalization ability.

TSE results on the Libri2Mix clean test set are summarized in Table 4, showing that our model achieves comparable performance relative to SOTA baselines. Compared to another AR-based method LauraTSE, our framework supports more tasks. The variant UniSE-TSE exclusively trained in the TSE mode achieves close performance, which is similar to the comparison between UniSE and UniSE-SR for SR. This indicates that incorporating more tasks wont deteriorate the performance of individual task in our framework. The results produced by directly processing target speech using BiCodec (the bottom row) reveal the performance limitations of codecs on SE frameworks, demonstrating the necessity of further improving low-bitrate NACs.

Table 5 compares the SS performance of our model with baselines on Libri2Mix and WSJ0-2mix test sets, where the former includes additional noises. UniSE outperforms other discriminative and generative models with OVRL scores of 3.32 on Libri2Mix and 3.36 on WSJ0-2mix, highlighting the effectiveness of our multi-mode inference strategy.

Finally, we carry out an ablation study in Table 6 to analyze the impact of different LM architectures and NACs on the SE performance. Replacing the LM backbone with Qwen2 [43] and GLM [44] results in similar performance, showing the adaptability of our framework. Utilizing X-codec2 [45] clearly leads to a performance decay, which might be caused by its large codebook size that challenges the ability of LM backbone.

4 CONCLUSION↩︎

In this work, we proposed an SE framework called UniSE, which unifies SR, TSE and SS tasks. UniSE adopts continuous features of the degraded speech and inference speech as conditions to generate discrete tokens of target speech via AR modeling. Multiple operational modes are defined by the task token, enabling task unification by the switching and combination of different modes. Extensive results show that our UniSE achieves competitive performance within each benchmark, verifying the effectiveness of decoder-only AR LM framework in unifying SE tasks. Future work could consider more general audio tasks and efforts can be made to improve the ability of codec.

References↩︎

[1]
H. Liu, Q. Kong, Q. Tian, et al., “Voicefixer: Toward general speech restoration with neural vocoder,” arXiv preprint arXiv:2109.13731, 2021.
[2]
J. Serrà, S. Pascual, J. Pons, et al., “Universal speech enhancement with score-based diffusion,” arXiv preprint arXiv:2206.03065, 2022.
[3]
J. Zhang, H. Yan, and X. Li, “A composite predictive-generative approach to monaural universal speech enhancement,” IEEE Trans. Audio, Speech, Lang. Process., vol. 33, pp. 2312–2325, 2025.
[4]
Y. Luo and N. Mesgarani, Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
[5]
S. Abdulatif, R.-Z. Cao, and B. Yang, CMGAN: Conformer-based metric-gan for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 32, pp. 2477–2493, 2024.
[6]
OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2024.
[7]
K. Tian, Y. Jiang, Z. Yuan, et al., “Visual autoregressive modeling: Scalable image generation via next-scale prediction,” in Proc. NeurIPS, 2024, vol. 37, pp. 84839–84865.
[8]
F. Kreuk, G. Synnaeve, A. Polyak, et al., AudioGen: Textually guided audio generation,” in Proc. ICLR, 2023.
[9]
A. Vyas, B. Shi, M. Le, et al., “Audiobox: Unified audio generation with natural language prompts,” arXiv preprint arXiv:2312.15821, 2023.
[10]
X. Wang, M. Jiang, Z. Ma, et al., Spark-TTS: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens,” arXiv preprint arXiv:2503.01710, 2025.
[11]
Z. Wang, X. Zhu, Z. Zhang, et al., SELM: Speech enhancement using discrete tokens and language models,” in Proc. ICASSP, 2024, pp. 11561–11565.
[12]
S. Chen, C. Wang, Z. Chen, et al., WavLM: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Top. Signal Process., vol. 16, no. 6, pp. 1505–1518, 2022.
[13]
J. Yao, H. Liu, C. CHEN, et al., GenSE: Generative speech enhancement via language models using hierarchical modeling,” in Proc. ICLR, 2025.
[14]
B. Tang, B. Zeng, and M. Li, LauraTSE: Target speaker extraction using auto-regressive decoder-only language models,” arXiv preprint arXiv:2504.07402, 2025.
[15]
Z. Peng, Z. Guo, W. Huang, et al., “Conformer: Local features coupling global representations for recognition and detection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 45, no. 8, pp. 9454–9468, 2023.
[16]
X. Li, Q. Wang, and X. Liu, MaskSR: Masked language model for full-band speech restoration,” in Proc. Interspeech, 2024, pp. 2275–2279.
[17]
H. Chang, H. Zhang, L. Jiang, et al., MaskGIT: Masked generative image transformer,” in Proc. CVPR, 2022, pp. 11305–11315.
[18]
B. Kang, X. Zhu, Z. Zhang, et al., LLaSE-G1: Incentivizing generalization capability for LLaMA-based speech enhancement,” arXiv preprint arXiv:2503.00493, 2025.
[19]
H. Touvron, T. Lavril, G. Izacard, et al., LLaMA: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
[20]
S. Zhao, B. Ma, K. N. Watcharasupat, et al., FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement.,” in Proc. ICASSP, 2022, pp. 9281–9285.
[21]
J. Zhang, J. Yang, Z. Fang, et al., AnyEnhance: A unified generative model with prompt-guidance and self-critic for voice enhancement,” arXiv preprint arXiv:2501.15417, 2025.
[22]
V. Panayotov, G. Chen, D. Povey, et al., “Librispeech: An asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
[23]
V. Pratap, Q. Xu, A. Sriram, et al., MLS: A large-scale multilingual dataset for speech research.,” in Proc. Interspeech, 2020, pp. 2757–2761.
[24]
H. He, Z. Shang, C. Wang, et al., “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation.,” in Proc. SLT, 2024, pp. 885–890.
[25]
C. K. A. Reddy, V. Gopal, R. Cutler, et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Proc. Interspeech, 2020, pp. 2492–2496.
[26]
E. Fonseca, X. Favory, J. Pons, et al., FSD50K: An open dataset of human-labeled sound events.,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 829–852, 2022.
[27]
G. Wichern, J. Antognini, M. Flynn, et al., WHAM!: Extending speech separation to noisy environments,” in Proc. Interspeech, 2019, pp. 1368–1372.
[28]
N. Turpault, R. Serizel, J. Salamon, et al., “Sound event detection in domestic environments with weakly labeled data and soundscape synthesis,” in Proc. DCASE, M. I. Mandel, J. Salamon, and D. P. W. Ellis, Eds., 2019, pp. 253–257.
[29]
J. Thiemann, N. Ito, and E. Vincent, The diverse environments multi-channel acoustic noise database: A database of multichannel environmental noise recordings,” J. Acoust. Soc. Am., vol. 133, pp. 3591–3591, 2013.
[30]
D. Snyder, G. Chen, and D. Povey, MUSAN: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
[31]
N. Furnon, R. Serizel, S. Essid, et al., DNN-based mask estimation for distributed speech enhancement in spatially unconstrained microphone arrays,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 29, pp. 2310–2323, 2021.
[32]
Z. Rafii, A. Liutkus, F.-R. Stöter, et al., MUSDB18-HQ - an uncompressed version of MUSDB18,” [Online], Available: https://doi.org/10.5281/zenodo.3338373.
[33]
A. Mesaros, T. Heittola, and T. Virtanen, “A multi-device dataset for urban acoustic scene classification.,” in Proc. DCASE, 2018, pp. 9–13.
[34]
K. Saijo, W. Zhang, S. Cornell, et al., “Interspeech 2025 URGENT speech enhancement challenge,” arXiv preprint arXiv:2505.23212, 2025.
[35]
C. K. A. Reddy, V. Gopal, and R. Cutler, DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in Proc. ICASSP, 2022, pp. 886–890.
[36]
G. Mittag, B. Naderi, A. Chehadi, et al., NISQA: A deep CNN-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Proc. Interspeech, 2021, pp. 2127–2131.
[37]
T. Saeki, D. Xin, W. Nakata, et al., UTMOS: Utokyo-sarulab system for VoiceMOS challenge 2022,” in Proc. Interspeech, 2022, pp. 4521–4525.
[38]
M. Ge, C. Xu, L. Wang, et al., SpEx+: A complete time domain speaker extraction network,” in Proc. Interspeech, 2020, pp. 1406–1410.
[39]
S. Wang, K. Zhang, S. Lin, et al., WeSep: A scalable and flexible toolkit towards generalizable target speaker extraction,” in Proc. Interspeech, 2024, pp. 4273–4277.
[40]
B. Tang, B. Zeng, and M. Li, TSELM: Target speaker extraction using discrete tokens and language models,” arXiv preprint arXiv:2409.07841, 2024.
[41]
C. Subakan, M. Ravanelli, S. Cornell, et al., “Attention is all you need in speech separation.,” in Proc. ICASSP, 2021, pp. 21–25.
[42]
S. Zhao, Y. Ma, C. Ni, et al., MossFormer2: Combining Transformer and RNN-free recurrent network for enhanced time-domain monaural speech separation.,” in Proc. ICASSP, 2024, pp. 10356–10360.
[43]
A. Yang, B. Yang, B. Hui, et al., “Qwen2 technical report,” arXiv preprint arXiv:2407.10671, 2024.
[44]
Z. Du, Y. Qian, X. Liu, et al., GLM: General language model pretraining with autoregressive blank infilling,” in Proc. ACL, 2022, pp. 320–335.
[45]
Z. Ye, X. Zhu, C.-M. Chan, et al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,” arXiv preprint arXiv:2502.04128, 2025.

  1. https://huggingface.co/microsoft/wavlm-base-plus↩︎

  2. https://huggingface.co/microsoft/wavlm-base-plus-sv↩︎