Abstract

Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach. Code is available at https://github.com/ZhangXinWhut/SimWhisper-Codec.

Speech Codec, architectural simplification, Whisper, semantic-acoustic conflict

1 Introduction↩︎

In recent years, Speech Large Language Models (Speech LLMs) have garnered significant attention from the research community, demonstrating remarkable performance across a range of tasks [1]–[5]. The success of Speech LLMs is critically underpinned by a core component: the speech codec. This component serves as a crucial bridge, converting continuous audio signals into discrete tokens suitable for LLM modeling, thereby connecting raw audio with the model’s semantic understanding.

However, current speech codecs face an inherent conflict between the preservation of semantic content and acoustic fidelity, where optimizing for one typically degrades the other [6], [7]. This trade-off is particularly pronounced at low bitrates, where achieving high performance in both dimensions remains difficult. To mitigate this conflict, prevailing methods augment acoustic-centric codecs with external semantic supervision through various strategies. For instance, SpeechTokenizer [8] guides the first residual vector quantization (RVQ) layer through semantic distillation from HuBERT [9], while Mimi Codec in Moshi [6] employs split RVQ [10] with SSL model [11] supervision on one quantization branch. PAST [12] incorporates auxiliary phonetic tasks such as phoneme classification and ASR, and XY-Tokenizer [13] employs multi-task learning with LLM-based ASR supervision and multi-stage training. While effective, these methods typically rely on complex semantic supervision.

Figure 1: Overview of SimWhisper-Codec. A simplified Whisper encoder with FSQ discretization and a symmetric decoder. Model architecture and training procedure are detailed in Section 3. — Figure 1: **Overview of SimWhisper-Codec.** A simplified Whisper encoder with FSQ discretization and a symmetric decoder. Model architecture and training procedure are detailed in Section 3.

In this work, we explore the opposite direction: instead of enhancing acoustic codecs with semantic supervision, we start from Whisper [14], a text-aligned ASR model, and adapt it for high-fidelity acoustic reconstruction. However, this adaptation encounters a task mismatch—ASR systems are designed to achieve invariance to acoustic variations for content extraction [15], while acoustic reconstruction requires preserving fine-grained acoustic details for fidelity. To investigate this task mismatch, we conduct empirical analysis examining how different architectural components of Whisper affect its acoustic reconstruction capabilities. Through empirical analysis presented in Section 2, we discover that targeted architectural simplification—specifically removing the convolutional front-end nonlinearity (GELU activation) and absolute positional encodings [16]—substantially enhances the model’s ability to preserve fine-grained acoustic information. Based on this finding, we propose SimWhisper-Codec (see Figure 1), a low-bitrate (1.1 kbps at 16 kHz) codec that combines a simplified Whisper encoder, Finite Scalar Quantization (FSQ) [17], and a symmetric decoder, enabling single-stage training without semantic supervision.

Our contributions are as follows:

We propose SimWhisper-Codec, a novel codec that simultaneously models semantic and acoustic information through targeted architectural simplifications of Whisper’s encoder combined with FSQ quantization and symmetric decoding, eliminating the need for external semantic supervision.
Experimental results demonstrate that SimWhisper-Codec outperforms semantically-supervised codecs in semantic preservation at comparable low bitrates, while achieving high speaker similarity (0.83 SIM) and intelligibility (0.91 STOI), validating our method’s effectiveness.

2 EMPIRICAL ANALYSIS OF COMPONENTS HINDERING ACOUSTIC RECONSTRUCTION↩︎

We conduct an empirical analysis to identify which architectural components in Whisper encoders adversely affect acoustic reconstruction capabilities.

2.1 Component Analysis and Hypotheses↩︎

Convolutional Front-End Nonlinearity. The Whisper encoder’s front-end consists of two convolutional layers with GELU activation functions. While these nonlinear activations enable complex feature transformations beneficial for ASR tasks [14], [18], we hypothesize that they suppress spectral details essential for acoustic reconstruction. By removing these activations, the convolutional layers become purely linear transformations that better preserve input signal structure and retain acoustic details necessary for reconstruction.

Absolute Positional Encodings. Absolute positional encodings assign fixed "identity markers" to each temporal position in the sequence [19], [20]. We hypothesize that absolute positional encodings are detrimental to acoustic reconstruction because: (1) acoustic features should remain position-invariant—a phoneme /a/ should have identical representation regardless of temporal location; (2) speech contains repetitive structures that absolute encodings differentiate, hindering pattern recognition for reconstruction. These theoretical considerations motivate our experimental validation.

2.2 Validation Analysis↩︎

To validate our hypotheses, we conduct controlled analysis experiments using LJSpeech [21]. We extract frame-level hidden states from the final layer of each Whisper encoder variant, then condition identical HiFiGAN vocoders [22] on these features to assess reconstruction quality. This setup allows us to isolate the impact of specific architectural components on acoustic modeling capability while keeping the vocoder constant. Crucially, all encoder variants remain frozen during HiFiGAN training to serve as feature extractors.

We systematically examine the effect of removing each component individually, followed by their combined removal. The results validate our hypotheses: removing convolutional front-end nonlinearity yields substantial improvements with PESQ-NB increasing from 1.24 to 3.60 (+2.36), STOI from 0.82 to 0.96 (+0.14), and SIM from 0.81 to 0.86 (+0.05). Removing absolute positional encodings also confirms our hypothesis with PESQ-NB increasing from 1.24 to 2.95 (+1.71), STOI from 0.82 to 0.94 (+0.12), and SIM from 0.81 to 0.84 (+0.03).

Simultaneously removing both components yields the best reconstruction performance: PESQ-NB reaches 3.67 (+2.43), STOI achieves 0.97 (+0.15), and SIM attains 0.87 (+0.06). This demonstrates complementary hindering effects—the nonlinearity suppresses spectral details while positional encodings interfere with flexible attention patterns. Table 1 summarizes these results, with objective metrics defined in Section 4.2.

To further understand how positional encodings affect attention mechanisms, we visualize self-attention patterns from a middle Transformer layer. Figure 2 provides compelling evidence for our positional encoding hypothesis: removing absolute positional encodings reduces diagonal dominance in self-attention patterns (0.031→0.005), enabling attention to spread across the sequence and reveal content-driven interactions around repeated segments.

3.2pt

Table 1: Reconstruction quality of Whisper encoder variants with HiFiGAN on LJSpeech.
Variant	SIM \(\uparrow\)	STOI \(\uparrow\)	PESQ-NB \(\uparrow\)	PESQ-WB \(\uparrow\)
Whisper encoder (baseline)	0.81	0.82	1.24	1.14
– remove absolute PEs only	0.84	0.94	2.95	2.49
– remove both stem GELUs only	0.86	0.96	3.60	3.28
– remove both components	0.87	0.97	3.67	3.33

Figure 2: Self-attention maps (log-weights, head-averaged) from a middle Transformer layer for the utterance “one, two, three, four, four, three, two, one”. The maps demonstrate how absolute positional encodings affect attention patterns in repetitive structures.

Having established that simultaneously removing both components achieves optimal reconstruction performance, these findings provide the foundation for our codec design. The substantial improvements in acoustic reconstruction quality (+2.43 PESQ-NB, +0.15 STOI) demonstrate that architectural simplification can effectively unlock Whisper’s potential for high-fidelity acoustic modeling. Based on these insights, we next present SimWhisper-Codec, which leverages the simplified Whisper encoder as a frozen feature extractor in a complete speech codec framework.

3 Method↩︎

3.1 SimWhisper-Codec↩︎

Motivation. Rather than augmenting acoustic codecs with external semantic supervision, we explore the opposite direction: starting from Whisper’s inherent semantic capabilities and adapting it for high-quality acoustic reconstruction. The key insight is that Whisper’s extensive multilingual training and text alignment provide natural semantic grounding, eliminating the need for additional semantic models. However, certain architectural components designed for ASR invariance may hinder fine-grained acoustic preservation. Based on our empirical findings, we propose SimWhisper-Codec, which employs a frozen simplified Whisper encoder paired with FSQ quantization and a symmetric trainable decoder. By leveraging Whisper’s inherent semantic capabilities while enhancing its acoustic modeling through architectural simplification, our approach enables single-stage training without external semantic supervision.

3.2 Model Architecture↩︎

As shown in Figure 1, the SimWhisper-Codec architecture is an end-to-end model comprising a simplified Whisper encoder, a downsampling module, a quantizer, an upsampling module, and a symmetric decoder. The downsampling module and quantizer collectively form an information bottleneck, compressing the encoder’s output by reducing both temporal resolution and feature dimensionality.

Encoder. The encoder adopts the Whisper architecture initialized with pre-trained weights, with two key modifications to enhance acoustic preservation. First, we remove the GELU non-linearities from the initial two convolutional layers while preserving both the layer structure and learned weights from pre-training to maintain compatibility with the pre-trained Whisper model. Second, we completely remove the absolute positional encodings from the Transformer blocks. This simplified encoder remains frozen during codec training to serve as a powerful feature extractor, leveraging the rich representations learned during Whisper’s original ASR pre-training.

Downsampler. The downsampler reduces the temporal resolution by stacking consecutive frames and aggregating temporal information into the channel dimension. Subsequently, a series of residual blocks with dilated convolutions and Snake activation functions [23] progressively compress the feature dimensionality while capturing multi-scale temporal context.

Quantizer. We employ a Finite Scalar Quantization (FSQ) module [17], which mitigates codebook collapse and obviates the need for complex training machinery such as exponential moving averages and commitment losses required by traditional VQ methods.

Upsampler. The upsampler mirrors the downsampling architecture, employing nearest-neighbor upsampling combined with convolutional and residual blocks to restore the original temporal resolution. The module progressively expands the feature dimensionality back to the decoder’s input requirements.

Decoder. The decoder adopts a symmetric architecture to the encoder, with symmetry achieved by replacing the encoder’s convolutional layers with transposed convolutions while maintaining the same architectural depth and feature dimensions. This design enables effective reversal of the encoding process to reconstruct mel-spectrogram representations from the upsampled features. A Vocos model [24] subsequently converts the spectral features to the final audio waveform.

3.3 Training Objective↩︎

The codec is trained using a single-stage GAN-based approach. The generator minimizes the following composite loss function:

\[\mathcal{L}_{G} = \lambda_{\text{recon}}\mathcal{L}_{\text{recon}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}} + \lambda_{\text{feat}}\mathcal{L}_{\text{feat}} \label{eq:total95loss}\tag{1}\]

where \(\lambda_{\text{recon}}\), \(\lambda_{\text{adv}}\), and \(\lambda_{\text{feat}}\) control the weights of the multi-scale reconstruction loss \(\mathcal{L}_{\text{recon}}\), adversarial loss \(\mathcal{L}_{\text{adv}}\), and feature matching loss \(\mathcal{L}_{\text{feat}}\), respectively.

Multi-scale Reconstruction Loss. We compute an L1 loss between the mel-spectrograms of the original and reconstructed audio across seven STFT scales. For each scale \(k \in \{5,\dots, 11\}\), we calculate:

\[\mathcal{L}_{\text{recon}} = \sum_{k} \| M_k(x) - M_k(\hat{x}) \|_1 \label{eq:recon95loss}\tag{2}\]

where \(M_k(\cdot)\) denotes the mel-spectrogram computed with FFT size \(2^k\), \(x\) is the original audio, and \(\hat{x}\) is the reconstructed audio.

Adversarial Loss. We employ a Least Squares GAN (LSGAN) objective [25] to enhance perceptual quality. The discriminator loss is defined as:

\[\mathcal{L}_{D} = \frac{1}{N} \sum_{i=1}^{N} \left[ (D_i(x) - 1)^2 + D_i(G(z))^2 \right] \label{eq:disc95loss}\tag{3}\]

where \(D_i\) represents the \(i\)-th discriminator output from (MPD [22], MSD [26]), \(N\) is the number of discriminators, and \(G(z)\) is the generated audio. The generator adversarial loss is defined as:

\[\mathcal{L}_{\text{adv}} = \frac{1}{N} \sum_{i=1}^{N} (D_i(G(z)) - 1)^2 \label{eq:gen95adv}\tag{4}\]

Feature Matching Loss. We compute an L1 loss between the feature maps of the discriminators for the real and generated audio, which prevents the generator from overtraining on the current discriminator and improves generation quality. The feature matching loss is formulated as:

\[\mathcal{L}_{\text{feat}} = \frac{1}{N \cdot K} \sum_{i=1}^{N} \sum_{j=1}^{K} \frac{\| D_i^j(x) - D_i^j(G(z)) \|_1}{\| D_i^j(x) \|_1 + \epsilon} \label{eq:feat95loss}\tag{5}\]

where \(D_i^j(\cdot)\) denotes the \(j\)-th layer feature map from the \(i\)-th discriminator, \(K\) is the number of feature layers, and \(\epsilon\) is a small constant for numerical stability.

4 Experiments↩︎

4.1 Settings↩︎

Dataset and Training Details. We use the full training set of LibriSpeech [27] with 960 hours of speech data for training. The test-clean set with 2620 utterances is used for testing. All speech data are in 16 kHz with randomly cropped 4-second audio segments. Training is conducted on a single NVIDIA H100 GPU with batch size of 32 and gradient accumulation set to 3, resulting in an effective batch size of 96. The total training is performed for 500,000 steps using a single-stage approach. Both generator and discriminator employ AdamW optimization with \(\beta_1 = 0.8\), \(\beta_2 = 0.99\), and weight decay \(0.01\). A cosine annealing learning rate schedule is used, declining from \(1 \times 10^{-4}\) to \(0\) with 5k warmup steps for both generator and discriminator.

Framework Configuration. Our framework features a symmetric encoder-decoder architecture. Both components are 12-layer Transformers based on the Whisper-small architecture, with 768-dimensional hidden states and 12 attention heads. The model processes 16 kHz audio, extracting 50 Hz feature sequences using a 25 ms window and a 10 ms hop size. The downsampler reduces the temporal resolution to 12.5 Hz via 4\(\times\) frame stacking and compresses the feature dimension from 768 to 32 using residual blocks with Snake activations [23] and multi-scale dilated convolutions (dilations: 1, 3, 5, 9). The upsampler employs a mirrored architecture, restoring the 50 Hz resolution and 768 dimensions through stages of 2\(\times\) nearest-neighbor upsampling and corresponding residual blocks. Following [28], we use a Finite Scalar Quantization (FSQ) module configured with eight codebooks, four dimensions per code, and levels of [8, 7, 6, 6] to achieve a 1.1 kbps bitrate. The decoder then reconstructs mel-spectrograms from the upsampled features, which are synthesized into the final waveform by a 24-layer Vocos model.

Baselines. We compare our codec against representative baselines at similar bitrates: EnCodec [7] (1.5 kbps), DAC-RVQ3 [29] (1.5 kbps), SpeechTokenizer [8] (1.0 kbps), Mimi-RVQ8 [6] (1.1 kbps), and XCodec2.0 [30] (0.8 kbps), all using official checkpoints.

4.2 Evaluation Metrics↩︎

Two aspects are evaluated: (i) acoustic reconstruction quality of the synthesized audio, and (ii) semantic alignment between the codec and text. All metrics are reported on LibriSpeech test-clean [27].

PESQ-WB/NB. Wideband and narrowband PESQ are reported following ITU-T P.862.2 and P.862, respectively [31]. Signals are resampled to 16 kHz (WB) and 8 kHz (NB) before scoring.

STOI. Short-Time Objective Intelligibility is used to measure intelligibility correlation between reference and reconstructed signals [32]. Audio is resampled to 16 kHz prior to computation.

SIM (speaker similarity). Cosine similarity is computed between speaker embeddings extracted from reference and reconstructed utterances using a WavLM-based speaker verification model¹.

WER via external ASR. Reconstructed audio is transcribed with HuBERT large² and word error rate (WER) is computed against LibriSpeech references.

4.3 Experimental Results↩︎

Table 2 presents a comprehensive comparison of SimWhisper-Codec against representative baselines. Our model effectively models both semantic and acoustic information simultaneously, validating our architectural simplification approach.

Semantic Preservation. On semantic evaluation, SimWhisper-Codec achieves the lowest Word Error Rate (WER) of 3.10, outperforming all other codecs. Notably, this performance is achieved without any semantic supervision, distinguishing it from baselines like XCodec2.0 (3.61 WER) and Mimi-RVQ8 (4.36 WER) that rely on such supervision. The results indicate that our architectural simplification does not compromise Whisper’s semantic alignment capabilities.

Acoustic Quality. In terms of acoustic reconstruction, SimWhisper-Codec delivers competitive results. It achieves a PESQ-NB of 2.98 and a STOI of 0.91, placing it among the top-performing models and significantly outperforming acoustic-only codecs like EnCodec and DAC-RVQ3. Furthermore, with a speaker similarity (SIM) score of 0.83, it demonstrates good preservation of speaker identity. These results show that the simplified Whisper encoder can effectively extract acoustic features even at low bitrates of 1.1 kbps.

1.1pt

Table 2: Low-bitrate codec comparison. Bold indicates the best performance for each metric.
		Frame	Semantic		Acoustic
4-5 (lr)6-9 Model	BPS	Rate	Semantic	WER	SIM	STOI	PESQ-	PESQ-
			Supervision	\(\downarrow\)	\(\uparrow\)	\(\uparrow\)	NB \(\uparrow\)	WB \(\uparrow\)
Ground Truth	–	–	–	2.53	–	–	–	–
EnCodec [7]	1.5k	75	No	45.49	0.60	0.85	1.94	1.56
DAC-RVQ3 [29]	1.5k	75	No	41.67	0.45	0.76	1.82	1.43
SpeechTokenizer [8]	1.0k	50	Yes	5.92	0.37	0.70	1.42	1.15
Mimi-RVQ8 [6]	1.1k	12.5	Yes	4.36	0.73	0.90	2.62	2.13
XCodec2.0 [30]	0.8k	50	Yes	3.61	0.82	0.91	2.95	2.32
SimWhisper-Codec (ours)	1.1k	12.5	No	3.10	0.83	0.91	2.98	2.36

4.4 Ablation Study↩︎

To validate the impact of removing different architectural components on codec performance, we conduct ablation studies on SimWhisper-Codec. For training simplicity, we maintain symmetric encoder-decoder architectures throughout all variants.

0.9pt

Table 3: Ablation study on SimWhisper-Codec architectural components. For training simplicity, encoder and decoder maintain symmetric architectures.
	Semantic	Acoustic
2-2 (lr)3-6 Variant	WER \(\downarrow\)	SIM \(\uparrow\)	STOI \(\uparrow\)	PESQ-NB \(\uparrow\)	PESQ-WB \(\uparrow\)
Whisper encoder (baseline)	5.95	0.78	0.85	1.95	1.68
– remove absolute PEs only	5.42	0.80	0.87	2.34	1.96
– remove both stem GELUs only	3.74	0.81	0.89	2.51	2.10
Ours: remove both components	3.10	0.83	0.91	2.98	2.36

Table 3 demonstrates that both architectural modifications contribute to performance improvements. Removing GELU activations from the convolutional front-end has a more substantial impact on semantic preservation (WER: 5.95→3.74), while removing positional encodings provides moderate gains across all metrics. The combination of both modifications yields the best results, achieving a WER of 3.10, SIM of 0.83, and PESQ-WB of 2.36, validating our architectural design choices.

4.5 Preservation of Acoustic Attributes↩︎

The ablation study demonstrates that our architectural modifications yield superior codec performance. To further validate that the simplified encoder retains fine-grained acoustic cues essential for high-quality synthesis, we conduct a layer-wise probing experiment analyzing its preservation of pitch information.

We train ridge regression models to predict frame-level fundamental frequency (\(F_0\)) from hidden states extracted from each encoder layer using THCHS-30 [33], a tonal Mandarin dataset with rich prosodic variation. Ground truth \(F_0\) values are extracted using Parselmouth [34], evaluating only voiced frames with Pearson correlation coefficient (PCC).

As shown in Figure 3, the results reveal that simplified Whisper maintains stable \(F_0\) tracking (PCC \(\approx 0.76\)) across all layers, while standard Whisper degrades from layer 6 onward (0.78→0.58). This indicates that our architectural modifications better preserve prosodic information essential for high-quality speech synthesis, providing additional evidence that the simplified encoder successfully retains acoustic details while maintaining semantic capabilities.

5 Conclusion↩︎

We presented SimWhisper-Codec, a low-bitrate speech codec that mitigates the semantic-acoustic conflict through architectural simplification rather than complex supervision. We removed convolutional front-end nonlinearity and absolute positional encodings from the frozen Whisper encoder. This architectural simplification maintains competitive acoustic quality while achieving superior semantic preservation. Results demonstrate that such simplification of Whisper can be more effective than semantic supervision approaches for speech codec design.

References↩︎

[1]

Z. Borsos, R. Marinier, D. Vincent, et al., “Audiolm: A language modeling approach to audio generation,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 31, pp. 2523–2533, 2023.

[2]

R. Algayres, Y. Adi, T. A. Nguyen, et al., “Generative spoken language model based on continuous word-sized audio tokens,” in Proc. EMNLP, 2023, pp. 3008–3028.

[3]

M. Hassid, T. Remez, T. A. Nguyen, et al., “Textually pretrained speech language models,” in Proc. NeurIPS, 2023.

[4]

D. Zhang, S. Li, X. Zhang, et al., “Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities,” in Proc. EMNLP, 2023, pp. 15757–15773.

[5]

Q. Fang, S. Guo, Y. Zhou, et al., “Llama-omni: Seamless speech interaction with large language models,” in Proc. ICLR, 2025.

[6]

A. Défossez, L. Mazaré, M. Orsini, et al., “Moshi: a speech-text foundation model for real-time dialogue,” arXiv preprint arXiv:2410.00037, 2024.

[7]

A. Défossez, J. Copet, G. Synnaeve, et al., “High fidelity neural audio compression,” Trans. Mach. Learn. Res., vol. 2023, 2023.

[8]

X. Zhang, D. Zhang, S. Li, et al., “Speechtokenizer: Unified speech tokenizer for speech language models,” in Proc. ICLR, 2024.

[9]

W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, et al., “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE ACM Trans. Audio Speech Lang. Process., vol. 29, pp. 3451–3460, 2021.

[10]

N. Zeghidour, A. Luebs, A. Omran, et al., “Soundstream: An end-to-end neural audio codec,” in Proc. ICASSP, 2021, pp. 3629–3633.

[11]

S. Chen, C. Wang, Z. Chen, et al., “Wavlm: Large-scale self-supervised pre-training for full stack speech processing,” IEEE J. Sel. Topics Signal Process., vol. 16, pp. 1505–1518, 2022.

[12]

N. Har-Tuv, O. Tal, and Y. Adi, “Past: Phonetic-acoustic speech tokenizer,” arXiv preprint arXiv:2505.14470, 2025.

[13]

Y. Gong, L. Jin, R. Deng, et al., “Xy-tokenizer: Mitigating the semantic-acoustic conflict in low-bitrate speech codecs,” arXiv preprint arXiv:2506.23325, 2025.

[14]

A. Radford, J. W. Kim, T. Xu, et al., “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28492–28518.

[15]

D. Serdyuk, K. Audhkhasi, P. Brakel, et al., “Invariant representations for noisy speech recognition,” arXiv preprint arXiv:1612.01928, 2016.

[16]

A. Vaswani, N. Shazeer, N. Parmar, et al., “Attention is all you need,” in Proc. NeurIPS, 2017, pp. 5998–6008.

[17]

F. Mentzer, D. Minnen, E. Agustsson, et al., “Finite scalar quantization: VQ-VAE made simple,” in Proc. ICLR, 2024.

[18]

A. Gulati, J. Qin, C.-C. Chiu, et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020.

[19]

A. Kazemnejad, I. Padhi, K. Natesan Ramamurthy, et al., “The impact of positional encoding on length generalization in transformers,” in Proc. NeurIPS, 2023, pp. 24892–24928.

[20]

Q. Zhang, M. Ge, H. Zhu, et al., “An empirical study on the impact of positional encoding in transformer-based monaural speech enhancement,” in Proc. ICASSP, 2024, pp. 1001–1005.

[21]

K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.

[22]

J. Kong, J. Kim, and J. Bae, “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17022–17033.

[23]

L. Ziyin, T. Hartwig, and M. Ueda, “Neural networks fail to learn periodic functions and how to fix it,” in Proc. NeurIPS, 2020, pp. 1583–1594.

[24]

H. Siuzdak, “Vocos: Closing the gap between time-domain and fourier-based neural vocoders for high-quality audio synthesis,” in Proc. ICLR, 2024.

[25]

X. Mao, Q. Li, H. Xie, et al., “Least squares generative adversarial networks,” in Proc. ICCV, 2017, pp. 2794–2802.

[26]

K. Kumar, R. Kumar, T. De Boissiere, et al., “Melgan: Generative adversarial networks for conditional waveform synthesis,” in Proc. NeurIPS, 2019, pp. 14881–14892.

[27]

V. Panayotov, G. Chen, D. Povey, et al., “Librispeech: an asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.

[28]

E. Casanova, R. Langman, P. Neekhara, et al., “Low frame-rate speech codec: a codec designed for fast high-quality speech llm training and inference,” in Proc. ICASSP, 2025, pp. 1–5.

[29]

R. Kumar, P. Seetharaman, A. Luebs, et al., “High-fidelity audio compression with improved rvqgan,” in Proc. NeurIPS, 2023, pp. 27980–27993.

[30]

Z. Ye, X. Zhu, C.-M. Chan, et al., “Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis,” arXiv preprint arXiv:2502.04128, 2025.

[31]

A. W. Rix, J. G. Beerends, M. P. Hollier, et al., “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, 2001, pp. 749–752.

[32]

C. H. Taal, R. C. Hendriks, R. Heusdens, et al., “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Trans. Audio Speech Lang. Process., vol. 19, pp. 2125–2136, 2011.

[33]

D. Wang and X. Zhang, “Thchs-30: A free chinese speech corpus,” arXiv preprint arXiv:1512.01882, 2015.

[34]

Y. Jadoul, B. Thompson, and B. De Boer, “Introducing parselmouth: A python interface to praat,” J. Phonetics, vol. 71, pp. 1–15, 2018.

Speaking Clearly: A Simplified Whisper-based Codec for Low-Bitrate Speech Coding