October 23, 2025
Speech codecs serve as bridges between continuous speech signals and large language models, yet face an inherent conflict between acoustic fidelity and semantic preservation. To mitigate this conflict, prevailing methods augment acoustic codecs with complex semantic supervision. We explore the opposite direction: a semantic-first approach that starts from a semantically-capable model and adapts it for high-fidelity acoustic reconstruction. Through empirical analysis, we discover that targeted architectural simplification can unlock the acoustic modeling potential of Whisper, a text-aligned Automatic Speech Recognition (ASR) model. Based on this finding, we propose SimWhisper-Codec, a novel codec that balances the semantic and acoustic preservation by leveraging a frozen, simplified Whisper encoder without requiring external supervision. Experimental results demonstrate that SimWhisper-Codec achieves superior performance in both semantic preservation and acoustic quality compared to semantically-supervised codecs such as Mimi Codec and SpeechTokenizer at similar bitrates, validating the effectiveness of our semantic-first approach. Code is available at https://github.com/ZhangXinWhut/SimWhisper-Codec.
Speech Codec, architectural simplification, Whisper, semantic-acoustic conflict
In recent years, Speech Large Language Models (Speech LLMs) have garnered significant attention from the research community, demonstrating remarkable performance across a range of tasks [1]–[5]. The success of Speech LLMs is critically underpinned by a core component: the speech codec. This component serves as a crucial bridge, converting continuous audio signals into discrete tokens suitable for LLM modeling, thereby connecting raw audio with the model’s semantic understanding.
However, current speech codecs face an inherent conflict between the preservation of semantic content and acoustic fidelity, where optimizing for one typically degrades the other [6], [7]. This trade-off is particularly pronounced at low bitrates, where achieving high performance in both dimensions remains difficult. To mitigate this conflict, prevailing methods augment acoustic-centric codecs with external semantic supervision through various strategies. For instance, SpeechTokenizer [8] guides the first residual vector quantization (RVQ) layer through semantic distillation from HuBERT [9], while Mimi Codec in Moshi [6] employs split RVQ [10] with SSL model [11] supervision on one quantization branch. PAST [12] incorporates auxiliary phonetic tasks such as phoneme classification and ASR, and XY-Tokenizer [13] employs multi-task learning with LLM-based ASR supervision and multi-stage training. While effective, these methods typically rely on complex semantic supervision.
In this work, we explore the opposite direction: instead of enhancing acoustic codecs with semantic supervision, we start from Whisper [14], a text-aligned ASR model, and adapt it for high-fidelity acoustic reconstruction. However, this adaptation encounters a task mismatch—ASR systems are designed to achieve invariance to acoustic variations for content extraction [15], while acoustic reconstruction requires preserving fine-grained acoustic details for fidelity. To investigate this task mismatch, we conduct empirical analysis examining how different architectural components of Whisper affect its acoustic reconstruction capabilities. Through empirical analysis presented in Section 2, we discover that targeted architectural simplification—specifically removing the convolutional front-end nonlinearity (GELU activation) and absolute positional encodings [16]—substantially enhances the model’s ability to preserve fine-grained acoustic information. Based on this finding, we propose SimWhisper-Codec (see Figure 1), a low-bitrate (1.1 kbps at 16 kHz) codec that combines a simplified Whisper encoder, Finite Scalar Quantization (FSQ) [17], and a symmetric decoder, enabling single-stage training without semantic supervision.
Our contributions are as follows:
We propose SimWhisper-Codec, a novel codec that simultaneously models semantic and acoustic information through targeted architectural simplifications of Whisper’s encoder combined with FSQ quantization and symmetric decoding, eliminating the need for external semantic supervision.
Experimental results demonstrate that SimWhisper-Codec outperforms semantically-supervised codecs in semantic preservation at comparable low bitrates, while achieving high speaker similarity (0.83 SIM) and intelligibility (0.91 STOI), validating our method’s effectiveness.
We conduct an empirical analysis to identify which architectural components in Whisper encoders adversely affect acoustic reconstruction capabilities.
Convolutional Front-End Nonlinearity. The Whisper encoder’s front-end consists of two convolutional layers with GELU activation functions. While these nonlinear activations enable complex feature transformations beneficial for ASR tasks [14], [18], we hypothesize that they suppress spectral details essential for acoustic reconstruction. By removing these activations, the convolutional layers become purely linear transformations that better preserve input signal structure and retain acoustic details necessary for reconstruction.
Absolute Positional Encodings. Absolute positional encodings assign fixed "identity markers" to each temporal position in the sequence [19], [20]. We hypothesize that absolute positional encodings are detrimental to acoustic reconstruction because: (1) acoustic features should remain position-invariant—a phoneme /a/ should have identical representation regardless of temporal location; (2) speech contains repetitive structures that absolute encodings differentiate, hindering pattern recognition for reconstruction. These theoretical considerations motivate our experimental validation.
To validate our hypotheses, we conduct controlled analysis experiments using LJSpeech [21]. We extract frame-level hidden states from the final layer of each Whisper encoder variant, then condition identical HiFiGAN vocoders [22] on these features to assess reconstruction quality. This setup allows us to isolate the impact of specific architectural components on acoustic modeling capability while keeping the vocoder constant. Crucially, all encoder variants remain frozen during HiFiGAN training to serve as feature extractors.
We systematically examine the effect of removing each component individually, followed by their combined removal. The results validate our hypotheses: removing convolutional front-end nonlinearity yields substantial improvements with PESQ-NB increasing from 1.24 to 3.60 (+2.36), STOI from 0.82 to 0.96 (+0.14), and SIM from 0.81 to 0.86 (+0.05). Removing absolute positional encodings also confirms our hypothesis with PESQ-NB increasing from 1.24 to 2.95 (+1.71), STOI from 0.82 to 0.94 (+0.12), and SIM from 0.81 to 0.84 (+0.03).
Simultaneously removing both components yields the best reconstruction performance: PESQ-NB reaches 3.67 (+2.43), STOI achieves 0.97 (+0.15), and SIM attains 0.87 (+0.06). This demonstrates complementary hindering effects—the nonlinearity suppresses spectral details while positional encodings interfere with flexible attention patterns. Table 1 summarizes these results, with objective metrics defined in Section 4.2.
To further understand how positional encodings affect attention mechanisms, we visualize self-attention patterns from a middle Transformer layer. Figure 2 provides compelling evidence for our positional encoding hypothesis: removing absolute positional encodings reduces diagonal dominance in self-attention patterns (0.031→0.005), enabling attention to spread across the sequence and reveal content-driven interactions around repeated segments.
3.2pt
| Variant | SIM \(\uparrow\) | STOI \(\uparrow\) | PESQ-NB \(\uparrow\) | PESQ-WB \(\uparrow\) |
|---|---|---|---|---|
| Whisper encoder (baseline) | 0.81 | 0.82 | 1.24 | 1.14 |
| – remove absolute PEs only | 0.84 | 0.94 | 2.95 | 2.49 |
| – remove both stem GELUs only | 0.86 | 0.96 | 3.60 | 3.28 |
| – remove both components | 0.87 | 0.97 | 3.67 | 3.33 |
Having established that simultaneously removing both components achieves optimal reconstruction performance, these findings provide the foundation for our codec design. The substantial improvements in acoustic reconstruction quality (+2.43 PESQ-NB, +0.15 STOI) demonstrate that architectural simplification can effectively unlock Whisper’s potential for high-fidelity acoustic modeling. Based on these insights, we next present SimWhisper-Codec, which leverages the simplified Whisper encoder as a frozen feature extractor in a complete speech codec framework.
Motivation. Rather than augmenting acoustic codecs with external semantic supervision, we explore the opposite direction: starting from Whisper’s inherent semantic capabilities and adapting it for high-quality acoustic reconstruction. The key insight is that Whisper’s extensive multilingual training and text alignment provide natural semantic grounding, eliminating the need for additional semantic models. However, certain architectural components designed for ASR invariance may hinder fine-grained acoustic preservation. Based on our empirical findings, we propose SimWhisper-Codec, which employs a frozen simplified Whisper encoder paired with FSQ quantization and a symmetric trainable decoder. By leveraging Whisper’s inherent semantic capabilities while enhancing its acoustic modeling through architectural simplification, our approach enables single-stage training without external semantic supervision.
As shown in Figure 1, the SimWhisper-Codec architecture is an end-to-end model comprising a simplified Whisper encoder, a downsampling module, a quantizer, an upsampling module, and a symmetric decoder. The downsampling module and quantizer collectively form an information bottleneck, compressing the encoder’s output by reducing both temporal resolution and feature dimensionality.
Encoder. The encoder adopts the Whisper architecture initialized with pre-trained weights, with two key modifications to enhance acoustic preservation. First, we remove the GELU non-linearities from the initial two convolutional layers while preserving both the layer structure and learned weights from pre-training to maintain compatibility with the pre-trained Whisper model. Second, we completely remove the absolute positional encodings from the Transformer blocks. This simplified encoder remains frozen during codec training to serve as a powerful feature extractor, leveraging the rich representations learned during Whisper’s original ASR pre-training.
Downsampler. The downsampler reduces the temporal resolution by stacking consecutive frames and aggregating temporal information into the channel dimension. Subsequently, a series of residual blocks with dilated convolutions and Snake activation functions [23] progressively compress the feature dimensionality while capturing multi-scale temporal context.
Quantizer. We employ a Finite Scalar Quantization (FSQ) module [17], which mitigates codebook collapse and obviates the need for complex training machinery such as exponential moving averages and commitment losses required by traditional VQ methods.
Upsampler. The upsampler mirrors the downsampling architecture, employing nearest-neighbor upsampling combined with convolutional and residual blocks to restore the original temporal resolution. The module progressively expands the feature dimensionality back to the decoder’s input requirements.
Decoder. The decoder adopts a symmetric architecture to the encoder, with symmetry achieved by replacing the encoder’s convolutional layers with transposed convolutions while maintaining the same architectural depth and feature dimensions. This design enables effective reversal of the encoding process to reconstruct mel-spectrogram representations from the upsampled features. A Vocos model [24] subsequently converts the spectral features to the final audio waveform.
The codec is trained using a single-stage GAN-based approach. The generator minimizes the following composite loss function:
\[\mathcal{L}_{G} = \lambda_{\text{recon}}\mathcal{L}_{\text{recon}} + \lambda_{\text{adv}}\mathcal{L}_{\text{adv}} + \lambda_{\text{feat}}\mathcal{L}_{\text{feat}} \label{eq:total95loss}\tag{1}\]
where \(\lambda_{\text{recon}}\), \(\lambda_{\text{adv}}\), and \(\lambda_{\text{feat}}\) control the weights of the multi-scale reconstruction loss \(\mathcal{L}_{\text{recon}}\), adversarial loss \(\mathcal{L}_{\text{adv}}\), and feature matching loss \(\mathcal{L}_{\text{feat}}\), respectively.
Multi-scale Reconstruction Loss. We compute an L1 loss between the mel-spectrograms of the original and reconstructed audio across seven STFT scales. For each scale \(k \in \{5,\dots, 11\}\), we calculate:
\[\mathcal{L}_{\text{recon}} = \sum_{k} \| M_k(x) - M_k(\hat{x}) \|_1 \label{eq:recon95loss}\tag{2}\]
where \(M_k(\cdot)\) denotes the mel-spectrogram computed with FFT size \(2^k\), \(x\) is the original audio, and \(\hat{x}\) is the reconstructed audio.
Adversarial Loss. We employ a Least Squares GAN (LSGAN) objective [25] to enhance perceptual quality. The discriminator loss is defined as:
\[\mathcal{L}_{D} = \frac{1}{N} \sum_{i=1}^{N} \left[ (D_i(x) - 1)^2 + D_i(G(z))^2 \right] \label{eq:disc95loss}\tag{3}\]
where \(D_i\) represents the \(i\)-th discriminator output from (MPD [22], MSD [26]), \(N\) is the number of discriminators, and \(G(z)\) is the generated audio. The generator adversarial loss is defined as:
\[\mathcal{L}_{\text{adv}} = \frac{1}{N} \sum_{i=1}^{N} (D_i(G(z)) - 1)^2 \label{eq:gen95adv}\tag{4}\]
Feature Matching Loss. We compute an L1 loss between the feature maps of the discriminators for the real and generated audio, which prevents the generator from overtraining on the current discriminator and improves generation quality. The feature matching loss is formulated as:
\[\mathcal{L}_{\text{feat}} = \frac{1}{N \cdot K} \sum_{i=1}^{N} \sum_{j=1}^{K} \frac{\| D_i^j(x) - D_i^j(G(z)) \|_1}{\| D_i^j(x) \|_1 + \epsilon} \label{eq:feat95loss}\tag{5}\]
where \(D_i^j(\cdot)\) denotes the \(j\)-th layer feature map from the \(i\)-th discriminator, \(K\) is the number of feature layers, and \(\epsilon\) is a small constant for numerical stability.
Dataset and Training Details. We use the full training set of LibriSpeech [27] with 960 hours of speech data for training. The test-clean set with 2620 utterances is used for testing. All speech data are in 16 kHz with randomly cropped 4-second audio segments. Training is conducted on a single NVIDIA H100 GPU with batch size of 32 and gradient accumulation set to 3, resulting in an effective batch size of 96. The total training is performed for 500,000 steps using a single-stage approach. Both generator and discriminator employ AdamW optimization with \(\beta_1 = 0.8\), \(\beta_2 = 0.99\), and weight decay \(0.01\). A cosine annealing learning rate schedule is used, declining from \(1 \times 10^{-4}\) to \(0\) with 5k warmup steps for both generator and discriminator.
Framework Configuration. Our framework features a symmetric encoder-decoder architecture. Both components are 12-layer Transformers based on the Whisper-small architecture, with 768-dimensional hidden states and 12 attention heads. The model processes 16 kHz audio, extracting 50 Hz feature sequences using a 25 ms window and a 10 ms hop size. The downsampler reduces the temporal resolution to 12.5 Hz via 4\(\times\) frame stacking and compresses the feature dimension from 768 to 32 using residual blocks with Snake activations [23] and multi-scale dilated convolutions (dilations: 1, 3, 5, 9). The upsampler employs a mirrored architecture, restoring the 50 Hz resolution and 768 dimensions through stages of 2\(\times\) nearest-neighbor upsampling and corresponding residual blocks. Following [28], we use a Finite Scalar Quantization (FSQ) module configured with eight codebooks, four dimensions per code, and levels of [8, 7, 6, 6] to achieve a 1.1 kbps bitrate. The decoder then reconstructs mel-spectrograms from the upsampled features, which are synthesized into the final waveform by a 24-layer Vocos model.
Baselines. We compare our codec against representative baselines at similar bitrates: EnCodec [7] (1.5 kbps), DAC-RVQ3 [29] (1.5 kbps), SpeechTokenizer [8] (1.0 kbps), Mimi-RVQ8 [6] (1.1 kbps), and XCodec2.0 [30] (0.8 kbps), all using official checkpoints.
Two aspects are evaluated: (i) acoustic reconstruction quality of the synthesized audio, and (ii) semantic alignment between the codec and text. All metrics are reported on LibriSpeech test-clean [27].
PESQ-WB/NB. Wideband and narrowband PESQ are reported following ITU-T P.862.2 and P.862, respectively [31]. Signals are resampled to 16 kHz (WB) and 8 kHz (NB) before scoring.
STOI. Short-Time Objective Intelligibility is used to measure intelligibility correlation between reference and reconstructed signals [32]. Audio is resampled to 16 kHz prior to computation.
SIM (speaker similarity). Cosine similarity is computed between speaker embeddings extracted from reference and reconstructed utterances using a WavLM-based speaker verification model1.
WER via external ASR. Reconstructed audio is transcribed with HuBERT large2 and word error rate (WER) is computed against LibriSpeech references.
Table 2 presents a comprehensive comparison of SimWhisper-Codec against representative baselines. Our model effectively models both semantic and acoustic information simultaneously, validating our architectural simplification approach.
Semantic Preservation. On semantic evaluation, SimWhisper-Codec achieves the lowest Word Error Rate (WER) of 3.10, outperforming all other codecs. Notably, this performance is achieved without any semantic supervision, distinguishing it from baselines like XCodec2.0 (3.61 WER) and Mimi-RVQ8 (4.36 WER) that rely on such supervision. The results indicate that our architectural simplification does not compromise Whisper’s semantic alignment capabilities.
Acoustic Quality. In terms of acoustic reconstruction, SimWhisper-Codec delivers competitive results. It achieves a PESQ-NB of 2.98 and a STOI of 0.91, placing it among the top-performing models and significantly outperforming acoustic-only codecs like EnCodec and DAC-RVQ3. Furthermore, with a speaker similarity (SIM) score of 0.83, it demonstrates good preservation of speaker identity. These results show that the simplified Whisper encoder can effectively extract acoustic features even at low bitrates of 1.1 kbps.
1.1pt
| Frame | Semantic | Acoustic | ||||||
|---|---|---|---|---|---|---|---|---|
| 4-5 (lr)6-9 Model | BPS | Rate | Semantic | WER | SIM | STOI | PESQ- | PESQ- |
| Supervision | \(\downarrow\) | \(\uparrow\) | \(\uparrow\) | NB \(\uparrow\) | WB \(\uparrow\) | |||
| Ground Truth | – | – | – | 2.53 | – | – | – | – |
| EnCodec [7] | 1.5k | 75 | No | 45.49 | 0.60 | 0.85 | 1.94 | 1.56 |
| DAC-RVQ3 [29] | 1.5k | 75 | No | 41.67 | 0.45 | 0.76 | 1.82 | 1.43 |
| SpeechTokenizer [8] | 1.0k | 50 | Yes | 5.92 | 0.37 | 0.70 | 1.42 | 1.15 |
| Mimi-RVQ8 [6] | 1.1k | 12.5 | Yes | 4.36 | 0.73 | 0.90 | 2.62 | 2.13 |
| XCodec2.0 [30] | 0.8k | 50 | Yes | 3.61 | 0.82 | 0.91 | 2.95 | 2.32 |
| SimWhisper-Codec (ours) | 1.1k | 12.5 | No | 3.10 | 0.83 | 0.91 | 2.98 | 2.36 |
To validate the impact of removing different architectural components on codec performance, we conduct ablation studies on SimWhisper-Codec. For training simplicity, we maintain symmetric encoder-decoder architectures throughout all variants.
0.9pt
| Semantic | Acoustic | ||||
|---|---|---|---|---|---|
| 2-2 (lr)3-6 Variant | WER \(\downarrow\) | SIM \(\uparrow\) | STOI \(\uparrow\) | PESQ-NB \(\uparrow\) | PESQ-WB \(\uparrow\) |
| Whisper encoder (baseline) | 5.95 | 0.78 | 0.85 | 1.95 | 1.68 |
| – remove absolute PEs only | 5.42 | 0.80 | 0.87 | 2.34 | 1.96 |
| – remove both stem GELUs only | 3.74 | 0.81 | 0.89 | 2.51 | 2.10 |
| Ours: remove both components | 3.10 | 0.83 | 0.91 | 2.98 | 2.36 |
Table 3 demonstrates that both architectural modifications contribute to performance improvements. Removing GELU activations from the convolutional front-end has a more substantial impact on semantic preservation (WER: 5.95→3.74), while removing positional encodings provides moderate gains across all metrics. The combination of both modifications yields the best results, achieving a WER of 3.10, SIM of 0.83, and PESQ-WB of 2.36, validating our architectural design choices.
The ablation study demonstrates that our architectural modifications yield superior codec performance. To further validate that the simplified encoder retains fine-grained acoustic cues essential for high-quality synthesis, we conduct a layer-wise probing experiment analyzing its preservation of pitch information.
We train ridge regression models to predict frame-level fundamental frequency (\(F_0\)) from hidden states extracted from each encoder layer using THCHS-30 [33], a tonal Mandarin dataset with rich prosodic variation. Ground truth \(F_0\) values are extracted using Parselmouth [34], evaluating only voiced frames with Pearson correlation coefficient (PCC).
As shown in Figure 3, the results reveal that simplified Whisper maintains stable \(F_0\) tracking (PCC \(\approx 0.76\)) across all layers, while standard Whisper degrades from layer 6 onward (0.78→0.58). This indicates that our architectural modifications better preserve prosodic information essential for high-quality speech synthesis, providing additional evidence that the simplified encoder successfully retains acoustic details while maintaining semantic capabilities.
We presented SimWhisper-Codec, a low-bitrate speech codec that mitigates the semantic-acoustic conflict through architectural simplification rather than complex supervision. We removed convolutional front-end nonlinearity and absolute positional encodings from the frozen Whisper encoder. This architectural simplification maintains competitive acoustic quality while achieving superior semantic preservation. Results demonstrate that such simplification of Whisper can be more effective than semantic supervision approaches for speech codec design.