UMA-Split: Unimodal Aggregation for both English and Mandarin Non-Autoregressive Speech Recognition


Abstract

This paper proposes a unimodal aggregation (UMA) based non-autoregressive model for both English and Mandarin speech recognition. The original UMA explicitly segments and aggregates acoustic frames (with unimodal weights that first monotonically increase and then decrease) of the same text token to learn better representations than regular connectionist temporal classification (CTC). However, it only works well in Mandarin. It struggles with other languages, such as English, for which a single syllable may be tokenized into multiple fine-grained tokens, or a token spans fewer than 3 acoustic frames and fails to form unimodal weights. To address this problem, we propose allowing each UMA-aggregated frame map to multiple tokens, via a simple split module that generates two tokens from each aggregated frame before computing the CTC loss. Experiments verify the proposed model’s effectiveness: it outperforms other advanced non-autoregressive models, and even matches the hybrid CTC/attention autoregressive model on LibriSpeech (English, 2.22%/4.93% WER for test clean/other with the 149 M model) and on AISHELL-1 (Mandarin, 4.43% CER).

unimodal aggregation, non-autoregressive speech recognition, CTC

1 Introduction↩︎

During the past decade, end-to-end (E2E) speech recognition models have made significant advancements [1]. A core challenge of E2E ASR models is aligning input acoustic feature frames with output text tokens, and mainstream models address this with distinct solutions. Attention-based encoder-decoder (AED) [2] models leverage cross-attention between inputs and outputs to determine which acoustic frames to focus on when generating the next token. Recurrent Neural Network Transducer (RNN-T) [3] introduces blank tokens to facilitate alignment. It leverages a joint network for frame-by-frame token prediction, where predictions rely on both acoustic feature frames and outputs of the prediction network (conditioned on previously predicted tokens). Both models are autoregressive (AR) and typically use beam search decoding to enhance recognition accuracy, resulting in slow inference speeds.

In contrast, the connectionist temporal classification (CTC) [4] model is non-autoregressive (NAR). It predicts tokens in parallel, relying solely on acoustic feature frames. This frame-level independence enables faster inference, but at the cost of reduced recognition performance. [5] compares the performance of various NAR methods (Mask-CTC [6], Intermediate CTC [7], Self-conditioned CTC [8], CIF [9], etc.) on multiple English datasets up to 2021. Among them, Self-conditioned CTC performed best on most datasets: it embeds intermediate-layer CTC predictions into the forward flow to relax CTC’s independence assumption, while using the intermediate CTC loss. The study further notes that combining different NAR techniques boosts performance.

Over the past four years, several NAR methods [10][12] have been designed to explicitly aggregate acoustic information monotonically. They achieve performance comparable to AR baselines in Mandarin ASR. Among them, Paraformer [10] employs a Continuous Integrate-and-Fire (CIF) predictor to estimate the number of output tokens and integrate continuous acoustic feature frames. Paraformer-v2 [13] observes that the CIF predictor fails to accurately estimate the number of English byte-pair encoding (BPE) tokens, so it replaces the CIF module with a CTC posterior module to enable English speech recognition.

This work proposes a unimodal aggregation (UMA) based model for NAR ASR on both English and Mandarin. As proposed in our prior work [11], UMA explicitly segments and aggregates acoustic feature frames (with unimodal weights that first monotonically increase and then decrease) belonging to the same text token, thus learning better feature representations for text tokens compared to the regular CTC. However, it faces challenges when applied to languages other than Mandarin. In Mandarin ASR, tokens are Chinese characters—each corresponding to a complete long syllable, thus the continuous frames belonging to one syllable/token can be well aggregated. In other languages (e.g., English with BPE tokens), by contrast, a single syllable may be tokenized into multiple tokens, leaving CTC alignment struggling to learn which token each UMA-aggregated frame should map to. Additionally, fine-grained tokens may span fewer than 3 acoustic frames, failing to form unimodal weights. To address these difficulties, this work proposes allowing each UMA-aggregated frame to map to multiple tokens, which is realized by designing a simple split module to generate two tokens from each UMA-aggregated frame before computing the CTC loss. Experiments on LibriSpeech (English) demonstrate that the proposed model can effectively conduct unimodal aggregation and then map one UMA-aggregated frame to multiple tokens. Overall, the proposed model achieves superior ASR performance for both English and Mandarin, and even reaches performance comparable to the hybrid CTC/AED AR model.

a

Figure 1: Model architecture. Frame rates (frames per second, fps) are computed when applying a 10 ms STFT frame-shift..

2 Method↩︎

The general architecture of the proposed model is shown in Fig. 1, consisting of six modules: convolutional subsampling, high-rate encoder, UMA module, low-rate encoder, split module, and CTC. The input of the model is log Mel filter-bank features, with a frame rate of 100 fps (frames per second) or 125 fps, depending on whether the STFT frame shift is 10 ms or 8 ms. A convolutional subsampling module is employed to downsample the feature sequence by a factor of 4, resulting in an output frame rate of 25 fps or 31.25 fps.

2.1 High-Rate Encoder↩︎

The high-rate encoder generates acoustic features while preserving the frame rate after convolutional subsampling. It can be implemented with any sequence modeling network; for the experiments in this paper, we adopted the E-Branchformer encoder block [14], given its superior performance in ASR tasks.

2.2 Unimodal Aggregation Module↩︎

In previous work [11], [15], UMA has been validated as effective for Mandarin speech recognition. In this work, we further improve its structure and extend its application to English.

UMA dynamically segments and aggregates the output feature sequence \(\mathbf{e}_t^h, t=1, \dots, T\) of the high-rate encoder. For each time step, a scalar aggregation weight \(\alpha_t\) is predicted using a feed-forward network (FFN) combined with a sigmoid activation, formulated as: \[\begin{align} \alpha_t=\text{Sigmoid}(\text{FFN}(\mathbf{e}_t^h)), \end{align}\] Consistent with the design in [16], [17], the FFN comprises two linear layers interleaved with a Swish activation. The first linear layer applies a dimensional-expansion factor of 2, while the second projects to a 1-dimensional representation, which is followed by the sigmoid activation to generate the weight.

A timestep \(t\) satisfying \(\alpha_t\le \alpha_{t-1} \;\text{and} \;\alpha_t \le \alpha_{t+1}\) is defined as a UMA valley, with \(t=0\) and \(t=T\) also included. The time index of the \(i\)-th UMA valley is denoted as \(\tau_{i}\in[1, T]\), and we aggregate the feature frames between two consecutive UMA valleys using UMA weights as: \[\begin{align} \label{eq:uma} \mathbf{c}_i=\frac{\sum_{t=\tau_{i}}^{\tau_{i+1}}\alpha_t \mathbf{e}_t^h}{\sum_{t=\tau_i}^{\tau_{i+1}}\alpha_t}. \end{align}\tag{1}\]

The UMA valleys partition the feature frames into \(i=1,\dots,I\) segments, so the length reduction rate is \(T/I\). A UMA example is shown in Fig. 2. This capability of dynamic segmentation and aggregation is automatically learned during training, without extra supervision regarding the length reduction rate.

2.3 Low-Rate Encoder↩︎

After the UMA, the sequence length is dynamically shortened. Influenced by factors including the text token rate, language type, and original frame rate, the post-UMA frame rate falls within the 4–7 fps range. The aggregated feature sequence is subsequently processed by a low-rate encoder, which preserves the frame rate. Our low-rate encoder consists of 6 Transformer encoder blocks [16]. The outputs are denoted as \(\mathbf{e}_i^l \;(i=1, \dots, I)\).

2.4 Split Module↩︎

As mentioned in Section 1, to address the difficutlies of applying UMA to English, we propose allowing each UMA-aggregated frame to involve and generate two text tokens. This is realized by simply split one UMA-aggregated frame into two frames, with each possibly corresponding to one non-blank token. The output of the split module, denoted as \(\mathbf{s}_j \;(j=1, \dots, 2I)\), is defined as: \[\begin{align} \label{eq:split} \mathbf{s}_j = \begin{cases} \text{LayerNorm}(\mathbf{e}_{i}^l) & \text{for } j=2i-1, \\ \text{LayerNorm}\left(\text{FFN}(\mathbf{e}_{i}^l)\right) & \text{for } j=2i. \end{cases} \end{align}\tag{2}\] The FFN employed here consists of two linear layers, with the first layer applying a dimensional-expansion factor of 4, and the second layer mapping back to the original dimension.

We don’t apply any explicit supervision to whether one UMA-aggregated frame would involve and generate two non-blank tokens; instead, this is learned automatically through the final CTC loss applied to the split sequence, i.e. \(\mathbf{s}_j \;(j=1, \dots, 2I)\). In the example shown in Fig. 2, three cases may occur for the two tokens after splitting: 1) Both are blank tokens; 2) Both are identical non-blank tokens, or one is a blank token—either case results in 1 non-blank token total (e.g., [_ten,_ten], [_the,#]); 3) The two are distinct non-blank tokens, leading to 2 non-blank tokens total (e.g., [_do,se]).

2.5 Loss Function↩︎

UMA’s performance depends on the accuracy of estimating the span of acoustic feature frames of tokens, and integrating the Self-conditioned CTC (SC-CTC) method [8] can enhance the accuracy. Self-conditioned CTC adds CTC predictions from intermediate layers into the input of subsequent layers, aiming to condition the subsequent and final CTC predictions on these intermediate ones. We apply this method prior to the UMA module in the high-rate encoder, allowing the prediction of UMA weights to be conditioned on the intermediate CTC predictions. Specifically, we integrate conditioning layers into the mid-, three-quarter-, and final layers of the high-rate encoder. For the 2nd and 4th layers of the low-rate encoder, the intermediate CTC loss (no conditioning) is applied. Note that the intermediate hidden units here are fed into the split module before computing the intermediate CTC loss. The average of these five intermediate CTC losses gives \(L_{\mathbf{inter}}\), combined with the CTC loss of the final output, it forms the overall training loss: \(\mathcal{L} = 0.5({L_{\mathbf{CTC}}} + {L_{\mathbf{inter}}})\).

a

Figure 2: An English example of UMA weights for the BPE 5000 Base model. Ground truth text: the dose for an adult is ten minims. # denotes blank token..

3 Experiments↩︎

3.1 Dataset↩︎

We evaluated the proposed method on two widely used datasets: 1) LibriSpeech [18], an  1,000 hours English audiobook speech dataset; 2) AISHELL-1 [19], an 178 hours Mandarin Chinese read speech dataset. For tokenization, Aishell-1 uses 4,233 Chinese characters, while LibriSpeech primarily uses 5,000 BPE tokens. The BPE vocabulary is generated from the text of the training data. We additionally conducted experiments on LibriSpeech to explore the impact of BPE vocabulary size on UMA.

3.2 Experimental Setups↩︎

ESPnet toolkit [20] is used for all experiments. We released our codes on 1. The AR baseline is the hybrid CTC/AED model [21], with all encoders being E-Branchformer [14]. We also noted that CTC networks trained via the hybrid CTC/AED method outperform those trained solely with CTC loss. Therefore, the baseline CTC models used for comparison are the models trained via the hybrid CTC/AED loss. Specifically, during inference, the AED head is not used, and only the CTC encoder is used. Given that the proposed model incorporates the SC-CTC mechanism, we also compared it with the original SC-CTC model [8] on LibriSpeech. Additionally, we directly quote results from several recent studies [11][14], [22], [23] on the same datasets to enable broader comparisons.

Note that at the early stage of training, unstable UMA aggregation may cause some batch samples to yield output lengths shorter than the target text token sequence, and thus incomputable CTC loss. Thus, we only adopt the batch samples with computable CTC loss for training, enabling normal gradient descent and gradual learning of reasonable aggregation ratios. In preliminary experiments, we tried character tokens on LibriSpeech. However, the token rate was too high (14.33 tps), failing to form effective UMA weights.

For each dataset, we ensure identical training configurations to the baseline models. All inferences are without language models. The CTC/AED baseline employs beam search (with 60 beams on LibriSpeech and 10 beams on AISHELL-1), while all CTC-related methods (including ours) use greedy search.

Input: All input data are sampled at 16kHz. The STFT frame shift is 10 ms for LibriSpeech, and 8 ms for AISHELL-1. The model input features are 80-dimentional log Mel filter-banks.

High-rate encoder: We use two sizes of E-Branchformer High-rate encoder on LibriSpeech: the hyperparameters of (dimension, feedforward dimension, layers, and attention heads) are set to Base (256, 1024, 13, 4) and Large (512, 1024, 18, 8). On AISHELL-1, the hyperparameters are (256, 1024, 17, 4). The numbers of layers have been adjusted to match the comparison baseline to have roughly equivalent model sizes.

Low-rate encoder: 6 Transformer encoder blocks, with dimension and attention heads matching the respective high-rate encoder. All models use 2048 feedforward dimension.

Optimizer: We use the AdamW optimizer and warmup scheduler. The batch size, learning rate, and warmup steps follow ESPnet’s recipes. For weight decay, we use 1e-6 for LibriSpeech and 1e-2 for AISHELL-1. The proposed model and the baselines are trained for the same number of steps. The model with averaged weights of the 10 best checkpoints is taken as the final model.

Table 1: UMA-Split model-related statistics.Token rate: Tokens per second (tps) calculated on test sets.Non-blank: Proportion of frames that output \(\geq 1\) non-blank tokens after splitting (relative to all frames).2-non-blank: Proportion of frames that output 2 distinct non-blank tokens (relative to Non-blank frames).SC represents SC-CTC.
Token type & Token rate Frame rate Split ratio Params Test (%)
vocabulary size before UMA after UMA non-blank 2-non-blank CER or WER
AISHELL-1 Char 4233 2.90 tps 31.25 fps 5.91 fps 49.4% 0% 46M 4.43
LibriSpeech BPE 500 5.37 tps 25 fps 6.16 fps 73.2% 30.1% 39M 2.75 / 6.45
BPE 5000 (B) 3.39 tps 25 fps 4.58 fps 70.5% 8.3% 41M 2.50 / 5.77
BPE 10000 3.11 tps 25 fps 4.38 fps 68.7% 4.9% 43M 2.49 / 5.73
BPE 5000 (w/o SC) 3.39 tps 25 fps 4.98 Hz 61.5% 12.6% 39M 2.90 / 6.53
BPE 5000 (L) 3.39 tps 25 fps 5.78 fps 56.1% 7.6% 149M 2.22 / 4.93
Table 2: Word error rate (WER,%) on LibriSpeech. All without LM. The decoding beam size or AR models is 60. Test set results are divided into test_clean / test_other.
Type clean / other Params
E-Branchformer (B), hybrid [14] AR 2.49 / 5.61 41M
CTC Infer w/o AED head NAR 3.20 / 7.09 29M
Zipformer-M, CTC [23] NAR 2.52 / 6.02 64M
Paraformer-v2 (S) [13] NAR 3.4 / 8.0 50M
E-Branchformer, SC-CTC NAR 2.62 / 6.16 43M
UMA-Split (B) (prop.) NAR 2.50 / 5.77 41M
E-Branchformer (L), hybrid [14] AR 2.14 / 4.55 149M
CTC Infer w/o AED head NAR 2.59 / 5.45 119M
Zipformer-L, CTC [23] NAR 2.50 / 5.72 147M
Paraformer-v2 (L) [13] NAR 3.0 / 6.9 120M
UMA-Split (L) (prop.) NAR 2.22 / 4.93 149M
Table 3: Character error rate (CER,%) on AISHELL-1. All without LM. AR model decoding beam size 10.
Type dev test Params
Branchformer (B), hybrid [22] AR 4.19 4.43 45M
E-Branchformer, hybrid AR 4.13 4.53 57M
CTC Infer w/o AED head NAR 4.39 4.91 46M
Paraformer-v2 (S) [13] NAR 4.5 4.9 50M
Zipformer-M, CTC[23] NAR 4.47 4.80 66M
EffectiveASR Large [12] NAR 4.26 4.62 76M
Original UMA Conformer [11] NAR 4.4 4.7 45M
UMA-Split (prop.) NAR 4.15 4.43 46M
- w/o split module NAR 4.28 4.53 45M

3.3 Main Results↩︎

LibriSpeech: Table 2 shows the results on LibriSpeech. Compared with other advanced NAR models, the proposed model achieves superior performance in both Base (41 M parameters) and Large (149 M parameters) configurations, which demonstrates that the proposed explicit frame aggregation with UMA weights indeed helps learn better token representations. Compared to the AR hybrid CTC/AED baseline, the proposed model achieves comparable performance across different parameter scales, with a 10× inference speedup [11].

AISHELL-1: Table 3 shows the results on AISHELL-1. Consistent to the results presented in [11], UMA outperforms other NAR models. Compared to the model without the split module, adding it brings a slight performance boost in Mandarin ASR. The superior results compared to other models confirm that the proposed model retains the original UMA’s strong capability for Mandarin ASR.

3.4 UMA-Split Analysis↩︎

Table 1 presents UMA-Split model statistics across different configurations. On AISHELL-1, each token corresponds to a complete syllable. Even with the split module, UMA-aggregated frames still belong to a single token, so the 2-non-blank split ratio is 0%.

On LibriSpeech, we compare different BPE vocabulary sizes with Base model. We find that as the vocabulary size increases, both the token rate and the frame rate after UMA decrease. A larger BPE vocabulary size has coarser-grained tokenization and larger average number of acoustic frames per token, which is easier for UMA learning. This results in a lower 2-non-blank split ratio. Although the split module allows one UMA-aggregated frame to map to 2 non-blank tokens, such mixed representation of 2 tokens is inferior than the representation of 1 token. Therefore, a lower 2-non-blank split ratio corresponds to a lower WER.

Additionally, compared with the model without SC-CTC (w/o SC) on LibriSpeech, we can find that SC-CTC is helpful for better recognizing the span of acoustic feature frames of tokens and thus for reducing the 2-non-blank split ratio and WER. We also note that compared to the Base model, the Large model exhibits a higher frame rate after UMA and generates more blank tokens (with a lower non-blank split ratio), while the reasons for this phenomenon remains unclear to us.

4 Conclusions↩︎

We propose a UMA-based NAR ASR model for English and Mandarin. The original UMA faces challenges in languages other than Mandarin (e.g., English with BPE tokens), where fine-grained tokens may span fewer than 3 acoustic frames, failing to form unimodal weights. We address this by allowing each UMA-aggregated frame to map to two tokens via a simple split module. Experiments on LibriSpeech and AISHELL-1 show the proposed model achieves performance even comparable to that of the hybrid CTC/AED AR model.

References↩︎

[1]
Rohit Prabhavalkar, Takaaki Hori, Tara N Sainath, Ralf Schlüter, and Shinji Watanabe, “End-to-end speech recognition: A survey,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, pp. 325–351, 2023.
[2]
William Chan, Navdeep Jaitly, Quoc Le, and Oriol Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational speech recognition,” in ICASSP, 2016, pp. 4960–4964.
[3]
Alex Graves, “Sequence transduction with recurrent neural networks,” Proc. Int. Conf. Mach. Learn., 2012.
[4]
Alex Graves, Santiago Fernández, Faustino Gomez, and Jürgen Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in ICML, 2006, pp. 369–376.
[5]
Yosuke Higuchi, Nanxin Chen, Yuya Fujita, Hirofumi Inaguma, Tatsuya Komatsu, Jaesong Lee, Jumon Nozaki, Tianzi Wang, and Shinji Watanabe, “A comparative study on non-autoregressive modelings for speech-to-text generation,” in ASRU, 2021, pp. 47–54.
[6]
Yosuke Higuchi, Shinji Watanabe, Nanxin Chen, Tetsuji Ogawa, and Tetsunori Kobayashi, “Mask CTC: Non-autoregressive end-to-end asr with CTC and mask predict,” in Interspeech, 2020.
[7]
Jaesong Lee and Shinji Watanabe, “Intermediate loss regularization for CTC-based speech recognition,” in ICASSP, 2021, pp. 6224–6228.
[8]
Jumon Nozaki and Tatsuya Komatsu, “Relaxing the conditional independence assumption of CTC-based ASR by conditioning on intermediate predictions,” in Interspeech, 2021.
[9]
Linhao Dong and Bo Xu, CIF: Continuous integrate-and-fire for end-to-end speech recognition,” in ICASSP, 2020, pp. 6079–6083.
[10]
Zhifu Gao, Shiliang Zhang, Ian Mcloughlin, and Zhijie Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition,” in Interspeech, 2022.
[11]
Ying Fang and Xiaofei Li, “Unimodal aggregation for CTC-based speech recognition,” in ICASSP, 2024, pp. 10591–10595.
[12]
Ziyang Zhuang, Chenfeng Miao, Kun Zou, Ming Fang, Tao Wei, Zijian Li, Ning Cheng, Wei Hu, Shaojun Wang, and Jing Xiao, “Effectiveasr: A single-step non-autoregressive mandarin speech recognition architecture with high accuracy and inference speed,” in ICASSP, 2025, pp. 1–5.
[13]
Keyu An, Zerui Li, Zhifu Gao, and Shiliang Zhang, “Paraformer-v2: An improved non-autoregressive transformer for noise-robust speech recognition,” Nat. Conf. Man- Mach. Speech Commun., pp. 1240–1253, 2024.
[14]
Kwangyoun Kim, Felix Wu, Yifan Peng, Jing Pan, Prashant Sridhar, Kyu J Han, and Shinji Watanabe, E-Branchformer: Branchformer with enhanced merging for speech recognition,” in SLT, 2023, pp. 84–91.
[15]
Ying Fang and Xiaofei Li, “Mamba for streaming asr combined with unimodal aggregation,” in ICASSP, 2025, pp. 1–5.
[16]
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[17]
Anmol Gulati, James Qin, Chung-Cheng Chiu, Niki Parmar, Yu Zhang, Jiahui Yu, Wei Han, Shibo Wang, Zhengdong Zhang, Yonghui Wu, and Ruoming Pang, Conformer: Convolution-augmented Transformer for Speech Recognition,” in Interspeech, 2020, pp. 5036–5040.
[18]
Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur, “Librispeech: an ASR corpus based on public domain audio books,” in ICASSP, 2015, pp. 5206–5210.
[19]
Hui Bu, Jiayu Du, Xingyu Na, Bengu Wu, and Hao Zheng, AISHELL-1: An open-source mandarin speech corpus and a speech recognition baseline,” in O-COCOSDA, 2017, pp. 1–5.
[20]
Shinji Watanabe, Takaaki Hori, Shigeki Karita, Tomoki Hayashi, Jiro Nishitoba, Yuya Unno, Nelson Enrique Yalta Soplin, Jahn Heymann, Matthew Wiesner, Nanxin Chen, Adithya Renduchintala, and Tsubasa Ochiai, ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018, pp. 2207–2211.
[21]
Shinji Watanabe, Takaaki Hori, Suyoun Kim, John R Hershey, and Tomoki Hayashi, “Hybrid CTC/attention architecture for end-to-end speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, pp. 1240–1253, 2017.
[22]
Yifan Peng, Siddharth Dalmia, Ian Lane, and Shinji Watanabe, “Branchformer: Parallel MLP-attention architectures to capture local and global context for speech recognition and understanding,” in ICML, 2022, pp. 17627–17643.
[23]
Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, and Daniel Povey, “Zipformer: A faster and better encoder for automatic speech recognition,” in ICLR, 2024.

  1. https://github.com/Audio-WestlakeU/UMA-ASR↩︎