Abstract

Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture—an efficient state-space model optimized for long-context sequence processing—for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba’s potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.

Multi-lingual ASR, State Space Models, Mamba

1 Introduction↩︎

Automatic Speech Recognition (ASR) has become a cornerstone of modern computing, supporting applications such as voice assistants, transcription services, and real-time speech translation. Driven by large-scale datasets and deep learning advances, ASR systems have reached near-human performance in high-resource languages like English and Mandarin [1], [2]. However, most existing systems are language-specific, which limits scalability and exacerbates the performance gap for low-resource languages with limited annotated data [3].

Multilingual ASR has emerged as a promising alternative by training a single model across multiple languages [4], [5]. Such models exploit shared phonetic and acoustic representations, enabling cross-lingual transfer from high-resource to under-represented languages. Despite this potential, achieving robust multilingual performance remains challenging. Transformer-based architectures, now dominant in ASR [6], [7], provide strong sequence modeling capabilities but with high computational and memory costs. These inefficiencies are especially problematic in multilingual scenarios, where diverse speech rates, prosodic patterns, and phenomena such as code-switching demand processing of long and variable-length utterances.

Recently, Mamba architecture [8] has been proposed to handle variable-length input sequences and temporal irregularities, common in multilingual speech data. Therefore, it can generalize across languages with different rhythmic and phonetic structures. Mamba also supports streaming ASR with mechanisms like lookahead and unimodal aggregation (UMA), which help it adapt to real-time multilingual input [9]. These features are particularly beneficial for languages characterized by rapid speech transitions or tonal variations, where conventional models often struggle to maintain recognition accuracy and latency.

Integrating Mamba into multilingual ASR offers several advantages: its memory-efficient design lowers training and inference costs [10], its sequential inductive bias can better capture cross-lingual phonetic structures—benefiting code-switching and low-resource languages—and its scalability enables adding languages without proportional computational overhead. In this work, we investigate the application of Mamba-based architectures to multilingual ASR. We conducted experiments across a diverse set of European languages. Analyzing their ability to learn shared linguistic representations by comparing performance against Transformer-based baselines, and their robustness to multilingual challenges. Our goal is to bridge the gap between recent advances in efficient sequence modeling and the development of inclusive, scalable ASR systems that can serve a truly global user base. Our MLMA model, trained on almost 12K hours covering 6 languages, is the first multilingual ASR system based on Mamba. MLMA code and weights are publicly available under the most permissive license.

2 Related Works↩︎

With the advent of deep learning, multilingual ASR systems, capable of recognizing multiple languages, has grown in demand for cross-lingual use [11]. Recent work in multilingual ASR has drastically increased language coverage to support hundreds and even thousands of languages. This includes approaches based on labeled training data such as Whisper [12], USM [13], Seamless [14] and MMS [15], Ml-superb 2.0 [16], FAMA [17] as well as zero-shot work [18]. While these transformer-based approaches are highly effective for modeling long-range dependencies, Transformers have notable drawbacks: their quadratic complexity makes long-sequence processing costly, they require vast amounts of labeled or weakly labeled data that are scarce for low-resource languages, and their large size limits their deployment in resource-constrained settings [19].

2.1 Mamba for ASR↩︎

Mamba has been applied to various speech tasks, for example, separation and enhancement [20]–[22], leveraging its property of linear-time complexity to model the long sequence while maintaining low computational cost. Motivated by this, numerous studies have been conducted to evaluate Mamba’s performance in ASR tasks. Table 1 summarizes the most recent research papers leveraging Mamba for ASR tasks, highlighting the main contribution of each work.

Table 1: The table summaries recent works exploring Mamba for ASR.
Ref.	Dataset (hours)	Language	Contribution
[20]	LibriSpeech	EN	ConMamba for monolingual ASR
[9]	AISHELL-1&2	Mandarin	Efficiency of Mamba for streaming ASR
[23]	LibriSpeech, AN4, SEAME, ASRU	EN & EN-Mandarin	BiMamba
[24]	LibriSpeech, GigaSpeech, SPGISpeech	EN	Samba-ASR: Mamba as Encoder and Decoder
[25]	LibriSpeech-100	EN	Mamba-based HuBERT model for ASR,
[26]
AISHELL, CSJ, VoxVorge
Japanese, IT	Mamba performance against Transformer architectures
[27]	TEDLIUM3	EN	Mamba-based HuBERT against Transformer-based SSL
[28]	LibriSpeech-100	EN	Mamba for long-context ASR
[29]	LibriSpeech-100	EN	Augmented ConMamba Encoder
MLMA (ours)
Voxpopuli,
Multilingual LibriSpeech
ES, DE, NL	MLMA: a European Multilingual ASR based on Mamba

Based on the literature review in Table 1, current Mamba-based ASR research exhibits significant limitations. Existing studies mainly investigate architectural replacements within Transformer backbones, but are restricted to monolingual or at most bilingual settings on small datasets like LibriSpeech-100 [30]. These works operate under matched conditions and lack the scale to test Mamba’s multilingual effectiveness. In contrast, our proposed MLMA model explores Mamba in a large-scale multilingual setting with nearly 12,000 hours of training across six European languages, representing the first multilingual ASR system based on Mamba and demonstrating its viability beyond constrained setups.

3 Proposed Mamba architecture for multilingual ASR↩︎

3.1 Overview of Mamba↩︎

Mamba is a Structured State Space Model (SSM) defined in discrete time as: \[h_{t} = \bar{A}h_{t-1} + \bar{B}x_{t}, \quad y_{t}=Ch_{t}\] where \(h_{t}\) is the state, \(\bar{A}\) the transition matrix, \(\bar{B}\) the input-state interaction, and \(C\) the output map.

Since \(\bar{A}\) and \(\bar{B}\) derive from continuous-time parameters, they are not learned directly but approximated via Zero-Order Hold (ZOH): \[\bar{A}=\exp(\Delta A), \quad \bar{B}=(\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B\] with \(A, B\) the continuous forms and \(\Delta\) the discretization step. Training is done on \(A, B\), which are converted to \(\bar{A}, \bar{B}\) at each forward pass, enabling efficient discrete-time modeling while preserving long-range dependencies. ZOH ensures that the temporal structure of the continuous-time model is retained after discretization, allowing it to track dependencies across long sequences. To increase adaptability, [31] introduced a selection mechanism: \[B=f_{B}(x), \quad C=f_{C}(x), \quad \Delta= \text{Broadcast}_{D}(f_{\Delta}(x))\] that is, instead of using fixed matrices \(B,C,\Delta\), the model learns functions \(f_B, f_C, f_{\Delta}\), that generate these parameters based on the input \(x\), allowing the model to flexibly adapt its state transition and output mapping according to the current input. Mamba [8] extends this idea by removing the Linear Time Invariance (LTI) constraint, allowing parameters to vary over time. This improves flexibility in non-stationary environments and strengthens modeling of long-range, context-dependent behaviors.

3.2 Convolutional Mamba (ConMamba) Encoder↩︎

For ASR, and speech processing in general, extracting both local and global features is crucial. Models such as Conformer [32] and Zipformer [33] achieve this by combining convolution (local) with self-attention (global). The ConMamba Encoder follows the same principle but replaces multi-head self-attention with Mamba layers, while retaining convolution to strengthen local feature extraction. For a generic input \(x\), a ConMamba encoder produces output embeddings \(y\) as: \[\begin{align} \tilde{x} &= x + \tfrac{1}{2} \mathrm{FFN}(x) &\quad x' &= \tilde{x} + \mathrm{Mamba}(\tilde{x}) \\ x'' &= x' + \mathrm{Conv}(x') &\quad y &= \mathrm{Layernorm}\!\left(x'' + \tfrac{1}{2} \mathrm{FFN}(x'')\right) \end{align}\] where FFN is a feed-forward module, and the convolutional module which extracts local patterns. Note that the outputs of both Mamba and Conv layers are summed before layer normalization with half of the output of another FFN. This hybrid design enables effective integration of local and global features for speech representation.

3.3 Proposed Architecture↩︎

Figure 1: The architecture of MLMA using ConMamba encoder and a CTC decoder as in [20]

The proposed MLMA model, as depicted in Figure 1, follows the architecture introduced in [20] that integrates a convolutional transformer with a bidirectional Mamba module (Bi-Mamba) within a CTC framework. Input audio is converted to 80-dimensional log Mel filter banks, normalized, and processed by a two-block CNN for low-level feature extraction and temporal downsampling. An 18-layer Transformer encoder (hidden size 256, feed-forward 1024, dropout 0.1, GELU) models contextual representations, augmented with a Bi-Mamba module (dstate=16, expand=2, dconv=4) to capture long-range dependencies. A linear projection followed by LogSoftmax maps encoder outputs to the vocabulary (including blank, BOS, and EOS), and training is performed with CTC loss. Note that in our experiments we use the same hyperparamters reported in ¹. More details on the training hyperparameters, along with our implementation, is available in the public repository².

4 Experimental Setup↩︎

Our experiments leverage four large-scale multilingual speech corpora—LibriSpeech (clean subsets) [30], CommonVoice v20.0 [34], VoxPopuli-ASR [35], MultiLingual LibriSpeech [36], and FLEURS [37]. We consider 6 languages, spanning over 11,000 hours of labeled speech data in: English (en), Italian (it), French (fr), Spanish (es), German (de), and Dutch (nl). This collection combines read and semi-spontaneous speech, ensuring broad linguistic and acoustic diversity across the languages. The amount of training hours for each language and each dataset is summarized in Table 2.

Table 2: List of training data used in our experiments.
Dataset	#hours
2-7	en	it	fr	es	de	nl
LS	464
CV v20.0	1774	249	829	499	947	46
VP-ASR	522	78	206	152	264	46
MLS		247	1077	918	1967	1554
FL	7.5	9.0	10.3	8.8	9.0	7.7
Total:	2760	574	2112	1569	3178	1646

To assess the effectiveness of ConMamba, we compare its performance against a Conformer model [32] (with 18 encoder layers and hidden size equal to 256. More detail on Conformer training hyperparameters are reported in ³) as well as we use some very large scale multilingual models (OWSM V3.1 [38], OWSM-CTC [39], FAMA [17] and Whisper-Large-v3 [12]) as reference although the comparison is not fair due to different model and training sizes and different decoding mechanisms. We evaluate the performance in monolingual settings (en), bilingual (en, it) with also ablation studies and multilingual. For the latter we consider in-domain and out-of-domain data.

4.1 Monolingual comparison with Conformer↩︎

Table 3 compares the performance of ConMamba with a Conformer when they are both trained from scratch on Libri-1000. Note that the number of parameters of the models are rather similar. ConMamba consistently outperforms the Conformer baseline, achieving lower WER on both test-clean and test-other test sets. This indicates that the ConMamba architecture offers improved robustness and generalization over the standard Conformer design.

Table 3: WER of ConMamba and Conformer on LibriSpeech dataset. Results are similar to what reported in [20]
Model	#Param(M)	WER(%)(\(\downarrow\))
(l2ptr2pt)3-4		test-clean	test-other
Conformer	28.8	4.27	11.29
ConMamba	31.6	4.05	10.50

4.2 Bilingual capabilities: Italian and English↩︎

In Table 4 we report the performance on bilingual settings considering Italian and English. This experiment allows us to compare not only ConMamba and Conformer but also other large-scale multilingual models relying on published results. We observe that ConMamba maintains strong performance across both English and Italian, providing consistent improvements over Conformer and generalizing effectively to multilingual and less curated speech datasets. The table also compares with four multilingual very large-scale models for ASR models. Although obviously less performing due to a smaller size, less training data and a simplified training, MLMA is not that far from those models.

2.5pt

Table 4: WER(%) (\(\downarrow\)) of bilingual ConMamba and Conformer, trained from scratch Italian and English data. Numbers for FAMA, OWSM v3.1 and Whisper-Large-v3 are from [17]; for OSWM-CTC from [39]. Note: besides having larger dimension and larger training sets, the large-scale models also employ autoregressive decoding methods. "-": results not reported in the reference paper.
Model	English			Italian
2-4 (lr)5-7	LS	CV	VP	CV	VP	MLS
ConMamba-CTC \(^a\)	3.6	18.8	10.7	11.4	24.8	13.4
Conformer-CTC \(^b\)	4.4	22.3	11.5	14.3	23.7	14.3
FAMA \(^c\) [17]	-	13.8	8.9	7.3	15.7	12.6
OWSM v3.1 \(^d\) [38]	-	11.9	8.4	12.5	24.0	19.3
OWSM-CTC \(^d\) [39]	2.4	12.1	8.6	-	-	22.1
Whisper-Large-v3\(^e\)	-	11.2	7.1	6.5	18.8	8.8

\(^a\) ConMamba-CTC: (31.6M-3334h). \(^b\) Conformer-CTC: (28.8M-3334h). \(^c\) FAMA: (475M-150K h). \(^d\) OWSM models: (1020M-180K h). \(^e\) Whisper large-v3: (1550M-5M).

4.3 Multilingual ASR↩︎

Finally, in Table 5 we evaluate the performance of an actual multilingual MLMA model that covers 6 languages and is trained on over 11840 hours of speech data. Overall, across the in-domain datasets, MLMA delivers consistent multilingual performance, effectively handling linguistic and acoustic variability in the training corpora. While performance naturally varies by language, the results indicate stable recognition capabilities across all languages. Importantly, evaluation on the unseen FLEURS benchmark further demonstrates that MLMA retains competitive performance under out-of-domain conditions, highlighting its robustness and supporting its potential as a strong foundation for multilingual ASR. Additionally, the reported results on the MLS dataset reveal that our MLMA model can achieve better performance compared to the OWSM-CTC foundation model.

Table 5: WER(\(\downarrow\)%) of MLMA across multilingual data. The numbers of OSWM-CTC are from [39]. FL*: FLEURS is not used in training. "-": results not reported in the reference paper.
Dataset	EN	IT	FR	ES	DE	NL
LS	7.2
CV	23.2	13.0	15.0	11.2	12.9	16.8
VP	11.5	24.5	14.8	12.9	16.1	21.5
MLS		13.3	9.1	6.5	9.5	14.8
FL*	19.2	12.5	19.6	10.6	15.4	27.9
1-7 Avg.	15.2	15.8	14.6	10.3	13.5	20.3
OWSM-CTC
MLS	-	22.1	12.9	10.3	11.9	20.4

4.4 Ablation studies↩︎

We conclude the paper with an analysis of the impact of the model size and of the amount of training data on MLMA models in bilingual ASR. Model size: Table 6 shows the performance on CV English and Italian, when scaling ConMamba from 31.6M to 42M parameters, highlighting that the model benefits from increased capacity without compromising efficiency. In particular, the larger model shows significant WER reduction on English, a language with rich phonetic diversity and complex prosody. This suggests that ConMamba can use additional parameters to refine its modeling of nuanced acoustic and linguistic patterns and to generalize to less curated datasets.

Table 6: WER(%) (\(\downarrow\)) on CV scaling the size of MLMA
Model	#Param(M)	WER(%) (\(\downarrow\))
(l2ptr2pt)3-4		EN	IT
(l2ptr2pt)1-4	31.6	23.00	10.92
(l2ptr2pt)2-4	42.0	21.04	10.42

Table 7 reports the WER on CV Italian and English, while increasing the amount of training material from 710 hours to 3334 hours. The results show that increasing the size of the training data improves the bilingual ConMamba model for both languages. We observe consistent in-domain improvements, particularly on the CV and VP subsets, along with notable out-of-domain gains on the English portion of the FL benchmark. This trend is not observed for Italian, likely due to the increased unbalance between the languages.

This highlights that more data boosts both in-domain performance and out-of-domain robustness, confirming the scalability of the ConMamba architecture.

6pt

Table 7: WER(%) (\(\downarrow\)) Performance of Bilingual ConMamba increasing the training material. 710 = LS (460h)_EN + CV_IT ; 1210= LS(960h)_EN + CV_IT and 3334 = LS(960)_EN + (CV + VP)_EN&IT + MLS_IT
	English				Italian
2-5 (lr)6-9 Hrs.	LS	CV	VP	FL	CV	VP	MLS	FL
710	5.3	56.5	32.8	35.4	11.7	34.8	30.8	10.2
1210	3.9	47.1	24.5	26.3	11.9	34.9	31.8	10.6
3334	3.6	18.8	10.7	14.9	11.4	24.8	13.4	13.1

5 Conclusion↩︎

In this work, we introduced MLMA, a multilingual ASR framework built upon the Mamba state-space architecture, enhanced with language-aware conditioning and shared representations. Through evaluations on standard multilingual benchmarks, MLMA demonstrated competitive recognition performance relative to Conformer-based models, while offering significantly faster inference. These findings underscore the potential of state-space models as efficient and scalable alternatives for multilingual ASR, particularly in scenarios involving both high and low-resource languages. MLMA represents a promising step toward practical ASR systems capable of real-time processing and broad linguistic coverage.

References↩︎

[1]

Mihaela C Stoian, Sameer Bansal, and Sharon Goldwater, “Analyzing asr pretraining for low-resource speech-to-text translation,” in ICASSP. IEEE, 2020.

[2]

Jiqiao Zhang and Degen Huang, “Speech recognition for low-resource languages using large language models and related-language data,” in ISPP 2025. SPIE, 2025, vol. 13664, pp. 1264–1269.

[3]

Mengjie Qian et al., “Learn and don’t forget: Adding a new language to asr foundation models,” arXiv preprint arXiv:2407.06800, 2024.

[4]

Vineel Pratap et al., “Massively multilingual asr: 50 languages, 1 model, 1 billion parameters,” arXiv preprint arXiv:2007.03001, 2020.

[5]

Bo Li et al., “Scaling end-to-end models for large-scale multilingual asr,” in ASRU. IEEE, 2021, pp. 1011–1018.

[6]

Alexander Loubser, Pieter De Villiers, and Allan De Freitas, “End-to-end automated speech recognition using a character based small scale transformer architecture,” Expert Systems with Applications, vol. 252, 2024.

[7]

Sehoon Kim et al., “Squeezeformer: An efficient transformer for automatic speech recognition,” Advances in Neural Information Processing Systems, vol. 35, pp. 9361–9373, 2022.

[8]

Albert Gu and Tri Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023.

[9]

Ying Fang and Xiaofei Li, “Mamba for streaming asr combined with unimodal aggregation,” in ICASSP, 2025.

[10]

Haohao Qu et al., “A survey of mamba,” arXiv preprint arXiv:2408.01129, 2024.

[11]

Jiahong Li et al., “Efficient multilingual asr finetuning via lora language experts,” arXiv preprint arXiv:2506.21555, 2025.

[12]

Alec Radford et al., “Robust speech recognition via large-scale weak supervision,” in PMLR, 2023.

[13]

Yu Zhang et al., “Google usm: Scaling automatic speech recognition beyond 100 languages,” arXiv preprint arXiv:2303.01037, 2023.

[14]

Loı̈c Barrault et al., “Seamlessm4t: massively multilingual & multimodal machine translation,” arXiv preprint arXiv:2308.11596, 2023.

[15]

Vineel Pratap et al., “Scaling speech technology to 1,000+ languages,” Journal of Machine Learning Research, vol. 25, no. 97, pp. 1–52, 2024.

[16]

Jiatong Shi et al., “Ml-superb 2.0: Benchmarking multilingual speech models across modeling constraints, languages, and datasets,” in In Proc. of InterSpeech, 2024.

[17]

Sara Papi et al., “Fama: The first large-scale open-science speech foundation model for english and italian,” arXiv preprint arXiv:2505.22759, 2025.

[18]

Jinming Zhao, Vineel Pratap, and Michael Auli, “Scaling a simple approach to zero-shot speech recognition,” in In Proc. of ICASSP. IEEE, 2025, pp. 1–5.

[19]

Badri Narayana Patro and Vijay Srinivas Agneeswaran, “Mamba-360: Survey of state space models as transformer alternative for long sequence modelling: Methods, applications, and challenges,” Engineering Applications of Artificial Intelligence, vol. 159, pp. 111279, 2025.

[20]

Xilin Jiang et al., “Speech slytherin: Examining the performance and efficiency of mamba for speech separation, recognition, and synthesis,” in ICASSP, 2025.

[21]

Xilin Jiang, Cong Han, and Nima Mesgarani, “Dual-path mamba: Short and long-term bidirectional selective structured state space models for speech separation,” in ICASSP. IEEE, 2025, pp. 1–5.

[22]

Rong Chao et al., “An investigation of incorporating mamba for speech enhancement,” in SLT. IEEE, 2024, pp. 302–308.

[23]

Xiangyu Zhang et al., “Mamba in speech: Towards an alternative to self-attention,” IEEE Transactions on Audio, Speech and Language Processing, 2025.

[24]

Syed Abdul Gaffar Shakhadri et al., “Samba-asr: State-of-the-art speech recognition leveraging structured state-space models,” arXiv preprint arXiv:2501.02832, 2025.

[25]

Xiangyu Zhang et al., “Rethinking mamba in speech processing by self-supervised models,” in ICASSP, 2025.

[26]

Koichi Miyazaki, Yoshiki Masuyama, and Masato Murata, “Exploring the capability of mamba in speech applications,” arXiv preprint arXiv:2406.16808, 2024.

[27]

Tzu-Quan Lin et al., “An exploration of mamba for speech self-supervised models,” arXiv preprint arXiv:2506.12606, 2025.

[28]

Xiaoxue Gao and Nancy F Chen, “Speech-mamba: Long-context speech recognition with selective state spaces models,” in SLT. IEEE, 2024, pp. 1–8.

[29]

Haoxiang Hou, Xun Gong, and Yanmin Qian, “Conmamba: A convolution-augmented mamba encoder model for efficient end-to-end asr systems,” in ISCSLP. IEEE, 2024, pp. 711–715.

[30]

Vassil Panayotov et al., “Librispeech: an asr corpus based on public domain audio books,” in ICASSP, 2015.

[31]

Albert Gu, Karan Goel, and Christopher Ré, “Efficiently modeling long sequences with structured state spaces,” in ICLR, 2022.

[32]

Anmol Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” arXiv preprint arXiv:2005.08100, 2020.

[33]

Zengwei Yao et al., “Zipformer: A faster and better encoder for automatic speech recognition,” arXiv preprint arXiv:2310.11230, 2023.

[34]

Rosana Ardila et al., “Common voice: A massively-multilingual speech corpus,” arXiv preprint arXiv:1912.06670, 2019.

[35]

Changhan Wang et al., “Voxpopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation,” arXiv preprint arXiv:2101.00390, 2021.

[36]

Vineel Pratap et al., “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020.

[37]

Alexis Conneau et al., “Fleurs: Few-shot learning evaluation of universal representations of speech,” in In Proc. of SLT. IEEE, 2023, pp. 798–805.

[38]

Yifan Peng et al., “OWSM v3.1: Better and Faster Open Whisper-Style Speech Models based on E-Branchformer,” in In Proc. of InterSpeech, 2024.

[39]

Yifan Peng et al., “OWSM-CTC: An open encoder-only speech foundation model for speech recognition, translation, and language identification,” in ACL, 2024.

MLMA: Towards MultiLingual ASR with MAmba-Based Architectures