October 21, 2025
Multilingual automatic speech recognition (ASR) remains a challenging task, especially when balancing performance across high- and low-resource languages. Recent advances in sequence modeling suggest that architectures beyond Transformers may offer better scalability and efficiency. In this work, we introduce MLMA (Multilingual Language Modeling with Mamba for ASR), a new approach that leverages the Mamba architecture—an efficient state-space model optimized for long-context sequence processing—for multilingual ASR. Using Mamba, MLMA implicitly incorporates language-aware conditioning and shared representations to support robust recognition across diverse languages. Experiments on standard multilingual benchmarks show that MLMA achieves competitive performance compared to Transformer-based architectures. These results highlight Mamba’s potential as a strong backbone for scalable, efficient, and accurate multilingual speech recognition.
Multi-lingual ASR, State Space Models, Mamba
Automatic Speech Recognition (ASR) has become a cornerstone of modern computing, supporting applications such as voice assistants, transcription services, and real-time speech translation. Driven by large-scale datasets and deep learning advances, ASR systems have reached near-human performance in high-resource languages like English and Mandarin [1], [2]. However, most existing systems are language-specific, which limits scalability and exacerbates the performance gap for low-resource languages with limited annotated data [3].
Multilingual ASR has emerged as a promising alternative by training a single model across multiple languages [4], [5]. Such models exploit shared phonetic and acoustic representations, enabling cross-lingual transfer from high-resource to under-represented languages. Despite this potential, achieving robust multilingual performance remains challenging. Transformer-based architectures, now dominant in ASR [6], [7], provide strong sequence modeling capabilities but with high computational and memory costs. These inefficiencies are especially problematic in multilingual scenarios, where diverse speech rates, prosodic patterns, and phenomena such as code-switching demand processing of long and variable-length utterances.
Recently, Mamba architecture [8] has been proposed to handle variable-length input sequences and temporal irregularities, common in multilingual speech data. Therefore, it can generalize across languages with different rhythmic and phonetic structures. Mamba also supports streaming ASR with mechanisms like lookahead and unimodal aggregation (UMA), which help it adapt to real-time multilingual input [9]. These features are particularly beneficial for languages characterized by rapid speech transitions or tonal variations, where conventional models often struggle to maintain recognition accuracy and latency.
Integrating Mamba into multilingual ASR offers several advantages: its memory-efficient design lowers training and inference costs [10], its sequential inductive bias can better capture cross-lingual phonetic structures—benefiting code-switching and low-resource languages—and its scalability enables adding languages without proportional computational overhead. In this work, we investigate the application of Mamba-based architectures to multilingual ASR. We conducted experiments across a diverse set of European languages. Analyzing their ability to learn shared linguistic representations by comparing performance against Transformer-based baselines, and their robustness to multilingual challenges. Our goal is to bridge the gap between recent advances in efficient sequence modeling and the development of inclusive, scalable ASR systems that can serve a truly global user base. Our MLMA model, trained on almost 12K hours covering 6 languages, is the first multilingual ASR system based on Mamba. MLMA code and weights are publicly available under the most permissive license.
With the advent of deep learning, multilingual ASR systems, capable of recognizing multiple languages, has grown in demand for cross-lingual use [11]. Recent work in multilingual ASR has drastically increased language coverage to support hundreds and even thousands of languages. This includes approaches based on labeled training data such as Whisper [12], USM [13], Seamless [14] and MMS [15], Ml-superb 2.0 [16], FAMA [17] as well as zero-shot work [18]. While these transformer-based approaches are highly effective for modeling long-range dependencies, Transformers have notable drawbacks: their quadratic complexity makes long-sequence processing costly, they require vast amounts of labeled or weakly labeled data that are scarce for low-resource languages, and their large size limits their deployment in resource-constrained settings [19].
Mamba has been applied to various speech tasks, for example, separation and enhancement [20]–[22], leveraging its property of linear-time complexity to model the long sequence while maintaining low computational cost. Motivated by this, numerous studies have been conducted to evaluate Mamba’s performance in ASR tasks. Table 1 summarizes the most recent research papers leveraging Mamba for ASR tasks, highlighting the main contribution of each work.
Based on the literature review in Table 1, current Mamba-based ASR research exhibits significant limitations. Existing studies mainly investigate architectural replacements within Transformer backbones, but are restricted to monolingual or at most bilingual settings on small datasets like LibriSpeech-100 [30]. These works operate under matched conditions and lack the scale to test Mamba’s multilingual effectiveness. In contrast, our proposed MLMA model explores Mamba in a large-scale multilingual setting with nearly 12,000 hours of training across six European languages, representing the first multilingual ASR system based on Mamba and demonstrating its viability beyond constrained setups.
Mamba is a Structured State Space Model (SSM) defined in discrete time as: \[h_{t} = \bar{A}h_{t-1} + \bar{B}x_{t}, \quad y_{t}=Ch_{t}\] where \(h_{t}\) is the state, \(\bar{A}\) the transition matrix, \(\bar{B}\) the input-state interaction, and \(C\) the output map.
Since \(\bar{A}\) and \(\bar{B}\) derive from continuous-time parameters, they are not learned directly but approximated via Zero-Order Hold (ZOH): \[\bar{A}=\exp(\Delta A), \quad \bar{B}=(\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B\] with \(A, B\) the continuous forms and \(\Delta\) the discretization step. Training is done on \(A, B\), which are converted to \(\bar{A}, \bar{B}\) at each forward pass, enabling efficient discrete-time modeling while preserving long-range dependencies. ZOH ensures that the temporal structure of the continuous-time model is retained after discretization, allowing it to track dependencies across long sequences. To increase adaptability, [31] introduced a selection mechanism: \[B=f_{B}(x), \quad C=f_{C}(x), \quad \Delta= \text{Broadcast}_{D}(f_{\Delta}(x))\] that is, instead of using fixed matrices \(B,C,\Delta\), the model learns functions \(f_B, f_C, f_{\Delta}\), that generate these parameters based on the input \(x\), allowing the model to flexibly adapt its state transition and output mapping according to the current input. Mamba [8] extends this idea by removing the Linear Time Invariance (LTI) constraint, allowing parameters to vary over time. This improves flexibility in non-stationary environments and strengthens modeling of long-range, context-dependent behaviors.
For ASR, and speech processing in general, extracting both local and global features is crucial. Models such as Conformer [32] and
Zipformer [33] achieve this by combining convolution (local) with self-attention (global). The ConMamba Encoder follows the same principle
but replaces multi-head self-attention with Mamba layers, while retaining convolution to strengthen local feature extraction. For a generic input \(x\), a ConMamba encoder produces output embeddings \(y\) as: \[\begin{align} \tilde{x} &= x + \tfrac{1}{2} \mathrm{FFN}(x) &\quad x' &= \tilde{x} + \mathrm{Mamba}(\tilde{x}) \\ x'' &= x' + \mathrm{Conv}(x') &\quad
y &= \mathrm{Layernorm}\!\left(x'' + \tfrac{1}{2} \mathrm{FFN}(x'')\right)
\end{align}\] where FFN is a feed-forward module, and the convolutional module which extracts local patterns. Note that the outputs of both Mamba and Conv layers are summed before layer normalization with half
of the output of another FFN. This hybrid design enables effective integration of local and global features for speech representation.
The proposed MLMA model, as depicted in Figure 1, follows the architecture introduced in [20] that integrates a convolutional transformer with a bidirectional Mamba module (Bi-Mamba) within a CTC framework. Input audio is converted to 80-dimensional log Mel filter banks, normalized, and processed by a two-block CNN for low-level feature extraction and temporal downsampling. An 18-layer Transformer encoder (hidden size 256, feed-forward 1024, dropout 0.1, GELU) models contextual representations, augmented with a Bi-Mamba module (dstate=16, expand=2, dconv=4) to capture long-range dependencies. A linear projection followed by LogSoftmax maps encoder outputs to the vocabulary (including blank, BOS, and EOS), and training is performed with CTC loss. Note that in our experiments we use the same hyperparamters reported in 1. More details on the training hyperparameters, along with our implementation, is available in the public repository2.
Our experiments leverage four large-scale multilingual speech corpora—LibriSpeech (clean subsets) [30], CommonVoice v20.0 [34], VoxPopuli-ASR [35], MultiLingual LibriSpeech [36], and FLEURS [37]. We consider 6 languages, spanning over 11,000 hours of labeled speech data in: English (en), Italian (it), French (fr), Spanish (es), German (de), and Dutch (nl). This collection combines read and semi-spontaneous speech, ensuring broad linguistic and acoustic diversity across the languages. The amount of training hours for each language and each dataset is summarized in Table 2.
| Dataset | #hours | |||||
|---|---|---|---|---|---|---|
| 2-7 | en | it | fr | es | de | nl |
| LS | 464 | |||||
| CV v20.0 | 1774 | 249 | 829 | 499 | 947 | 46 |
| VP-ASR | 522 | 78 | 206 | 152 | 264 | 46 |
| MLS | 247 | 1077 | 918 | 1967 | 1554 | |
| FL | 7.5 | 9.0 | 10.3 | 8.8 | 9.0 | 7.7 |
| Total: | 2760 | 574 | 2112 | 1569 | 3178 | 1646 |
To assess the effectiveness of ConMamba, we compare its performance against a Conformer model [32] (with 18 encoder layers and hidden size equal to 256. More detail on Conformer training hyperparameters are reported in 3) as well as we use some very large scale multilingual models (OWSM V3.1 [38], OWSM-CTC [39], FAMA [17] and Whisper-Large-v3 [12]) as reference although the comparison is not fair due to different model and training sizes and different decoding mechanisms. We evaluate the performance in monolingual settings (en), bilingual (en, it) with also ablation studies and multilingual. For the latter we consider in-domain and out-of-domain data.
Table 3 compares the performance of ConMamba with a Conformer when they are both trained from scratch on Libri-1000. Note that the number of parameters of the models are rather similar. ConMamba consistently outperforms the Conformer baseline, achieving lower WER on both test-clean and test-other test sets. This indicates that the ConMamba architecture offers improved robustness and generalization over the standard Conformer design.
| Model | #Param(M) | WER(%)(\(\downarrow\)) | |
|---|---|---|---|
| (l2ptr2pt)3-4 | test-clean | test-other | |
| Conformer | 28.8 | 4.27 | 11.29 |
| ConMamba | 31.6 | 4.05 | 10.50 |
In Table 4 we report the performance on bilingual settings considering Italian and English. This experiment allows us to compare not only ConMamba and Conformer but also other large-scale multilingual models relying on published results. We observe that ConMamba maintains strong performance across both English and Italian, providing consistent improvements over Conformer and generalizing effectively to multilingual and less curated speech datasets. The table also compares with four multilingual very large-scale models for ASR models. Although obviously less performing due to a smaller size, less training data and a simplified training, MLMA is not that far from those models.
2.5pt
| Model | English | Italian | ||||
|---|---|---|---|---|---|---|
| 2-4 (lr)5-7 | LS | CV | VP | CV | VP | MLS |
| ConMamba-CTC \(^a\) | 3.6 | 18.8 | 10.7 | 11.4 | 24.8 | 13.4 |
| Conformer-CTC \(^b\) | 4.4 | 22.3 | 11.5 | 14.3 | 23.7 | 14.3 |
| FAMA \(^c\) [17] | - | 13.8 | 8.9 | 7.3 | 15.7 | 12.6 |
| OWSM v3.1 \(^d\) [38] | - | 11.9 | 8.4 | 12.5 | 24.0 | 19.3 |
| OWSM-CTC \(^d\) [39] | 2.4 | 12.1 | 8.6 | - | - | 22.1 |
| Whisper-Large-v3\(^e\) | - | 11.2 | 7.1 | 6.5 | 18.8 | 8.8 |
\(^a\) ConMamba-CTC: (31.6M-3334h). \(^b\) Conformer-CTC: (28.8M-3334h). \(^c\) FAMA: (475M-150K h). \(^d\) OWSM models: (1020M-180K h). \(^e\) Whisper large-v3: (1550M-5M).
Finally, in Table 5 we evaluate the performance of an actual multilingual MLMA model that covers 6 languages and is trained on over 11840 hours of speech data. Overall, across the in-domain datasets, MLMA delivers consistent multilingual performance, effectively handling linguistic and acoustic variability in the training corpora. While performance naturally varies by language, the results indicate stable recognition capabilities across all languages. Importantly, evaluation on the unseen FLEURS benchmark further demonstrates that MLMA retains competitive performance under out-of-domain conditions, highlighting its robustness and supporting its potential as a strong foundation for multilingual ASR. Additionally, the reported results on the MLS dataset reveal that our MLMA model can achieve better performance compared to the OWSM-CTC foundation model.
| Dataset | EN | IT | FR | ES | DE | NL |
|---|---|---|---|---|---|---|
| LS | 7.2 | |||||
| CV | 23.2 | 13.0 | 15.0 | 11.2 | 12.9 | 16.8 |
| VP | 11.5 | 24.5 | 14.8 | 12.9 | 16.1 | 21.5 |
| MLS | 13.3 | 9.1 | 6.5 | 9.5 | 14.8 | |
| FL* | 19.2 | 12.5 | 19.6 | 10.6 | 15.4 | 27.9 |
| 1-7 Avg. | 15.2 | 15.8 | 14.6 | 10.3 | 13.5 | 20.3 |
| OWSM-CTC | ||||||
| MLS | - | 22.1 | 12.9 | 10.3 | 11.9 | 20.4 |
We conclude the paper with an analysis of the impact of the model size and of the amount of training data on MLMA models in bilingual ASR. Model size: Table 6 shows the performance on CV English and Italian, when scaling ConMamba from 31.6M to 42M parameters, highlighting that the model benefits from increased capacity without compromising efficiency. In particular, the larger model shows significant WER reduction on English, a language with rich phonetic diversity and complex prosody. This suggests that ConMamba can use additional parameters to refine its modeling of nuanced acoustic and linguistic patterns and to generalize to less curated datasets.
| Model | #Param(M) | WER(%) (\(\downarrow\)) | ||
|---|---|---|---|---|
| (l2ptr2pt)3-4 | EN | IT | ||
| (l2ptr2pt)1-4 | 31.6 | 23.00 | 10.92 | |
| (l2ptr2pt)2-4 | 42.0 | 21.04 | 10.42 | |
Table 7 reports the WER on CV Italian and English, while increasing the amount of training material from 710 hours to 3334 hours. The results show that increasing the size of the training data improves the bilingual ConMamba model for both languages. We observe consistent in-domain improvements, particularly on the CV and VP subsets, along with notable out-of-domain gains on the English portion of the FL benchmark. This trend is not observed for Italian, likely due to the increased unbalance between the languages.
This highlights that more data boosts both in-domain performance and out-of-domain robustness, confirming the scalability of the ConMamba architecture.
6pt
| English | Italian | |||||||
|---|---|---|---|---|---|---|---|---|
| 2-5 (lr)6-9 Hrs. | LS | CV | VP | FL | CV | VP | MLS | FL |
| 710 | 5.3 | 56.5 | 32.8 | 35.4 | 11.7 | 34.8 | 30.8 | 10.2 |
| 1210 | 3.9 | 47.1 | 24.5 | 26.3 | 11.9 | 34.9 | 31.8 | 10.6 |
| 3334 | 3.6 | 18.8 | 10.7 | 14.9 | 11.4 | 24.8 | 13.4 | 13.1 |
In this work, we introduced MLMA, a multilingual ASR framework built upon the Mamba state-space architecture, enhanced with language-aware conditioning and shared representations. Through evaluations on standard multilingual benchmarks, MLMA demonstrated competitive recognition performance relative to Conformer-based models, while offering significantly faster inference. These findings underscore the potential of state-space models as efficient and scalable alternatives for multilingual ASR, particularly in scenarios involving both high and low-resource languages. MLMA represents a promising step toward practical ASR systems capable of real-time processing and broad linguistic coverage.