December 19, 2024
Building a universal multilingual automatic speech recognition (ASR) model that performs equitably across languages has long been a challenge due to its inherent difficulties. To address this task we introduce a Language-Agnostic Multilingual ASR pipeline through orthography Unification and language-specific Transliteration (LAMA-UT1). LAMA-UT operates without any language-specific modules while matching the performance of state-of-the-art models trained on a minimal amount of data. Our pipeline consists of two key steps. First, we utilize a universal transcription generator to unify orthographic features into Romanized form and capture common phonetic characteristics across diverse languages. Second, we utilize a universal converter to transform these universal transcriptions into language-specific ones. In experiments, we demonstrate the effectiveness of our proposed method leveraging universal transcriptions for massively multilingual ASR. Our pipeline achieves a relative error reduction rate of 45% when compared to Whisper and performs comparably to MMS, despite being trained on only 0.1% of Whisper’s training data. Furthermore, our pipeline does not rely on any language-specific modules. However, it performs on par with zero-shot ASR approaches which utilize additional language-specific lexicons and language models. We expect this framework to serve as a cornerstone for flexible multilingual ASR systems that are generalizable even to unseen languages.
Developing a model for multilingual automatic speech recognition (ASR) is appealing due to its applicability to universal languages, including low-resource or unseen languages. However, this task presents significant challenges as it requires extensive datasets and involves the complexity of capturing shared characteristics across diverse languages in both phonetic and orthographic domains.
Since the revolution in ASR technologies driven by self-supervised learning (SSL) models [1]–[4], monolingual ASR has achieved superhuman transcription performance, shifting the main focus of recent research towards developing a universal model that spans multiple languages.
There are two primary methods for building a multilingual ASR model. The first approach involves scaling the size of both the labeled dataset and the model itself, using a single universal model to enhance its capacity and cover a vast number (100+) of languages, thereby achieving multilingual ASR [5]. Another approach involves incorporating language-specific modules into the universal model to address the performance inconsistencies of previous methods. For example, MMS [6] demonstrated the feasibility of scaling multilingual technology to over 1,000 languages by leveraging common features across languages and adding language-specific modules to improve the performance of each language. Indeed, there have been efforts to integrate both methods [7], combining their strengths to build a more robust and versatile model.
Although these works have demonstrated strong performance across various languages, the trade-off between performance and complexity remains a substantial challenge. The first method, using a single universal pipeline, struggles to achieve consistent performance across languages, and its effectiveness in low-resource languages remains uncertain. On the other hand, despite achieving state-of-the-art performance and parameter efficiency, the second method cannot be considered a single universal model due to the inclusion of language-specific modules. Moreover, the use of language-specific modules like adapters, heads, and language models (LMs) sometimes complicates the training and inference pipeline, suggesting potential areas for future improvement.
Simultaneously, large language models (LLMs) have garnered considerable attention for their remarkable capabilities in the natural language processing (NLP) domain. Following this trend, ASR pipelines have integrated audio SSL models as encoders and LLMs as decoders to enhance transcription quality [8], [9]. These approaches involve using projectors [10] or fine-tuning strategies [11], [12] to align modalities and improve transcription capabilities across multilingual datasets. Subsequently, shallow fusion [13], [14] based scoring methods [15], [16] were attempted to replace conventional LMs with LLMs during the decoding stage. Despite these efforts resulting in performance gain across various languages, a comprehensive method to fully leverage the diverse emergent abilities of LLMs remains to be developed.
In this paper, we introduce a novel language-agnostic multilingual ASR pipeline that spans over 100 languages, including completely unseen languages. As in Fig 1, the proposed pipeline consists of two phases: universal transcription generation and language-specific transliteration. In the universal transcription generation phase, we focused on reducing orthographic complexity by unifying diverse orthographic systems into a consistent format, approximating phonetic features across multiple languages. In the language-specific transliteration phase, we regard the transformation from universal transcription to language-specific transcription as a transliteration task by leveraging a universal converter. Our experiments demonstrate notable transcription performance of LAMA-UT across over 100 languages while using significantly smaller training data (only 680 hours) compared to other state-of-the-art multilingual ASR models. Furthermore, our proposed pipeline outperforms previous methods, especially in low-resource languages, and demonstrates proficiency in completely unseen languages, achieving performance comparable to existing language-agnostic ASR methods without relying on any language-specific modules. Our contributions are summarized as follows:
We propose a novel language-agnostic multilingual ASR pipeline consisting of two phases: universal transcription generation and language-specific transliteration.
We enabled our proposed pipeline to perform multilingual ASR with minimal data by unifying diverse orthographic systems through Romanization.
Our pipeline demonstrates consistent performance across over 100 seen languages and excels with completely unseen languages, all without relying on any language-specific modules or additional fine-tuning.
Initially, multilingual ASR models handled a limited number of languages [17], [18], under 60. However, recent advancements have led to the development of models capable of managing a broader range of languages. Whisper [5] uses a sequence-to-sequence [19] approach with 680,000 hours of weakly supervised data, and its neural decoder serves as a LM, enhancing transcription performance. With this method, Whisper attained impressive performance across most supported languages.
Google USM [7] employs a Conformer [20] encoder with various types of heads [21], [22] and is trained on an extensive dataset. It also employs a three-stage training incorporating speech-only, speech-text paired, and text-only data. Furthermore, to enhance transcription performance for low-resource languages, USM integrates language-specific adapters and employs Noisy Student Training (NST) techniques [23], [24].
MMS [6], a state-of-the-art multilingual ASR model, employed a Connectionist Temporal Classification (CTC) based approach [25] on a dataset covering over 1,000 languages. It utilizes a two-stage fine-tuning pipeline. The first stage involves Romanization-based fine-tuning to learn a global representation across diverse languages. In the second stage, language-specific adapters and heads are added to capture detailed features for each language and fine-tuned.
ASR-2K [26] is a zero-shot ASR model which utilizes three universal models to cover a range of languages: an acoustic model [27], a pronunciation model [28], and a LM [29]. This suggests the potential for a universal multilingual ASR model capable of functioning in a zero-shot environment without relying on any language-specific components. Consequently, Zero-Shot MMS [30] utilized language-specific lexicon and n-gram LMs in the decoding phase to enhance zero-shot transcription performance.
[15] trained a multilingual LLM covering 84 languages and employed a shallow fusion-based per-frame scoring to enhance transcription quality in multilingual ASR. Subsequently, [16] introduced non-autoregressive per-segment scoring, which improves transcription performance and reduces the computational burden. These methods primarily leveraged the strengths of a multilingual acoustic model (USM) and achieved further accuracy by incorporating LLMs into the decoding step.
The overall structure of the proposed multilingual ASR pipeline, LAMA-UT, comprises a universal transcription generation phase and a language-specific transliteration phase, as shown in Fig 1. We produce universal transcriptions by finetuning an audio encoder with an additional classification layer. Consequently, we manually select a prompt type from a predefined dictionary and combine it with language information to generate the input prompt for the universal converter. Finally, by feeding this input prompt into the universal converter, we translate the universal transcription into language-specific ones.
Previous studies in linguistics [31], [32] have shown that the phonological characteristics of human speech are constrained by a limited range of sounds due to the anatomical structure of the vocal tract. Similarly, in the ASR domain, prior research [33] has empirically demonstrated that the primary obstacle in multilingual ASR is the orthographic complexity across languages. Through the integration of these two insights, we aim to unify orthographic systems across diverse languages by standardization of notations into a Latin character system. This approach establishes alignment between phonetic and orthographic features through a unified transcription system. As a result, we develop a universal transcription generator capable of producing consistent transcriptions across multilingual speech corpora, including unseen languages.
The first method for orthography unification is to use the international phonetic alphabet (IPA). IPA is a phonetic notation system that includes four elements: consonants, vowels, diacritics, and suprasegmentals. IPA can precisely transcribe pronunciations in a consistent format with a combination of the four elements. There are challenges with the IPA, especially in vocabulary mapping, and one possible solution is to treat the combination of elements as a single token (e.g., ts, dz, etc.). However, due to the vast diversity of possible combinations makes this approach difficult to implement. Conversely, treating each IPA character as a distinct token introduces another issue: characters without phonetic value must be mapped to specific frames as shown in Fig 2. Since diacritics and suprasegmentals provide detailed information about pronunciation (e.g., length, tone, and stress) but do not carry specific phonetic values, mapping them to distinct frames can introduce confusion during the training process.
| Zero-Shot | Transcribe following Romanized sentence into a {lang} sentence: {roman}. |
|---|---|
| Few-Shot | Here are some examples of transcribing a Romanized sentence into a {lang} sentence: {shots}. Considering the examples above, transcribe the following Romanized sentence into a {lang} sentence: {roman}. |
| Zero-Shot CoT | Transcribe the following Romanized sentence into a {lang} sentence. Think step by step: {roman}. |
| Few-Shot + Zero-Shot CoT | Here are some examples of transcribing a Romanized sentence into a {lang} sentence: {shots}. Considering the examples above, transcribe the following Romanized sentence into a {lang} sentence. Think step by step: {roman}. |
| Transcribe the following Romanized sentence into a {lang} sentence, based on its pronunciation: {roman}. | |
| 2-2 | Correct the typographical and spacing errors in the following {lang} sentence: {pred}. |
Romanization is an alternative method for orthography unification which involves converting text from various writing systems into Latin script. While Romanization does not preserve phonetic features as precisely as the IPA, it generally retains phonetic information. Additionally, Romanization offers several advantages over the IPA. Romanization standardizes diverse writing systems using the Latin alphabet, which is already employed by the majority of languages. In contrast, IPA requires a specific set of rules for converting the orthography of each language into its IPA representation. Thus, Romanization is more efficient as it only requires conversion for languages that do not use the Latin alphabet. Furthermore, Romanization is advantageous for LLMs, as a large portion of their training data consists of Latin characters. Given these benefits, we adopt Romanization as a method for orthography unification.
Since our goal is to generate a universal transcription with unified orthography, our first approach was leveraging a wav2vec2.0-phoneme [34]. However, we found that directly passing phoneme tokens to the universal converter is suboptimal for transliteration, as it generates phoneme-level tokens without accounting for spacing. To address this, we shifted our focus to developing a universal transcription generator that produces character-level tokens while incorporating spacing information. In this context, Romanization provides a universal character-level orthographic representation, effectively reducing the vocabulary size to around 30 tokens compared to IPA. Since Romanization aligns with the common phonetic features preserved across languages, we are also confident in the proposed method’s strong generalization ability for languages not explicitly included in the training data. We selected wav2vec2.0-XLS-R [35] with 1 billion parameters, an SSL model pre-trained on 128 languages, as the audio encoder to leverage the advantages of pre-training on a diverse set of languages. We then attach a classification layer on top and fine-tune both the audio encoder and the classification layer with speech and Romanized transcription pairs to generate universal transcriptions.
The next step is to revert the universal transcription, which retains phonetic features, back to its original language-specific form. Since this process involves a text-to-text transformation, we approach it as a transliteration task. Consequently, we focused on the versatility of LLMs which excel in multilingual and multitask benchmarks due to extensive training on diverse text data. Therefore, we aim to utilize LLMs as universal converters to transform Romanized transcriptions into language-specific ones.
While LLMs have brought a tectonic shift to the NLP domain, additional techniques are still needed to fully harness their emergent abilities. In this context, prompt engineering has emerged as a field focused on crafting and refining prompts to effectively utilize LLMs across diverse applications and research areas. To maximize the performance of the inversion process, in the ablation study, we empirically investigated various prompt types: zero-shot, few-shot, zero-shot chain-of-thought (CoT), and prompt chaining, to determine which is the most appropriate for this task.
We transliterate the unified Romanized transcription by leveraging LLM’s multilingual and multitask language understanding ability without finetuning. Since our approach does not require any special fine-tuning, the universal converter can be replaced with any superior LLMs, potentially improving the performance of our proposed pipeline in line with the rapidly advancing capabilities of LLMs. For this paper implementation, we utilize LLaMA3-8B, 4-bit quantized LLaMA3-70B [36], and GPT-4o-mini [37] as the universal converter.
FLEURS [38] is a multilingual speech corpus encompassing 102 languages. It provides a relatively small amount of data per language (approximately 12 hours) while ensuring an unbiased distribution of data across the languages. Given our focus on demonstrating effective multilingual ASR with minimal data, we utilize the FLEURS and its official splits for experiments.
CommonVoice [39] is a multilingual speech dataset crowdsourced from speakers of various languages. For unseen languages, we leverage the official test split of 25 languages from CommonVoice 17.0, which offers sufficient samples for evaluation.
We initially applied NFKC normalization and lowercase transformation to the text transcriptions. Subsequently, we excluded samples containing parentheses or numbers from the dataset for the following reasons: parentheses and digits in transcriptions introduced ambiguity, as some enclosed phrases were pronounced while others were not, and digits had one-to-many pronunciation mappings across languages (e.g. ‘1’ can be pronounced as ‘one’, ‘eins’, ‘uno’, ‘yi’, etc.). Finally, we utilized the Python library Uroman [40] to obtain Romanized transcription and Phonemizer [41] for IPA transcription. For Japanese, we employed Pykakasi [42] due to the limitation of Uroman, which treats Japanese kanji as Chinese characters. Following these preprocessing steps, we obtained approximately 6 to 8 hours of speech-transcription paired data per language on average.
| Model | Seen (PER / CER \(\downarrow\)) | Unseen (PER / CER \(\downarrow\)) | |||||||||
| de | nl | fr | es | it | pt | avg | ia | eo | eu | avg | |
| wav2vec2.0-phoneme + n-gram LM | 23.8 | 38.0 | 31.0 | 28.7 | 33.5 | 45.0 | 33.0 | 10.7 | - | 20.8 | 31.4 |
| 14.8 | 26.0 | 26.4 | 12.3 | 21.7 | 36.5 | 22.9 | 6.1 | - | 13.7 | 22.2 | |
| IPA generator (LAMA-UT) | 10.2 | 10.5 | 9.6 | 4.2 | 4.6 | 10.9 | 14.4 | 29.0 | 32.0 | 36.6 | 35.1 |
| Roman generator (LAMA-UT) | 7.3 | 9.6 | 12.9 | 4.4 | 3.6 | 7.2 | 11.3 | 14.0 | 20.8 | 30.3 | 32.3 |
| Model | Data (h) | Universal | Zero-Shot |
|---|---|---|---|
| Whisper | 680k | O | X |
| MMS-1162 | 122k | X | X |
| LAMA-UT | 0.6k | O | O |
We performed fine-tuning on all layers except the feature extractor for 3,000 steps with a CTC loss and a batch size of 128. We bypassed the two-stage fine-tuning pipeline from prior studies [6], [34] because our distinct methodology, which used a smaller dataset, caused the divided fine-tuning approach to result in premature convergence and instability. For hyperparameters, we employed the default AdamW optimizer [43], [44] with a tri-stage learning rate scheduler. The warm-up, hold, and decay phases were configured to 10%, 60%, and 30% of the total training steps, respectively. We then performed a series of experiments to determine the optimal learning rate schedule within the range of 5e-6 to 5e-4. Finally, the entire training pipeline was conducted on two RTX-3090 GPUs with 24GB of VRAM each, and we leveraged gradient accumulation techniques to address memory issues.
We leveraged a beam search decoder from flashlight [45] with a beam size of 100. No additional lexicons or LMs were utilized in the decoding pipeline to maintain a universal pipeline without relying on language-specific elements.
For the prompting strategy, we utilized language information and a subset of the training data to construct our hypothesis prompt. The specific format of the prompt employed is detailed in Table 1. In zero-shot prompting, the universal converter automatically transforms Romanized transcriptions into language-specific ones using only the Romanized transcriptions and language information. We employed zero-shot prompting to evaluate the performance of the LLM with minimal input. Few-shot prompting [46] involves providing examples to help the model generate responses to subsequent instances. We hypothesized that this approach would be particularly effective for low-resource or unseen languages by inducing in-context learning. Specifically, we randomly sampled five Romanized transcription and target transcription pairs for each few-shot example. Zero-shot CoT prompting [47] is a technique that supports complex reasoning by inducing the decomposition of intricate tasks into detailed steps. Specifically, we appended the phrase “Let’s think step by step” to the input prompt to encourage the reasoning of the model. Prompt chaining employs a sequence of prompts, with each prompt building upon the output of the previous one, to manage complex multi-step tasks. In this aspect, we concentrated on the decomposable process of converting predicted Romanized transcriptions into language-specific transcriptions through (i) reverse-Romanization and (ii) error correction. We considered that errors in Romanized transcriptions could propagate during transliteration to language-specific ones, potentially reducing system performance.
| Universal Converter | CER \(\downarrow\) | WER \(\downarrow\) | |
|---|---|---|---|
| Seen | LLaMA-8B | 26.6 | 46.7 |
| LLaMA-70B | 15.5 | 35.3 | |
| GPT-4o-mini | 7.5 | 18.1 | |
| Unseen | LLaMA-8B | 33.0 | 50.2 |
| LLaMA-70B | 27.2 | 58.9 | |
| GPT-4o-mini | 15.8 | 38.3 |
Finally, we required the output of the universal converter to conform to a specific format. We instruct the model to enclose the output within three backticks (e.g., ‘‘‘), which allows us to isolate and sort only the language-specific
transcription from the output of the model. We set the temperature value to 0.0 for all LLMs to obtain deterministic results.
| Resource | Lang. | Whisper-large-v3 | MMS-1162 | LAMA-UT | ||||||
| Data (h) | CER \(\downarrow\) | WER \(\downarrow\) | Data (h) | CER \(\downarrow\) | WER \(\downarrow\) | Data (h) | CER \(\downarrow\) | WER \(\downarrow\) | ||
| High | es | 11000 | 1.2 | 3.1 | 2969 | 1.6 | 5.8 | 6.1 | 2.8 | 7.3 |
| it | 2585 | 0.5 | 1.6 | 1566 | 1.2 | 5.2 | 6.8 | 2.0 | 5.2 | |
| id | 1014 | 1.4 | 5.7 | 71 | 2.9 | 14.2 | 6.8 | 4.2 | 11.5 | |
| Middle | ta | 136 | 18.3 | 26.7 | 265 | 11.0 | 41.5 | 6.3 | 19.5 | 31.9 |
| ur | 104 | 30.9 | 65.0 | 57 | 9.0 | 29.0 | 4.9 | 14.9 | 31.9 | |
| sk | 90 | 2.9 | 8.7 | 301 | 2.2 | 8.8 | 4.5 | 3.8 | 10.2 | |
| Low | mk | 16 | 10.3 | 26.3 | 45 | 1.5 | 8.1 | 5.1 | 5.5 | 17.2 |
| hi | 12 | 35.9 | 43.3 | 57 | 5.8 | 19.6 | 5.0 | 8.2 | 15.0 | |
| kk | 12 | 8.5 | 35.1 | 46 | 2.8 | 15.2 | 8.1 | 6.7 | 22.9 | |
| Average | - | 23.9 | 42.9 | - | 7.8 | 28.8 | - | 14.8 | 33.2 | |
Table 3 shows that LAMA-UT effectively achieves multilingual ASR with a universal model. This approach even operates in a zero-shot environment without requiring language-specific modules while utilizing only a minimal amount of training data. In the subsequent results, we aim to validate the performance of each component within the pipeline.
We conducted a performance comparison between our universal transcription generator and the existing baseline, wav2vec2 phoneme [34]. Our universal transcription generator focuses on generating character-level tokens and is measured using Character Error Rate (CER), while the baseline wav2vec2 phoneme is measured using Phoneme Error Rate (PER). However, since both metrics are fundamentally used for estimating phonetic symbols, this comparison can be considered meaningful. Following the Table 2, the results show that the proposed method demonstrated significantly better performance across a broader range of languages compared to existing approaches even not utilizing language-specific modules (e.g. n-gram LM). Furthermore, our pipeline demonstrated relatively strong transcription capabilities for unseen languages that were not explicitly included in the training data. In conclusion, transcribing diverse languages based on their pronunciation can produce a universal transcription which is highly effective.
Among the two methods for standardizing orthographic features, Romanization proved to be more effective than IPA. Its ability to represent pronunciation across languages while reducing complexity makes it a superior choice for meaningful results. Romanization balances phonetic accuracy with simplicity, providing better alignment with LLMs and ensuring efficient processing across multilingual ASR tasks. However, since these results are only constrained to the first phase of the pipeline, we have constructed the full end-to-end performance comparison between IPA-based and Romanization-based LAMA-UT, and the results are shown in the appendix.
Despite the effectiveness of orthography unification, the success of the entire pipeline hinges on the proper functioning of the universal converter. Therefore, the most critical aspect to validate before experimentation was whether a frozen LLM could effectively serve as a universal converter. To validate this objective, we passed ground truth Romanized transcriptions into the frozen universal converter and assessed its performance. This approach not only tests the converter’s capability to accurately produce language-specific transcriptions but also serves to evaluate the upper bound performance of the universal converter within the proposed ASR pipeline. In Table 4, results demonstrated that universal transcription based on pronunciation characteristics can yield significant performance improvements compared to previous works when the universal transcription generator operates ideally. However, the upper bound performance for unseen languages showed a slight decrease compared to seen languages. This decrease is likely because the unseen languages we tested are typically extremely low-resource languages within the training data of the LLM.
| Model | Repetition Rate (%) \(\downarrow\) | Prompting Strategy | |||||||||
| Zero-Shot | Few-Shot (5) | Zero-Shot CoT | Few-Shot (5) + Zero-Shot CoT | Prompt Chaining | |||||||
| CER \(\downarrow\) | WER \(\downarrow\) | CER \(\downarrow\) | WER \(\downarrow\) | CER \(\downarrow\) | WER \(\downarrow\) | CER \(\downarrow\) | WER \(\downarrow\) | CER \(\downarrow\) | WER \(\downarrow\) | ||
| 35.1 | 70.6 | 22.7 | 49.6 | 37.2 | 77.4 | 22.1 | 49.8 | 35.9 | 70.8 | ||
| LLaMA-70B | 1 | 24.3 | 53.8 | 17.4 | 43.8 | 25.4 | 54.6 | 16.8 | 43.7 | 26.7 | 55.2 |
| GPT-4o-mini | 0.2 | 16.6 | 39.3 | 15.3 | 37.2 | 18.2 | 41.0 | 15.7 | 37.9 | 16.9 | 38.7 |
| Model | Data (h) | # Lang. | Universal | CER \(\downarrow\) |
|---|---|---|---|---|
| ASR-2K | 2k | 8 | O | 65.5 |
| LAMA-UT (Roman) | 0.6k | 102 | O | 34.7 |
| MMS-ZS | 40k | 1078 | X | 29.2 |
| + n-gram LM | 40k | 1078 | X | 25.2 |
We leveraged two baseline models for comparison: Whisper and MMS. In Table 5, results demonstrate that the proposed method achieved a relative reduction of 60% in CER and 30% in WER compared to Whisper. Moreover, LAMA-UT matches the performance of MMS despite the absence of language-specific adapters, heads, and n-gram LMs. Notably, the performance improvements were most pronounced for low-resource languages. While Whisper exhibited increased error rates for these languages due to limited training data, our method showed substantial performance enhancements with minimal data resources. Despite the slight performance degradation in high-resource languages, the improvement observed in low-resource languages is remarkably meaningful. The full comparison results are presented in Fig 3. It is noteworthy that these results were achieved with considerably smaller training data compared to Whisper and MMS.
Our main focus was developing a generalized pipeline that demonstrates strong performance with unseen languages. To validate this objective, we utilized two zero-shot ASR models as baselines: ASR-2K and Zero-Shot MMS (MMS-ZS). In Table 7, our method demonstrated a reduction in CER by half while using significantly less training data compared to ASR-2K. Furthermore, it is noteworthy that our proposed pipeline performs remarkably well even without language-specific modules, demonstrating comparable performance to MMS-ZS which leverages language-specific lexicon and n-gram LM.
In Table 6, few-shot prompting showed the highest performance across all models and prompting strategies. Interestingly, even with zero-shot prompting, the proposed pipeline consistently outperforms Whisper on average in CER and WER, where Whisper records 23.9% and 42.9% respectively, as shown in Table 5. On the other hand, the use of sequential reasoning failed to achieve the anticipated improvements. Specifically, we observed considerable error propagation when utilizing zero-shot CoT prompts and prompt chaining techniques. Minor inaccuracies in the Romanization phase were amplified as they were processed by the LLM, leading to transcriptions that deviated in meaning from the intended output.
From the perspective of model size, using a relatively smaller LLM like LLaMA-8B frequently resulted in issues such as word repetition, which complicated the transcription sorting process. Additionally, this model faced challenges with language misprediction, often generating transcriptions in languages other than the intended target language. This issue was particularly noticeable with low-resource languages such as Arabic. With the LLaMA-70B model, while word repetition was less pronounced compared to the LLaMA-8B model, the issue of language misprediction persisted, albeit at a reduced frequency. Among the LLMs tested, GPT-4o-mini demonstrated the best performance overall. It outperformed the other models across all prompting strategies, achieving an impressive average CER of 15% across 102 languages.
In this paper, we introduced a generalized multilingual ASR pipeline LAMA-UT, that operates effectively without relying on language-specific modules. By utilizing Romanized transcription as a unified representation across languages, we structured the multilingual ASR pipeline into two phases. Initially, Romanization aligns phonetic and orthographic features, allowing the universal transcription to be effectively generalized across diverse languages and trained efficiently with a smaller dataset. Subsequently, we used a frozen LLM to convert the universal transcription into language-specific ones. This inversion process showed remarkable performance across languages, including those not previously encountered. Our experiments demonstrated that the proposed method not only maintains performance for high-resource languages but also significantly outperforms existing methods for low-resource languages, all while effectively handling unseen languages. Furthermore, our approach matched the performance of models that employ language-specific modules, despite not using any such components. We anticipate that this research will provide a viable alternative for utilizing LLMs to support universal multilingual ASR systems across a variety of applications.
The following appendix sections propose additional experimental results and specifications for LAMA-UT, which can be instrumental in deeper understanding or validating the strengths of the proposed pipeline. The first section of the appendix presents the specifications of the languages we leveraged in the experiments, given the corresponding ISO-639 language code. Consequently, the second section demonstrates the transcription performance of LAMA-UT for each language used in the experiments. Then, the third section suggests the end-to-end transcription performance comparison between IPA-based LAMA-UT and Romanization-based LAMA-UT, further to validate the effectiveness of Romanization in orthography unification. Finally, in the last section, we would like to discuss possible advancements of LAMA-UT and our further research direction.
| Seen | Afrikaans (af), Amharic (am), Arabic (ar), Assamese (as), Asturian (ast), Azerbaijani (az), Belarusian (be), Bulgarian(bg), Bengali (bn), Bosnian (bs), Catalan (ca), Cebuano (ceb), Sorani-Kurdish (ku), Mandarin Chinese (cmn), Czech(cs), Welsh (cy), Danish (da), German (de), Greek (el), English (en), Spanish (es), Estonian (et), Persian (fa), Fula(ff), Finnish (fi), Filipino (fil), French (fr), Irish (ga), Galician (gl), Gujarati (gu), Hausa (ha), Hebrew (he), Hindi(hi), Croatian (hr),Hungarian (hu), Armenian (hy), Indonesian (id), Igbo (ig), Icelandic (is), Italian (it), Japanese(ja), Javanese (jv), Georgian (ka), Kamba (kam),Kabuverdianu (kea), Kazakh (kk), Khmer (km), Kannada (kn), Korean(ko), Kyrgyz (ky), Luxembourgish (lb), Ganda (lg), Lingala (ln), Lao (lo), Lithuanian (lt), Luo (luw), Latvian (lv), Maori(mi), Macedonian (mk), Malayalam (ml), Mongolian (mn), Marathi (mr), Malay (ms), Maltese (mt), Burmese (my), Norwegian(no), Nepali (ne), Dutch (nl), Northern-Sotho (nso), Nyanja (ny), Occitan (oc), Oromo (om), Oriya (or), Punjabi(pa), Polish (pl), Pashto (ps), Portuguese (pt), Romanian (ro), Russian (ru), Sindhi (sd), Slovak (sk), Slovenian (sl), Shona(sn), Somali (so), Serbian (sr), Swedish (sv), Swahili (sw), Tamil (ta), Telugu (te), Tajik (tg), Thai (th), Turkish (tr), Ukrainian(uk), Umbundu (umb), Urdu (ur), Uzbek (uz), Vietnamese (vi), Wolof (wo), Xhosa (xh), Yoruba (yo), Cantonese Chinese (yue), Zulu (zu) |
|---|---|
| Unseen | Abkhazian (ab), Albanian (sq), Basaa (bas), Bashkir (ba), Basque (eu), Breton (br), Chuvash (cv), Eastern Mari (mhr), Erzya(myv), Esperanto (eo), Guarani (gn), Hakha Chin (cnh), Interlingua (ia), Kinyarwanda (rw), Latgalian (ltg), Norwegian Nynorsk (nn), Romansh (rm), Tatar (tt), Toki Pona (tok), Turkmen (tk), Uighur (ug), Upper Sorbian (hsb), Western Frisian (fy), Western Mari (mrj), Yakut (sah) |
Table 8 indicates the languages that were leveraged in the experiments of our manuscript. All of the seen languages comprise 102 languages in the FLEURS dataset, and they are used in both training and evaluation of LAMA-UT. Evaluation samples for unseen languages were chosen from the official test split of the CommonVoice 17.0 dataset, which ensured that they possessed an adequate volume of data.
| Language | CER \(\downarrow\) | Language | CER \(\downarrow\) | Language | CER \(\downarrow\) | Language | CER \(\downarrow\) | |
|---|---|---|---|---|---|---|---|---|
| Seen | Afrikaans | 19.3 | Ganda | 10.6 | Lithuanian | 8.7 | Shona | 9.0 |
| Amharic | 8.7 | Georgian | 9.4 | Luo | 7.4 | Sindhi | 29.7 | |
| Arabic | 8.8 | German | 8.4 | Luxembourgish | 13.6 | Slovak | 5.4 | |
| Armenian | 5.6 | Greek | 11.5 | Macedonian | 5.3 | Slovenian | 9.1 | |
| Assamese | 12.9 | Gujarati | 8.9 | Malay | 8.3 | Somali | 16.8 | |
| Asturian | 9.3 | Hausa | 9.7 | Malayalam | 7.7 | Sorani-Kurdish | 11.8 | |
| Azerbaijani | 10.7 | Hebrew | 20.5 | Maltese | 8.4 | Spanish | 4.7 | |
| Belarusian | 8.0 | Hindi | 11.6 | Mandarin Chinese | 6.5 | Swahili | 6.4 | |
| Bengali | 9.8 | Hungarian | 14.5 | Maori | 8.7 | Swedish | 11.8 | |
| Bosnian | 6.9 | Icelandic | 18.7 | Marathi | 11.9 | Tajik | 5.3 | |
| Bulgarian | 7.6 | Igbo | 14.0 | Mongolian | 14.7 | Tamil | 11.9 | |
| Burmese | 19.4 | Indonesian | 6.0 | Nepali | 11.2 | Telugu | 10.8 | |
| Cantonese Chinese | 16.6 | Irish | 28.4 | Northern-Sotho | 13.8 | Thai | 15.9 | |
| Catalan | 7.5 | Italian | 3.7 | Norwegian | 9.2 | Turkish | 6.4 | |
| Cebuano | 7.5 | Japanese | 33.4 | Nyanja | 12.4 | Ukrainian | 11.2 | |
| Croatian | 6.3 | Javanese | 7.0 | Occitan | 17.7 | Umbundu | 9.5 | |
| Czech | 11.0 | Kabuverdianu | 6.8 | Oriya | 14.3 | Urdu | 44.0 | |
| Danish | 15.6 | Kamba | 10.9 | Oromo | 20.2 | Uzbek | 16.3 | |
| Dutch | 11.3 | Kannada | 7.4 | Pashto | 20.0 | Vietnamese | 13.0 | |
| English | 13.1 | Kazakh | 5.1 | Persian | 6.9 | Welsh | 12.8 | |
| Estonian | 4.4 | Khmer | 23.6 | Polish | 10.9 | Wolof | 14.9 | |
| Filipino | 4.8 | Korean | 9.6 | Portuguese | 9.0 | Xhosa | 10.4 | |
| Finnish | 4.9 | Kyrgyz | 5.2 | Punjabi | 20.3 | Yoruba | 10.9 | |
| French | 12.3 | Lao | 17.8 | Romanian | 11.8 | Zulu | 9.5 | |
| Fula | 14.7 | Latvian | 7.0 | Russian | 9.0 | |||
| Galician | 6.7 | Lingala | 7.7 | Serbian | 8.9 | |||
| Unseen | Abkhazian | 45.8 | Eastern Mari | 30.9 | Latgalian | 28.8 | Upper Sorbian | 35.3 |
| Albanian | 34.8 | Erzya | 29.9 | Norwegian Nynorsk | 27.1 | Western Frisian | 33.8 | |
| Basa | 33.4 | Esperanto | 22.1 | Romansh | 31.9 | Western Mari | 36.7 | |
| Bashkir | 33.5 | Guarani | 33.5 | Tatar | 31.8 | Yakut | 36.3 | |
| Basque | 33.0 | Hakha Chin | 42.9 | Toki Pona | 28.7 | |||
| Breton | 49.1 | Interlingua | 17.3 | Turkmen | 41.3 | |||
| Chuvash | 37.4 | Kinyarwanda | 37.2 | Uighur | 27.1 |
| Language | CER \(\downarrow\) | Language | CER \(\downarrow\) | Language | CER \(\downarrow\) | Language | CER \(\downarrow\) | |
|---|---|---|---|---|---|---|---|---|
| Seen | Afrikaans | 18.3 | Ganda | 10.4 | Luo | 7.3 | Sindhi | 29.1 |
| Amharic | 47.4 | Georgian | 23.6 | Luxembourgish | 16.5 | Slovak | 3.8 | |
| Arabic | 13.9 | German | 5.5 | Macedonian | 5.5 | Slovenian | 9.4 | |
| Armenian | 22.4 | Greek | 15.6 | Malay | 6.9 | Somali | 15.4 | |
| Assamese | 18.9 | Gujarati | 11.9 | Malayalam | 24.1 | Sorani-Kurdish | 26.4 | |
| Asturian | 9.8 | Hausa | 10.2 | Maltese | 9.8 | Spanish | 2.8 | |
| Azerbaijani | 11.9 | Hebrew | 35.2 | Mandarin Chinese | 36.0 | Swahili | 5.3 | |
| Belarusian | 11.3 | Hindi | 8.3 | Maori | 12.0 | Swedish | 9.5 | |
| Bengali | 13.8 | Hungarian | 16.7 | Marathi | 14.0 | Tajik | 8.9 | |
| Bosnian | 5.9 | Icelandic | 18.1 | Mongolian | 21.4 | Tamil | 19.6 | |
| Bulgarian | 8.2 | Igbo | 17.9 | Nepali | 13.2 | Telugu | 15.5 | |
| Burmese | 52.6 | Indonesian | 4.3 | Northern-Sotho | 15.1 | Thai | 45.3 | |
| Cantonese Chinese | 61.6 | Irish | 34.8 | Norwegian | 7.0 | Turkish | 5.1 | |
| Catalan | 6.3 | Italian | 2.0 | Nyanja | 12.1 | Ukrainian | 9.7 | |
| Cebuano | 7.4 | Javanese | 6.6 | Occitan | 19.9 | Umbundu | 10.5 | |
| Croatian | 5.2 | Kabuverdianu | 8.3 | Oriya | 18.3 | Urdu | 14.9 | |
| Czech | 9.9 | Kamba | 14.6 | Oromo | 19.8 | Uzbek | 14.5 | |
| Danish | 12.9 | Kannada | 13.9 | Pashto | 32.7 | Vietnamese | 19.0 | |
| Dutch | 8.7 | Kazakh | 6.7 | Persian | 11.9 | Welsh | 12.3 | |
| English | 8.2 | Khmer | 66.3 | Polish | 8.6 | Wolof | 18.0 | |
| Estonian | 5.7 | Korean | 18.8 | Portuguese | 7.9 | Xhosa | 10.5 | |
| Filipino | 4.4 | Kyrgyz | 8.8 | Punjabi | 19.5 | Yoruba | 25.5 | |
| Finnish | 4.7 | Lao | 55.0 | Romanian | 11.0 | Zulu | 9.6 | |
| French | 8.2 | Latvian | 7.9 | Russian | 6.0 | |||
| Fula | 16.6 | Lingala | 7.8 | Serbian | 5.7 | |||
| Galician | 6.0 | Lithuanian | 8.6 | Shona | 8.6 | |||
| Unseen | Abkhazian | 55.9 | Eastern Mari | 35.3 | Latgalian | 34.7 | Upper Sorbian | 35.2 |
| Albanian | 35.4 | Erzya | 35.7 | Norwegian Nynorsk | 23.9 | Western Frisian | 29.6 | |
| Basa | 42.1 | Esperanto | 19.7 | Romansh | 30.0 | Western Mari | 41.4 | |
| Bashkir | 29.7 | Guarani | 40.0 | Tatar | 24.2 | Yakut | 38.2 | |
| Basque | 32.8 | Hakha Chin | 42.8 | Toki Pona | 29.0 | |||
| Breton | 49.9 | Interlingua | 15.2 | Turkmen | 43.1 | |||
| Chuvash | 44.3 | Kinyarwanda | 36.7 | Uighur | 24.5 |
In this section, we present a detailed performance analysis for each language, comparing transcription performance from the universal transcription generation phase to that after passing through the universal converter. Additionally, we analyze these results further to explore the correlation between orthographic characteristics and transcription performance. First of all, Table 9 suggests the language-specific transcription performance of the Romanization-based universal transcription generator, and Table 10 presents the language-specific end-to-end (E2E) language-specific transcription performance of LAMA-UT. In the case of the universal transcription generator, CERs were consistently similar across most languages with minimal variation. However, in the end-to-end pipeline with language-specific transliteration, notably higher error rates were observed for certain languages, and the languages highlighted in bold represent the top 10 languages with the most significant increases in error rate observed in the end-to-end pipeline compared to the CER values during the universal transcription phase. Interestingly, the 10 languages that exhibited the most significant performance degradation after passing through the universal converter were all non-Latin script languages, with the majority being languages that do not employ spacing in their orthography. This suggests that when a pre-trained LLM is utilized as a universal converter, languages with non-Latin orthography which inherently exhibit different structural characteristics compared to Latin-based languages, are more prone to error propagation.
| Universal Converter | IPA | Roman | |||||||
| Upper Bound | Few Shot | Upper Bound | Few Shot | ||||||
| CER \(\downarrow\) | WER \(\downarrow\) | CER \(\downarrow\) | WER \(\downarrow\) | CER \(\downarrow\) | WER \(\downarrow\) | CER \(\downarrow\) | WER \(\downarrow\) | ||
| Seen | LLaMA-8B | 51.4 | 82.4 | 56.5 | 80.8 | 33.0 | 44.4 | 38.1 | 64.7 |
| LLaMA-70B | 40.0 | 64.6 | 33.8 | 67.8 | 15.6 | 34.6 | 23.1 | 56.2 | |
| GPT-4o-mini | 27.3 | 49.2 | 33.0 | 70.6 | 10.0 | 28.6 | 19.9 | 49.4 | |
| Unseen | LLaMA-8B | 52.0 | 93.0 | 43.4 | 93.3 | 32.7 | 60.2 | 41.6 | 94.4 |
| LLaMA-70B | 42.5 | 89.0 | 38.1 | 94.6 | 17.0 | 57.5 | 34.7 | 93.9 | |
| GPT-4o-mini | 40.4 | 78.2 | 35.2 | 84.1 | 12.3 | 42.1 | 29.7 | 83.0 | |
In this section, we would like to further clarify the effectiveness of utilizing Romanization as an intermediate representation. Since the experimental results from Table 2 were limited to the universal transcription generation phase, we supplement the previous results with the full performance comparison between IPA-based and Romanization-based end-to-end architecture of LAMA-UT, both in upper bound assessment and transcription performance. Table 11 proposes the end-to-end performance comparison between IPA-based LAMA-UT and Romanization-based LAMA-UT. The only difference between the two models is the orthography unification method utilized in the universal transcription generator. (e.g., the IPA-based model utilizes IPA as an intermediate representation, while the Romanization-based one leverages Romanized transcription as an intermediate representation. In the results, Romanization-based LAMA-UT consistently outperforms IPA-based LAMA-UT as a substantial difference in both CER and WER. These results strongly demonstrate the superiority of Romanization over IPA when leveraging LLMs as a universal converter, since LLMs are primarily trained with Latin scripts and optimized tokenization strategy for them. Furthermore, there are some inconsistencies in the results of the IPA-based LAMA-UT, where passing predicted IPA transcriptions along with a few examples to the universal converter yielded better performance than passing ground truth IPA transcriptions without examples. This is presumably because the LLMs did not frequently encounter IPA-driven tokens during its training process.
LAMA-UT showed comparable or better performance compared to previous works even without language-specific modules (e.g., adapters, lexicons, n-gram LMs), while achieving efficient training with a significantly reduced dataset size. However, there is still room for further improvement in both the universal transcription generator and the universal converter. First, the transcription performance of the universal transcription generator can be improved. For instance, the universal transcription generator of LAMA-UT can leverage additional linguistic information (e.g., embedding from a pre-trained language classifier) to further enhance the transcription quality of the first phase. Secondly, our pipeline shows relatively lower performance for languages with distinct linguistic structures, like Korean, and those with additional features (e.g., tones), such as Chinese, in the language-specific transliteration phase. Since our universal converter is replaceable, this issue will naturally be resolved in line with the development of LLMs. Finally, the utilization of prompt learning techniques [50]–[52] might improve transliteration performance. We plan to address these aspects in future research.
Codes: https://github.com/sanghyang00/lama-ut↩︎