April 02, 2024
Large language models (LLMs) are trained on text-only data that go far beyond the languages with paired speech and text data. At the same time, Dual Encoder (DE) based retrieval systems project queries and documents into the same embedding space and have demonstrated their success in retrieval and bi-text mining. To match speech and text in many languages, we propose using LLMs to initialize multi-modal DE retrieval systems. Unlike traditional methods, our system doesn’t require speech data during LLM pre-training and can exploit LLM’s multilingual text understanding capabilities to match speech and text in languages unseen during retrieval training. Our multi-modal LLM-based retrieval system is capable of matching speech and text in 102 languages despite only training on 21 languages. Our system outperforms previous systems trained explicitly on all 102 languages. We achieve a 10% absolute improvement in Recall@1 averaged across these languages. Additionally, our model demonstrates cross-lingual speech and text matching, which is further enhanced by readily available machine translation data.
LLMs have demonstrated their effectiveness in modelling textual sequences to tackle various downstream tasks [1]–[3]. This effectiveness has led to the development of powerful LLMs capable of modelling text in a wide range of languages. The abundance of textual data in different languages across the internet has fueled the progress of multi-lingual models [4]–[6]. On the other hand, speech technologies are prevalent in smartphones and personal assistants, but their language availability is relatively limited compared to the languages that LLMs support [7], [8].
Various efforts have explored solutions to the speech-text data scarcity problem [9]–[11]. Works such as SpeechMatrix [12] use separate speech and text encoders to mine semantically similar utterances that are neighbors in an embedding space. However, these approaches are limiting because they require speech and text encoders that have aligned representation spaces.
We posit that we can retrieve speech and text utterances by aligning both modalities within the embedding space built from a single pre-trained LLM. We take inspiration from previous works that use pre-trained LLMs to perform automatic speech recognition (ASR) and automatic speech translation (AST) [13]–[15]. Our intuition is that we can perform the speech and text alignment leveraging the capabilities of text-only LLMs without requiring two separate models.
In this paper, we propose converting LLMs into speech and text DE retrieval systems without requiring speech pre-training and outperform previous methods with significantly less data. By discretizing speech into acoustic units [16], we extend our LLMs embedding layer and treat the acoustic units as ordinary text tokens. Consequently, we transform our LLM into a retrieval system via a contrastive loss allowing us to match speech and text utterances in various languages. Our contributions are the following:
We build a speech-to-text symmetric DE from a pre-trained LLM. We show that our retrieval system is effective matching speech and text in 102 languages of FLEURS [17] despite only training on 21 languages.
We show that our model exhibits cross-lingual speech and text matching without training on this type of data. At the same time, we find that cross-lingual speech and text matching is furthered improved by training on readily available machine translation data.
We train a transformer-based DE model that encodes speech and text given a dataset \(\emph{D} = \{(x_i, y_i)\}\), where \(x_i\) is a speech utterance and \(y_i\) is its transcription. We denote the speech and text embeddings as \(\boldsymbol{x_i} = E(x_i)\) and \(\boldsymbol{y_i} = E(y_i)\), respectively, where \(E\) is a transformer-based DE that encodes speech and text.
We convert raw speech into discrete tokens using the process in [18], [19]. The process converts a speech query \(x_i\) into an embedding using a pre-trained speech encoder. The output embedding is then discretized into a set of tokens using k-means clustering. We refer to the resulting tokens as audio tokens. We use the 2B variant of the Universal Speech Model (USM) encoder [20] as the speech encoder and take the middle layer as the embedding for \(x_i\). Additionally, we generate audio tokens at 25Hz using k-means clustering, resulting in a set of 1024 possible audio tokens. We will refer to this as our audio token vocabulary.
To support text and audio tokens in our LLM, we follow the formulation of [13]. We extend the embedding layer of a transformer decoder by \(a\) tokens, where \(a\) represents the size of our audio token vocabulary. This modification leads to an embedding layer with size \((t + a) \times m\), where \(t\) is the number of tokens in the text vocabulary and \(m\) is the dimensions of the embedding vectors. In our implementation, the first \(t\) tokens represent text and the remaining \(a\) tokens are reserved for audio. We initialize the embeddings layer from scratch when training our model.
Appendix [sec:dataset95stats] details our training and evaluation datasets along with the number of languages in each dataset, the split we used, and the size of each dataset. We focus on the following retrieval tasks:
involves retrieving the corresponding transcription from a database given a speech sample. In S2T, we train on CoVoST-2 [21] speech utterances and their transcriptions. CoVoST-2 is a large multilingual speech corpus derived from Wikipedia expanding over 21 languages and provides translation to and from English. We use FLEURS [17] to evaluate S2T performance on 102 languages. FLEURS is an \(n\)-way parallel dataset containing speech utterances from FLoRES-101 [22] human translations. To evaluate S2T, we report recall at 1 (\(R@1\)) rates for retrieving the correct transcription for every speech sample and word error rate (WER).
attempts to retrieve the corresponding text translation of a speech sample. We use S2TT to measure the cross-lingual capabilities of our multi-modal DE retrieval system. We evaluate this capability zero-shot on X \(\to\) En S2TT data of FLUERS and explore if we can further improve this capability by training on readily-available machine translation data from WikiMatrix [23]. We pick French, German, Dutch, and Polish to English that are common across WikiMatrix and FLEURS and further discuss the amount of machine translation data used in Appendix [sec:dataset95stats]. For S2TT, we report 4-gram corpusBLEU [24].
Figure 1 shows an illustration of our model. We initialize our dual encoder from PaLM 2 XXS [25] and append a linear projection layer after pooling the outputs along the sequence length dimension. The embedding and linear projection layers are initialized randomly. After initializing our model from PaLM 2, we use a contrastive loss [26]. Appendix 9.1 includes more details on our training setup. We will refer to our proposed model as PaLM 2 DE.
R@1 \(\uparrow\) | WER \(\downarrow\) | |
---|---|---|
mSLAM DE [17] | 76.9 | 14.6 |
PaLM 2 DE (Proposed Model) | 86.15 | 13.85 |
We train our DE model to perform S2T, where the task is to retrieve the corresponding transcription given a speech sample. We train on the 21 languages from CoVoST-2 and evaluate our model using the S2T portion of FLEURS in 102 languages.
Table 1 shows the average \(R@1\) and WER for S2T for 102 languages from FLEURS. We compare against the mSLAM DE model from [17], a model trained on 426k hours of S2T data in 51 languages and fine-tuned on FLEURS training data. Our model significantly outperforms the mSLAM DE baseline in \(R@1\) and \(WER\) metrics despite being trained with only 1/10 of the data and having been initialized from a text-only LLM. More importantly, our model was only trained on the 21 languages in CoVoST-2 and never fine-tuned on the FLEURS training data.
In Figure 2 we break down the \(R@1\) scores based on seen and unseen languages during training. We find that our model performs best on the 20 languages that are within the training and evaluation data, but still perform significantly well on the remaining 82 unseen languages. We hypothesize this is due to the vast textual multilingual data our backbone LLM has seen during pre-training.
R@1 \(\uparrow\) | ||
---|---|---|
2-3Language Group (#) | mSLAM DE | PaLM 2 DE |
[17] | (Proposed Model) | |
Afro-Asiatic (7) | 73.67 | 90.82 |
Atlantic-Congo (15) | 86.77 | 79.47 |
Austro-Asiatic (3) | 47.90 | 41.71 |
Austronesian (6) | 75.50 | 85.74 |
Dravidian (4) | 65.70 | 90.46 |
Indo-European (51) | 84.62 | 92.38 |
Japonic (1) | 5.80 | 63.23 |
Kartvelian (1) | 70.50 | 74.57 |
Koreanic (1) | 5.20 | 45.81 |
Kra-Dai (1) | 3.20 | 37.12 |
Mongolic (1) | 70.70 | 98.10 |
Nilo-Saharan (1) | 91.00 | 94.53 |
Sino-Tibetan (3) | 3.40 | 61.91 |
Turkic (4) | 81.28 | 91.89 |
Uralic (3) | 91.40 | 93.38 |
Table 2 shows the \(R@1\) language group breakdown for S2T on FLEURS. We find that although we only trained on 21 languages, our model significantly outperforms mSLAM DE in 13 of the 15 language groups. These results are consistent with the experiments in [15] which explore the effect of initializing speech language models from pre-trained LLMs.
We evaluate on S2TT to gauge the cross-modal and cross-lingual capabilities of our model. We show that we can further improve S2TT by simply training on a mixture of S2T and translation data without using any S2TT training data.
Given the multi-lingual capabilities of our backbone language model, we explore if these capabilities are transferred after training our model contrastively on the S2T task. We hypothesize that our model should showcase cross-lingual and cross-modal capabilities due to the cross-modal training task and the cross-lingual capabilities of the backbone LLM. We evaluate S2TT in a zero-shot setting to assess our model’s performance retrieving English translations given a speech sample in another language. Using the FLEURS S2TT portion, we evaluate S2TT X \(\to\) En in 4 languages: German, Polish, French, and Dutch.
Figure 3 shows BLEU S2TT performance using S2T CoVoST-2 in 21 languages. We call this setup Transcripts
in Figure 3. Our results
demonstrate that even when only training our model on speech and transcriptions, we can achieve some zero-shot S2TT performance and We find that S2TT BLEU scores are considerably higher for languages present S2T training data. For example, Polish was not
in the S2T training therefore its BLEU scores are the lowest.
To further improve our model’s cross-lingual performance, we add readily available translation data from [23] to improve
S2TT. For each batch, we combine 25% translation and 75% S2T data. Figure 3 shows comparison of only training on S2T (Transcripts
) and combining S2T and translation data (
Transcriptions + Translations
). We find that combining S2T and translation data significantly improves the S2TT BLEU scores in all 4 languages without training on S2TT data. This finding demonstrates that we can improve our
models cross-lingual performance with highly accessible translation data without needing scarce and often expensive speech-to-text translation training data.
The success of pre-trained LLMs have motivated the application of these models in different modalities. [18] transformed speech into pseudo-text units to introduce the task of generative spoken language modeling. [19] introduced a framework to generate audio with long-term consistency. Consequently, [15] showed that SpeechLMs benefit from being initialized from pre-train LLMs while [13] demonstrated that pre-trained LLMs can be adapted to various tasks that required text and speech understanding.
On the other hand, several works aim to build joint speech and text representations. [27] introduced w2v-bert which combines masked language modeling and contrastive learning to create speech representations. [28] jointly pre-trains on speech and text from unsupervised speech and text data. Recently, [29] employed separate speech and text encoders to generate embeddings in over 200 languages. Nevertheless, there is still a lack of understanding of whether joint speech and text representations can be built from a single encoder. We fill this gap by using pre-trained LLMs to jointly train on speech samples and their transcriptions to show that our approach is capable of speech-text matching in 102 languages.
We present an effective approach to developing a speech-to-text DE from a text-only LLM. Our findings suggest that by using a text-only LLM as a backbone model, we can drastically outperform previous approaches using considerably less speech-to-text training data. Additionally, we find that we can improve zero-shot speech translation by simply combining readily available translation and S2T data. We showcase our findings in 102 languages for S2T and 4 languages in S2TT; opening up the possibility of using speech-to-text DE’s in different cross-model and cross-lingual settings.
We would like to thank Shankar Kumar and Ankur Bapna for the valuable feedback on the draft of the paper. Chris Tar, Mario Guajardo-Céspedes, and Jason Riesa for the early experiment discussions and feedback. Christian Frank, Duc Dung Nguyen, Alex Tudor, and Dalia El Badawy for helping answer questions about AudioPaLM.
Input Type | Before Tokenization | Input Ids |
---|---|---|
Speech | [English Speech] 50,210,245, \(\ldots\) | 240, 503, 32050, 32210, 32245, \(\ldots\) |
Transcription | [English Text] Hello World . | 59, 294, 691, \(\ldots\) |
[30] showed that applying a contrastive loss to sentence encoders leads to improved retrieval performance in downstream tasks. After initializing our model from the PaLM 2, we use a contrastive loss [26].
\[L = -\frac{1}{N} \sum_{i=1}^{N} \frac{e^{\text{sim}(\boldsymbol{x}_{i}, \boldsymbol{y}_{i})}}{\sum_{j=1}^{N} e^{\text{sim}(\boldsymbol{x}_{i}, \boldsymbol{y}_{j})}}\label{eq:loss}\tag{1}\]
Using equation 1 , our multi-modal DE will learn from paired speech and text embeddings \((\boldsymbol{x}_i, \boldsymbol{y}_i)\), where \(\boldsymbol{y}_{i}\) is considered as a positive example to \(\boldsymbol{x}_i\) while all other examples where \(i \neq j\) are negative ones. The model should learn to bring the positive transcriptions closer to the corresponding speech sample, while pushing away all the other negative transcriptions. In our training, the positive and negative distinction is done within the training batch. Hence, we apply an in-batch softmax as part of our loss computation. Lastly, sim() is a similarity function formulated as the dot product between the speech sample and the transcription embeddings.
To train our model, we use the sum of a contrastive loss with a spreadout loss [31] of both the speech and text embeddings. We calculate the contrastive loss [32] in a bidirectional way, by adding the loss in the speech-to-text and the text-to-speech direction.
We use the Adam [33] optimizer with a learning rate of \(1.0 \times 10^{-3}\) with linear ramp cosine decay scheduler with 2.5k warm up steps. We use a dropout probability of \(0.1\) and train for 100k steps with a batch size of 1024.
For training and inference, we found that using a prefix improves speech-to-text retrieval performance. Therefore, we pre-pend a prefix containing the language and modality shown in in Table 3. In the case of a speech utterance, the prefix will be tokenized with the LLMs tokenizer and the remaining will be converted to audio tokens.
Dataset | Type | Task | Langs. | Split | Size |
---|---|---|---|---|---|
CoVoST-2 | Speech | S2T | 21 | Train | 900 h. |
FLEURS | Speech | S2T | 102 | Test | 283 h. |
FLEURS | Speech | S2TT | 102 | Test | 283 h. |
Wikimatrix | Text | MT | 4 | Train | 9M sents. |
# Sents. X \(\to\) En | |
---|---|
German (de) | 6.2M |
Polish (pl) | 2.1M |
French (fr) | 705k |
Dutch (nl) | 570k |
Table 4 shows the training and evaluation datasets we used through out our experiments. We used 21 languages CoVoST-2 to train our model on speech-to-text retrieval which amounts to approximately 900 hours of speech. To evaluate our models speech-to-text retrieval capabilities, we evaluate on FLEURS speech-to-text test split on 102 languages. We use FLEURS speech-to-text translation test split to evaluate our models abilities on tasks that require cross-lingual and cross-modal knowledge. We evaluate of 4 different languages: German, Polish, French, and Dutch.
We find that combining speech-to-text retrieval data and readily available translation data improves our models cross-lingual and cross-modal abilities. Table 5 shows the number of parallel sentences we used during training from X \(\to\) En.