April 02, 2024
We present the first self-supervised multilingual speech model trained exclusively on African speech. The model learned from nearly 60 000 hours of unlabeled speech segments in 21 languages and dialects spoken in sub-Saharan Africa. On the SSA subset of the FLEURS-102 dataset, our approach based on a HuBERT\(_{base}\) (0.09B) architecture shows competitive results, for ASR downstream task, compared to the w2v-bert-51 (0.6B) pre-trained model proposed in the FLEURS benchmark, while being more efficient by using 7x less data and 6x less parameters. Furthermore, in the context of a LID downstream task, our approach outperforms FLEURS baselines accuracy by over 22%.
Popular self-supervised learning (SSL) approaches have shown their potential to handle multilingual speech recognition (ASR) and are capable of achieving top performance ([1]; [2]; [3]). They enable a model to be pre-trained on a vast amount of unlabeled data, producing richer audio representation for training downstream models, compared to standard features such as MFCCs or filterbanks. A pre-trained model can be used as a speech encoder with a fine-tuning or as a feature extractor by freezing its weights during the downstream task training. In any case, the performance of the downstream task models will be affected by the characteristics of the speech data used for pre-training [4].
Although [5] already demonstrated, five years ago, that transfer learning from resource-rich to resource-poor languages is more effective when the languages share similar typological features and, later, [6], revealed that 48% of the typological features indexed in the World Atlas of Language Structures (WALS) classification project1 do not appear in datasets, most of the multilingual pre-trained speech models publicly released today still are mainly learned from only very few languages, causing their over-representation at the cost of others ([7]; [8]; [9]; [10]). African languages, which have unique characteristics and are underresourced, are severely affected by this situation ([11]; [12]).
Fortunately, African languages gain interest in the NLP community. Several studies have demonstrated the effectiveness of Africa-centric pre-trained models, showing superior performance compared to large multilingual pre-trained models that are primarily trained on English ([13]; [14]; [15]; [16]). In speech processing, several challenges and publications of new resources recently appeared ([17]; [18]; [19]; [20]). On the ASR downstream task, [21] got better performance for several African languages when applying self-supervised techniques and multilingual modeling, compared to traditional approaches.
In line with these works, we tackle in this paper the under-representation of African languages by proposing a multilingual speech pre-trained model specifically made for performing downstream tasks in sub-Saharan Africa (SSA) languages, by only using spoken data from this region.
The pre-trained dataset we created is composed of broadcast news recordings from diverse sources publicly available on the Web, across several countries, during May 2023. Sometimes, the same recording could be available in different languages spoken in the country. Data collected contained both studio recordings (controlled environment, prepared talks) and street interviews (noisy environment, spontaneous speech). Occasionally, jingles or songs appeared in the audio content. We therefore applied a voice activity detection (VAD) tool [22] to get segments containing only speech. Finally, we gathered a dataset which comprises nearly 60 000 hours of speech segments and covers 21 languages and variants. For details, see appendix 5.
[8] publicly released a parallel speech dataset in 102 languages and proposed it as benchmark. Data are divided in seven macro family, including a sub-Saharan Africa group. We therefore evaluate our approach on this SSA subset (FLEURS\(_{SSA}\)) which is composed of 20 languages, 5 of which are present in our pre-trained dataset.
Experiments were carried out using the well-known HuBERT approach [23] with the base configuration (90M parameters). The pretraining task was achieved using the unlabeled data and the fairseq toolkit [24] through two successive iterations on 4 A100 40Gb GPUs. The first iteration was trained for 275k steps, using a K-means clustering computed on the MFCCs extracted from the training set as target labels. The second iteration was trained for 500k steps, and used embeddings from the 6th transformer layer using 600 hours of the training set. The ratio between languages has been preserved. The finally obtained pre-trained model is publicly available2.
For downstream task training, we used the SpeechBrain toolkit [25]. The final pretrained model is considered as a speech encoder and is fully fine-tuned with two 1024 linear layers and a softmax output at the top. A first pool of speech recognition system (60k\(_{(0.09B)}\)) is obtained by a direct fine-tuning of the whole model on each language of the FLEURS dataset. A second pool (60k\(_{FT-ALL (0.09B)}\)) is then obtained by first jointly fine-tuning on all languages before fine-tuning again on each language.
Following the methodology of the FLEURS paper [8] and to be consistent with their results, we did not rescore the hypothesis with a language model. Average character error rates (CERs) obtained on the 20 languages of the FLEURS\(_{SSA}\) test set are given in table 1. The detailed scores per language are provided in appendix 6.
CER | WER | ||||
60k\(_{(0.09B)}\) | 60k\(_{FT-ALL (0.09B)}\) | FLEURS\(_{w2v-bert (0.6B)}\) | 60k | 60k\(_{FT-ALL}\) | |
average | 15.8 | 13.8 | 13.6 | 56.6 | 51.7 |
Results show that a model that is six times smaller and trained with seven times less data can achieve a performance level that is very close to the best baseline of FLEURS. This model is a step in the direction of more specific but cost-effective pre-trained approaches.
To ensure the quality of the speech representation, we fine-tuned our pretrained model using SpeechBrain for a language identification (LID) downstream task. We employed adaptive average pooling to produce output with shape [Batch,1,20] and we applied a softmax. We call this model 60K\(_{LID}\). The model is trained for 15 epochs on the 20 languages of the FLEURS\(_{SSA}\) subset.
We also propose a second scenario where we employed adaptive average pooling to produce output with shape [Batch,1,768], with the addition of two linear layers to smoothly decrease the dimension from 768 to 256 then from 256 to 20. We call this model 60K\(_{LID-smooth}\). It is trained under the same conditions as 60K\(_{LID}\). Accuracy for both scenarios is presented in Table 2.
FLEURS\(_{w2v-bert}\) | FLEURS\(_{mSLAM}\) | 60K\(_{LID}\) | 60K\(_{LID-smooth}\) | |
---|---|---|---|---|
FLEURS\(_{SSA}\) | 59.1 | 62.2 | 84.9 | 90.4 |
Experiments have shown that our pre-trained model yields significantly improved results. This improvement can be attributed to the model’s specialization in SSA languages. Specifically, we utilized only SSA speech data for pretraining and, during fine-tuning, the model was trained solely on the 20 SSA languages from the FLEURS dataset, rather than the full dataset of 102 languages.
The results obtained on both downstream tasks suggest that our models produce relevant multilingual speech representations within the specific context of SSA languages.
To the best of our knowledge, we present the first open source SSL model exclusively pre-trained on sub-Saharan African languages. By only focusing on African speech that contains specific features unobserved in other languages spoken in the world, we improved the robustness on the ASR downstream task for SSA languages. While we obtain similar results on the overall SSA subset than the best model presented in the FLEURS paper (w2v-BERT-51), yet our approach is more efficient by using much less data and a reduced number of parameters for pre-training. On a LID downstream task, results show that our specialized model trained on the SSA context It performs better than the two FLEURS baselines, by obtaining more than 22% in absolute accuracy.
In the following table 3, we present the languages distribution in the pre-training set.
We applied automatic segmentation on the raw recordings.
For the French language set, only African accented French was used.
"Unknown" row at the end of the table means speech recordings with language mixing.
No automatic LID has been applied to the segments.
Language | ISO-3 | Hours |
---|---|---|
Bambara | bam | 2 552 |
Dyula | dyu | 14 |
French | fra | 5 670 |
Fula | ful | 702 |
Fulfulde | ffm | 727 |
Fulfulde | fuh | 446 |
Gulmancema | gux | 13 |
Hausa | hau | 9 211 |
Kinyarwanda | kin | 8 046 |
Kituba | ktu | 647 |
Lingala | lin | 1 269 |
Luba-Lulua | lua | 675 |
Mossi | mos | 13 |
Maninkakan | mwk | 791 |
Sango | sag | 1 268 |
Songhai | son | 780 |
Swahili | swc | 706 |
Swahili | swh | 13 926 |
Tamasheq | taq | 1 212 |
Wolof | wol | 64 |
Zarma | dje | 567 |
Unknown | — | 10 272 |
Total | — | 59 572 |
Results listed below are obtained when applying monolingual fine-tuning on each sub-Saharan African languages provided in the Test set of FLEURS benchmark.
Scores in bold show the best result depending on the approach. We show character error rate (CER) scores along with word error rates (WERs).
CER | WER\(^*\) | |||
Language | 60k\(_{(0.09B)}\) | 60k\(_{FT-ALL (0.09B)}\) | 60k | 60k\(_{FT-ALL}\) |
Seen languages | ||||
Fula | 21.2 | 17.8 | 61.9 | 56.4 |
Hausa | 10.5 | 9.0 | 32.5 | 29.4 |
Lingala | 8.7 | 6.9 | 24.7 | 20.9 |
Swahili | 7.1 | 5.5 | 23.8 | 20.3 |
Wolof | 19.4 | 17.0 | 55.0 | 50.7 |
average | 13.4 | 11.2 | 39.6 | 35.5 |
Unseen languages | ||||
Afrikaans | 23.3 | 20.3 | 68.4 | 62.6 |
Amharic | 15.9 | 14.9 | 52.7 | 49.0 |
Ganda | 11.5 | 10.7 | 52.8 | 50.3 |
Igbo | 19.7 | 17.2 | 57.5 | 52.9 |
Kamba | 16.1 | 15.6 | 53.9 | 53.7 |
Luo | 9.9 | 8.2 | 38.9 | 34.9 |
Northen-Sotho | 13.5 | 11.7 | 43.2 | 38.9 |
Nyanja | 13.3 | 10.9 | 54.2 | 48.3 |
Oromo | 22.8 | 20.1 | 78.1 | 74.8 |
Shona | 11.6 | 8.3 | 50.2 | 39.3 |
Somali | 21.6 | 19.7 | 64.9 | 60.3 |
Umbundu | 21.7 | 18.8 | 61.7 | 54.2 |
Xhosa | 11.9 | 9.9 | 51.6 | 45.9 |
Yoruba | 24.3 | 23.5 | 67.5 | 65.7 |
Zulu | 12.2 | 9.6 | 53.4 | 44.9 |
average | 16.6 | 14.6 | 56.6 | 51.7 |
overall average | 15.8 | 13.8 | 52.3 | 47.7 |