Poro 34B and the Blessing of Multilinguality
April 02, 2024
The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing and that it should be possible to substantially improve over the capabilities of monolingual models for small languages through multilingual training. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that not only substantially advances over the capabilities of existing models for Finnish, but also excels in translation and is competitive in its class in generating English and programming languages. We release the model parameters, scripts, and data under open licenses at https://huggingface.co/LumiOpen/Poro-34B.
Neural language models based on the transformer architecture [1] have led to substantial advances in natural language processing. Encoder-only transformer models such as BERT [2] have advanced the state of the art in a broad range of classification tasks, while decoder-only models such as GPT [3] have redefined what can be achieved by generative models, opening new areas of study in e.g.prompting and in-context learning. The success of these models is related in substantial part to their scaling properties: training larger models on more data leads to better results and even entirely new capabilities [4]. Studies refining our understanding of the optimal balance of model size and training steps have increased the demands on data [5], and many recent models optimize further for inference-time efficiency by training smaller models on more data [6].
These developments have introduced increasing demands for textual data, with many recent models pretrained on a trillion tokens or more [7]–[12]. While such resources can still be assembled from internet crawls for a few of the languages best represented online, for the vast majority of human languages we have already run out of data for training the largest of language models [13], [14]. While it is standard to repeat training data, repetition can lead to reduced sample efficiency and degradation of performance [15]: [16] estimate that the value of repetition starts to diminish rapidly after four epochs and that repetition ceases to add information around 40 epochs. The availability of data is thus currently a limit for monolingual training for all but a few of the highest-resourced languages.
Multilingual training offers one obvious solution for increasing the amount of training data available, and a large number of multilingual transformer models have been introduced [17]–[20]. However, despite the intuitive appeal of augmenting training data with texts in other natural languages, multilinguality is frequently seen as a negative – commonly referred to as the curse of multilinguality [17]. While there have been studies of the tradeoffs between monolingual and multilingual training [21], [22] as well as efforts to enhance models specifically for multilinguality [23] and to introduce additional language capabilities to existing models [24]–[27], state-of-the-art generative models are still frequently trained near-exclusively on large languages such as English, with only limited efforts specifically focusing on optimizing performance for smaller languages. In this study, we explore how to lift data limitations to create state-of-the-art large generative models from scratch for smaller languages, drawing on the emerging understanding of how to make the most of limited data and turn multilinguality from a curse into a blessing. Some key lessons include 1) limited multilinguality instead of a large number of languages [17], [22] 2) matching scripts (e.g. Latin) [21] and 3) matching language families [28], 4) incorporating a cross-lingual signal using translation pairs [20], [29], 5) oversampling target language data up to four epochs [16] and 6) augmenting natural language with programming language data [30].
We chose to specifically target the Finnish language, which is an interesting case for study as it is a Uralic language with no large close neighbours in its language family, necessitating more distant transfer than e.g.between English and another Germanic language. While the language is natively spoken by under six million people, its resources are still sufficient to consider a monolingual training approach for larger generative models. In a recent study, [31] combined several web crawls and curated sources of Finnish to create a dataset of approximately 40B tokens and introduced the monolingual FinGPT models, trained from scratch for 300B tokens (8 epochs). The largest of these models show signs of data limitations, with the 8B parameter model outperforming the 13B in benchmarks. We believe it should be possible to overcome these limitations by applying the lessons listed above. While we cannot match language families, we train for four epochs over the Finnish data and augment it with both English and programming language data as well as an explicit cross-lingual signal from translation pairs. We pursue this approach to create Poro 34B, training a 34B parameter model for a total of 1T tokens – 25 times more than the available Finnish data – and evaluate the model in detail on Finnish, English, and programming language tasks. We find that the model not only achieves the goal of substantially advancing over the performance of existing Finnish models, but is also competitive in its class of open models on English and code as well as remarkably strong in translation tasks.
For pretraining Poro 34B, we rely on datasets that have been previously preprocessed to remove low-quality texts and boilerplate, filter toxic context, and deduplicate repeated texts. We illustrate the pretraining data distribution in Figure 1 and describe the data briefly in the following. Data sources are detailed in Table 5 in the Reproducibility section.
For Finnish pretraining data, we draw on the resources recently introduced by [31] for creating the FinGPT model family. We exclude the ePub and Lehdet resources provided by the National Library of Finland for that work as they could not be shared due to copyright limitations, but use the remaining sources of data, totalling to a 32B token monolingual corpus. The majority of the Finnish data originates from web crawls (approx.%) complemented with news sources (approx.%), Project Lönnrot, the Finnish equivalent of Project Guthenberg copyright-free book corpus (approx.%), Wikipedia (approx.%) and Finnish online discussion forum contents from Reddit and Suomi24 (approx.%). Following the rule of thumb proposed by [16], we upsample the 32B tokens of Finnish so that four epochs over the data are made during training. Consequently approx.% of the total tokens seen in pretraining are Finnish.
For English pretraining data, we use SlimPajama [32], a cleaned and deduplicated subset of the RedPajama corpus1 [33], from which we excluded data from the books category due to their copyright status. We supplemented this dataset with the Project Gutenberg public domain books data from the Dolma corpus2 [34]. We train for one epoch over the 542B tokens of the English data, which thus represents slightly over half of the 1T total training tokens.
To introduce data representing various programming languages (referred to hereinafter as "code" for short) into our pretraining, we make use of the Starcoder corpus [10], a processed subset of The Stack corpus3 [35]. The original corpus consists of 208B tokens, which we oversample 1.5x so that approximately a third of the tokens seen during pretraining represent code.
We introduce a cross-lingual signal into pretraining by including translation examples from OPUS [36] Specifically, we use the English-Finnish examples from the Tatoeba challenge dataset [37] to generate instruction-formatted translation examples. The Tatoeba challenge training data was reformatted into a minimalistic instruction-following format by recasting each English-Finnish translation pair into a document with the following format:
<|user|>Translate into Finnish: {{en}}
<|assistant|>{{fi}}
Where {{en}}
and {{fi}}
are the English and Finnish texts (resp.) of the translation pair. We additionally reverse the translation order (i.e. Finnish to English instead of English to Finnish) for a total of two documents for
each sentence pair. No weighting is applied to the approx.8B tokens of cross-lingual data, which thus represents slightly under 1% of the pretraining tokens.
In this section, we describe the method used to create the Poro 34B tokenizer, the pretraining setup, and provide an estimate of the compute cost of pretraining the model.
The choice of tokenizer has a broad range of impacts, not only on the efficiency of training and inference but also the capabilities of trained models [38]–[40]. As we were not aware of any existing tokenizer that would be a good fit for our combination of languages and code, we created a new tokenizer for our model. Specifically, we trained a custom byte-level BPE tokenizer using the same pre-normalization as the FinGPT tokenizer. We selected a vocabulary size of 128K tokens, aiming to achieve low fertility on the targeted languages while keeping the vocabulary reasonably small. The tokenizer was trained on a uniform distribution of samples of the Finnish, English and code datasets.
We assess the fertility of the tokenizer on the English and Finnish sentences from the devtest portion of the widely used Flores-101 benchmark for machine translation [41], which allows for a degree of cross-lingual comparability. For code, we use an approx.1M character sample of lines from the Starcoder held-out test data.4 Figure 2 provides a comparison of the fertility of the tokenizer compared to selected reference tokenizers (see Section 4). We find that on this data the new Poro 34B tokenizer has at least broadly comparable fertility to the lowest-scoring tokenizer on each of Finnish, English, and code, as well as the lowest average fertility of the compared tokenizers.
We next briefly present the key model and training parameters (detailed in Table 6 in the Reproducibility section) and the pretraining software and configuration. The hardware used to train the model is described in detail in Appendix 6.1.
Architecture Poro 34B is a decoder-only model with a parameter count of 34 billion, sharing its architecture with FinGPT [31] and BLOOM [19]. It incorporates an extra layer normalization immediately following the input embedding layer for better training stability and uses ALiBi [42] as its positional encoding method. The model consists of 54 layers with a hidden dimension of 7168 and a total of 56 attention heads.
Training We train to 1T tokens, intentionally exceeding the Chinchilla compute-opimality estimate [43] of approx.700B tokens for a model of this size, thus gaining inference-time efficiency for the cost of additional compute investment in pretraining [6]. We train with a sequence length of 2048 tokens5 using a cosine learning rate scheduler with a maximum learning rate of 1.5e-4, decaying to a minimum of 1e-5 over 990B tokens, and a linear warmup of 10B tokens. Our global batch size is 2048 samples totaling to 4194304 tokens per each optimization step.
Software Poro 34B was trained on the LUMI supercomputer GPU-partition, which is powered by AMD MI250X-GPUs. The majority of open source frameworks for large language model pretraining are made to be primarily NVIDIA-compatible, and we required scalable AMD-compatible training software. Thus, we adopted the Megatron-DeepSpeed fork6 introduced by [31], which has optimized kernels converted from CUDA to be compatible with AMD ROCm, and has been demonstrated to be a viable solution for large model pretraining on LUMI.
Configuration Considering the hardware available and the selected hyperparameters such as batch size, a configuration of 128 nodes was chosen for the training of the model, resulting in a world size of 1024. The training was done using activation checkpointing, a micro batch size of 1, gradient accumulation of 16, and a 3D parallelism strategy of tensor parallel of degree 2, pipeline parallel of degree 4, resulting in a data parallel degree of 128. This allowed total training cycle throughput of 49618 TFLOPs and 174378 tokens/second.
Following [12], we estimate the carbon footprint of our pretraining by multiplying the theoretical upper bound of the total power used by the GPUs when they are utilized at 100% with the carbon intesity factor of LUMI. Taking into account the systems’s power usage effectiveness (PUE) value of 1.04,7 we approximate the total power consumption to be 448MWh. As LUMI is powered by fully renewable electricity, we assume the carbon intensity factor to be 0.8 This brings our emissions to a total of 0 \(\mathrm{tCO_{2}eq}\). It is important to note that we only take into account power consumption of the GPUs used, as the consumption of the entire node was not logged during training.
We thoroughly analyze the capabilities of the model for Finnish, English and code, first briefly reporting perplexity results and then focusing on community-standard benchmarks for evaluating generative models. We then assess the quality of Finnish text generated by the model and finally evaluate the model’s translation capability from English to Finnish (and vice versa). We primarily compare the performance of the model to a selection of similarly-sized general-purpose open source base language models: Llama 33B9 [7], MPT 30B [9], and Falcon 40B [46]. For Finnish Language evaluations we additionally compare to FinGPT 8B and FinGPT 13B [31], and for code we compare also to StarCoder base [10].
We assess the perplexity of the model on the same data used to evaluate tokenizer fertility (Section 3.1), namely Flores-101 devtest English and Finnish and a sample of the StarCoder test data. As token-level perplexity is dependent on tokenization, it cannot be used to directly compare models with different tokenizers. We therefore report character-level perplexity \(PPL_c\) following [47], normalizing by character rather than token count when calculating perplexity.
We benchmark the capabilities of the model in Finnish using the FIN-bench10 dataset [31], which covers a variety of tasks to assess various aspects of model capabilities in Finnish, combining tasks translated and manually corrected from English BIG-bench [48] with additional Finnish tasks. We evaluate all FIN-bench results in a 3-shot setting using the standard metrics defined for the benchmark. For English evaluations, we use LM Eval Harness [49] to evaluate with the following datasets: ARC Challenge [50], GSM8K [51], HellaSwag [52], MMLU [53], TruthfulQA [54], and Winogrande [55]. We selected these evaluations based on their use as English language benchmarks by [56] and use an identical testing configuration here. Programming language proficiency is assessed via the Bigcode Evaluation Harness [57] with the HumanEval [58], and MBPP [59] benchmarks, employing the pass@10 metric for evaluation.
To evaluate the quality of Finnish text generation, we prompt the model with short Finnish phrases and use GPT-4 [60] to judge the quality of the text in terms of coherence, grammatical correctness, and use of specialised vocabulary. Finally, to evaluate translation performance, we use both the Flores-101 devtest [41] data as well as the Tatoeba Challenge test sets [37] in an 8-shot setting, following [61].
3pt
Poro 34B | Llama 33B | MPT 30B | Falcon 40B | FinGPT 8B | FinGPT 13B | StarCoder | |
---|---|---|---|---|---|---|---|
Finnish | 1.89 | 2.98 | 2.89 | 3.57 | 1.94 | 1.92 | 3.83 |
English | 1.87 | 1.81 | 1.89 | 1.85 | 2.55 | 2.46 | 2.38 |
Code | 3.21 | 4.27 | 3.58 | 3.65 | 25.1 | 27.3 | 3.15 |
Average | 2.32 | 3.02 | 2.79 | 3.02 | 9.86 | 10.6 | 3.12 |
Poro 34B | Llama 33B | MPT 30b | Falcon 40B | FinGPT 8B | FinGPT 13B | Starcoder | |
---|---|---|---|---|---|---|---|
Finnish | 66.28 | 53.36 | 53.22 | 42.58 | 49.69 | 48.92 | 45.55 |
English | 50.57 | 59.96 | 52.62 | 49.87 | 31.47 | 32.85 | 35.44 |
Code | 41.80 | 37.67 | 39.18 | 38.57 | - | - | 49.06 |
Table 1 summarizes the results of the perplexity evaluation as mean character-level perplexity \(PPL_c\) for various models over the sentences/code lines. We find that Poro 34B has comparatively low (good) \(PPL_c\) on all three datasets, including the best result for Finnish. Poro 34B is to the best of our knowledge the only open model specifically trained for this combination of languages, and it is thus not surprising that it has the best overall average in this evaluation. While perplexity is not necessarily predictive of downstream performance and these datasets only represent a part of the relevant distribution, the result suggests that the model has learned all of its target languages well.
The overall results of the benchmark evaluations are summarized in Table 2 and detailed in Appendix 6.2. We find that Poro 34B is the best-performing model for Finnish in this comparison, substantially outperforming the best previously introduced monolingual Finnish model. We further analyzed the progression of the Finnish capabilities by evaluating Poro 34B checkpoints at 10% intervals on FIN-bench. These results are summarized in Figure 3. Interestingly, the model outperforms the best FinGPT model already after 100B tokens of training (10%) despite the relatively small proportion of Finnish in the Poro 34B data and the fact that the FinGPT models were trained on 300B tokens in total. These results indicate that our limited multilingual approach is effective for creating stronger models for Finnish than possible through monolingual training and demonstrate that the model is benefiting substantially from its training data in other languages when tested on Finnish tasks.
For English, we find that the model achieves broadly comparable results to the MPT 30B and Falcon 40B models, both of which were trained for 1T tokens of predominantly English data. This indicates that the limited multilingual training approach has not notably detracted from the English capabilities of the model. The best-performing open model in this comparison is Llama 33B, which was trained for longer (1.4T tokens), also predominantly on English data. We find that Poro 34B is nevertheless a capable model in its class also for English, despite not optimizing specifically for English performance. The programming language benchmarks indicate that Poro 34B is more capable on code than the other natural language-focused models, while the code-focused StarCoder model clearly outperforms all of the other models. We attribute the relatively high performance of Poro 34B on code to the comparatively large proportion of the training data dedicated to code. As with English, we consider the performance of the model on code a positive addition to its capabilities, even though code generation was not a primary goal in creating the model.
Finally, we note a surprising finding arising from the Finnish evaluation: two of the larger English-focused models (Llama 33B and MPT 30B) score higher than the previously introduced smaller monolingual Finnish models on the FIN-bench benchmark. While FIN-bench tasks are in Finnish, the benchmark consists of multiple-choice rather than generation tasks, has been produced in substantial part through translation from English, and includes tasks with little emphasis on natural language (e.g.arithmetic). We hypothesize that the comparatively high performance of the English-focused models on this benchmark might not indicate that they can generate fluent Finnish, which also calls the Finnish proficiency of Poro 34B into question. We study this question specifically in the following section.
To compare the ability of Poro 34B to generate coherent and grammatically correct texts in Finnish compared to other models, we prompted the models with short Finnish texts and evaluated the output using an LLM-as-a-judge approach [62]. To assure the prompt texts were not part of the training data of any of the models, we collected 50 recent news headlines from the Finnish broadcaster Yle11 and 50 theses titles published in 2024 from an online archive12. We prompted the models with the phrase Seuraava teksti on kirjoitettu suomeksi: (Eng. The following text is written in Finnish:) followed by the title. We repeated the generation for each title five times, giving us 250 news and 250 academic output texts to evaluate. For every text from Poro 34B and the corresponding text from a baseline model, we ask GPT-4, as our judge model, to pick a preference based on grammatical correctness, coherence, and usage of vocabulary using a prompt adapted from that used in MT-bench [62] (the prompt is provided in Appendix 6.3). We query the judge twice for each pair of texts, switching the positions of the texts to eliminate any potential impact of ordering preference. We consider the pair a win for a model if the judge picks the text from that model twice, and a tie if the judge changes its preference when the order is switched.
FinGPT 8B | FinGPT 13B | Llama 33B | MPT 30B | Falcon 40B | |
---|---|---|---|---|---|
News | 52.0 / 12.4 | 53.2 / 13.2 | 2.0 / 8.4 | 10.8 / 11.6 | 0.8 / 6.4 |
Academic | 46.0 / 10.0 | 50.0 / 10.4 | 0.8 / 10.4 | 21.2 / 7.2 | 0.4 / 10.0 |
The results of the experiments are summarized in Table 3. We find that the multilingual Poro 34B is roughly on par with the monolingual FinGPT models, performing competitively on academic text generation while falling narrowly behind on news text generation. By contrast, the English-focused models are clearly worse in generating Finnish text, with the best-performing model (MPT 30B on academic text) being preferred in only 21% of cases. The most striking contrast between the benchmark and open generation results is for Llama 33B, which scored higher than FinGPT on FIN-bench, but is preferred only in 2% of cases to Poro 34B in generation, where Poro 34B and FinGPT show roughly equal capabilities. These results support our hypothesis that multiple-choice benchmarks created (in substantial part) through translation of English benchmarks may be a poor measure of the specific language capabilities of large models. Unlike the English language models, Poro 34B continues to show strong performance here, showing that it is indeed both capable in general, and fluent in Finnish.
Flores-101 | Tatoeba | |||
Model | En-Fi | Fi-En | En-Fi | Fi-En |
ChatGPT | 33.4 | 35.9 | - | - |
GPT4 | 35.3 | 40.2 | - | - |
37.3 | 39.0 | - | - | |
M2M-12B | 33.4 | 33.8 | 36.7 | 41.3 |
NLLB-1.3B | 30.0 | 35.4 | 40.2 | 55.7 |
OPUS-MT | 37.2 | 35.6 | 46.7 | 58.4 |
Poro 34B | 37.6 | 39.8 | 47.3 | 60.5 |
General-purpose language models have shown promising results on translation benchmarks on multiple languages [63]–[65]. Following [61], we evaluated
Poro 34B for English to Finnish translation and vice-versa on the first 100 sentences of the Flores-101 test data by prompting the model with eight translation examples sampled randomly from the development set, formatting the examples simply as
<src>=<trg>
. We further evaluated Poro 34B and three strong open-source translation models on the Tatoeba Challenge test set with more than 11,000 sentences: OPUS-MT [66], NLLB-1.3B [67], and
M2M-100-12B [[68]]13. We used the standard SentencePiece BLEU
(spBLEU) as our metric. The results of both evaluations are shown in Table 4.14 These results demonstrate that Poro 34B is a remarkably strong translator, outperforming not only dedicated open-source translation models but even Google Translate, and scoring roughly on par with GPT-4 in this
evaluation. We attribute this result to the combination of strong Finnish and English capabilities and the inclusion of a comparatively large number of translation examples in the pretraining data.
It should be noted, however, that the Tatoeba and Flores sentences are relatively short and simple, and this evaluation does thus not capture the full picture of the translation capabilities of the evaluated models. We aim to assess the translation capability of Poro 34B more comprehensively on longer texts, especially texts that might include different modalities such as tables and code, in future work.
In this study, we have considered the challenges that the availability of data poses for pretraining large generative models for smaller languages and explored a limited multilingual approach to create Poro 34B, a 34B-parameter model trained on 1T tokens of Finnish, English, and code, including 8B tokens of Finnish-English translation pairs. We thoroughly evaluated the model and found it to substantially advance over the performance of existing models for Finnish while also performing competitively in its class of open models for English and code generation, as well as achieving remarkably good results in translation tasks. We additionally observed some limitations of benchmarks with tasks translated from English in measuring the specific capability of models in Finnish, which diverged substantially from model-as-judge evaluation of generation fluency.
Our model architecture and the Finnish datasets included follow those of the FinGPT family of monolingual Finnish models, which were limited by the available Finnish training data. The superior performance of our model in Finnish evaluations demonstrates that multilingual training can lift such limitations, allowing further scaling of models focused on smaller languages. In future work, we hope to explore this effect more systematically to answer some of the many questions that remain open regarding the training of large generative models for smaller languages, including the impacts of covering multiple smaller languages and the effect of the size of data available in the target languages.
A number of the choices made in training Poro 34B were made with incomplete information regarding their specific impacts on the final model. For example, we opted to include a comparatively large amount of programming language data as well as instruction-formatted translation examples in the pretraining data, the latter on the assumption that this would provide a cross-lingual signal that would strengthen the ability of the model to benefit from data in a more distantly related language (English). While this approach is intuitively appealing and the performance of our model suggests that it has at a minimum not notably detracted from the capabilities of the model, we did not as part of this work have the resources to conduct ablation studies nor to explore alternative ways to incorporate cross-lingual information in pretraining. We aim to study these questions further in future work.
We hope that our approach can serve as a template for the creation of larger models for other smaller languages and that the model introduced in this work can serve as both as a focus of research in its own right as well as a starting point for further pretraining, finetuning and alignment to create useful models, tools and methods not only for Finnish but also other languages. We release the model weights as well as all relevant documentation and software fully openly at https://huggingface.co/LumiOpen/Poro-34B.
It has been our aim throughout this work to release Poro 34B fully openly, including model weights, pretraining configuration, the pretraining and evaluation data, and all associated scripts and tools. We provide here additional details of these to facilitate accurate reproduction of our work.
The identifiers of the pretraining data sources are provided below in Table 5.
3pt
Dataset | Language | Reference |
---|---|---|
SlimPajama | English | https://huggingface.co/datasets/cerebras/SlimPajama-627B |
Starcoder | Code | https://huggingface.co/datasets/bigcode/starcoderdata |
Tatoeba challenge | Eng-Fin | https://huggingface.co/datasets/tatoeba |
Project Gutenberg | English | https://huggingface.co/datasets/allenai/dolma |
Parsebank | Finnish | https://turkunlp.org/finnish_nlp.html |
mC4 | https://huggingface.co/datasets/mc4 | |
CC-Fi | https://github.com/TurkuNLP/CC-Fi | |
Fiwiki | https://fi.wikipedia.org/wiki | |
Lönnrot | http://www.lonnrot.net | |
Suomi24 | http://urn.fi/urn:nbn:fi:lb-2021101527 | |
Reddit-Fi | https://www.reddit.com/r/Suomi | |
STT | http://urn.fi/urn:nbn:fi:lb-2019041501 | |
Yle | http://urn.fi/urn:nbn:fi:lb-2017070501 | |
Yle | http://urn.fi/urn:nbn:fi:lb-2021050401 | |
Yle | http://urn.fi/urn:nbn:fi:lb-2019050901 | |
Yle | http://urn.fi/urn:nbn:fi:lb-2021050701 |
The model and pretraining hyperparameters are detailed in Table 6.
Architecture hyperparameters | Pretraining hyperparameters | ||
---|---|---|---|
Parameters | 34B | Global Batch Size | 2048 |
Precision | bfloat16 | Learning rate | 1.5e-4 |
Layers | 54 | Total tokens | 1000B |
Hidden dim | 7168 | Warmup tokens | 10B |
Attention heads | 56 | Decay tokens | 1000B |
Vocab size | 131072 | Decay style | cosine |
Sequence length | 2048 | Min. learning rate | 1e-5 |
Activation | GELU | Adam (\(\beta_1\), \(\beta_2\)) | (0.9, 0.95) |
Position embedding | ALiBi | Weight decay | 1e-1 |
Tied embeddings | True | Gradient clipping | 1.0 |
The model weights and a model card providing further details regarding the model and its pretraining are available from https://huggingface.co/LumiOpen/Poro-34B.
The model pretraining was performed using the Megatron-DeepSpeed fork https://github.com/TurkuNLP/Megatron-DeepSpeed.
All scripts and tools introduced specifically for this work are made available under open licenses at https://huggingface.co/LumiOpen/Poro-34B.
We are committed to open science, transparency and accessibility in our work. While we acknowledge the concerns and the potential for negative impacts associated with making powerful generative models and the technology to create them more widely available, we believe that in the case of Poro 34B the positives clearly outweigh the negatives. We discuss some specific concerns and their mitigations in the following.
Poro 34B is a base model trained in substantial part on texts sourced from web crawls, which are known to include biases, toxicity and factual errors. While we have selected curated text sources that have been extensively filtered to remove problematic material, no such filtering is perfect. Like all language models, Poro 34B is a product of its inputs, and its output may reflect issues in its training material. Furthermore, as Poro 34B is a base model that has not been finetuned for any specific purpose, extra care should be taken when interpreting its output, and the model should not be used as is in any application with potential for significant impact on people’s rights or well-being. We emphasize these limitations in the model card published with the model.
Pretraining large language models is computationally intensive, and the creation of large models can have substantial environmental impacts. Poro 34B was trained on the LUMI supercomputer, which is powered entirely by renewable energy resources. According to the official specifications, the carbon intensity factor of LUMI’s operation is considered to be zero. This approach effectively minimizes the carbon footprint associated with the computational aspects of training our model.
Though concerns about the capabilities of frontier models to cause catastrophic harm have been discussed in the literature, a model of Poro 34B’s size and training duration does not represent new frontier capability and releasing the model does not introduce any new classes of risk.
The authors wish to acknowledge CSC – IT Center for Science, Finland, for generous computational resources on the LUMI supercomputer. This project has received funding from the European Union’s Horizon Europe research and innovation programme under Grant agreement No 101070350. The contents of this publication are the sole responsibility of its authors and do not necessarily reflect the opinion of the European Union.
Poro 34B was trained on the LUMI-G GPU partition of the LUMI supercomputer, located in Finland. LUMI is at the time of this writing the fastest supercomputer in Europe, and the 5th fastest in the world (https://www.top500.org/). LUMI is also ranked 7th greenest by the Green500 list (https://www.top500.org/lists/green500/).
The LUMI-G partition has 2978 nodes, with each node having four AMD MI250x GPUs with 128GB of memory each, and a single 64-core CPU. The MI250x is a multi-chip module (MCM), with dual-GCD (graphics compute die) design, which in practice means a node has eight logical devices, each logical device with access to 64GB of high bandwidth memory.
Each node has four 200Gbps Slingshot-11 network interconnects. The nodes are connected together in a dragonfly topology. During benchmarking and scale testing we did not observe the network topology as a limiting factor for the required collective operation sizes. The total of 800 Gbps per-node bandwidth proved to be more than sufficient, and the communication overhead was minimal during training.
Tables 7, 8 and 9 show the detailed benchmark results for Finnish, English, and code.
2pt
Benchmark | Poro 34B | Llama 33B | MPT-30b | Falcon-40b | FinGPT 8B | FinGPT 13B | Starcoder |
---|---|---|---|---|---|---|---|
Analogies | 77.69 | 61.54 | 57.69 | 43.85 | 40.0 | 36.15 | 46.15 |
Arithmetic | 54.28 | 47.74 | 57.25 | 51.06 | 41.96 | 45.23 | 48.41 |
Cause and Effect | 67.97 | 60.78 | 58.82 | 46.41 | 66.01 | 69.28 | 54.90 |
Emotions | 55.00 | 45.00 | 39.37 | 16.88 | 45.62 | 38.75 | 23.13 |
Empirical Judg. | 62.63 | 43.43 | 43.43 | 34.34 | 32.32 | 36.36 | 44.44 |
General Knowl. | 75.71 | 48.57 | 37.14 | 22.86 | 51.43 | 40.00 | 22.86 |
Intent Recogn. | 83.24 | 77.75 | 77.31 | 46.24 | 51.43 | 58.24 | 65.03 |
Misconceptions | 53.73 | 51.49 | 50.00 | 50.00 | 51.45 | 45.52 | 47.01 |
Paraphrase | 58.50 | 53.00 | 52.50 | 54.50 | 49.50 | 45.50 | 47.50 |
Sentence Ambig. | 66.67 | 45.00 | 56.67 | 48.33 | 48.33 | 53.33 | 51.67 |
Similarities Abst. | 73.68 | 52.63 | 55.26 | 53.95 | 68.42 | 69.74 | 50.00 |
Average | 66.28 | 53.36 | 53.22 | 42.58 | 49.69 | 48.92 | 45.55 |
2pt
Benchmark | Poro 34B | Llama 33B | MPT-30b | Falcon-40b | FinGPT 8B | FinGPT 13B | Starcoder |
---|---|---|---|---|---|---|---|
ARC-Challenge | 53.16 | 61.61 | 55.80 | 50.51 | 25.34 | 24.31 | 30.29 |
Hellaswag | 77.77 | 84.64 | 82.23 | 77.01 | 42.91 | 46.77 | 47.22 |
MMLU | 46.29 | 58.13 | 47.27 | 46.13 | 23.34 | 23.64 | 32.11 |
TruthfulQA | 41.66 | 42.84 | 38.44 | 41.64 | 43.80 | 44.58 | 40.06 |
Winogrande | 72.77 | 80.27 | 74.82 | 81.53 | 53.19 | 57.53 | 54.85 |
GSM8K | 11.75 | 32.27 | 17.13 | 2.43 | 0.22 | 0.22 | 8.11 |
Average | 50.57 | 59.96 | 52.62 | 49.87 | 31.47 | 32.85 | 35.44 |
Benchmark | Category | Poro 34B | Llama 33B | MPT-30b | Falcon-40b | Starcoder |
---|---|---|---|---|---|---|
HumanEval | Python | 37.20 | 34.15 | 35.37 | 34.15 | 45.12 |
MBPP | Python | 47.40 | 41.20 | 43.00 | 43.00 | 53.00 |
Average | 41.80 | 37.67 | 39.18 | 38.57 | 49.06 |
Please act as an impartial judge and evaluate the quality of two Finnish texts. Your evaluation should consider these three factors: 1. Vocabulary Usage: Wide range of vocabulary used effectively, including more specialized terms. 2. Grammatical Correctness: Strong grammatical skills; errors are rare and minor. 3. Coherence: The text is well-structured and coherent throughout.
Begin your evaluation by comparing the two texts and provide a short explanation. Avoid any position biases and ensure that the order in which the texts were presented does not influence your decision. Be as objective as possible. You are not allowed to declare a tie. After providing your explanation, output your final verdict by strictly following this format: "[[A]]" if Text A is better, "[[B]]" if Text B is better.
https://huggingface.co/datasets/togethercomputer/RedPajama-Data-1T↩︎
We only sample lines with at least 10 alphabetic characters to avoid very short lines.↩︎
We acknowledge that this can be considered limiting by today’s standards, but this limitation can be relieved by methods for extending the context length, for example via linear extrapolation [42] or interpolation [44].↩︎
https://www.lumi-supercomputer.eu/sustainable-future/↩︎
We acknowledge that this assumption can be contested. As [12] note: "LUMI is powered entirely by hydroelectric power and some sources [45] measure the carbon intensity factor of hydroelectric power to be 0.024."↩︎
Also known as Llama 30B due to a typo; https://github.com/meta-llama/llama/issues/49↩︎
https://yle.fi/. Headlines were collected on 11-12 January 2024.↩︎
We did not evaluate closed models such as ChatGPT on Tatoeba because of the cost of API access.↩︎
We attempted to reproduce some of the Flores-101 results reported by [61] and obtained a slightly higher result for GPT-4 in Eng-Fin translation (37.5 instead of 35.33) and slightly lower results for M2M-12B and NLLB-1.3B (31.4 and 26.6, respectively). For the sake of consistency, we present the results from that study without modification.↩︎