Poro 34B and the Blessing of Multilinguality

Luukkonen, Risto; Burdge, Jonathan; Zosa, Elaine; Talman, Aarne; Komulainen, Ville; Hatanpää, Väinö; Sarlin, Peter; Pyysalo, Sampo

Computer Science > Computation and Language

arXiv:2404.01856 (cs)

[Submitted on 2 Apr 2024 (v1), last revised 10 Jun 2025 (this version, v3)]

Title:Poro 34B and the Blessing of Multilinguality

Authors:Risto Luukkonen, Jonathan Burdge, Elaine Zosa, Aarne Talman, Ville Komulainen, Väinö Hatanpää, Peter Sarlin, Sampo Pyysalo

View PDF HTML (experimental)

Abstract:The pretraining of state-of-the-art large language models now requires trillions of words of text, which is orders of magnitude more than available for the vast majority of languages. While including text in more than one language is an obvious way to acquire more pretraining data, multilinguality is often seen as a curse, and most model training efforts continue to focus near-exclusively on individual large languages. We believe that multilinguality can be a blessing: when the lack of training data is a constraint for effectively training larger models for a target language, augmenting the dataset with other languages can offer a way to improve over the capabilities of monolingual models for that language. In this study, we introduce Poro 34B, a 34 billion parameter model trained for 1 trillion tokens of Finnish, English, and programming languages, and demonstrate that a multilingual training approach can produce a model that substantially advances over the capabilities of existing models for Finnish and excels in translation, while also achieving competitive performance in its class for English and programming languages. We release the model parameters, scripts, and data under open licenses at this https URL.

Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2404.01856 [cs.CL]
	(or arXiv:2404.01856v3 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2404.01856

Submission history

From: Risto Luukkonen [view email]
[v1] Tue, 2 Apr 2024 11:34:12 UTC (331 KB)
[v2] Wed, 24 Apr 2024 12:37:23 UTC (331 KB)
[v3] Tue, 10 Jun 2025 15:06:59 UTC (128 KB)

Computer Science > Computation and Language

Title:Poro 34B and the Blessing of Multilinguality

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Poro 34B and the Blessing of Multilinguality

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators