research projects
Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources
(2024 - 2025)
Large language models (LLMs) are at the core of the current AI revolution, and have laid the
groundwork for tremendous advancements in Natural Language Processing. Building LLMs require
huge amounts of data, which is not available for low resource languages. As a result, LLMs shine in
high-resource languages like English, but lag behind in many others, especially in those where
training resources are scarce, including many regional languages in Europe.
The data scarcity problem is usually alleviated by augmenting the training corpora in the target
language with text from a language with many resources (e.g. English). In this project we propose a
systematic study of different strategies to perform this combination in an optimal way, framing the
existing approaches into a more general curriculum learning paradigm. We will use the
computational resources of EuroHPC to perform a systematic study and scale up experiments to
build LLMs for four European languages with few resources. The results of the project will help
fostering NLP applications in these languages, and closing the existing gap between minority
languages and English.
Organization: EuroHPC Joint Undertaking
Main researcher: Aitor Soroa
Participants:
Rodrigo Agerri, Eneko Agirre, Itziar Aldabe, Mikel Artetxe, Gorka Azkune, Ekhi Azurmendi, Jeremy Barnes, Ander Barrena, Iker De la Iglesia, Julen Etxaniz, Iker García, Imanol Miranda, Paula Ontalvilla, Naiara Perez, German Rigau , Oscar Sainz, Ander Salaberria, Aitor Soroa, Aimar Zabala , Irune Zubiaga