research projects
Use of computational resources in the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources
(2023 - 2024)
Large language models (LLMs) are at the core of the current AI revolution, and have laid the groundwork for tremendous advancements in Natural Language Processing. Building LLMs needs huge resources, both in terms of compute and data, and only a handful of private companies are able to face the extreme amount of computational power required to train them. As a result, LLMs shine in high-resource languages like English, but lag behind in many others, especially in those where training resources are scarce, including many regional languages in Europe. There have been several proposals in the literature to adapt pre-trained LLMs to new languages, but all past efforts focus on models of relatively small size. In this project, we propose to use the computational resources of the EuroHPC SuperComputer to scale up the experiments and build very large models for European languages with few resources. By varying the compute and data scale, we will analyze whether the models exhibit emergent capabilities that allow them to be easily adapted to many tasks. The results of the project will help fostering NLP applications in these languages, and closing the existing gap between minority languages and English.
Organization: EuroHPC Joint Undertaking
Main researcher: Aitor Soroa
Participants:
Rodrigo Agerri, Eneko Agirre, Itziar Aldabe, Mikel Artetxe, Gorka Azkune, Iker De la Iglesia, Julen Etxaniz, Aitor Ormazabal, Naiara Perez, German Rigau , Oscar Sainz, Aitor Soroa, Irune Zubiaga