Presenting Latxa - the largest language model built for Basque
We are delighted to introduce the Latxa model family, the largest and best-performing LLMs available for Basque. Latxa is a breed of domestic sheep native to the Basque Country, famous for its cheese.
Our Latxa is a family of Large Language Models (LLM) ranging from 7 to 70 billion model parameters based on Meta’s LLaMA models. Current LLMs exhibit incredible performance for high-resource languages such as English, ChatGPT being the most popular example. But, in the case of Basque and other low-resource languages, their performance is close to a random guesser, widening the technological gap between high- and low-resource languages when it comes to digital tools. We present Latxa to overcome these limitations and promote the development of LLM-based research, innovation and products for the Basque language. This work has been partially supported by the Basque Government (IKER-GAITU project).
The Latxa family of models are pre-trained base LLM models, without further fine-tuning on user-oriented instructions or preferences. These models are thus not for direct use by the general public. These models are key to building successful NLP tools for Basque.We release these open models to be used by technicians that know how to include such base LLMs in final-user applications, or know how to adapt them via fine-tuning. We are already working on instruction-following models, but it is still an open research issue whether models usable by the general public with similar quality to GPT can be constructed for Basque. The models were developed using in-house GPUs, with the final models being trained on the Leonardo supercomputer at CINECA under the EuroHPC Joint Undertaking (project EHPC-EXT-2023E01-013).
For the corpora, we leveraged EusCrawl, a high-quality corpus for Basque comprising 1.72 million documents and 288 million words, totalling 2.1GiB of uncompressed text. EusCrawl was built using ad-hoc scrapers, extracting text from 33 Basque websites with high-quality content, resulting in cleaner text compared to general-purpose approaches.
To assess the quality of the models, we thoroughly evaluated them on a suite of diverse and challenging tasks. The tasks evaluate the performance of the models for a variety of linguistic competences such as reading comprehension, common sense reasoning, sentiment analysis, stance detection, topic classification, correference, inference and word senses (see model cards in HuggingFace for more details on evaluation datasets and procedure). The results in the figure below show the performance of different models, with the average in the rightmost part. We tested the English LLaMA models as well as some of the best language models for Basque to date, allowing for head-to-head comparison with our models (three purple bars). The figure clearly indicates the superiority of our three models, as well as the improvement of results as we increase model size.
Latxa models inherit the LLaMA-2 License, which allows for commercial and research use. Although based on an English LLM, these models are intended to be used with Basque text; for any other language the performance is not guaranteed.
The models are publicly available in HuggingFace, please refer to the model card for more technical information and to get started with the models.
Text authored by: Eneko Agirre. Julen Etxaniz, Oscar Sainz