Analyse de textes [FR]

Les outils d'analyse du langage naturel sont des modules logiciels qui effectuent des analyses linguistiques sur des textes à différents niveaux. Ces outils sont des composants essentiels de tout logiciel de traitement du langage naturel (TLN) qui analyse du texte, et tout logiciel de text mining est généralement construit en combinant des modules linguistiques de base formant des pipelines complexes.

Le centre HiTZ a une grande tradition dans la construction d'outils d'analys...lire la suite

Chercheur/se principal/e: 

voir plus

Text_analysis_tabs

Demos

Demo of the English NLP pipeline

Just copy in any English text and see what entities and events and other annotations are added automatically. The result is represented in the NAF format.

Demo of the Spanish NLP pipeline

Just copy in any Spanish text and see what entities and other annotations are added automatically. The result is represented in the NAF format.

Eustagger

Basque lemmatizer and morphosyntactic analyzer

Xuxen

Basque spelling corrector on-line

Contrats

Projects

Patents

MALTIXA

Ressources

Publications

Olia Toporkov, Rodrigo Agerri

On the Role of Morphological Information for Contextual Lemmatization (2024)

Computational Linguistics (MIT Press).

Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, Eneko Agirre

GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction (2024)

The Twelfth International Conference on Learning Representations

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

A Preliminary Study of ChatGPT for Spanish E2R Text Adaptation (2024)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2024, May). A Preliminary Study of ChatGPT for Spanish E2R Text Adaptation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 1422-1434).

Margot Madina, Itziar Gonzalez-Dios, and Melanie Siegel.

LanguageTool as a CAT tool for Easy-to-Read in Spanish (2024)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2024, May). LanguageTool as a CAT tool for Easy-to-Read in Spanish. In Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 (pp. 93-101).

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

Towards Reliable E2R Texts: A Proposal for Standardized Evaluation Practices (2024)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2024, July). Towards reliable E2R texts: a proposal for standardized evaluation practices. In International Conference on Computers Helping People with Special Needs (pp. 224-231). Cham: Springer Nature Switzerland.

Iakes Goenaga, Aitziber Atutxa, Koldo Gojenola, Maite Oronoz, Rodrigo Agerri

Explanatory argument extraction of correct answers in resident medical exams (2024)

Artificial Intelligence in Medicine, Volume 157, 2024, 102985,

Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre, German Rigau

What do Language Models know about word senses? Zero-Shot WSD with Language Models and Domain Inventories (2023)

In Proceedings of the 12th Global Wordnet Conference, pages 331–342, University of the Basque Country, Donostia - San Sebastian, Basque Country. Global Wordnet Association.

Ainara Estarrona, Izaskun Etxeberria, Manuel Padilla-Moyano, Ander Soraluze

Measuring language distance for historical texts in Basque (2023)

Procesamiento del Lenguaje Natural, Revista no 70, marzo del 2023, pp. 53-61

Nayla Escribano, German Rigau, Rodrigo Agerri

A modular approach for multilingual timex detection and normalization using deep learning and grammar-based methods (2023)

Nayla Escribano, German Rigau, Rodrigo Agerri, A modular approach for multilingual timex detection and normalization using deep learning and grammar-based methods, Knowledge-Based Systems, Volume 273, 2023, 110612, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2023.110612. (https://www.sciencedirect.com/science/article/pii/S0950705123003623) Abstract: Detecting and normalizing temporal expressions is an essential step for many NLP tasks. While a variety of methods have been proposed for detection, best normalization approaches rely on hand-crafted rules. Furthermore, most of them have been designed only for English. In this paper we present a modular multilingual temporal processing system combining a fine-tuned Masked Language Model for detection, and a grammar-based normalizer. We experiment in Spanish and English and compare with HeidelTime, the state-of-the-art in multilingual temporal processing. We obtain best results in gold timex normalization, timex detection and type recognition, and competitive performance in the combined TempEval-3 relaxed value metric. A detailed error analysis shows that detecting only those timexes for which it is feasible to provide a normalization is highly beneficial in this last metric. This raises the question of which is the best strategy for timex processing, namely, leaving undetected those timexes for which is not easy to provide normalization rules or aiming for high coverage. Keywords: Temporal processing; Multilingualism; Sequence labeling; Grammar-based approaches; Deep learning; Natural language processing

Itziar Aduriz, Manex Agirrezabal, Eneko Agirre, Iñaki Alegria, Xabier Arregi, Jose Mari Arriola Xabier Artola, Arantza Díaz de Ilarraza, Ainara Estarrona, Izaskun Etxeberria, Nerea Ezeiza, Kepa Sarazola

Mofologia Konputazionala Euskaraz, 35 urte (2023)

Lindemann, D. (arg.). Miren Azkarateri esker onez, 15-30. UPV/EHU Argitalpen zerbitzua. Bilbo.

Rodrigo Agerri, Eneko Agirre

Lessons learned from the evaluation of Spanish Language Models (2023)

Procesamiento del Lenguaje Natural (70), pp 157-170

Gorka Urbizu, Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri, Aitor Soroa

Scaling Laws for BERT in Low-Resource Settings (2023)

Findings of the Association for Computational Linguistics: ACL 2023

Masson, M., Roose, P., Sallaberry, C., Agerri, R., Bessagnet, MN., Lacayrelle, A.L.P

APs: A Proxemic Framework for Social Media Interactions Modeling and Analysis (2023)

In: Crémilleux, B., Hess, S., Nijssen, S. (eds) Advances in Intelligent Data Analysis XXI. IDA 2023. Lecture Notes in Computer Science, vol 13876. Springer, Cham.

Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, Dan Roth

Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey (2023)

ACM Computing Surveys. 27 June 2023

Iker De la Iglesia, María Vivó, Paula Chocrón, Gabriel de Maeztu, Koldo Gojenola, Aitziber Atutxa

Overview of ClinAIS at IberLEF 2023: Automatic Identification of Sections in Clinical Documents in Spanish (2023)

Procesamiento del Lenguaje Natural, Revista nº 71, septiembre de 2023

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

Easy-to-Read Language Resources and Tools for three European Languages (2023)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2023, July). Easy-to-Read Language Resources and Tools for three European Languages. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments (pp. 693-699).

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

Easy-to-Read Language: baliabide linguistikoen eta testuen egokitzapena eta tresna automatikoen garapena (2023)

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel (2023) Easy-to-Read Language: baliabide linguistikoen eta testuen egokitzapena eta tresna automatikoen garapena. V. IKERGAZTE NAZIOARTEKO IKERKETA EUSKARAZ Kongresuko artikulu-bilduma: Giza Zientziak eta Artea, 35-42.

Margot Madina, Itziar Gonzalez-Dios and Melanie Siegel

Easy-to-Read in Germany: a Survey on its Current State and Available Resources (2023)

Margot Madina, Itziar Gonzalez-Dios and Melanie Siegel (2023) Easy-to-Read in Germany: a Survey on its Current State and Available Resources. To appear in proceedings of 10th Language & Technology Conference

Jeremy Barnes

Sentiment and Emotion Classification in Low-resource Settings (2023)

Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Irene Baucells de la Peña, Blanca Calvo Figueras, Marta Villegas, Oier Lopez de Lacalle

Entailment-based Task Transfer for Catalan Text Classification in Small Data Regimes (2023)

Procesamiento del Lenguaje Natural. v. 71, p. 165-177, sep. 2023

Iker García, Rodrigo Agerri, German Rigau

T-Projection: High Quality Annotation Projection for Sequence Labeling Tasks (2023)

Findings of the Association for Computational Linguistics: EMNLP 2023

Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models (2023)

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Javier Álvez, Itziar Gonzalez-Dios, German Rigau

Towards Effective Correction Methods Using WordNet Meronymy Relations (2023)

Álvez, J., Gonzalez-Dios, I., & Rigau, G. (2023, January). Towards Effective Correction Methods Using WordNet Meronymy Relations. In Proceedings of the 12th Global Wordnet Conference (pp. 31-40).

Roberto Centeno, Rodrigo Agerri

Overview of NLP-MisInfo 2023: Workshop on NLP applied to Misinformation (2023)

Roberto Centeno and Rodrigo Agerri (2023). Overview of NLP-MisInfo 2023: Workshop on NLP applied to Misinformation. In Proceedings of the Workshop on NLP applied to Misinformation, co-located with the 39th International Conference of the Spanish Society for Natural Language Processing (SEPLN 2023).

Rodrigo Agerri, Iñigo Alonso, Aitziber Atutxa, Ander Berrondo, Ainara Estarrona, Iker Garcia-Ferrero, Iakes Goenaga, Koldo Gojenola, Maite Oronoz, Igor Perez-Tejedor, German Rigau and Anar Yeginbergenova

HiTZ@Antidote: Argumentation-driven Explainable Artificial Intelligence for Digital Medicine (2023)

Rodrigo Agerri, Iñigo Alonso, Aitziber Atutxa, Ander Berrondo, Ainara Estarrona, Iker Garcia-Ferrero, Iakes Goenaga, Koldo Gojenola, Maite Oronoz, Igor Perez-Tejedor, German Rigau and Anar Yeginbergenova (2023). HiTZ@Antidote: Argumentation-driven Explainable Artificial Intelligence for Digital Medicine. In SEPLN 2023: 39th International Conference of the Spanish Society for Natural Language Processing.

Joseba Fernandez de Landa, Rodrigo Agerri

HiTZ-IXA at PoliticES 2023: Document and Sentence Level Text Representations for Demographic Characteristics and Political Ideology Detection. (2023)

Joseba Fernandez de Landa, Rodrigo Agerri (2023). HiTZ-IXA at PoliticES 2023: Document and Sentence Level Text Representations for Demographic Characteristics and Political Ideology Detection. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2023) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2023), Jaén, Spain, September 2023.

Blanca Calvo Figueras, Irene Bausells, Tommaso Caselli

Dynamic Stance: Modeling Discussions by Labeling the Interactions (2023)

Findings of the Association for Computational Linguistics: EMNLP 2023

David Samuel, Jeremy Barnes, Robin Kurtz, Stephan Oepen, Lilja Øvrelid, and Erik Velldal

Direct Parsing to Sentiment Graphs (2022)

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages: 470–478

Nora Hollenstein, Itziar Gonzalez-Dios, Lisa Beinborn, and Lena Jäger

Patterns of text readability in human and predicted eye movements (2022)

Nora Hollenstein, Itziar Gonzalez-Dios, Lisa Beinborn, and Lena Jäger. 2022. Patterns of Text Readability in Human and Predicted Eye Movements. In Proceedings of the Workshop on Cognitive Aspects of the Lexicon, pages 1–15, Taipei, Taiwan. Association for Computational Linguistics.

Itziar Glez Dios, Aitor Soroa, Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gérard Dupont, Stella Biderman, Anna Rogers, Loubna Ben Allal, Francesco de Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa, Paulo Villegas, Tristan Thrush, etal.

The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset (2022)

2022. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Soroa, A., Gonzalez-Dios, I,... & Manica, M.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2022)

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., ... & Manica, M. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv preprint arXiv:2211.05100.

Oscar Cumbicus-Pineda, Iker Gutiérrez-Fandiño, Itziar Gonzalez-Dios, Aitor Soroa

Noisy Channel for Automatic Text Simplification (2022)

Cumbicus-Pineda, O. M., Gutiérrez-Fandiño, I., Gonzalez-Dios, I., & Soroa, A. (2022). Noisy Channel for Automatic Text Simplification. arXiv preprint arXiv:2211.03152.

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

Does Corpus Quality Really Matter for Low-Resource Languages? (2022)

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 7383–7390.

Iker Garcia-Ferrero, Rodrigo Agerri, German Rigau

Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings (2022)

Findings of the Association for Computational Linguistics: EMNLP 2022

Maxime Masson, Christian Sallaberry, Rodrigo Agerri, Marie-Noelle Bessagnet, Philippe Roose, Annig Le Parc Lacayrelle

A Domain-Independent Method for Thematic Dataset Building from Social Media: The Case of Tourism on Twitter (2022)

In: Chbeir, R., Huang, H., Silvestri, F., Manolopoulos, Y., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2022. WISE 2022. Lecture Notes in Computer Science, vol 13724. Springer, Cham.

Gorka Urbizu, Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri and Aitor Soroa

BasqueGLUE: A Natural Language Understanding Benchmark for Basque (2022)

LREC 2022

Jeremy Barnes, Laura Oberlaender, Enrica Troiano, Andrey Kutuzov, Jan Buchmann, Rodrigo Agerri, Lilja Øvrelid, Erik Velldal

SemEval 2022 Task 10: Structured Sentiment Analysis (2022)

In SemEval 2022

Blanca Calvo Figueras, Montse Cuadros, Rodrigo Agerri

A Semantics-Aware Approach to Automated Claim Verification (2022)

In Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER), pages 37–48, Dublin, Ireland. Association for Computational Linguistics

Nayla Escribano, Jon Ander González, Julen Orbegozo-Terradillos, Ainara Larrondo-Ureta, Simón Peña-Fernández, Olatz Pérez-de-Viñaspre, Rodrigo Agerri

Euskararen erabilera Eusko Legebiltzarreko debateetan (2012-2020) (2022)

Nayla Escribano, Jon Ander González, Julen Orbegozo-Terradillos, Ainara Larrondo-Ureta, Simón Peña-Fernández, Olatz Pérez-de-Viñaspre, Rodrigo Agerri (2022). Euskararen erabilera Eusko Legebiltzarreko debateetan (2012-2020). In Mediatika, 19, 163-178.

Amaia Aguirregoitia Martinez, Kepa Bengoetxea Kortazar, Itziar Gonzalez-Dios

Journal of Immersion and Content-Based Language Education, Volume 9, Issue 1, May 2021, p. 4 - 30

Ionut-Teodor Sorodoc, Madhumita Sushil, Ece Takmaz, Eneko Agirre (Editors)

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop (2021)

In conjunction with EACL. Association for Computational Linguistics

Elena Zotova, Rodrigo Agerri, German Rigau

Semi-automatic generation of multilingual datasets for stance detection in Twitter (2021)

Expert Systems with Applications, 170 (2021).

Joseba Fernandez de Landa, Iker García, Ander Salaberria, Jon Ander Campos

Twitterreko Euskal Komunitatearen Eduki Azterketa Pandemia Garaian (2021)

IV. Ikergazte. Nazioarteko ikerketa euskaraz. Kongresuko artikulu bilduma. Ingeniaritza eta Arkitektura

Oscar Sainz, German Rigau

Ask2Transformers: Zero-Shot Domain labelling with Pretrained Language Models (2021)

Proceedings of the 11th Global WordNet Conference pages 44–52, University of South Africa (UNISA). Global Wordnet Association.

Iakes Goenaga, Xabier Lahuerta, Aitziber Atutxa, Koldo Gojenola

A Section Identification Tool: towards HL7 CDA/CCR Standardization in Spanish Discharge Summaries (2021)

Journal of Biomedical Informatics

Prys Delyth, Sarasola Kepa, Alegria Iñaki, Perez-de-Viñaspre Olatz, Palmer Geraint, Corcoran Padraig, Arman Laura, Knight Dawn ,Spasic Irena, Bryn Jones Dewi, Cooper Sarah, Prys Myfyr, Muralidaran Vigneshwaran, O’Hare Keeziah, Prys Gruffudd, Watkins Gareth, Roberts Jonathan C, Butcher Peter W. S., Lew Robert, Rees Geraint, Sharma Nirwan, Frankenberg-Garcia Ana, Farhat Leena Sarah, Teahan William John.

Language and Technology in Wales: Volume I (2021)

Language and Technology in Wales: Volume I. University of Bangor. ISBN: 978-1-84220-189-3

Prys Delyth, Sarasola Kepa, Alegria Iñaki, Perez-de-Viñaspre Olatz, Palmer Geraint, Corcoran Padraig, Arman Laura, Knight Dawn ,Spasic Irena, Bryn Jones Dewi, Cooper Sarah, Prys Myfyr, Muralidaran Vigneshwaran, O’Hare Keeziah, Prys Gruffudd, Watkins Gareth, Roberts Jonathan C, Butcher Peter W. S., Lew Robert, Rees Geraint, Sharma Nirwan, Frankenberg-Garcia Ana, Farhat Leena Sarah, Teahan William John.

Iaith a Thechnoleg yng Nghymru: Cyfrol 1 (2021)

Iaith a Thechnoleg yng Nghymru: Cyfrol 1. University of Bangor. ISBN: 978-1-84220-189-6

Oscar Cumbicus, Itziar Gonzalez-Dios, Aitor Soroa

A Syntax-Aware Edit-based System for Text Simplification (2021)

Cumbicus, Oscar, Gonzalez-Dios, Itziar and Soroa, Aitor (2021). A Syntax-Aware Edit-based System for Text Simplification In: Proceedings of Recent Advances in Natural Language Processing, pages 329–339. https://aclanthology.org/2021.ranlp-1.38/

Kepa Bengoetxea and Itziar Gonzalez-Dios

MultiAzterTest@Exist-IberLEF 2021: Linguistically Motivated Sexism Identification (2021)

Kepa Bengoetxea and Itziar Gonzalez-Dios (2021)

MultiAzterTest@Exist-IberLEF
2021: Linguistically Motivated Sexism Identification. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021) pp. 449-457 http://ceur-ws.org/Vol-2943/

Itziar Gonzalez-Dios, Kepa Bengoetxea

MultiAzterTest@VaxxStance-IberLEF 2021: Identifying Stances with Language Models and Linguistic Features (2021)

Itziar Gonzalez-Dios and Kepa Bengoetxea (2021) Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021). pp. 192-201. http://ceur-ws.org/Vol-2943/

Oscar M. Cumbicus-Pineda, Itziar Gonzalez-Dios, Aitor Soroa

Linguistic Capabilities for a Checklist-based evaluation in Automatic Text Simplification (2021)

Oscar M. Cumbicus-Pineda, Itziar Gonzalez-Dios, Aitor Soroa. (2021). Linguistic Capabilities for a Checklist-based evaluation in Automatic Text Simplification. Proceedings of the First Workshop on Current Trends in Text Simplification (CTTS 2021) co-located with the 37th Conference of the Spanish Society for Natural Language Processing (SEPLN2021) Online (initially located in Málaga, Spain), September 21st, 2021. Edited by: Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng Sheang, pages 70-83. ISSN 1613-0073

Rodrigo Agerri, Roberto Centeno, María Espinosa, Joseba Fernández de Landa, Álvaro Rodrigo

VaxxStance@IberLEF 2021: Overview of the Task on Going Beyond Text in Cross-Lingual Stance Detection (2021)

Procesamiento del Lenguaje Natural, 67, pp 173-181

Amir Zeldes, Yang Janet Liu, Mikel Iruskieta, Philippe Muller, Chloé Braud, Sonia Badene

Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021) (2021)

Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021). URL: https://aclanthology.org/volumes/2021.disrpt-1/

Yi-Ling Chung, Marco Guerini, Rodrigo Agerri

Multilingual Counter Narrative Type Classification (2021)

Proceedings of Argument Mining 2021

Beatriz Pereda-Goikoetxea, María Isabel Elorza-Puyadena Mikel Lersundi-Ayestaran Joseba Xabier Huitzi-Egilegor María José Uranga-Iturrioz Blanca Marín-Fernández

Emakumeen emozio-zurrunbiloa erditzean (2021)

Ekaia, 2021, 41, 31-48

Bernardo Magnini, Begoña Altuna, Alberto Lavelli, Manuela Speranza, Roberto Zanoli

The E3C Project: European Clinical Case Corpus (2021)

Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD 2021). Pages 17-20. ISSN: 1613-0073. URL: http://ceur-ws.org/Vol-2968/paper5.pdf

Eneko Agirre

Cross-Lingual Word Embeddings (Book Review) (2020)

Computational Linguistics 46 (1), 245-248. (https://doi.org/10.1162/COLI_r_00372)

Jose R. Pichel, Pablo Gamallo, Iñaki Alegria, Marco Neves

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity (2020)

Journal of Quantitative Linguistics. DOI 10.1080/09296174.2020.1732177

Uxoa Iñurrieta

Identification and translation of verb+noun multiword expressions: a Spanish-Basque study (2020)

Procesamiento del Lenguaje Natural, 64, pp. 123-126.

Pablo Gamallo José Ramom Pichel and Iñaki Alegria

Measuring Language Distance of Isolated European Languages (2020)

MDPI Information 2020, 11(4), 181 doi:10.3390/info11040181

Kepa Bengoetxea, Itziar Gonzalez-Dios, Amaia Aguirregoitia

AzterTest: Open source linguistic and stylistic analysis tool (2020)

Procesamiento del Lenguaje Natural, 64, 61-68. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6196

Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

Give your Text Representation Models some Love: the Case for Basque (2020)

Proceedings of LREC. Also available at arxiv https://arxiv.org/pdf/2004.00033.pdf

Begoña Altuna, María Jesús Aranzabe, Arantza Díaz de Ilarraza

EusTimeML: A mark-up language for temporal information in Basque (2020)

Research in Corpus Linguistics 8: 86-104. ISSN 2243-4712. Asociación Española de Lingüística de Corpus (AELINCO) DOI 10.32714/ricl.08.01.06

Begoña Altuna

Análisis de estructuras temporales en euskera y creación de un corpus (2020)

Procesamiento del Lenguaje Natural, Revista no 64, marzo de 2020, pp. 131-134 URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6206 ISSN: 1989-7553

Itziar Aduriz, Jose Mari Arriola, Xabier Artola, Zuhaitz Beloki, Nerea Ezeiza, Koldo Gojenola

Morfeus+: Word Parsing in Basque beyond Morphological Segmentation (2020)

WORD STRUCTURE 13.3, 283-315

Elena Zotova, Rodrigo Agerri, Manuel Nuñez and German Rigau

Multilingual Stance Detection in Tweets: The Catalonia Independence Corpus (2020)

Language Resources and Evaluation Conference (LREC 2020)

José Ramom Pichel, Pablo Gamallo, Marco Neves & Iñaki Alegria

Distância diacrónica automática entre variantes diatópicas do português e do espanhol (2020)

Linguamática, Vol. 12 N. 1, 117–126 ISSN: 1647–0818

Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining (2020)

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Pages 255-262

Uxoa Inurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola

Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. (2020)

Inurrieta U, Aduriz I, Díaz de Ilarraza A, Labaka G, Sarasola K (2020) Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. PLoS ONE 15(8): e0237767. https://doi.org/10.1371/journal.pone.0237767

Itziar Aduriz, Jose Mari Arriola

Testu-corpusen informazio morfosintaktikoaren etiketatze automatikoa hizkuntz ezagutzan oinarrituz: zenbait arazo, hainbat erronka (2020)

Fontes Linguae Vasconum 50 urte. Ekarpen berriak euskararen ikerketari / Nuevas aportaciones al estudio de la lengua vasca.

Jose Ramom Pichel Camos

Medidas de distância entre línguas baseadas em corpus (2020)

Nazioarteko tesia. Artikulu bilduma.

Mikel Artetxe, Gorka Labaka, Eneko Agirre

Translation Artifacts in Cross-lingual Transfer Learning (2020)

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). (Pages 7674–7684).

Gorka Urbizu, Ander Soraluze, Olatz Arregi

Sequence to Sequence Coreference Resolution (2020)

Proceedings of the 3rd Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2020), pages 39–46,Barcelona, Spain (online), December 12, 2020.

Rodrigo Agerri, German Rigau

Projecting Heterogeneous Annotations for Named Entity Recognition (2020)

In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020). Winner of the

CAPITEL@IberLEF
task on Spanish NER.

María Espinosa, Rodrigo Agerri, Roberto Centeno, Alvaro Rodrigo

DeepReading@SardiStance:Combining Textual, Social and Emotional Features. (2020)

Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020). Winners of the

SardiStance@Evalita
2020 shared task

Jon Alkorta, Koldo Gojenola, Mikel Iruskieta

Sentimenduen tratamendu konputazionalerantz: gramatika maila ezberdinetako sentimendu balentzia aldatzaileen bila (2019)

Olatz Arbelaitz, Urtzi Etxeberria, Ainhoa Latatu, Miren Josu Ormaetxebarria (arg.), III. Ikergazte. Nazioarteko Ikerketa Euskaraz, Giza Zientziak eta Artea (1. liburukia), 39-46. Udako Euskal Unibertsitatea (UEU). Bilbo.

Joseba Fernandez de Landa, Rodrigo Agerri, Iñaki Alegria

Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case (2019)

MDPI: Information: Vol. 10, 6. 212. doi: 10.3390/info10060212 https://www.mdpi.com/2078-2489/10/6/212

Y Yaghoobzadeh, K Kann, TJ Hazen, E Agirre, H Schütze

Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings (2019)

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Mikel Iruskieta

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool (2019)

PLoS ONE 14(9): e0221639

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Cross-lingual Diachronic Distance: Application to Portuguese and Spanish (2019)

Procesamiento del Lenguaje Natural, Revista no 63, septiembre de 2019, pp. 77-84

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Measuring diachronic language distance using perplexity. Application to English, Portuguese and Spanish. (2019)

Natural Language Engeenering

Ainara Estarrona, Izaskun Etxeberria, Ander Soraluze, Manuel Padilla-Moyano

Spelling Normalisation of Basque Historical Texts (2019)

Procesamiento del Lenguaje Natural, vol. 63, pp. 59-66

Jose Mari Arriola, Izaskun Aldezabal, Ainara Estarrona

A modular grammar-helping tool for Basque: work in progress (2019)

NoDaLiDa2019, Turku, Finland

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (2018)

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 282–291. Brussels, Belgium, October 31 - November 1, 2018. Best paper award

Mikel Iruskieta, Arantxa Otegi, Larraitz Uria, Arantza Diaz de Ilarraza, Amaia Artolazabal

Zer i(ra)kas dezakegu geure corpusekin "jolastuz"? (2018)

Traineru bete lagun: Iñaki Gaminde omenduz. UPV/EHU. 35-66 or.

Zuhaitz Beloki and Xabier Artola and Aitor Soroa

A scalable architecture for data-intensive natural language processing (2017)

Natural Language Engineering, 1-23. doi:10.1017/S1351324917000092.

Maite Taboada, Iria da Cunha, Erick G. Maziero, Paula Cardoso, Julinao D. Antonio, Mikel Iruskieta

Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms (2017)

https://aclanthology.org/volumes/W17-36/

Itziar Aduriz, Iñaki Alegria, Olatz Arregi, Arantza Diaz de Ilarraza, Kepa Sarasola

Hizkuntza-teknologia “Datu Handien” garaian: programa bilatzaileak, itzultzaileak… (2017)

Senez, 48, pp. 191-200. ISSN: 1132-2152. 2017 https://eizie.eus/eu/argitalpenak/senez/20171102/aurkezpena/datuhandiak

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga, Gorka Labaka

A Modular Chain of NLP Tools for Basque (2016)

Proceedings of the 19th International Conference on Text, Speech and Dialogue, TSD 2016, Brno, Czech Republic, Lecture Notes in Computer Science, vol. 9924, pp. 93-100, Springer. ISBN 978-3-319-45509-9. DOI 10.1007/978-3-319-45510-5_11

Uxoa Iñurrieta, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola, Itziar Aduriz, John Carroll.

Using Linguistic Data for English and Spanish Verb-Noun Combination Identification (2016)

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers pages 857–867, Osaka, Japan, December 11-17 2016. ISBN: 978-4-87974-702-0.

Asier Larrinaga Larrazabal

Euskal telebistaren sorrera (2016)

-

Rodrigo Agerri, Xabier Artola, Zuhaitz Beloki, German Rigau, Aitor Soroa

Big data for Natural Language Processing: A streaming approach (2015)

Knowledge-Based Systems. http://dx.doi.org/10.1016/j.knosys.2014.11.007. Vol.79, pages 36-42.

Xabier Artola, Zuhaitz Beloki, Aitor Soroa

A stream computing approach towards scalable NLP (2014)

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4

Rodrigo Agerri, Josu Bermudez, German Rigau

IXA pipeline: Efficient and Ready to Use Multilingual NLP tools. (2014)

LREC 2014: 3823-3828. ISBN 978-2-9517408-8-4

Iakes Goenaga, Koldo Gojenola, Nerea Ezeiza

Exploiting the Contribution of Morphological Information to Parsing: the BASQUE TEAM system in the SPRML’2013 Shared Task (2013)

Workshop on Statistical Parsing of Morphologically Rich Languages.Pages 71-77. SPRML’2013 Shared Task, Seattle, EMNLP Workshop. https://aclanthology.org/W13-4908.pdf . ISBN 978-1-937284-97-8

All HiTZ publications

Text_analysis_tabs_full

Demo of the English NLP pipeline

Just copy in any English text and see what entities and events and other annotations are added automatically. The result is represented in the NAF format.

Demo of the Spanish NLP pipeline

Just copy in any Spanish text and see what entities and other annotations are added automatically. The result is represented in the NAF format.

Eustagger

Basque lemmatizer and morphosyntactic analyzer

Xuxen

Basque spelling corrector on-line

MALTIXA

Olia Toporkov, Rodrigo Agerri

On the Role of Morphological Information for Contextual Lemmatization (2024)

Computational Linguistics (MIT Press).

Oscar Sainz, Iker García-Ferrero, Rodrigo Agerri, Oier Lopez de Lacalle, German Rigau, Eneko Agirre

GoLLIE: Annotation Guidelines improve Zero-Shot Information-Extraction (2024)

The Twelfth International Conference on Learning Representations

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

A Preliminary Study of ChatGPT for Spanish E2R Text Adaptation (2024)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2024, May). A Preliminary Study of ChatGPT for Spanish E2R Text Adaptation. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) (pp. 1422-1434).

Margot Madina, Itziar Gonzalez-Dios, and Melanie Siegel.

LanguageTool as a CAT tool for Easy-to-Read in Spanish (2024)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2024, May). LanguageTool as a CAT tool for Easy-to-Read in Spanish. In Proceedings of the 3rd Workshop on Tools and Resources for People with REAding DIfficulties (READI)@ LREC-COLING 2024 (pp. 93-101).

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

Towards Reliable E2R Texts: A Proposal for Standardized Evaluation Practices (2024)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2024, July). Towards reliable E2R texts: a proposal for standardized evaluation practices. In International Conference on Computers Helping People with Special Needs (pp. 224-231). Cham: Springer Nature Switzerland.

Iakes Goenaga, Aitziber Atutxa, Koldo Gojenola, Maite Oronoz, Rodrigo Agerri

Explanatory argument extraction of correct answers in resident medical exams (2024)

Artificial Intelligence in Medicine, Volume 157, 2024, 102985,

Oscar Sainz, Oier Lopez de Lacalle, Eneko Agirre, German Rigau

What do Language Models know about word senses? Zero-Shot WSD with Language Models and Domain Inventories (2023)

In Proceedings of the 12th Global Wordnet Conference, pages 331–342, University of the Basque Country, Donostia - San Sebastian, Basque Country. Global Wordnet Association.

Ainara Estarrona, Izaskun Etxeberria, Manuel Padilla-Moyano, Ander Soraluze

Measuring language distance for historical texts in Basque (2023)

Procesamiento del Lenguaje Natural, Revista no 70, marzo del 2023, pp. 53-61

Nayla Escribano, German Rigau, Rodrigo Agerri

A modular approach for multilingual timex detection and normalization using deep learning and grammar-based methods (2023)

Nayla Escribano, German Rigau, Rodrigo Agerri, A modular approach for multilingual timex detection and normalization using deep learning and grammar-based methods, Knowledge-Based Systems, Volume 273, 2023, 110612, ISSN 0950-7051, https://doi.org/10.1016/j.knosys.2023.110612. (https://www.sciencedirect.com/science/article/pii/S0950705123003623) Abstract: Detecting and normalizing temporal expressions is an essential step for many NLP tasks. While a variety of methods have been proposed for detection, best normalization approaches rely on hand-crafted rules. Furthermore, most of them have been designed only for English. In this paper we present a modular multilingual temporal processing system combining a fine-tuned Masked Language Model for detection, and a grammar-based normalizer. We experiment in Spanish and English and compare with HeidelTime, the state-of-the-art in multilingual temporal processing. We obtain best results in gold timex normalization, timex detection and type recognition, and competitive performance in the combined TempEval-3 relaxed value metric. A detailed error analysis shows that detecting only those timexes for which it is feasible to provide a normalization is highly beneficial in this last metric. This raises the question of which is the best strategy for timex processing, namely, leaving undetected those timexes for which is not easy to provide normalization rules or aiming for high coverage. Keywords: Temporal processing; Multilingualism; Sequence labeling; Grammar-based approaches; Deep learning; Natural language processing

Itziar Aduriz, Manex Agirrezabal, Eneko Agirre, Iñaki Alegria, Xabier Arregi, Jose Mari Arriola Xabier Artola, Arantza Díaz de Ilarraza, Ainara Estarrona, Izaskun Etxeberria, Nerea Ezeiza, Kepa Sarazola

Mofologia Konputazionala Euskaraz, 35 urte (2023)

Lindemann, D. (arg.). Miren Azkarateri esker onez, 15-30. UPV/EHU Argitalpen zerbitzua. Bilbo.

Rodrigo Agerri, Eneko Agirre

Lessons learned from the evaluation of Spanish Language Models (2023)

Procesamiento del Lenguaje Natural (70), pp 157-170

Gorka Urbizu, Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri, Aitor Soroa

Scaling Laws for BERT in Low-Resource Settings (2023)

Findings of the Association for Computational Linguistics: ACL 2023

Masson, M., Roose, P., Sallaberry, C., Agerri, R., Bessagnet, MN., Lacayrelle, A.L.P

APs: A Proxemic Framework for Social Media Interactions Modeling and Analysis (2023)

In: Crémilleux, B., Hess, S., Nijssen, S. (eds) Advances in Intelligent Data Analysis XXI. IDA 2023. Lecture Notes in Computer Science, vol 13876. Springer, Cham.

Bonan Min, Hayley Ross, Elior Sulem, Amir Pouran Ben Veyseh, Thien Huu Nguyen, Oscar Sainz, Eneko Agirre, Ilana Heintz, Dan Roth

Recent Advances in Natural Language Processing via Large Pre-Trained Language Models: A Survey (2023)

ACM Computing Surveys. 27 June 2023

Iker De la Iglesia, María Vivó, Paula Chocrón, Gabriel de Maeztu, Koldo Gojenola, Aitziber Atutxa

Overview of ClinAIS at IberLEF 2023: Automatic Identification of Sections in Clinical Documents in Spanish (2023)

Procesamiento del Lenguaje Natural, Revista nº 71, septiembre de 2023

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

Easy-to-Read Language Resources and Tools for three European Languages (2023)

Madina, M., Gonzalez-Dios, I., & Siegel, M. (2023, July). Easy-to-Read Language Resources and Tools for three European Languages. In Proceedings of the 16th International Conference on PErvasive Technologies Related to Assistive Environments (pp. 693-699).

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel

Easy-to-Read Language: baliabide linguistikoen eta testuen egokitzapena eta tresna automatikoen garapena (2023)

Margot Madina, Itziar Gonzalez-Dios, Melanie Siegel (2023) Easy-to-Read Language: baliabide linguistikoen eta testuen egokitzapena eta tresna automatikoen garapena. V. IKERGAZTE NAZIOARTEKO IKERKETA EUSKARAZ Kongresuko artikulu-bilduma: Giza Zientziak eta Artea, 35-42.

Margot Madina, Itziar Gonzalez-Dios and Melanie Siegel

Easy-to-Read in Germany: a Survey on its Current State and Available Resources (2023)

Margot Madina, Itziar Gonzalez-Dios and Melanie Siegel (2023) Easy-to-Read in Germany: a Survey on its Current State and Available Resources. To appear in proceedings of 10th Language & Technology Conference

Jeremy Barnes

Sentiment and Emotion Classification in Low-resource Settings (2023)

Proceedings of the 13th Workshop on Computational Approaches to Subjectivity, Sentiment, & Social Media Analysis

Irene Baucells de la Peña, Blanca Calvo Figueras, Marta Villegas, Oier Lopez de Lacalle

Entailment-based Task Transfer for Catalan Text Classification in Small Data Regimes (2023)

Procesamiento del Lenguaje Natural. v. 71, p. 165-177, sep. 2023

Iker García, Rodrigo Agerri, German Rigau

T-Projection: High Quality Annotation Projection for Sequence Labeling Tasks (2023)

Findings of the Association for Computational Linguistics: EMNLP 2023

Aitor Ormazabal, Mikel Artetxe, Aitor Soroa

CombLM: Adapting Black-Box Language Models through Small Fine-Tuned Models (2023)

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

Javier Álvez, Itziar Gonzalez-Dios, German Rigau

Towards Effective Correction Methods Using WordNet Meronymy Relations (2023)

Álvez, J., Gonzalez-Dios, I., & Rigau, G. (2023, January). Towards Effective Correction Methods Using WordNet Meronymy Relations. In Proceedings of the 12th Global Wordnet Conference (pp. 31-40).

Roberto Centeno, Rodrigo Agerri

Overview of NLP-MisInfo 2023: Workshop on NLP applied to Misinformation (2023)

Roberto Centeno and Rodrigo Agerri (2023). Overview of NLP-MisInfo 2023: Workshop on NLP applied to Misinformation. In Proceedings of the Workshop on NLP applied to Misinformation, co-located with the 39th International Conference of the Spanish Society for Natural Language Processing (SEPLN 2023).

Rodrigo Agerri, Iñigo Alonso, Aitziber Atutxa, Ander Berrondo, Ainara Estarrona, Iker Garcia-Ferrero, Iakes Goenaga, Koldo Gojenola, Maite Oronoz, Igor Perez-Tejedor, German Rigau and Anar Yeginbergenova

HiTZ@Antidote: Argumentation-driven Explainable Artificial Intelligence for Digital Medicine (2023)

Rodrigo Agerri, Iñigo Alonso, Aitziber Atutxa, Ander Berrondo, Ainara Estarrona, Iker Garcia-Ferrero, Iakes Goenaga, Koldo Gojenola, Maite Oronoz, Igor Perez-Tejedor, German Rigau and Anar Yeginbergenova (2023). HiTZ@Antidote: Argumentation-driven Explainable Artificial Intelligence for Digital Medicine. In SEPLN 2023: 39th International Conference of the Spanish Society for Natural Language Processing.

Joseba Fernandez de Landa, Rodrigo Agerri

HiTZ-IXA at PoliticES 2023: Document and Sentence Level Text Representations for Demographic Characteristics and Political Ideology Detection. (2023)

Joseba Fernandez de Landa, Rodrigo Agerri (2023). HiTZ-IXA at PoliticES 2023: Document and Sentence Level Text Representations for Demographic Characteristics and Political Ideology Detection. In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2023) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2023), Jaén, Spain, September 2023.

Blanca Calvo Figueras, Irene Bausells, Tommaso Caselli

Dynamic Stance: Modeling Discussions by Labeling the Interactions (2023)

Findings of the Association for Computational Linguistics: EMNLP 2023

David Samuel, Jeremy Barnes, Robin Kurtz, Stephan Oepen, Lilja Øvrelid, and Erik Velldal

Direct Parsing to Sentiment Graphs (2022)

Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages: 470–478

Nora Hollenstein, Itziar Gonzalez-Dios, Lisa Beinborn, and Lena Jäger

Patterns of text readability in human and predicted eye movements (2022)

Nora Hollenstein, Itziar Gonzalez-Dios, Lisa Beinborn, and Lena Jäger. 2022. Patterns of Text Readability in Human and Predicted Eye Movements. In Proceedings of the Workshop on Cognitive Aspects of the Lexicon, pages 1–15, Taipei, Taiwan. Association for Computational Linguistics.

Itziar Glez Dios, Aitor Soroa, Hugo Laurençon, Lucile Saulnier, Thomas Wang, Christopher Akiki, Albert Villanova del Moral, Teven Le Scao, Leandro Von Werra, Chenghao Mou, Eduardo González Ponferrada, Huu Nguyen, Jörg Frohberg, Mario Šaško, Quentin Lhoest, Angelina McMillan-Major, Gérard Dupont, Stella Biderman, Anna Rogers, Loubna Ben Allal, Francesco de Toni, Giada Pistilli, Olivier Nguyen, Somaieh Nikpoor, Maraim Masoud, Pierre Colombo, Javier de la Rosa, Paulo Villegas, Tristan Thrush, etal.

The BigScience ROOTS Corpus: A 1.6 TB Composite Multilingual Dataset (2022)

2022. Thirty-sixth Conference on Neural Information Processing Systems Datasets and Benchmarks Track

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., Soroa, A., Gonzalez-Dios, I,... & Manica, M.

BLOOM: A 176B-Parameter Open-Access Multilingual Language Model (2022)

Scao, T. L., Fan, A., Akiki, C., Pavlick, E., Ilić, S., Hesslow, D., ... & Manica, M. (2022). BLOOM: A 176B-Parameter Open-Access Multilingual Language Model. arXiv preprint arXiv:2211.05100.

Oscar Cumbicus-Pineda, Iker Gutiérrez-Fandiño, Itziar Gonzalez-Dios, Aitor Soroa

Noisy Channel for Automatic Text Simplification (2022)

Cumbicus-Pineda, O. M., Gutiérrez-Fandiño, I., Gonzalez-Dios, I., & Soroa, A. (2022). Noisy Channel for Automatic Text Simplification. arXiv preprint arXiv:2211.03152.

Mikel Artetxe, Itziar Aldabe, Rodrigo Agerri, Olatz Perez-de-Viñaspre, Aitor Soroa

Does Corpus Quality Really Matter for Low-Resource Languages? (2022)

Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 7383–7390.

Iker Garcia-Ferrero, Rodrigo Agerri, German Rigau

Model and Data Transfer for Cross-Lingual Sequence Labelling in Zero-Resource Settings (2022)

Findings of the Association for Computational Linguistics: EMNLP 2022

Maxime Masson, Christian Sallaberry, Rodrigo Agerri, Marie-Noelle Bessagnet, Philippe Roose, Annig Le Parc Lacayrelle

A Domain-Independent Method for Thematic Dataset Building from Social Media: The Case of Tourism on Twitter (2022)

In: Chbeir, R., Huang, H., Silvestri, F., Manolopoulos, Y., Zhang, Y. (eds) Web Information Systems Engineering – WISE 2022. WISE 2022. Lecture Notes in Computer Science, vol 13724. Springer, Cham.

Gorka Urbizu, Iñaki San Vicente, Xabier Saralegi, Rodrigo Agerri and Aitor Soroa

BasqueGLUE: A Natural Language Understanding Benchmark for Basque (2022)

LREC 2022

Jeremy Barnes, Laura Oberlaender, Enrica Troiano, Andrey Kutuzov, Jan Buchmann, Rodrigo Agerri, Lilja Øvrelid, Erik Velldal

SemEval 2022 Task 10: Structured Sentiment Analysis (2022)

In SemEval 2022

Blanca Calvo Figueras, Montse Cuadros, Rodrigo Agerri

A Semantics-Aware Approach to Automated Claim Verification (2022)

In Proceedings of the Fifth Fact Extraction and VERification Workshop (FEVER), pages 37–48, Dublin, Ireland. Association for Computational Linguistics

Nayla Escribano, Jon Ander González, Julen Orbegozo-Terradillos, Ainara Larrondo-Ureta, Simón Peña-Fernández, Olatz Pérez-de-Viñaspre, Rodrigo Agerri

Euskararen erabilera Eusko Legebiltzarreko debateetan (2012-2020) (2022)

Nayla Escribano, Jon Ander González, Julen Orbegozo-Terradillos, Ainara Larrondo-Ureta, Simón Peña-Fernández, Olatz Pérez-de-Viñaspre, Rodrigo Agerri (2022). Euskararen erabilera Eusko Legebiltzarreko debateetan (2012-2020). In Mediatika, 19, 163-178.

Amaia Aguirregoitia Martinez, Kepa Bengoetxea Kortazar, Itziar Gonzalez-Dios

Journal of Immersion and Content-Based Language Education, Volume 9, Issue 1, May 2021, p. 4 - 30

Ionut-Teodor Sorodoc, Madhumita Sushil, Ece Takmaz, Eneko Agirre (Editors)

Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Student Research Workshop (2021)

In conjunction with EACL. Association for Computational Linguistics

Elena Zotova, Rodrigo Agerri, German Rigau

Semi-automatic generation of multilingual datasets for stance detection in Twitter (2021)

Expert Systems with Applications, 170 (2021).

Joseba Fernandez de Landa, Iker García, Ander Salaberria, Jon Ander Campos

Twitterreko Euskal Komunitatearen Eduki Azterketa Pandemia Garaian (2021)

IV. Ikergazte. Nazioarteko ikerketa euskaraz. Kongresuko artikulu bilduma. Ingeniaritza eta Arkitektura

Oscar Sainz, German Rigau

Ask2Transformers: Zero-Shot Domain labelling with Pretrained Language Models (2021)

Proceedings of the 11th Global WordNet Conference pages 44–52, University of South Africa (UNISA). Global Wordnet Association.

Iakes Goenaga, Xabier Lahuerta, Aitziber Atutxa, Koldo Gojenola

A Section Identification Tool: towards HL7 CDA/CCR Standardization in Spanish Discharge Summaries (2021)

Journal of Biomedical Informatics

Prys Delyth, Sarasola Kepa, Alegria Iñaki, Perez-de-Viñaspre Olatz, Palmer Geraint, Corcoran Padraig, Arman Laura, Knight Dawn ,Spasic Irena, Bryn Jones Dewi, Cooper Sarah, Prys Myfyr, Muralidaran Vigneshwaran, O’Hare Keeziah, Prys Gruffudd, Watkins Gareth, Roberts Jonathan C, Butcher Peter W. S., Lew Robert, Rees Geraint, Sharma Nirwan, Frankenberg-Garcia Ana, Farhat Leena Sarah, Teahan William John.

Language and Technology in Wales: Volume I (2021)

Language and Technology in Wales: Volume I. University of Bangor. ISBN: 978-1-84220-189-3

Prys Delyth, Sarasola Kepa, Alegria Iñaki, Perez-de-Viñaspre Olatz, Palmer Geraint, Corcoran Padraig, Arman Laura, Knight Dawn ,Spasic Irena, Bryn Jones Dewi, Cooper Sarah, Prys Myfyr, Muralidaran Vigneshwaran, O’Hare Keeziah, Prys Gruffudd, Watkins Gareth, Roberts Jonathan C, Butcher Peter W. S., Lew Robert, Rees Geraint, Sharma Nirwan, Frankenberg-Garcia Ana, Farhat Leena Sarah, Teahan William John.

Iaith a Thechnoleg yng Nghymru: Cyfrol 1 (2021)

Iaith a Thechnoleg yng Nghymru: Cyfrol 1. University of Bangor. ISBN: 978-1-84220-189-6

Oscar Cumbicus, Itziar Gonzalez-Dios, Aitor Soroa

A Syntax-Aware Edit-based System for Text Simplification (2021)

Cumbicus, Oscar, Gonzalez-Dios, Itziar and Soroa, Aitor (2021). A Syntax-Aware Edit-based System for Text Simplification In: Proceedings of Recent Advances in Natural Language Processing, pages 329–339. https://aclanthology.org/2021.ranlp-1.38/

Kepa Bengoetxea and Itziar Gonzalez-Dios

MultiAzterTest@Exist-IberLEF 2021: Linguistically Motivated Sexism Identification (2021)

Kepa Bengoetxea and Itziar Gonzalez-Dios (2021)

MultiAzterTest@Exist-IberLEF
2021: Linguistically Motivated Sexism Identification. Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021) pp. 449-457 http://ceur-ws.org/Vol-2943/

Itziar Gonzalez-Dios, Kepa Bengoetxea

MultiAzterTest@VaxxStance-IberLEF 2021: Identifying Stances with Language Models and Linguistic Features (2021)

Itziar Gonzalez-Dios and Kepa Bengoetxea (2021) Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2021) co-located with the Conference of the Spanish Society for Natural Language Processing (SEPLN 2021). pp. 192-201. http://ceur-ws.org/Vol-2943/

Oscar M. Cumbicus-Pineda, Itziar Gonzalez-Dios, Aitor Soroa

Linguistic Capabilities for a Checklist-based evaluation in Automatic Text Simplification (2021)

Oscar M. Cumbicus-Pineda, Itziar Gonzalez-Dios, Aitor Soroa. (2021). Linguistic Capabilities for a Checklist-based evaluation in Automatic Text Simplification. Proceedings of the First Workshop on Current Trends in Text Simplification (CTTS 2021) co-located with the 37th Conference of the Spanish Society for Natural Language Processing (SEPLN2021) Online (initially located in Málaga, Spain), September 21st, 2021. Edited by: Horacio Saggion, Sanja Štajner, Daniel Ferrés, Kim Cheng Sheang, pages 70-83. ISSN 1613-0073

Rodrigo Agerri, Roberto Centeno, María Espinosa, Joseba Fernández de Landa, Álvaro Rodrigo

VaxxStance@IberLEF 2021: Overview of the Task on Going Beyond Text in Cross-Lingual Stance Detection (2021)

Procesamiento del Lenguaje Natural, 67, pp 173-181

Amir Zeldes, Yang Janet Liu, Mikel Iruskieta, Philippe Muller, Chloé Braud, Sonia Badene

Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021) (2021)

Proceedings of the 2nd Shared Task on Discourse Relation Parsing and Treebanking (DISRPT 2021). URL: https://aclanthology.org/volumes/2021.disrpt-1/

Yi-Ling Chung, Marco Guerini, Rodrigo Agerri

Multilingual Counter Narrative Type Classification (2021)

Proceedings of Argument Mining 2021

Beatriz Pereda-Goikoetxea, María Isabel Elorza-Puyadena Mikel Lersundi-Ayestaran Joseba Xabier Huitzi-Egilegor María José Uranga-Iturrioz Blanca Marín-Fernández

Emakumeen emozio-zurrunbiloa erditzean (2021)

Ekaia, 2021, 41, 31-48

Bernardo Magnini, Begoña Altuna, Alberto Lavelli, Manuela Speranza, Roberto Zanoli

The E3C Project: European Clinical Case Corpus (2021)

Proceedings of the Annual Conference of the Spanish Association for Natural Language Processing: Projects and Demonstrations (SEPLN-PD 2021). Pages 17-20. ISSN: 1613-0073. URL: http://ceur-ws.org/Vol-2968/paper5.pdf

Eneko Agirre

Cross-Lingual Word Embeddings (Book Review) (2020)

Computational Linguistics 46 (1), 245-248. (https://doi.org/10.1162/COLI_r_00372)

Jose R. Pichel, Pablo Gamallo, Iñaki Alegria, Marco Neves

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity (2020)

Journal of Quantitative Linguistics. DOI 10.1080/09296174.2020.1732177

Uxoa Iñurrieta

Identification and translation of verb+noun multiword expressions: a Spanish-Basque study (2020)

Procesamiento del Lenguaje Natural, 64, pp. 123-126.

Pablo Gamallo José Ramom Pichel and Iñaki Alegria

Measuring Language Distance of Isolated European Languages (2020)

MDPI Information 2020, 11(4), 181 doi:10.3390/info11040181

Kepa Bengoetxea, Itziar Gonzalez-Dios, Amaia Aguirregoitia

AzterTest: Open source linguistic and stylistic analysis tool (2020)

Procesamiento del Lenguaje Natural, 64, 61-68. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6196

Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

Give your Text Representation Models some Love: the Case for Basque (2020)

Proceedings of LREC. Also available at arxiv https://arxiv.org/pdf/2004.00033.pdf

Begoña Altuna, María Jesús Aranzabe, Arantza Díaz de Ilarraza

EusTimeML: A mark-up language for temporal information in Basque (2020)

Research in Corpus Linguistics 8: 86-104. ISSN 2243-4712. Asociación Española de Lingüística de Corpus (AELINCO) DOI 10.32714/ricl.08.01.06

Begoña Altuna

Análisis de estructuras temporales en euskera y creación de un corpus (2020)

Procesamiento del Lenguaje Natural, Revista no 64, marzo de 2020, pp. 131-134 URL: http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6206 ISSN: 1989-7553

Itziar Aduriz, Jose Mari Arriola, Xabier Artola, Zuhaitz Beloki, Nerea Ezeiza, Koldo Gojenola

Morfeus+: Word Parsing in Basque beyond Morphological Segmentation (2020)

WORD STRUCTURE 13.3, 283-315

Elena Zotova, Rodrigo Agerri, Manuel Nuñez and German Rigau

Multilingual Stance Detection in Tweets: The Catalonia Independence Corpus (2020)

Language Resources and Evaluation Conference (LREC 2020)

José Ramom Pichel, Pablo Gamallo, Marco Neves & Iñaki Alegria

Distância diacrónica automática entre variantes diatópicas do português e do espanhol (2020)

Linguamática, Vol. 12 N. 1, 117–126 ISSN: 1647–0818

Ivana Kvapilíková, Mikel Artetxe, Gorka Labaka, Eneko Agirre, Ondřej Bojar

Unsupervised Multilingual Sentence Embeddings for Parallel Corpus Mining (2020)

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Pages 255-262

Uxoa Inurrieta, Itziar Aduriz, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola

Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. (2020)

Inurrieta U, Aduriz I, Díaz de Ilarraza A, Labaka G, Sarasola K (2020) Learning about phraseology from corpora: A linguistically motivated approach for Multiword Expression identification. PLoS ONE 15(8): e0237767. https://doi.org/10.1371/journal.pone.0237767

Itziar Aduriz, Jose Mari Arriola

Testu-corpusen informazio morfosintaktikoaren etiketatze automatikoa hizkuntz ezagutzan oinarrituz: zenbait arazo, hainbat erronka (2020)

Fontes Linguae Vasconum 50 urte. Ekarpen berriak euskararen ikerketari / Nuevas aportaciones al estudio de la lengua vasca.

Jose Ramom Pichel Camos

Medidas de distância entre línguas baseadas em corpus (2020)

Nazioarteko tesia. Artikulu bilduma.

Mikel Artetxe, Gorka Labaka, Eneko Agirre

Translation Artifacts in Cross-lingual Transfer Learning (2020)

Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). (Pages 7674–7684).

Gorka Urbizu, Ander Soraluze, Olatz Arregi

Sequence to Sequence Coreference Resolution (2020)

Proceedings of the 3rd Workshop on Computational Models of Reference, Anaphora and Coreference (CRAC 2020), pages 39–46,Barcelona, Spain (online), December 12, 2020.

Rodrigo Agerri, German Rigau

Projecting Heterogeneous Annotations for Named Entity Recognition (2020)

In Proceedings of the Iberian Languages Evaluation Forum (IberLEF 2020). Winner of the

CAPITEL@IberLEF
task on Spanish NER.

María Espinosa, Rodrigo Agerri, Roberto Centeno, Alvaro Rodrigo

DeepReading@SardiStance:Combining Textual, Social and Emotional Features. (2020)

Proceedings of the Seventh Evaluation Campaign of Natural Language Processing and Speech Tools for Italian (EVALITA 2020). Winners of the

SardiStance@Evalita
2020 shared task

Jon Alkorta, Koldo Gojenola, Mikel Iruskieta

Sentimenduen tratamendu konputazionalerantz: gramatika maila ezberdinetako sentimendu balentzia aldatzaileen bila (2019)

Olatz Arbelaitz, Urtzi Etxeberria, Ainhoa Latatu, Miren Josu Ormaetxebarria (arg.), III. Ikergazte. Nazioarteko Ikerketa Euskaraz, Giza Zientziak eta Artea (1. liburukia), 39-46. Udako Euskal Unibertsitatea (UEU). Bilbo.

Joseba Fernandez de Landa, Rodrigo Agerri, Iñaki Alegria

Large Scale Linguistic Processing of Tweets to Understand Social Interactions among Speakers of Less Resourced Languages: The Basque Case (2019)

MDPI: Information: Vol. 10, 6. 212. doi: 10.3390/info10060212 https://www.mdpi.com/2078-2489/10/6/212

Y Yaghoobzadeh, K Kann, TJ Hazen, E Agirre, H Schütze

Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings (2019)

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Mikel Iruskieta

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool (2019)

PLoS ONE 14(9): e0221639

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Cross-lingual Diachronic Distance: Application to Portuguese and Spanish (2019)

Procesamiento del Lenguaje Natural, Revista no 63, septiembre de 2019, pp. 77-84

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Measuring diachronic language distance using perplexity. Application to English, Portuguese and Spanish. (2019)

Natural Language Engeenering

Ainara Estarrona, Izaskun Etxeberria, Ander Soraluze, Manuel Padilla-Moyano

Spelling Normalisation of Basque Historical Texts (2019)

Procesamiento del Lenguaje Natural, vol. 63, pp. 59-66

Jose Mari Arriola, Izaskun Aldezabal, Ainara Estarrona

A modular grammar-helping tool for Basque: work in progress (2019)

NoDaLiDa2019, Turku, Finland

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (2018)

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 282–291. Brussels, Belgium, October 31 - November 1, 2018. Best paper award

Mikel Iruskieta, Arantxa Otegi, Larraitz Uria, Arantza Diaz de Ilarraza, Amaia Artolazabal

Zer i(ra)kas dezakegu geure corpusekin "jolastuz"? (2018)

Traineru bete lagun: Iñaki Gaminde omenduz. UPV/EHU. 35-66 or.

Zuhaitz Beloki and Xabier Artola and Aitor Soroa

A scalable architecture for data-intensive natural language processing (2017)

Natural Language Engineering, 1-23. doi:10.1017/S1351324917000092.

Maite Taboada, Iria da Cunha, Erick G. Maziero, Paula Cardoso, Julinao D. Antonio, Mikel Iruskieta

Proceedings of the 6th Workshop on Recent Advances in RST and Related Formalisms (2017)

https://aclanthology.org/volumes/W17-36/

Itziar Aduriz, Iñaki Alegria, Olatz Arregi, Arantza Diaz de Ilarraza, Kepa Sarasola

Hizkuntza-teknologia “Datu Handien” garaian: programa bilatzaileak, itzultzaileak… (2017)

Senez, 48, pp. 191-200. ISSN: 1132-2152. 2017 https://eizie.eus/eu/argitalpenak/senez/20171102/aurkezpena/datuhandiak

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga, Gorka Labaka

A Modular Chain of NLP Tools for Basque (2016)

Proceedings of the 19th International Conference on Text, Speech and Dialogue, TSD 2016, Brno, Czech Republic, Lecture Notes in Computer Science, vol. 9924, pp. 93-100, Springer. ISBN 978-3-319-45509-9. DOI 10.1007/978-3-319-45510-5_11

Uxoa Iñurrieta, Arantza Díaz de Ilarraza, Gorka Labaka, Kepa Sarasola, Itziar Aduriz, John Carroll.

Using Linguistic Data for English and Spanish Verb-Noun Combination Identification (2016)

Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: Technical Papers pages 857–867, Osaka, Japan, December 11-17 2016. ISBN: 978-4-87974-702-0.

Asier Larrinaga Larrazabal

Euskal telebistaren sorrera (2016)

-

Rodrigo Agerri, Xabier Artola, Zuhaitz Beloki, German Rigau, Aitor Soroa

Big data for Natural Language Processing: A streaming approach (2015)

Knowledge-Based Systems. http://dx.doi.org/10.1016/j.knosys.2014.11.007. Vol.79, pages 36-42.

Xabier Artola, Zuhaitz Beloki, Aitor Soroa

A stream computing approach towards scalable NLP (2014)

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4

Rodrigo Agerri, Josu Bermudez, German Rigau

IXA pipeline: Efficient and Ready to Use Multilingual NLP tools. (2014)

LREC 2014: 3823-3828. ISBN 978-2-9517408-8-4

Iakes Goenaga, Koldo Gojenola, Nerea Ezeiza

Exploiting the Contribution of Morphological Information to Parsing: the BASQUE TEAM system in the SPRML’2013 Shared Task (2013)

Workshop on Statistical Parsing of Morphologically Rich Languages.Pages 71-77. SPRML’2013 Shared Task, Seattle, EMNLP Workshop. https://aclanthology.org/W13-4908.pdf . ISBN 978-1-937284-97-8

All HiTZ publications