Text Analysis

Natural Language Analysis Tools are software modules that perform linguistic analysis on texts at different levels. These tools are essential components of any Natual Language Processing (NLP) software that analyzes text, and any text mining software is typically built by combining basic linguistic modules forming complex pipelines.

The HiTZ center has a large tradition in building analysis tools for many languages, which range from basic linguistic processors such as tokenizers, Part-...Read More

see more

Text_analysis_tabs

Demos

Demo of the English NLP pipeline

Just copy in any English text and see what entities and events and other annotations are added automatically. The result is represented in the NAF format.

Demo of the Spanish NLP pipeline

Just copy in any Spanish text and see what entities and other annotations are added automatically. The result is represented in the NAF format.

Eustagger

Basque lemmatizer and morphosyntactic analyzer

Xuxen

Basque spelling corrector on-line

Contracts

Projects

Patents

MALTIXA

Resources

Publications

Eneko Agirre

Cross-Lingual Word Embeddings (Book Review) (2020)

Computational Linguistics (https://doi.org/10.1162/COLI_r_00372)

Jose R. Pichel, Pablo Gamallo, Iñaki Alegria, Marco Neves

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity (2020)

Journal of Quantitative Linguistics. DOI 10.1080/09296174.2020.1732177

Uxoa Iñurrieta

Identification and translation of verb+noun multiword expressions: a Spanish-Basque study (2020)

Procesamiento del Lenguaje Natural, 64, pp. 123-126.

Pablo Gamallo José Ramom Pichel and Iñaki Alegria

Measuring Language Distance of Isolated European Languages (2020)

MDPI Information 2020, 11(4), 181 doi:10.3390/info11040181

Kepa Bengoetxea, Itziar Gonzalez-Dios, Amaia Aguirregoitia

AzterTest: Open source linguistic and stylistic analysis tool (2020)

Procesamiento del Lenguaje Natural, 64, 61-68. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6196

Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

Give your Text Representation Models some Love: the Case for Basque (2020)

Proceedings of LREC. Also available at arxiv https://arxiv.org/pdf/2004.00033.pdf

Begoña Altuna, María Jesús Aranzabe, Arantza Díaz de Ilarraza

EusTimeML: A mark-up language for temporal information in Basque (2020)

Research in Corpus Linguistics 8: 86-104. ISSN 2243-4712. Asociación Española de Lingüística de Corpus (AELINCO) DOI 10.32714/ricl.08.01.06

Mikel Iruskieta, Arantxa Otegi, Larraitz Uria, Arantza Diaz de Ilarraza, Amaia Artolazabal

Zer i(ra)kas dezakegu geure corpusekin "jolastuz"? (2019)

Traineru bete lagun: Iñaki Gaminde omenduz. UPV/EHU. 35-66 or.

Y Yaghoobzadeh, K Kann, TJ Hazen, E Agirre, H Schütze

Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings (2019)

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Mikel Iruskieta

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool (2019)

PLoS ONE 14(9): e0221639

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Cross-lingual Diachronic Distance: Application to Portuguese and Spanish (2019)

SEPLN, 2019

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Measuring diachronic language distance using perplexity. Application to English, Portuguese and Spanish. (2019)

Natural Language Engeenering

Ainara Estarrona, Izaskun Etxeberria, Ander Soraluze, Manuel Padilla-Moyano

Spelling Normalisation of Basque Historical Texts (2019)

Procesamiento del Lenguaje Natural, vol. 63, pp. 59-66

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (2018)

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 282–291. Brussels, Belgium, October 31 - November 1, 2018. Best paper award

Zuhaitz Beloki and Xabier Artola and Aitor Soroa

A scalable architecture for data-intensive natural language processing (2017)

Natural Language Engineering, 1-23. doi:10.1017/S1351324917000092.

Itziar Aduriz, Iñaki Alegria, Olatz Arregi, Arantza Diaz de Ilarraza, Kepa Sarasola

Hizkuntza-teknologia “Datu Handien” garaian: programa bilatzaileak, itzultzaileak… (2017)

Senez, 48, pp. 191-200. ISSN: 1132-2152. 2017 https://eizie.eus/eu/argitalpenak/senez/20171102/aurkezpena/datuhandiak

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga, Gorka Labaka

A Modular Chain of NLP Tools for Basque (2016)

Proceedings of the 19th International Conference on Text, Speech and Dialogue, TSD 2016, Brno, Czech Republic, Lecture Notes in Computer Science, vol. 9924, pp. 93-100, Springer. ISBN 978-3-319-45509-9. DOI 10.1007/978-3-319-45510-5_11

Rodrigo Agerri, Xabier Artola, Zuhaitz Beloki, German Rigau, Aitor Soroa

Big data for Natural Language Processing: A streaming approach (2015)

Knowledge-Based Systems. http://dx.doi.org/10.1016/j.knosys.2014.11.007. Vol.79, pages 36-42.

Xabier Artola, Zuhaitz Beloki, Aitor Soroa

A stream computing approach towards scalable NLP (2014)

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4

Rodrigo Agerri, Josu Bermudez, German Rigau

IXA pipeline: Efficient and Ready to Use Multilingual NLP tools. (2014)

LREC 2014: 3823-3828. ISBN 978-2-9517408-8-4

More publications

Text_analysis_tabs_full

Demo of the English NLP pipeline

Just copy in any English text and see what entities and events and other annotations are added automatically. The result is represented in the NAF format.

Demo of the Spanish NLP pipeline

Just copy in any Spanish text and see what entities and other annotations are added automatically. The result is represented in the NAF format.

Eustagger

Basque lemmatizer and morphosyntactic analyzer

Xuxen

Basque spelling corrector on-line

MALTIXA

Eneko Agirre

Cross-Lingual Word Embeddings (Book Review) (2020)

Computational Linguistics (https://doi.org/10.1162/COLI_r_00372)

Jose R. Pichel, Pablo Gamallo, Iñaki Alegria, Marco Neves

A Methodology to Measure the Diachronic Language Distance between Three Languages Based on Perplexity (2020)

Journal of Quantitative Linguistics. DOI 10.1080/09296174.2020.1732177

Uxoa Iñurrieta

Identification and translation of verb+noun multiword expressions: a Spanish-Basque study (2020)

Procesamiento del Lenguaje Natural, 64, pp. 123-126.

Pablo Gamallo José Ramom Pichel and Iñaki Alegria

Measuring Language Distance of Isolated European Languages (2020)

MDPI Information 2020, 11(4), 181 doi:10.3390/info11040181

Kepa Bengoetxea, Itziar Gonzalez-Dios, Amaia Aguirregoitia

AzterTest: Open source linguistic and stylistic analysis tool (2020)

Procesamiento del Lenguaje Natural, 64, 61-68. http://journal.sepln.org/sepln/ojs/ojs/index.php/pln/article/view/6196

Rodrigo Agerri, Iñaki San Vicente, Jon Ander Campos, Ander Barrena, Xabier Saralegi, Aitor Soroa, Eneko Agirre

Give your Text Representation Models some Love: the Case for Basque (2020)

Proceedings of LREC. Also available at arxiv https://arxiv.org/pdf/2004.00033.pdf

Begoña Altuna, María Jesús Aranzabe, Arantza Díaz de Ilarraza

EusTimeML: A mark-up language for temporal information in Basque (2020)

Research in Corpus Linguistics 8: 86-104. ISSN 2243-4712. Asociación Española de Lingüística de Corpus (AELINCO) DOI 10.32714/ricl.08.01.06

Mikel Iruskieta, Arantxa Otegi, Larraitz Uria, Arantza Diaz de Ilarraza, Amaia Artolazabal

Zer i(ra)kas dezakegu geure corpusekin "jolastuz"? (2019)

Traineru bete lagun: Iñaki Gaminde omenduz. UPV/EHU. 35-66 or.

Y Yaghoobzadeh, K Kann, TJ Hazen, E Agirre, H Schütze

Probing for Semantic Classes: Diagnosing the Meaning Content of Word Embeddings (2019)

Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics.

Aitziber Atutxa, Kepa Bengoetxea, Arantza Diaz de Ilarraza, Mikel Iruskieta

Towards a top-down approach for an automatic discourse analysis for Basque: Segmentation and Central Unit detection tool (2019)

PLoS ONE 14(9): e0221639

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Cross-lingual Diachronic Distance: Application to Portuguese and Spanish (2019)

SEPLN, 2019

José Ramom Pichel, Pablo Gamallo, Iñaki Alegria

Measuring diachronic language distance using perplexity. Application to English, Portuguese and Spanish. (2019)

Natural Language Engeenering

Ainara Estarrona, Izaskun Etxeberria, Ander Soraluze, Manuel Padilla-Moyano

Spelling Normalisation of Basque Historical Texts (2019)

Procesamiento del Lenguaje Natural, vol. 63, pp. 59-66

Mikel Artetxe, Gorka Labaka, Iñigo Lopez-Gazpio, Eneko Agirre

Uncovering divergent linguistic information in word embeddings with lessons for intrinsic and extrinsic evaluation (2018)

Proceedings of the 22nd Conference on Computational Natural Language Learning (CoNLL 2018), pages 282–291. Brussels, Belgium, October 31 - November 1, 2018. Best paper award

Zuhaitz Beloki and Xabier Artola and Aitor Soroa

A scalable architecture for data-intensive natural language processing (2017)

Natural Language Engineering, 1-23. doi:10.1017/S1351324917000092.

Itziar Aduriz, Iñaki Alegria, Olatz Arregi, Arantza Diaz de Ilarraza, Kepa Sarasola

Hizkuntza-teknologia “Datu Handien” garaian: programa bilatzaileak, itzultzaileak… (2017)

Senez, 48, pp. 191-200. ISSN: 1132-2152. 2017 https://eizie.eus/eu/argitalpenak/senez/20171102/aurkezpena/datuhandiak

Arantxa Otegi, Nerea Ezeiza, Iakes Goenaga, Gorka Labaka

A Modular Chain of NLP Tools for Basque (2016)

Proceedings of the 19th International Conference on Text, Speech and Dialogue, TSD 2016, Brno, Czech Republic, Lecture Notes in Computer Science, vol. 9924, pp. 93-100, Springer. ISBN 978-3-319-45509-9. DOI 10.1007/978-3-319-45510-5_11

Rodrigo Agerri, Xabier Artola, Zuhaitz Beloki, German Rigau, Aitor Soroa

Big data for Natural Language Processing: A streaming approach (2015)

Knowledge-Based Systems. http://dx.doi.org/10.1016/j.knosys.2014.11.007. Vol.79, pages 36-42.

Xabier Artola, Zuhaitz Beloki, Aitor Soroa

A stream computing approach towards scalable NLP (2014)

Proceedings of the Ninth International Conference on Language Resources and Evaluation (LREC'14). Reykjavik, Iceland. ISBN: 978-2-9517408-8-4

Rodrigo Agerri, Josu Bermudez, German Rigau

IXA pipeline: Efficient and Ready to Use Multilingual NLP tools. (2014)

LREC 2014: 3823-3828. ISBN 978-2-9517408-8-4

More publications