PoS-tagging the Web in Portuguese. National varieties, text typologies and spelling systems

  1. Marcos García
  2. Pablo Gamallo
  3. Iria Gayo
  4. Miguel A. Pousada Cruz
Journal:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Year of publication: 2014

Issue: 53

Pages: 95-101

Type: Article

More publications in: Procesamiento del lenguaje natural

Abstract

The great amount of text produced every day in the Web turned it as one of the main sources for obtaining linguistic corpora, that are further analyzed with Natural Language Processing techniques. On a global scale, languages such as Portuguese - official in 9 countries - appear on the Web in several varieties, with lexical, morphological and syntactic (among others) differences. Besides, a unified spelling system for Portuguese has been recently approved, and its implementation process has already started in some countries. However, it will last several years, so different varieties and spelling systems coexist. Since PoS-taggers for Portuguese are specifically built for a particular variety, this work analyzes different training corpora and lexica combinations aimed at building a model with high-precision annotation in several varieties and spelling systems of this language. Moreover, this paper presents different dictionaries of the new orthography (Spelling Agreement) as well as a new freely available testing corpus, containing different varieties and textual typologies.

Bibliographic References

  • Aires, Raquel V. Xavier. 2000. Implementação, adaptação combinação e avaliação de etiquetadores para o Português do Brasil. Master's thesis, Instituto de Ciências Matemáticas, Universidade de São Paulo, São Paulo.
  • Almeida, Gladis Maria de Barcellos, José Pedro Ferreira, Margarita Correia, and Gilvan Müller de Oliveira. 2013. Vocabulário Ortográfico Comum (VOC): constituição de uma base lexical para a língua portuguesa. ESTUDOS LINGUíSTICOS, 42(1):204-215.
  • Aluísio, Sandra M., Gisele M. Pinheiro, Marcelo Finger, M. Graças Volpe Nunes, and Stella E. Tagnin. 2003. The Lacio-Web Project: overview and issues in Brazilian Portuguese corpora creation. In Proceedings of Corpus Linguistics, volume 2003, pages 14-21.
  • Bick, Eckhard. 2000. The Parsing System PALAVRAS: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Ph.D. thesis, University of Aarhus, Denmark.
  • Branco, António and João Silva. 2004. Evaluating Solutions for the Rapid Development of State-of-the-Art POS Taggers for Portuguese. In Maria Teresa Lino, Maria Francisca Xavier, Fátima Ferreira, Rute Costa, and Raquel Silva, editors, Proceedings of the 4th edition of the Language Resources and Evaluation Conference (LREC 2004), pages 507-510, Paris. European Language Resources Association.
  • Brants, Thorsten. 2000. TnT - A Statistical Part-of-Speech Tagger. In Proceedings of the 6th Conference on Applied Natural Language Processing (ANLP 2000). Association for Computational Linguistics.
  • Brill, Eric. 1995. Transformation-based error-driven learning and natural language processing: A case study in part-ofspeech tagging. Computational linguistics, 21(4):543-565.
  • Eleutério, Samuel, Elisabete Ranchhod, Cristina Mota, and Paula Carvalho. 2003. Dicionários Electrónicos do Português. Características e Aplicações. In Actas del VIII Simposio Internacional de Comunicación Social, pages 636-642, Santiago de Cuba.
  • Garcia, Marcos and Pablo Gamallo. 2010. Análise morfossintáctica para português europeu e galego: Problemas, soluções e avaliação. Linguamática, 2(2):59-67.
  • Garcia, Marcos and Pablo Gamallo. 2014. Multilingual corpora with coreferential annotation of person entities. In Proceedings of the 9th edition of the Language Resources and Evaluation Conference (LREC 2014), pages 3229{3233, Reykjavik. European Language Resources Association.
  • Marques, Nuno and Gabriel Lopes. 2001. Tagging with Small Training Corpora. In Proceedings of the International Conference on Intelligent Data Analysis, volume 2189 of Lecture Notes on Artificial Intelligente (LNAI), pages 63-72. Springer-Verlag.
  • Muniz, Marcelo Caetano Martins. 2004. A construção de recursos lingúísticocomputacionais para o português do Brasil: o projeto de Unitex-PB. Master's thesis, Instituto de Ciências Matemáticas de São Carlos, Universidade de São Paulo, São Paulo.
  • Padró, Lluís and Evgeny Stanilovsky. 2012. FreeLing 3.0: Towards Wider Multilinguality. In Proceedings of 8th edition of the Language Resources and Evaluation Conference (LREC 2012), Istanbul, Turkey. European Language Resources Association.
  • Ratnaparkhi, Adwait. 1996. A maximum entropy model for part-of-speech tagging. In Proceedings of the Empirical Methods in Natural Language Processing (EMNLP 1996), volume 1, pages 133{142. Association for Computational Linguistics.
  • Ribeiro, Ricardo, Luís C. Oliveira, and Isabel Trancoso. 2003. Using Morphossyntactic Information in TTS Systems: Comparing Strategies for European Portuguese. In Proceedings of the 6th Workshop on Computational Processing on the Portuguese Language (PROPOR 2003), pages 143-150, Faro. Springer-Verlag.
  • Tufis, Dan and Oliver Mason. 1998. Tagging Romanian texts: a case study for QTAG, a language independent probabilistic tagger. In Proceedings of the 1st edition of the Language Resources and Evaluation Conference (LREC 1998), volume 1, pages 589-596. European Language Resources Association.