Distancia diacrónica interlingüística: aplicación al portugués y el castellano

  1. Gamallo Otero, Pablo
  2. Alegría Loinaz, Iñaki
  3. Pichel Campos, José Ramom
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Año de publicación: 2019

Número: 63

Páginas: 77-84

Tipo: Artículo

Otras publicaciones en: Procesamiento del lenguaje natural

Resumen

El objetivo de este trabajo es establecer una metodología basada en corpus para medir automáticamente la distancia interlingüística entre períodos históricos de dos lenguas mediante perplexity. El corpus de los dos idiomas ha sido construido adhoc con ortografía lo más próxima a la original representando cronológicamente y de forma balanceada ficción y no ficción. Se ha aplicado la metodología a dos lenguas relacionadas, Portugués y Español, y medido sus distancias diacrónicas tanto en ortografía original como en una ortografía transcrita automáticamente.

Información de financiación

The authors thanks the referees for thoughtful comments and helpful suggestions. We are very grateful to Fernando Venâncio from the University of Amsterdam, José António Souto Cabo and Carlos Quiroga from the University of Santiago de Com-postela for his expertise in Portuguese and Spanish Language history. This work has received financial support from the DOMINO project (PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE), and the Con-sellería de Cultura, Educación e Orde-nación Universitaria (accreditation 2016-2019, ED431G/08) and the European Regional Development Fund (ERDF).

Referencias bibliográficas

  • Asgari, E. and M. R. K. Mofrad. 2016. Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 65–74, San Diego, California.
  • Bakker, D., A. Muller, V. Velupillai, S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R. Mailhammer, A. Grant, and E. W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology, 13(1):169–181.
  • Barbançon, F., S. Evans, L. Nakhleh, D. Ringe, and T. Warnow. 2013. An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica, 30:143–170.
  • Biber, D. 1993. Representativeness in corpus design. Literary and linguistic computing, 8(4):243–257.
  • Brown, C. H., E. W. Holman, S. Wichmann, and V. Velupilla. 2008. Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals, 61(4).
  • Chen, S. F. and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, pages 310–318, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Chiswick, B. and P. Miller. 2004. Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages. Discussion papers. IZA.
  • Corredoira, F. V. 1998. A construção da ĺıngua portuguesa frente ao castelhano: o galego como exemplo a contrario.
  • Curell, C. 2006. La influencia del francés en el español contemporáneo. In La cultura del otro: español en Francia, francés en España, pages 785–792. Universidad de Sevilla.
  • Degaetano-Ortlieb, S., H. Kermes, A. Khamis, and E. Teich. 2016. An information-theoretic approach to modeling diachronic change in scientific english. Selected Papers from Varieng-From Data to Evidence (d2e).
  • Ellison, T. M. and S. Kirby. 2006. Measuring language divergence by intra-lexical comparison. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pages 273–280.
  • Galves, C. and P. Faria. 2010. Tycho Brahe parsed corpus of historical Portuguese. URL: http://www. tycho. iel. unicamp. br/˜ tycho/corpus/en/index. html.
  • Gamallo, P., I. Alegria, J. R. Pichel, and M. Agirrezabal. 2016. Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 170–177.
  • Gamallo, P., J. R. Pichel, and I. Alegria. 2017. From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484:152–162.
  • Gao, Y., W. Liang, Y. Shi, and Q. Huang. 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications, 393(C):579–589.
  • González, M. 2015. An analysis of twitter corpora and the differences between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pages 1–7.
  • Holman, E., S. Wichmann, C. Brown, V. Velupillai, A. Muller, and D. Bakker. 2008. Explorations in automated lexicostatistics. Folia Linguistica, 42(2):331– 354.
  • Liu, H. and J. Cong. 2013. Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10):1139–1144.
  • Malmasi, S., M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, and J. Tiedemann. 2016. Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), pages 1– 14, Osaka, Japan.
  • Millar, R. M. and L. Trask. 2015. Trask’s historical linguistics. Routledge.
  • Nakhleh, L., D. A. Ringe, and T. Warnow. 2005. Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language, 81(2):382–420.
  • Nerbonne, J. and W. Heeringa. 1997. Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pages 11–18.
  • Petroni, F. and M. Serva. 2010. Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications, 389(11):2280–2283.
  • Pichel, J. R., P. Gamallo, and I. Alegria. 2018. Measuring language distance among historical varieties using perplexity. application to european portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 145–155.
  • Rama, T., L. Borin, G. Mikros, and J. Macutek. 2015. Comparative evaluation of string similarity measures for automatic language classification.
  • Rama, T. and A. K. Singh. 2009. From bag of languages to family trees from noisy corpus. In Proceedings of the International Conference RANLP-2009, pages 355–359.
  • Rissanen, M. et al. 1993. The helsinki corpus of english texts. Kyttö et. al, pages 73–81.
  • Satterthwaite-Phillips, D. 2011. Phylogenetic Inference of the Tibeto-Burman Languages Or on the Usefulness of Lexicostatistics (and” megalo”-comparison) for the Subgrouping of Tibeto-Burman. Stanford University.
  • Sennrich, R. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pages 539–549, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Singh, A. K. and H. Surana. 2007. Can corpus based measures be used for comparative study of languages? In Proceedings of ninth meeting of the ACL special interest group in computational morphology and phonology, pages 40–47. Association for Computational Linguistics.
  • Swadesh, M. 1952. Lexicostatistic dating of prehistoric ethnic contacts. In Proceedings of the American Philosophical Society 96, pages 452–463.
  • Venâncio, F. 2014. O castelhano como vernáculo português. https://pgl.gal/o-castelhano-como-vernaculo-portugues/
  • Xavier, M. F., M. T. Brocardo, and M. Vincente. 1994. Cipm–um corpus informatizado do português medieval. Actas do X Encontro da Associação Portuguesa de Lingúıstica, 2:599–612.
  • Yujian, L. and L. Bo. 2007. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095.
  • Zampieri, M. 2017. Compiling and processing historical and contemporary portuguese corpora. arXiv preprint arXiv:1710.00803.