Distancia diacrónica interlingüística: aplicación al portugués y el castellano

  1. Gamallo Otero, Pablo
  2. Alegría Loinaz, Iñaki
  3. Pichel Campos, José Ramom
Revista:
Procesamiento del lenguaje natural

ISSN: 1135-5948

Ano de publicación: 2019

Número: 63

Páxinas: 77-84

Tipo: Artigo

Outras publicacións en: Procesamiento del lenguaje natural

Resumo

The aim of this paper is to establish a corpus-based methodology for automatically measuring the cross-lingual distance between historical periods of two languages using perplexity. The corpus of both has been constructed adhoc with the closest spelling to the original representing chronologically and in a balanced way fiction and non-fiction. The methodology has been applied to two related languages, Portuguese and Spanish, and measured their diachronic distances both in original orthography and in an automatically transcribed spelling. |

Información de financiamento

The authors thanks the referees for thoughtful comments and helpful suggestions. We are very grateful to Fernando Venâncio from the University of Amsterdam, José António Souto Cabo and Carlos Quiroga from the University of Santiago de Com-postela for his expertise in Portuguese and Spanish Language history. This work has received financial support from the DOMINO project (PGC2018-102041-B-I00, MCIU/AEI/FEDER, UE), and the Con-sellería de Cultura, Educación e Orde-nación Universitaria (accreditation 2016-2019, ED431G/08) and the European Regional Development Fund (ERDF).

Referencias bibliográficas

  • Asgari, E. and M. R. K. Mofrad. 2016. Comparing fifty natural languages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. In Proceedings of the Workshop on Multilingual and Cross-lingual Methods in NLP, pages 65–74, San Diego, California.
  • Bakker, D., A. Muller, V. Velupillai, S. Wichmann, C. H. Brown, P. Brown, D. Egorov, R. Mailhammer, A. Grant, and E. W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology, 13(1):169–181.
  • Barbançon, F., S. Evans, L. Nakhleh, D. Ringe, and T. Warnow. 2013. An experimental study comparing linguistic phylogenetic reconstruction methods. Diachronica, 30:143–170.
  • Biber, D. 1993. Representativeness in corpus design. Literary and linguistic computing, 8(4):243–257.
  • Brown, C. H., E. W. Holman, S. Wichmann, and V. Velupilla. 2008. Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals, 61(4).
  • Chen, S. F. and J. Goodman. 1996. An empirical study of smoothing techniques for language modeling. In Proceedings of the 34th Annual Meeting on Association for Computational Linguistics, ACL ’96, pages 310–318, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Chiswick, B. and P. Miller. 2004. Linguistic Distance: A Quantitative Measure of the Distance Between English and Other Languages. Discussion papers. IZA.
  • Corredoira, F. V. 1998. A construção da ĺıngua portuguesa frente ao castelhano: o galego como exemplo a contrario.
  • Curell, C. 2006. La influencia del francés en el español contemporáneo. In La cultura del otro: español en Francia, francés en España, pages 785–792. Universidad de Sevilla.
  • Degaetano-Ortlieb, S., H. Kermes, A. Khamis, and E. Teich. 2016. An information-theoretic approach to modeling diachronic change in scientific english. Selected Papers from Varieng-From Data to Evidence (d2e).
  • Ellison, T. M. and S. Kirby. 2006. Measuring language divergence by intra-lexical comparison. In Proceedings of the 21st international conference on computational linguistics and 44th annual meeting of the association for computational linguistics, pages 273–280.
  • Galves, C. and P. Faria. 2010. Tycho Brahe parsed corpus of historical Portuguese. URL: http://www. tycho. iel. unicamp. br/˜ tycho/corpus/en/index. html.
  • Gamallo, P., I. Alegria, J. R. Pichel, and M. Agirrezabal. 2016. Comparing two basic methods for discriminating between similar languages and varieties. In Proceedings of the Third Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial3), pages 170–177.
  • Gamallo, P., J. R. Pichel, and I. Alegria. 2017. From language identification to language distance. Physica A: Statistical Mechanics and its Applications, 484:152–162.
  • Gao, Y., W. Liang, Y. Shi, and Q. Huang. 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications, 393(C):579–589.
  • González, M. 2015. An analysis of twitter corpora and the differences between formal and colloquial tweets. In Proceedings of the Tweet Translation Workshop 2015, pages 1–7.
  • Holman, E., S. Wichmann, C. Brown, V. Velupillai, A. Muller, and D. Bakker. 2008. Explorations in automated lexicostatistics. Folia Linguistica, 42(2):331– 354.
  • Liu, H. and J. Cong. 2013. Language clustering with word co-occurrence networks based on parallel texts. Chinese Science Bulletin, 58(10):1139–1144.
  • Malmasi, S., M. Zampieri, N. Ljubešić, P. Nakov, A. Ali, and J. Tiedemann. 2016. Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. In Proceedings of the 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), pages 1– 14, Osaka, Japan.
  • Millar, R. M. and L. Trask. 2015. Trask’s historical linguistics. Routledge.
  • Nakhleh, L., D. A. Ringe, and T. Warnow. 2005. Perfect phylogenetic networks: A new methodology for reconstructing the evolutionary history of natural languages. Language, 81(2):382–420.
  • Nerbonne, J. and W. Heeringa. 1997. Measuring dialect distance phonetically. In Proceedings of the Third Meeting of the ACL Special Interest Group in Computational Phonology, pages 11–18.
  • Petroni, F. and M. Serva. 2010. Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications, 389(11):2280–2283.
  • Pichel, J. R., P. Gamallo, and I. Alegria. 2018. Measuring language distance among historical varieties using perplexity. application to european portuguese. In Proceedings of the Fifth Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial 2018), pages 145–155.
  • Rama, T., L. Borin, G. Mikros, and J. Macutek. 2015. Comparative evaluation of string similarity measures for automatic language classification.
  • Rama, T. and A. K. Singh. 2009. From bag of languages to family trees from noisy corpus. In Proceedings of the International Conference RANLP-2009, pages 355–359.
  • Rissanen, M. et al. 1993. The helsinki corpus of english texts. Kyttö et. al, pages 73–81.
  • Satterthwaite-Phillips, D. 2011. Phylogenetic Inference of the Tibeto-Burman Languages Or on the Usefulness of Lexicostatistics (and” megalo”-comparison) for the Subgrouping of Tibeto-Burman. Stanford University.
  • Sennrich, R. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. In Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, EACL ’12, pages 539–549, Stroudsburg, PA, USA. Association for Computational Linguistics.
  • Singh, A. K. and H. Surana. 2007. Can corpus based measures be used for comparative study of languages? In Proceedings of ninth meeting of the ACL special interest group in computational morphology and phonology, pages 40–47. Association for Computational Linguistics.
  • Swadesh, M. 1952. Lexicostatistic dating of prehistoric ethnic contacts. In Proceedings of the American Philosophical Society 96, pages 452–463.
  • Venâncio, F. 2014. O castelhano como vernáculo português. https://pgl.gal/o-castelhano-como-vernaculo-portugues/
  • Xavier, M. F., M. T. Brocardo, and M. Vincente. 1994. Cipm–um corpus informatizado do português medieval. Actas do X Encontro da Associação Portuguesa de Lingúıstica, 2:599–612.
  • Yujian, L. and L. Bo. 2007. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence, 29(6):1091–1095.
  • Zampieri, M. 2017. Compiling and processing historical and contemporary portuguese corpora. arXiv preprint arXiv:1710.00803.