Distância diacrónica automática entre variantes diatópicas do português e do espanhol

  1. Pichel, José Ramom 1
  2. Gamallo, Pablo 2
  3. Neves, Marco 3
  4. Alegria, Iñaki 4
  1. 1 imaxin software
  2. 2 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

  3. 3 Universidade Nova de Lisboa
    info

    Universidade Nova de Lisboa

    Lisboa, Portugal

    ROR https://ror.org/02xankh89

  4. 4 Universidade do País Basco (EHU/UPV)
Journal:
Linguamática

ISSN: 1647-0818

Year of publication: 2020

Volume: 12

Issue: 1

Pages: 117-126

Type: Article

DOI: 10.21814/LM.12.1.319 DIALNET GOOGLE SCHOLAR lock_openOpen access editor

More publications in: Linguamática

Abstract

The objective of this work is to apply a perplexity-based methodology to automatically calculate the cross-lingual distance between different historical periods of diatopic language variants. This methodology applies to an adhoc constructed corpus in original spelling, on a balanced basis of fiction and non-fiction, which measures the historical distance between European and Brazilian Portuguese on the one hand, and European and Argentinian Spanish on the other. The results show very close distances, both in original spelling and automatically transcribed spelling, between the diatopic varieties of Portuguese and Spanish, with slight convergences/divergences from the middle of the 20th century until today. It should be noted that the method is not supervised and can be applied to other diatopic varieties of languages.

Bibliographic References

  • Asgari, Ehsaneddin & Mohammad R. K. Mo-frad. 2016.Comparing fifty natural lan-guages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. Em Workshop on Multilingual and Cross-lingual Methods in NLP, 65–74.10.18653/v1/W16-1208.
  • Bakker, Dik, Andre Muller, Viveka Velupillai, Soren Wichmann, Cecil H. Brown, Pa-mela Brown, Dmitry Egorov, Robert Mai-lhammer, Anthony Grant & Eric W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology 13(1). 169–181.10.1515/LITY.2009.009.
  • Barbançon, François, Steven N. Evans, Luay Nakhleh, Don Ringe & Tandy Warnow.2013.An experimental study comparing linguistic phylogenetic reconstruction methods .Diachronica 30(2). 143–170.10.1075/dia.30.2.01bar.
  • Bello, Andrés. 1984.Gramática de la lengua castellana. EDAF.
  • Bello, Andrés et al. 1951.Gramatica: gramática de la lengua castellana destinada al uso de los americanos. Caracas: Ministerio de Educación.
  • Biber, Douglas. 1993. Representativeness in corpus design. Literary and linguistic Computing 8(4). 243–257.10.1093/llc/8.4.243.
  • Brown, Cecil H., Eric W. Holman, Søren Wich-mann & Viveka Velupilla. 2009. Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals 61(4). 285–308.10.1524/stuf.2008.0026.
  • Cavnar, William B & John M Trenkle. 1994. N-gram-based text categorization. Em 3rd anual symposium on document analysis and information retrieval, 161–175.
  • Chen, Stanley F. & Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. Em 34th Annual Meeting on Association for Computational Linguistics,310–318.10.3115/981863.981904.
  • Chiswick, Barry R. & Paul W. Miller. 2004.Linguistic distance: A quantitative measure of the distance between english and other languages. Bonn: IZA Discussion Papers.
  • Degaetano-Ortlieb,Stefania, Hannah Kermes, Ashraf Khamis & Elke Teich. 2016. An information-theoretic approach to modeling diachronic change in scientific english. Em From Data to Evidence in English Language Research, 258–281. Brill. 10.1163/9789004390652_012.
  • Dieguez-Tirado, Javier, Carmen Garcia-Mateo, Laura Docio-Fernandez & Antonio Cardenal-Lopez. 2005. Adaptation strategies forthe acoustic and language models in bilingual speech transcription. Em IEEE International Conference on Acoustics, Speech, and Signal Processing, I/833–I/836.10.1109/ICASSP.2005.1415243.
  • Dunning, Ted. 1994. Statistical identification of language. Computing Research Laboratory, New Mexico State University.
  • Gamallo, Pablo, Inaki Alegria, José Ramom Pichel & Manex Agirrezabal. 2016.Comparing two basic methods for discriminating between similar languages and varieties. Em 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 170–177.
  • Gamallo, Pablo, Marcos Garcia, Susana Sotelo &José Ramom Pichel. 2014. Comparing ranking-based and naive bayes approaches to language detection on tweets. Em Workshop Tweet LID: Twitter Language Identification Workshop at SEPLN 2014, 12–16.
  • Gamallo, Pablo, José Ramom Pichel & Iñaki Alegria. 2017a. From language identification to language distance.Physica A: Statistical Mechanics and its Applications 484. 152–162.10.1016/j.physa.2017.05.011.
  • Gamallo, Pablo, Jose Ramom Pichel, Santiago de Compostela & Inaki Alegria. 2017b. Aperplexity-based method for similar languages discrimination. 4th Workshop on NLP for Si-milar Languages, Varieties and Dialects (Var-Dial)109–114.10.18653/v1/W17-1213.
  • Gao, Yuyang, Wei Liang, Yuming Shi & Qiu-ling Huang. 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications 393. 579–589.10.1016/j.physa.2013.08.075.
  • González, Meritxell. 2015. An analysis of Twitter corpora and the differences between formaland colloquial tweets. Em Tweet Translation Workshop 2015, 1–7.
  • Gonzalez-Dominguez, Javier, Ignacio Lopez-Moreno,Ha ̧sim Sak, Joaquin Gonzalez-Rodriguez & Pedro J Moreno. 2014. Automatic language identification using long short-term memory recurrent neural networks. Em 15th Annual Conference of the International Speech Communication Association, .
  • Han, Aaron Li-Feng, Yi Lu, Derek F Wong, Lidia S Chao, Liangye He & Junwen Xing.2013. Quality estimation for machine translation using the joint method of evaluation criteria and statistical modeling. Em 8th Workshopon Statistical Machine Translation, 365–372.
  • Holman, Eric W., Søren Wichmann, Cecil H.Brown, Viveka Velupillai, André Muller & Dik Bakker. 2008. Explorations in automated lexico statistics. Folia Linguistica42(3–4). 331–354.10.1515/FLIN.2008.331.
  • Jelinek, Fred, Robert L Mercer, Lalit R Bahl & James K Baker. 1977. Perplexity: a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America 62(S1). S63.10.1121/1.2016299.
  • Kondrak, Grzegorz. 2005. N-gram similarity and distance. Em International Symposium on String Processing and Information Retrieval(SPIRE), 115–126.10.1007/11575832_13.
  • Kroon, Martin, Masha Medvedeva & Barbara Plank. 2018. When simple n-gram models out-perform syntactic approaches: Discriminating between Dutch and Flemish. Em 5th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 244–253.
  • Liu, HaiTao & Jin Cong. 2013. Language clustering with word co-occurrence networks basedon parallel texts. Chinese Science Bulletin 58. 1139–1144. 10.1007/s11434-013-5711-8.
  • Lopez-Moreno, Ignacio, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez & Pedro Moreno. 2014. Automatic language identification using deep neural networks. Em IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), 5337–5341. 10.1109/ICASSP.2014.6854622.
  • Malmasi, Shervin, Marcos Zampieri, Nikola Ljubesic, Preslav Nakov, Ahmed Ali & Jörg Tiedemann. 2016. Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. Em 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), 1–14.
  • Millar, Robert McColl & Larry Trask. 2015. Trask’s historical linguistics. Abington, UK: Routledge.
  • Nakhleh, Luay, Donald A Ringe & Tandy Warnow. 2005. Perfect phylogenetic networks: Anew methodology for reconstructing the evolutionary history of natural languages. Language 81(2). 382–420.
  • Nerbonne, John & Wilbert Heeringa. 1997. Measuring dialect distance phonetically. Em 3rd Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON), 11–18.
  • Petroni, Filippo & Maurizio Serva. 2010. Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications 389(11). 2280–2283. 10.1016/j.physa.2010.02.004.
  • Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2018. Measuring language distance among historical varieties using perplexity. application to european portuguese. Em 5thWorkshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 145–155.
  • Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2019a. Cross-lingual diachronic distance: Application to portuguese and spanish. Procesamiento del Lenguaje Natural 63. 77–84.
  • Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2019b. Measuring diachronic lan-guage distance using perplexity: Application to english, portuguese, and spanish. Natural Language Engineering 1–22. 10.1017/S1351324919000378.
  • Rama, Taraka & Lars Borin. 2015. Comparative evaluation of string similarity measures for automatic language classification. Em Sequences in Language and Text, 171–200. De Gruyter Mouton. 10.1515/9783110362879-012.
  • Rissanen, Matti, Merja Kytö & Minna Palander-Collin. 1993. Early english in the computer age: Explorations through the helsinki corpus. Berlin: De Gruyter Mouton.
  • Sennrich, Rico. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. Em 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 539–549.
  • Simoes, Alberto, Álvaro Iriarte Sanromán &José Joao Almeida. 2012. Dicionário-aberto: A source of resources for the portuguese language processing. Em International Conference on Computational Processing of the Portuguese Language (PROPOR), 121–127. 10.1007/978-3-642-28885-2_14.
  • Singh, Anil Kumar & Harshit Surana. 2007. Cancorpus based measures be used for comparative study of languages? Em 9th meeting of the ACL special interest group in computational morphology and phonology, 40–47.
  • Specia, Lucia, Carolina Scarton & Gustavo Henrique Paetzold. 2018. Quality estimation for machine translation. Synthesis Lectures on Human Language Technologies 11(1). 1–162.10.2200/S00854ED1V01Y201805HLT039.
  • Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts: With special reference to north american indians and eskimos. American Philosophical Society 96(4). 452–463.
  • Tiedemann, Jörg & Nikola Ljubesic. 2012. Efficient discrimination between closely related languages. Em International Conference on Computational Linguistics (COLING), 2619–2634.
  • Yujian, Li & Liu Bo. 2007. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence 29(6). 1091–1095. 10.1109/TPAMI.2007.1078.
  • Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardzic, Nikola Ljubesic, Jörg Tiedemann et al. 2018. Language identification and morphosyntactic tagging: The second vardial evaluation campaign. Em 5th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 1–17.
  • Zampieri, Marcos, Shervin Malmasi, Yves Scherrer, Tanja Samardzic, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan,Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru & Tommi Jauhiainen. 2019. A report on the third VarDial evaluation campaign. Em 6th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 1–16. 10.18653/v1/W19-1401.
  • Zubiaga, Arkaitz, Iñaki San Vicente, Pablo Gamallo, José Ramom Pichel, Iñaki Alegria, Nora Aranberri, Aitzol Ezeiza & Víctor Fresno. 2016. TweetLID: a bench-mark for tweet language identification. Language Resources and Evaluation 50. 729–766.10.1007/s10579-015-9317-4.