Distância diacrónica automática entre variantes diatópicas do português e do espanhol

  1. Pichel, José Ramom 1
  2. Gamallo, Pablo 2
  3. Neves, Marco 3
  4. Alegria, Iñaki 4
  1. 1 imaxin software
  2. 2 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

  3. 3 Universidade Nova de Lisboa
    info

    Universidade Nova de Lisboa

    Lisboa, Portugal

    ROR https://ror.org/02xankh89

  4. 4 Universidade do País Basco (EHU/UPV)
Revista:
Linguamática

ISSN: 1647-0818

Ano de publicación: 2020

Volume: 12

Número: 1

Páxinas: 117-126

Tipo: Artigo

DOI: 10.21814/LM.12.1.319 DIALNET GOOGLE SCHOLAR lock_openAcceso aberto editor

Outras publicacións en: Linguamática

Obxectivos de Desenvolvemento Sustentable

Resumo

O objetivo deste trabalho é aplicar uma metodologia baseada na perplexidade, para calcular automaticamente a distância interlinguística entre diferentes períodos históricos de variantes diatópicas de idiomas. Esta metodologia aplica-se a um corpus construído adhoc em ortografia original, numa base equilibrada de ficção e não-ficção, que mede a distância histórica entre o português europeu e do Brasil, por um lado, e o espanhol europeu e o da Argentina, por outro. Os resultados mostram distâncias muito próximas em ortografia original e transcrita automaticamente, entre as variedades diatópicas do português e do espanhol, com ligeiras convergências/divergências desde meados do século XX até hoje. É de salientar que o método não é supervisionado e pode ser aplicado a outras variedades diatópicas de línguas.

Referencias bibliográficas

  • Asgari, Ehsaneddin & Mohammad R. K. Mo-frad. 2016.Comparing fifty natural lan-guages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. Em Workshop on Multilingual and Cross-lingual Methods in NLP, 65–74.10.18653/v1/W16-1208.
  • Bakker, Dik, Andre Muller, Viveka Velupillai, Soren Wichmann, Cecil H. Brown, Pa-mela Brown, Dmitry Egorov, Robert Mai-lhammer, Anthony Grant & Eric W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology 13(1). 169–181.10.1515/LITY.2009.009.
  • Barbançon, François, Steven N. Evans, Luay Nakhleh, Don Ringe & Tandy Warnow.2013.An experimental study comparing linguistic phylogenetic reconstruction methods .Diachronica 30(2). 143–170.10.1075/dia.30.2.01bar.
  • Bello, Andrés. 1984.Gramática de la lengua castellana. EDAF.
  • Bello, Andrés et al. 1951.Gramatica: gramática de la lengua castellana destinada al uso de los americanos. Caracas: Ministerio de Educación.
  • Biber, Douglas. 1993. Representativeness in corpus design. Literary and linguistic Computing 8(4). 243–257.10.1093/llc/8.4.243.
  • Brown, Cecil H., Eric W. Holman, Søren Wich-mann & Viveka Velupilla. 2009. Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals 61(4). 285–308.10.1524/stuf.2008.0026.
  • Cavnar, William B & John M Trenkle. 1994. N-gram-based text categorization. Em 3rd anual symposium on document analysis and information retrieval, 161–175.
  • Chen, Stanley F. & Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. Em 34th Annual Meeting on Association for Computational Linguistics,310–318.10.3115/981863.981904.
  • Chiswick, Barry R. & Paul W. Miller. 2004.Linguistic distance: A quantitative measure of the distance between english and other languages. Bonn: IZA Discussion Papers.
  • Degaetano-Ortlieb,Stefania, Hannah Kermes, Ashraf Khamis & Elke Teich. 2016. An information-theoretic approach to modeling diachronic change in scientific english. Em From Data to Evidence in English Language Research, 258–281. Brill. 10.1163/9789004390652_012.
  • Dieguez-Tirado, Javier, Carmen Garcia-Mateo, Laura Docio-Fernandez & Antonio Cardenal-Lopez. 2005. Adaptation strategies forthe acoustic and language models in bilingual speech transcription. Em IEEE International Conference on Acoustics, Speech, and Signal Processing, I/833–I/836.10.1109/ICASSP.2005.1415243.
  • Dunning, Ted. 1994. Statistical identification of language. Computing Research Laboratory, New Mexico State University.
  • Gamallo, Pablo, Inaki Alegria, José Ramom Pichel & Manex Agirrezabal. 2016.Comparing two basic methods for discriminating between similar languages and varieties. Em 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 170–177.
  • Gamallo, Pablo, Marcos Garcia, Susana Sotelo &José Ramom Pichel. 2014. Comparing ranking-based and naive bayes approaches to language detection on tweets. Em Workshop Tweet LID: Twitter Language Identification Workshop at SEPLN 2014, 12–16.
  • Gamallo, Pablo, José Ramom Pichel & Iñaki Alegria. 2017a. From language identification to language distance.Physica A: Statistical Mechanics and its Applications 484. 152–162.10.1016/j.physa.2017.05.011.
  • Gamallo, Pablo, Jose Ramom Pichel, Santiago de Compostela & Inaki Alegria. 2017b. Aperplexity-based method for similar languages discrimination. 4th Workshop on NLP for Si-milar Languages, Varieties and Dialects (Var-Dial)109–114.10.18653/v1/W17-1213.
  • Gao, Yuyang, Wei Liang, Yuming Shi & Qiu-ling Huang. 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications 393. 579–589.10.1016/j.physa.2013.08.075.
  • González, Meritxell. 2015. An analysis of Twitter corpora and the differences between formaland colloquial tweets. Em Tweet Translation Workshop 2015, 1–7.
  • Gonzalez-Dominguez, Javier, Ignacio Lopez-Moreno,Ha ̧sim Sak, Joaquin Gonzalez-Rodriguez & Pedro J Moreno. 2014. Automatic language identification using long short-term memory recurrent neural networks. Em 15th Annual Conference of the International Speech Communication Association, .
  • Han, Aaron Li-Feng, Yi Lu, Derek F Wong, Lidia S Chao, Liangye He & Junwen Xing.2013. Quality estimation for machine translation using the joint method of evaluation criteria and statistical modeling. Em 8th Workshopon Statistical Machine Translation, 365–372.
  • Holman, Eric W., Søren Wichmann, Cecil H.Brown, Viveka Velupillai, André Muller & Dik Bakker. 2008. Explorations in automated lexico statistics. Folia Linguistica42(3–4). 331–354.10.1515/FLIN.2008.331.
  • Jelinek, Fred, Robert L Mercer, Lalit R Bahl & James K Baker. 1977. Perplexity: a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America 62(S1). S63.10.1121/1.2016299.
  • Kondrak, Grzegorz. 2005. N-gram similarity and distance. Em International Symposium on String Processing and Information Retrieval(SPIRE), 115–126.10.1007/11575832_13.
  • Kroon, Martin, Masha Medvedeva & Barbara Plank. 2018. When simple n-gram models out-perform syntactic approaches: Discriminating between Dutch and Flemish. Em 5th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 244–253.
  • Liu, HaiTao & Jin Cong. 2013. Language clustering with word co-occurrence networks basedon parallel texts. Chinese Science Bulletin 58. 1139–1144. 10.1007/s11434-013-5711-8.
  • Lopez-Moreno, Ignacio, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez & Pedro Moreno. 2014. Automatic language identification using deep neural networks. Em IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), 5337–5341. 10.1109/ICASSP.2014.6854622.
  • Malmasi, Shervin, Marcos Zampieri, Nikola Ljubesic, Preslav Nakov, Ahmed Ali & Jörg Tiedemann. 2016. Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. Em 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), 1–14.
  • Millar, Robert McColl & Larry Trask. 2015. Trask’s historical linguistics. Abington, UK: Routledge.
  • Nakhleh, Luay, Donald A Ringe & Tandy Warnow. 2005. Perfect phylogenetic networks: Anew methodology for reconstructing the evolutionary history of natural languages. Language 81(2). 382–420.
  • Nerbonne, John & Wilbert Heeringa. 1997. Measuring dialect distance phonetically. Em 3rd Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON), 11–18.
  • Petroni, Filippo & Maurizio Serva. 2010. Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications 389(11). 2280–2283. 10.1016/j.physa.2010.02.004.
  • Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2018. Measuring language distance among historical varieties using perplexity. application to european portuguese. Em 5thWorkshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 145–155.
  • Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2019a. Cross-lingual diachronic distance: Application to portuguese and spanish. Procesamiento del Lenguaje Natural 63. 77–84.
  • Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2019b. Measuring diachronic lan-guage distance using perplexity: Application to english, portuguese, and spanish. Natural Language Engineering 1–22. 10.1017/S1351324919000378.
  • Rama, Taraka & Lars Borin. 2015. Comparative evaluation of string similarity measures for automatic language classification. Em Sequences in Language and Text, 171–200. De Gruyter Mouton. 10.1515/9783110362879-012.
  • Rissanen, Matti, Merja Kytö & Minna Palander-Collin. 1993. Early english in the computer age: Explorations through the helsinki corpus. Berlin: De Gruyter Mouton.
  • Sennrich, Rico. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. Em 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 539–549.
  • Simoes, Alberto, Álvaro Iriarte Sanromán &José Joao Almeida. 2012. Dicionário-aberto: A source of resources for the portuguese language processing. Em International Conference on Computational Processing of the Portuguese Language (PROPOR), 121–127. 10.1007/978-3-642-28885-2_14.
  • Singh, Anil Kumar & Harshit Surana. 2007. Cancorpus based measures be used for comparative study of languages? Em 9th meeting of the ACL special interest group in computational morphology and phonology, 40–47.
  • Specia, Lucia, Carolina Scarton & Gustavo Henrique Paetzold. 2018. Quality estimation for machine translation. Synthesis Lectures on Human Language Technologies 11(1). 1–162.10.2200/S00854ED1V01Y201805HLT039.
  • Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts: With special reference to north american indians and eskimos. American Philosophical Society 96(4). 452–463.
  • Tiedemann, Jörg & Nikola Ljubesic. 2012. Efficient discrimination between closely related languages. Em International Conference on Computational Linguistics (COLING), 2619–2634.
  • Yujian, Li & Liu Bo. 2007. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence 29(6). 1091–1095. 10.1109/TPAMI.2007.1078.
  • Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardzic, Nikola Ljubesic, Jörg Tiedemann et al. 2018. Language identification and morphosyntactic tagging: The second vardial evaluation campaign. Em 5th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 1–17.
  • Zampieri, Marcos, Shervin Malmasi, Yves Scherrer, Tanja Samardzic, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan,Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru & Tommi Jauhiainen. 2019. A report on the third VarDial evaluation campaign. Em 6th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 1–16. 10.18653/v1/W19-1401.
  • Zubiaga, Arkaitz, Iñaki San Vicente, Pablo Gamallo, José Ramom Pichel, Iñaki Alegria, Nora Aranberri, Aitzol Ezeiza & Víctor Fresno. 2016. TweetLID: a bench-mark for tweet language identification. Language Resources and Evaluation 50. 729–766.10.1007/s10579-015-9317-4.