Distância diacrónica automática entre variantes diatópicas do português e do espanhol
- Pichel, José Ramom 1
- Gamallo, Pablo 2
- Neves, Marco 3
- Alegria, Iñaki 4
- 1 imaxin software
-
2
Universidade de Santiago de Compostela
info
-
3
Universidade Nova de Lisboa
info
- 4 Universidade do País Basco (EHU/UPV)
ISSN: 1647-0818
Ano de publicación: 2020
Volume: 12
Número: 1
Páxinas: 117-126
Tipo: Artigo
Outras publicacións en: Linguamática
Resumo
O objetivo deste trabalho é aplicar uma metodologia baseada na perplexidade, para calcular automaticamente a distância interlinguística entre diferentes períodos históricos de variantes diatópicas de idiomas. Esta metodologia aplica-se a um corpus construído adhoc em ortografia original, numa base equilibrada de ficção e não-ficção, que mede a distância histórica entre o português europeu e do Brasil, por um lado, e o espanhol europeu e o da Argentina, por outro. Os resultados mostram distâncias muito próximas em ortografia original e transcrita automaticamente, entre as variedades diatópicas do português e do espanhol, com ligeiras convergências/divergências desde meados do século XX até hoje. É de salientar que o método não é supervisionado e pode ser aplicado a outras variedades diatópicas de línguas.
Referencias bibliográficas
- Asgari, Ehsaneddin & Mohammad R. K. Mo-frad. 2016.Comparing fifty natural lan-guages and twelve genetic languages using word embedding language divergence (WELD) as a quantitative measure of language distance. Em Workshop on Multilingual and Cross-lingual Methods in NLP, 65–74.10.18653/v1/W16-1208.
- Bakker, Dik, Andre Muller, Viveka Velupillai, Soren Wichmann, Cecil H. Brown, Pa-mela Brown, Dmitry Egorov, Robert Mai-lhammer, Anthony Grant & Eric W. Holman. 2009. Adding typology to lexicostatistics: A combined approach to language classification. Linguistic Typology 13(1). 169–181.10.1515/LITY.2009.009.
- Barbançon, François, Steven N. Evans, Luay Nakhleh, Don Ringe & Tandy Warnow.2013.An experimental study comparing linguistic phylogenetic reconstruction methods .Diachronica 30(2). 143–170.10.1075/dia.30.2.01bar.
- Bello, Andrés. 1984.Gramática de la lengua castellana. EDAF.
- Bello, Andrés et al. 1951.Gramatica: gramática de la lengua castellana destinada al uso de los americanos. Caracas: Ministerio de Educación.
- Biber, Douglas. 1993. Representativeness in corpus design. Literary and linguistic Computing 8(4). 243–257.10.1093/llc/8.4.243.
- Brown, Cecil H., Eric W. Holman, Søren Wich-mann & Viveka Velupilla. 2009. Automated classification of the world’s languages: a description of the method and preliminary results. Language Typology and Universals 61(4). 285–308.10.1524/stuf.2008.0026.
- Cavnar, William B & John M Trenkle. 1994. N-gram-based text categorization. Em 3rd anual symposium on document analysis and information retrieval, 161–175.
- Chen, Stanley F. & Joshua Goodman. 1996. An empirical study of smoothing techniques for language modeling. Em 34th Annual Meeting on Association for Computational Linguistics,310–318.10.3115/981863.981904.
- Chiswick, Barry R. & Paul W. Miller. 2004.Linguistic distance: A quantitative measure of the distance between english and other languages. Bonn: IZA Discussion Papers.
- Degaetano-Ortlieb,Stefania, Hannah Kermes, Ashraf Khamis & Elke Teich. 2016. An information-theoretic approach to modeling diachronic change in scientific english. Em From Data to Evidence in English Language Research, 258–281. Brill. 10.1163/9789004390652_012.
- Dieguez-Tirado, Javier, Carmen Garcia-Mateo, Laura Docio-Fernandez & Antonio Cardenal-Lopez. 2005. Adaptation strategies forthe acoustic and language models in bilingual speech transcription. Em IEEE International Conference on Acoustics, Speech, and Signal Processing, I/833–I/836.10.1109/ICASSP.2005.1415243.
- Dunning, Ted. 1994. Statistical identification of language. Computing Research Laboratory, New Mexico State University.
- Gamallo, Pablo, Inaki Alegria, José Ramom Pichel & Manex Agirrezabal. 2016.Comparing two basic methods for discriminating between similar languages and varieties. Em 3rd Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 170–177.
- Gamallo, Pablo, Marcos Garcia, Susana Sotelo &José Ramom Pichel. 2014. Comparing ranking-based and naive bayes approaches to language detection on tweets. Em Workshop Tweet LID: Twitter Language Identification Workshop at SEPLN 2014, 12–16.
- Gamallo, Pablo, José Ramom Pichel & Iñaki Alegria. 2017a. From language identification to language distance.Physica A: Statistical Mechanics and its Applications 484. 152–162.10.1016/j.physa.2017.05.011.
- Gamallo, Pablo, Jose Ramom Pichel, Santiago de Compostela & Inaki Alegria. 2017b. Aperplexity-based method for similar languages discrimination. 4th Workshop on NLP for Si-milar Languages, Varieties and Dialects (Var-Dial)109–114.10.18653/v1/W17-1213.
- Gao, Yuyang, Wei Liang, Yuming Shi & Qiu-ling Huang. 2014. Comparison of directed and weighted co-occurrence networks of six languages. Physica A: Statistical Mechanics and its Applications 393. 579–589.10.1016/j.physa.2013.08.075.
- González, Meritxell. 2015. An analysis of Twitter corpora and the differences between formaland colloquial tweets. Em Tweet Translation Workshop 2015, 1–7.
- Gonzalez-Dominguez, Javier, Ignacio Lopez-Moreno,Ha ̧sim Sak, Joaquin Gonzalez-Rodriguez & Pedro J Moreno. 2014. Automatic language identification using long short-term memory recurrent neural networks. Em 15th Annual Conference of the International Speech Communication Association, .
- Han, Aaron Li-Feng, Yi Lu, Derek F Wong, Lidia S Chao, Liangye He & Junwen Xing.2013. Quality estimation for machine translation using the joint method of evaluation criteria and statistical modeling. Em 8th Workshopon Statistical Machine Translation, 365–372.
- Holman, Eric W., Søren Wichmann, Cecil H.Brown, Viveka Velupillai, André Muller & Dik Bakker. 2008. Explorations in automated lexico statistics. Folia Linguistica42(3–4). 331–354.10.1515/FLIN.2008.331.
- Jelinek, Fred, Robert L Mercer, Lalit R Bahl & James K Baker. 1977. Perplexity: a measure of the difficulty of speech recognition tasks.The Journal of the Acoustical Society of America 62(S1). S63.10.1121/1.2016299.
- Kondrak, Grzegorz. 2005. N-gram similarity and distance. Em International Symposium on String Processing and Information Retrieval(SPIRE), 115–126.10.1007/11575832_13.
- Kroon, Martin, Masha Medvedeva & Barbara Plank. 2018. When simple n-gram models out-perform syntactic approaches: Discriminating between Dutch and Flemish. Em 5th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 244–253.
- Liu, HaiTao & Jin Cong. 2013. Language clustering with word co-occurrence networks basedon parallel texts. Chinese Science Bulletin 58. 1139–1144. 10.1007/s11434-013-5711-8.
- Lopez-Moreno, Ignacio, Javier Gonzalez-Dominguez, Oldrich Plchot, David Martinez, Joaquin Gonzalez-Rodriguez & Pedro Moreno. 2014. Automatic language identification using deep neural networks. Em IEEE International Conference on Acoustics, Speechand Signal Processing (ICASSP), 5337–5341. 10.1109/ICASSP.2014.6854622.
- Malmasi, Shervin, Marcos Zampieri, Nikola Ljubesic, Preslav Nakov, Ahmed Ali & Jörg Tiedemann. 2016. Discriminating between similar languages and Arabic dialect identification: A report on the third DSL Shared Task. Em 3rd Workshop on Language Technology for Closely Related Languages, Varieties and Dialects (VarDial), 1–14.
- Millar, Robert McColl & Larry Trask. 2015. Trask’s historical linguistics. Abington, UK: Routledge.
- Nakhleh, Luay, Donald A Ringe & Tandy Warnow. 2005. Perfect phylogenetic networks: Anew methodology for reconstructing the evolutionary history of natural languages. Language 81(2). 382–420.
- Nerbonne, John & Wilbert Heeringa. 1997. Measuring dialect distance phonetically. Em 3rd Meeting of the ACL Special Interest Group in Computational Phonology (SIGPHON), 11–18.
- Petroni, Filippo & Maurizio Serva. 2010. Measures of lexical distance between languages. Physica A: Statistical Mechanics and its Applications 389(11). 2280–2283. 10.1016/j.physa.2010.02.004.
- Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2018. Measuring language distance among historical varieties using perplexity. application to european portuguese. Em 5thWorkshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 145–155.
- Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2019a. Cross-lingual diachronic distance: Application to portuguese and spanish. Procesamiento del Lenguaje Natural 63. 77–84.
- Pichel, José Ramom, Pablo Gamallo & Iñaki Alegria. 2019b. Measuring diachronic lan-guage distance using perplexity: Application to english, portuguese, and spanish. Natural Language Engineering 1–22. 10.1017/S1351324919000378.
- Rama, Taraka & Lars Borin. 2015. Comparative evaluation of string similarity measures for automatic language classification. Em Sequences in Language and Text, 171–200. De Gruyter Mouton. 10.1515/9783110362879-012.
- Rissanen, Matti, Merja Kytö & Minna Palander-Collin. 1993. Early english in the computer age: Explorations through the helsinki corpus. Berlin: De Gruyter Mouton.
- Sennrich, Rico. 2012. Perplexity minimization for translation model domain adaptation in statistical machine translation. Em 13th Conference of the European Chapter of the Association for Computational Linguistics (EACL), 539–549.
- Simoes, Alberto, Álvaro Iriarte Sanromán &José Joao Almeida. 2012. Dicionário-aberto: A source of resources for the portuguese language processing. Em International Conference on Computational Processing of the Portuguese Language (PROPOR), 121–127. 10.1007/978-3-642-28885-2_14.
- Singh, Anil Kumar & Harshit Surana. 2007. Cancorpus based measures be used for comparative study of languages? Em 9th meeting of the ACL special interest group in computational morphology and phonology, 40–47.
- Specia, Lucia, Carolina Scarton & Gustavo Henrique Paetzold. 2018. Quality estimation for machine translation. Synthesis Lectures on Human Language Technologies 11(1). 1–162.10.2200/S00854ED1V01Y201805HLT039.
- Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts: With special reference to north american indians and eskimos. American Philosophical Society 96(4). 452–463.
- Tiedemann, Jörg & Nikola Ljubesic. 2012. Efficient discrimination between closely related languages. Em International Conference on Computational Linguistics (COLING), 2619–2634.
- Yujian, Li & Liu Bo. 2007. A normalized levenshtein distance metric. IEEE transactions on pattern analysis and machine intelligence 29(6). 1091–1095. 10.1109/TPAMI.2007.1078.
- Zampieri, Marcos, Shervin Malmasi, Preslav Nakov, Ahmed Ali, Suwon Shon, James Glass, Yves Scherrer, Tanja Samardzic, Nikola Ljubesic, Jörg Tiedemann et al. 2018. Language identification and morphosyntactic tagging: The second vardial evaluation campaign. Em 5th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 1–17.
- Zampieri, Marcos, Shervin Malmasi, Yves Scherrer, Tanja Samardzic, Francis Tyers, Miikka Silfverberg, Natalia Klyueva, Tung-Le Pan,Chu-Ren Huang, Radu Tudor Ionescu, Andrei M. Butnaru & Tommi Jauhiainen. 2019. A report on the third VarDial evaluation campaign. Em 6th Workshop on NLP for Similar Languages, Varieties and Dialects (VarDial), 1–16. 10.18653/v1/W19-1401.
- Zubiaga, Arkaitz, Iñaki San Vicente, Pablo Gamallo, José Ramom Pichel, Iñaki Alegria, Nora Aranberri, Aitzol Ezeiza & Víctor Fresno. 2016. TweetLID: a bench-mark for tweet language identification. Language Resources and Evaluation 50. 729–766.10.1007/s10579-015-9317-4.