La construcción de un corpus paralelo bilingüe multifuncional

  1. Irene Doval Reixa 1
  1. 1 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

Revista:
Moenia: Revista lucense de lingüistica & literatura

ISSN: 1137-2346

Ano de publicación: 2017

Título do exemplar: Morfosintaxis y semántica del verbo en español: historia y descripción

Número: 23

Páxinas: 717-734

Tipo: Artigo

Outras publicacións en: Moenia: Revista lucense de lingüistica & literatura

Resumo

This article describes the steps and addresses the different aspects/issues to consider in the construction of a bilingual parallel corpus aimed to be used for multiple purposes, with special focus on the cross-linguistic research, translation and teaching of foreign languages. This process is exemplified by the creation of the corpus PaGeS, a parallel corpus German / Spanish, available for online searches via web interface. This corpus, although originally created for cross-linguistic research, aims to cover a wide range of uses. The paper describes the different phases / processes in the construction of the corpus: compilation, preprocessing, corpus markup, linguistic annotation and alignment of the data. Finally, the web interface and the search possibilities for the different user groups are presented.Keywords: corpus linguistics, parallel corpus, cross-linguistics, translation.

Referencias bibliográficas

  • BAKER, M. (1996): “Corpus-based translation studies: The challenges that lie ahead”. En H. Somers (ed.): Terminology, LSP and Translation. Amsterdam: Benjamins, 175-86.
  • BERNARDINI, S. (2004): “Corpora in the Classroom: An Overview and Some Reflections on Future. Developments”. En J. Sinclair (ed.): How to Use Corpora in Language Teaching. Amsterdam: John Benjamins, 15-36.
  • BRAUNE, F. & A. FRASER (2010): “Improved unsupervised sentence alignment for symmetrical and asymmetrical parallel corpora”. COLING ’10 Proceedings of the 23rd International Conference on Computational Linguistics: Posters. Beijing, China, 81-9.
  • BROWN, P. F., J. C. LAI & R. L. MERCER (1991): “Aligning Sentences in Parallel Corpora”. Proceedings of the 29th Annual Meeting on Association for Computational Linguistics, ACL ’91. Stroudsburg, PA: ACL, 169-76.
  • DOVAL, I. (2016): “Bilingual Parallel Corpora for Linguistic Research”. EPiC Series in Language and Linguistics 1, 88-96.
  • GALE, W. A. & K. W. CHURCH (1993): “A Program for Aligning Sentences in Bilingual Corpora”. Computational Linguistics 19/1, 75-102.
  • HAJLAOUI, N. et al. (2014): “DCEP-Digital Corpus of the European Parliament”. Proceedings LREC 2014 (Language Resources and Evaluation Conference). Reykjavik, Iceland. Mai 26-31, 2014, 3164-71. En línea: <http://www.lrec-conf.org/proceedings/lrec2014/pdf/943_Paper.pdf>.
  • HEID, U. (2008): “Corpus linguistics and lexicography”. En Lüdeling & Kytö (2008: 131-53).
  • LÜDELING, A. & M. KYTÖ (2008): Corpus Linguistics. An International Handbook. Vol. 1. Handbücher zur Sprachund Kommunikationswissenschat. Berlin: Walter de Gruyter.
  • HILL, T. (2011): El verano de los juguetes muertos. Barcelona Penguin Random House. [Der Sommer der toten Puppen. Berlin: Suhrkamp, 2013.]
  • JOHANSSON, S. (2007a): Seeing through Multilingual Corpora: On the use of corpora in contrastive studies. Amsterdam: John Benjamins.
  • JOHANSSON, S. (2007b): “Using Corpora: From Learning to Research”. En E. Hidalgo, L. Quereda & J. Santana (eds.): Corpora in the Foreign Language Classroom. Amsterdam: Rodopi, 17-30.
  • KOEHN, P., (2005): “EuroParl, A parallel corpus for statistical machine translation”. Proceedings of the machine translation summit. Phuket: AAMT, 79-86. En línea: <http://www.statmt.org/europarl/>.
  • KAY, M. & M. RÖSCHEISEN (1993): “Text-translation Alignment”. Computational Linguistics 19.1, 121-142.
  • MCENERY, T. & A. HARDIE (2012): Corpus Linguistics. Cambridge: Cambridge University Press.
  • PADRÓ, L. (2011): “Analizadores Multilingües en FreeLing”. Linguamatica 3/2, 13-20.
  • RÖMER, U. (2008): “Corpora and language teaching”. En Lüdeling & Kytö (2008: 112-31).
  • SCHMID, H. (1995): “Improvements in Part-of-Speech Tagging with an Application to German”. Proceedings of the ACL SIGDAT-Workshop. Dublin, 47-50. En línea:<http://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/data/tree-tagger2.pdf>.
  • STEINBERGER R. et al. (2014): “An overview of the European Union’s highly multilingual parallel corpora”. Language Resources and Evaluation Journal 48/4, 679-707.
  • TIEDEMANN, J. (2011): Bitext Alignment. Toronto: Morgan & Claypool.
  • TIEDEMANN, J. (2012): “Parallel Data, Tools and Interfaces in OPUS”. Proceedings of the Eighth International Conference on Language Resources and Evaluation (LREC-2012). Paris: ELRA, 2214-8. En línea: <www.lrec-conf.org/proceedings/lrec2012/pdf/463_Paper.pdf>.
  • VARGA, D. et al. (2007): “Parallel corpora for medium density languages”. En N. Nicolov et al. (eds.): Recent Advances in Natural Language Processing IV. Amsterdam: John Benjamins, 590-6.
  • VOLK, M., J. GRAËN, & E. CALLEGARO (2014): “Innovations in Parallel Corpus Search Tools”. En N. Calzolari et al. (eds.): Proceedings LREC 2014, 3172-8. En línea: <http://www.lrec-conf.org/proceedings/lrec2014/pdf/504_Paper.pdf>.
  • WETZEL, D. & F. BOND (2012): “Enriching parallel corpora for statistical machine translation with semantic negation rephrasing”. En M. Carpuat, L. Specia & D. Wu (eds.): Proceedings of the Sixth Workshop on Syntax, Semantics and Structure in Statistical Translation. Stroudsburg: ACL, 20-9. En línea: <http://aclweb.org/anthology/W12-4203>.
  • ZHEKOVA, D. et al. (2016): “Alignment and Application of Russian-German Multi-Target Parallel Corpora for Linguistic Analysis and Literary Studies”. MATLIT 4/1, 45-61. En línea: <http://dx.doi.org/10.14195/2182-8830_4-1_3>.