Linguistically annotated multilingual comparable corpora of parliamentary debates in English ParlaMint-en.ana 3.0

  1. Kuzman, Taja
  2. Ljubešić, Nikola
  3. Erjavec, Tomaž
  4. Kopp, Matyáš
  5. Ogrodniczuk, Maciej
  6. Osenova, Petya
  7. Fišer, Darja
  8. Pirker, Hannes
  9. Wissik, Tanja
  10. Schopper, Daniel
  11. Kirnbauer, Martin
  12. Mochtak, Michal
  13. Rupnik, Peter
  14. Pol, Henk van der
  15. Depoorter, Griet
  16. de Does, Jesse
  17. Simov, Kiril
  18. Grigorova, Vladislava
  19. Grigorov, Ilko
  20. Jongejan, Bart
  21. Haltrup Hansen, Dorte
  22. Navarretta, Costanza
  23. Mölder, Martin
  24. Kahusk, Neeme
  25. Vider, Kadri
  26. Bel, Nuria
  27. Antiba-Cartazo, Iván
  28. Pisani, Marilina
  29. Zevallos, Rodolfo
  30. Regueira, Xosé Luís
  31. Vladu, Adina Ioana
  32. Magariños, Carmen
  33. Bardanca, Daniel
  34. Barcala, Mario
  35. Garcia, Marcos
  36. Pérez Lago, María
  37. García Louzao, Pedro
  38. Vivel Couso, Ainhoa
  39. Vázquez Abuín, Marta
  40. García Díaz, Noelia
  41. Vidal Miguéns, Adrián
  42. Fernández Rei, Elisa
  43. Diwersy, Sascha
  44. Luxardo, Giancarlo
  45. Coole, Matthew
  46. Rayson, Paul
  47. Nwadukwe, Amanda
  48. Gkoumas, Dimitris
  49. Prokopidis, Prokopis
  50. Gavriilidou, Maria
  51. Piperidis, Stelios
  52. Ligeti-Nagy, Noémi
  53. Jelencsik-Mátyus, Kinga
  54. Varga, Zsófia
  55. Dodé, Réka
  56. Barkarson, Starkaður
  57. Agnoloni, Tommaso
  58. Bartolini, Roberto
  59. Frontini, Francesca
  60. Montemagni, Simonetta
  61. Quochi, Valeria
  62. Venturi, Giulia
  63. Ruisi, Manuela
  64. Marchetti, Carlo
  65. Battistoni, Roberto
  66. Darģis, Roberts
  67. van Heusden, Ruben
  68. Marx, Maarten
  69. Depuydt, Katrien
  70. Tungland, Lars Magne
  71. Rudolf, Michał
  72. Nitoń, Bartłomiej
  73. Aires, José
  74. Mendes, Amália
  75. Cardoso, Aida
  76. Pereira, Rui
  77. Yrjänäinen, Väinö
  78. Norén, Fredrik Mohammadi
  79. Magnusson, Måns
  80. Jarlbrink, Johan
  81. Meden, Katja
  82. Pančur, Andrej
  83. Ojsteršek, Mihael
  84. Çöltekin, Çağrı
  85. Kryvenko, Anna
  86. Mostrar todos os autores +

Editorial: CLARIN ERIC

Ano de publicación: 2023

Tipo: Libro

Resumo

ParlaMint-en 3.0 comprises linguistically annotated multilingual comparable corpora of parliamentary debates ParlaMint.ana 3.0 (http://hdl.handle.net/11356/1488) which were machine translated to English and the translation linguistically annotated.Except for the translation to English, small changes in the metadata and the absence of the British parliament corpus, the corpora included in this entry are all respects identical to the source language corpora, i.e. the entry comprises the same 26 European parliamentary corpora, together with over 1.1 billion words.The translation to English was done with EasyNMT (https://github.com/UKPLab/EasyNMT) with OPUS-MT models (https://github.com/Helsinki-NLP/Opus-MT). Machine translation was done on the sentence level, and includes both speeches and transcriber notes, including headings. The linguistic annotation of the speeches, i.e. tokenisation, tagging with UD PoS and morphological features, lemmatisation, and NER annotation was done with Stanza (https://stanfordnlp.github.io/stanza/), using the English language model. For NER the conll03 model with 4 NE classes was used.Note that the automatically produced translation to English contains errors typical of neural machine translation, which also includes factual errors even when a high level of fluency is achieved, and any manual or automatic usage of this corpus should take the machine translation limitations into account. Note also that some metadata errors were noticed after the source 3.0 corpora were released, and were corrected for the MTed corpus, so there are slight differences in the metadata between the two.The files associated with this entry include the linguistically annotated corpora in several formats: the corpora in thje canonical ParlaMint TEI XML encoding; the corpora in the derived vertical format (for use with CQP-based concordancers, such as CWB, noSketch Engine or KonText); and the corproa in the CoNLL-U format with TSV speech metadata. In contrast to the source language corpora, the CoNLL-U files are not derived from the TEI encoded corpus but are the ones output by the machine translation and linguistic annotation pipeline as these also contain word-alignment information, which is not present in the TEI version. Also included is the ParlaMint-en-3.0 release of the scripts and samples available at the GitHub repository of the ParlaMint project.