Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training

  1. de-Dios-Flores, Iria 1
  2. Paniagua Suárez, Silvia
  3. Bardanca, Daniel
  4. Gamallo, Pablo
  5. García, Marcos
  6. Ramom Pichel Campos, José
  7. Carbajal Pérez, Cristina
  8. Moscoso Sánchez, Antonio
  9. Francisco Marini, Jose Javier
  10. Canosa Pérez, Cristian
  1. 1 Universidade de Santiago de Compostela
    info

    Universidade de Santiago de Compostela

    Santiago de Compostela, España

    ROR https://ror.org/030eybx10

Editor: Zenodo

Ano de publicación: 2024

Tipo: Dataset

CC BY 4.0

Resumo

CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres.  ------------------ We happily announce that we are introducing a new version of the CorpusNÓS. After improving our text cleaning and processing methods in our cleaning pipeline, we have decided to release this new version of the corpus, which reflects those enhancements.  This new version contains the same files as the previous one and holds the same distribution of the data, however, we decided to change the format from plain text (*.txt) to JSONL (*.jsonl) so future cleaning processes can be performed easily, and relevant metadata can be included. As of now, some examples of entries from the CorpusNós have the following structure:  {"id": 0, "text": "Abades: Parroquia do concello de Baltar baixo a advocación de san Paio.", "num_words": 12}  {"id": 581, "text": "Feliz 2008 a tódolos nosos lectores\nAgora que remata 2007, un ano cheo de novidades tecnolóxicas que difundimos a través deste espazo dixital, queremos desexar a tódolos que non seguen con fidelidade unha boa despedida do ano e un feliz aninovo.\nNós volveremos o mércores, 2 de xaneiro, á nosa actividade ordinaria, cumprindo coa nosa labor informativa para que as novas tecnolóxicas de Galicia e en galego cheguen ós nosos lectores puntualmente.", "num_words": 72, "pyplexity_score": 717.7585757844212, "lang": "gl"}  In the plain text version, the delimiter between different documents was constituted by two newlines (\n\n). In the JSONL version, each document is a JSON object with their corresponding id, but it also includes the number of words of each document, and, in some cases, the pyplexity score and the language tag.  This new version of CorpusNós has undergone a heavier process of deduplication than the previous one. This means that more exact match duplications as well as partial duplications have been removed from the corpus and, therefore, the number of documents and tokens in this version has decreased and the current statistics are:      Subcorpus:  Data obtained via transfer agreement  Genre  Nº tokens  Nº documents     Books  7.217.626  103     Research articles  2.638.537  635     Press  92.661.350  161.760     Governmental  221.565.059  527.699     Web contents  15.471.132  41.276     Encyclopedic  4.799.214  47.396     Subtotal  332.721.231  777.583      Subcorpus:  Public data  Genre  Nº tokens  Nº documents     Press and blogs  142.238.181  598.375     Encyclopedic  48.260.708  148.560     Web crawls  1.205.699.835  2.850.604     Translation corpora  106.555.883  3.544.026     Subtotal  1.502.754.607  7.141.565     Total  1.835.475.838  7.919.148    The TXT version is still available under the corpusnos_v1_txt zip file and it mantains the same structure as before (documents are divided by two newlines '\n\n') but this version hasn't gone through the improved cleaning process mentioned above.  Please, note that if you want to download or use the newest version you have to download the corpusnos_v2_jsonl. Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing.  Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0).   Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models.  If you use this data in your work, please cite:  de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.