Nos_CorpusNOS-GL: Galician Macrocorpus for LLM training
- de-Dios-Flores, Iria 1
- Paniagua Suárez, Silvia
- Bardanca, Daniel
- Gamallo, Pablo
- García, Marcos
- Ramom Pichel Campos, José
- Carbajal Pérez, Cristina
- Moscoso Sánchez, Antonio
- Francisco Marini, Jose Javier
- Canosa Pérez, Cristian
-
1
Universidade de Santiago de Compostela
info
Editor: Zenodo
Ano de publicación: 2024
Tipo: Dataset
Resumo
CorpusNÓS is a massive Galician corpus made up of 2.1B words primarily devised for training large language models. The corpus sources are varied and represent a relatively wide range of genres. ------------------ We happily announce that we are introducing a new version of the CorpusNÓS. After improving our text cleaning and processing methods in our cleaning pipeline, we have decided to release this new version of the corpus, which reflects those enhancements. This new version contains the same files as the previous one and holds the same distribution of the data, however, we decided to change the format from plain text (*.txt) to JSONL (*.jsonl) so future cleaning processes can be performed easily, and relevant metadata can be included. As of now, some examples of entries from the CorpusNós have the following structure: {"id": 0, "text": "Abades: Parroquia do concello de Baltar baixo a advocación de san Paio.", "num_words": 12} {"id": 581, "text": "Feliz 2008 a tódolos nosos lectores\nAgora que remata 2007, un ano cheo de novidades tecnolóxicas que difundimos a través deste espazo dixital, queremos desexar a tódolos que non seguen con fidelidade unha boa despedida do ano e un feliz aninovo.\nNós volveremos o mércores, 2 de xaneiro, á nosa actividade ordinaria, cumprindo coa nosa labor informativa para que as novas tecnolóxicas de Galicia e en galego cheguen ós nosos lectores puntualmente.", "num_words": 72, "pyplexity_score": 717.7585757844212, "lang": "gl"} In the plain text version, the delimiter between different documents was constituted by two newlines (\n\n). In the JSONL version, each document is a JSON object with their corresponding id, but it also includes the number of words of each document, and, in some cases, the pyplexity score and the language tag. This new version of CorpusNós has undergone a heavier process of deduplication than the previous one. This means that more exact match duplications as well as partial duplications have been removed from the corpus and, therefore, the number of documents and tokens in this version has decreased and the current statistics are: Subcorpus: Data obtained via transfer agreement Genre Nº tokens Nº documents Books 7.217.626 103 Research articles 2.638.537 635 Press 92.661.350 161.760 Governmental 221.565.059 527.699 Web contents 15.471.132 41.276 Encyclopedic 4.799.214 47.396 Subtotal 332.721.231 777.583 Subcorpus: Public data Genre Nº tokens Nº documents Press and blogs 142.238.181 598.375 Encyclopedic 48.260.708 148.560 Web crawls 1.205.699.835 2.850.604 Translation corpora 106.555.883 3.544.026 Subtotal 1.502.754.607 7.141.565 Total 1.835.475.838 7.919.148 The TXT version is still available under the corpusnos_v1_txt zip file and it mantains the same structure as before (documents are divided by two newlines '\n\n') but this version hasn't gone through the improved cleaning process mentioned above. Please, note that if you want to download or use the newest version you have to download the corpusnos_v2_jsonl. Note: Some of the files referenced may be missing in this version of the corpus due to pending transfer agreements and they will be included in a future version of the corpus as soon as they are available for publishing. Note: Please, note that the following subcorpora have different licenses which correspond to their original licenses as specified in the paper: TED2020 (CC BY–NC–ND 4.0), mC4 (Apache License 2.0), OSCAR (CC0). Please refer to our paper for more details, CorpusNÓS: A massive Galician corpus for training large language models. If you use this data in your work, please cite: de-Dios-Flores, Iria, Silvia Paniagua Suárez, Cristina Carbajal Pérez, Daniel Bardanca Outeiriño, Marcos Garcia and Pablo Gamallo. 2024. CorpusNÓS: A massive Galician corpus for training large language models. Proceedings of the 16th International Conference on Computational Processing of Portuguese - ACL Anthology (Volume 1), 593-599.