iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP
- Pintard, Alice 12
- François, Thomas 12
- Justine, Nagant de Deuxchaisnes 12
- Barbosa, Sílvia 34
- Reis, Maria Leonor 34
- Moutinho, Michell 34
- Monteiro, Ricardo 34
- Amaro, Raquel 34
- Correia, Susana 34
- Rodríguez Rey, Sandra 56
- Mu, Keran 7
- Garcia González, Marcos 56
- Bernárdez Braña, André 56
- Blanco Escoda, Xavier 7
- 1 CENTAL
- 2 UCLouvain
- 3 CLUNL
- 4 NOVA FCSH
- 5 CITIUS
-
6
Universidade de Santiago de Compostela
info
-
7
Universitat Autònoma de Barcelona
info
Editor: Zenodo
Ano de publicación: 2024
Tipo: Dataset
Resumo
The iRead4Skills Dataset 2: annotated corpora by level of complexity for FR, PT and SP is a collection of texts categorized by complexity level and annotated for complexity features, presented in xlsx format. These corpora were compiled, classified and annotated under the scope of the project iRead4Skills – Intelligent Reading Improvement System for Fundamental and Transversal Skills Development, funded by the European Commission (grant number: 1010094837). The project aims to enhance reading skills within the adult population by creating an intelligent system that assesses text complexity and recommends suitable reading materials to adults with low literacy skills, contributing to reducing skills gaps and facilitating access to information and culture (https://iread4skills.com/). This dataset is the result of specifically devised classification and annotation tasks, in which selected texts were organized and distributed to trainers in Adult Learning (AL) and Vocational Education Training (VET) Centres, as well as to adult students in AL and VET centres. This task was conducted via the Qualtrics platform. The Dataset 2: annotated corpora by level of complexity for FR, PT and SP is derived from the iRead4Skills Dataset 1: corpora by level of complexity for FR, PT and SP ( https://doi.org/10.5281/zenodo.10055909), which comprises written texts of various genres and complexity levels. From this collection, a subset of texts was selected for classification and annotation. This classification and annotation task aimed to provide additional data and test sets for the complexity analysis systems for the three languages of the project: French, Portuguese, and Spanish. The texts in each of the language corpora were selected taking into account the diversity of topics/domains, genres, and the reading preferences of the target audience of the iRead4Skills project. This percentage amounted to the total of 462 texts per language, which were divided by level of complexity, resulting in the following distribution: · 140 Very Easy texts · 140 Easy texts · 140 Plain texts · 42 More Complex texts. Trainers were asked to classify the texts according to the complexity levels of the project, here informally defined as: Very Easy (everyone can understand the text or most of the text). Easy (a person with less than the 9th year of schooling can understand the text or most of the text) Plain (a person with the 9th year of schooling can understand the text the first time he/she reads it) More complex (a person with the 9th year of schooling cannot understand the text the first time he/she reads it). They were also asked to annotate the parts of the texts considered complex according to various type of features, at word-level and at sentence-level (e.g., word order, sentence composition, etc.), according to following categories: Lexical/word-related features - unknown word - word too technical/specialized or archaic - complex derived word - points to a previous reference that is not obvious - word (other) Syntactic/sentence-level features - unusual word order - too much embedded secondary information - too many connectors in the same sentence - sentence (other) - other (please specify) The sets were divided in three parts in Qualtrics and, in each part, the texts are shown randomly to the annotator. Students were asked to confirm that they could read without difficulty texts adequate to their literacy level. Each set contained texts from a given level, plus one text of the level immediately above. They were also asked to annotate words or sequences of words in the text that they did not understand, according to the following categories: - difficult word - difficult part of the text The complete results and datasets are in TSV/Excel format, in pairs of two files, with one file concerning the results from the classification (trainers)/validation (students) task and one file concerning the results from the annotation task. The complete datasets will be available under creative CC BY-NC-ND 4.0