Disambiguating Spanish se constructions with machine learning techniques

Aldama García, Nuria

Disambiguating Spanish se constructions with machine learning techniques

Aldama García, Nuria

Dirixida por:

Antonio Moreno Sandoval Director

Universidade de defensa: Universidad Autónoma de Madrid

Fecha de defensa: 10 de decembro de 2021

Tribunal:

Cristina Sánchez López Presidente/a
Jordi Porta Zamorano Secretario/a
Pablo Gamallo Otero Vogal

Tipo: Tese

Teseo: 700014 DIALNET Biblos-e Archivo editor

Resumo

Spanish se constructions constitute a linguistic phenomenon that challenges Natural Language Processing (NLP) tasks such as part-of-speech or dependency relation tagging. The three main reasons why se is a hurdling topic for NLP are: rst, the high-frequency of appearance of se in Spanish; second, the nine di erent syntactic constructions where se appears adding information of diverse nature depending on the context; third, the lack of gender and number features se displays that does not help se-type disambiguation. This thesis' main goal is to improve the state-of-the-art results on automatic morphosyntactic se analysis on the basis of two hypotheses: the grouping (GH) and the subcategorization frame (SFH) hypotheses. This thesis proposes a new annotation scheme for se that connects the di erent constructions through a transitivity gradient (Moreno Cabrera, 2004). The new annotation scheme is applied on the SE-corpus, a European Spanish corpus made of 3,100 sentences containing the word se. The SE-corpus belongs to the news, leisure and daily life domain of CORPES XXI (Real Academia Espa~nola, 2018) and it has been manually annotated as part of this research work. The SE-corpus is used to train di erent models using UDPipe1.2 to test whether the new annotation scheme can be learnt by the neural networks that underlie the dependency parser. The resulting models are evaluated on an additional gold standard test corpus made of 100 sentences containing the form se. These sentences are obtained from CORPES XXI, too. The best model yields a LAS F-score of 86.97 points and a UAS F-score of 89.65 points. Regarding se analysis, the best model yields a LAS F-score of 82.55 points and a UAS F-score of 98.16 points. The main contributions of this thesis are: a new annotation scheme for se adapted to Universal Dependencies' guidelines, manual annotation guidelines for Spanish se disambiguation, the raw and annotated version of the SE-corpus and the best resulting model