On Data Engineering and Knowledge Graphs - A holistic, smarter approach to data enrichment

Ayala Hernández, Daniel

On Data Engineering and Knowledge Graphs - A holistic, smarter approach to data enrichment

Ayala Hernández, Daniel

Supervised by:

David Ruiz Cortés Director
Inmaculada Concepción Hernández Salmerón Director

Defence university: Universidad de Sevilla

Fecha de defensa: 22 October 2020

Committee:

José Miguel Toro Bonilla Chair
José Antonio Troyano Jiménez Secretary
Carlos Rafael Rivero Osuna Committee member
Manuel Lama Penín Committee member
Ernest Teniente López Committee member

Type: Thesis

Teseo: 631734 DIALNET Idus editor

Abstract

The recent years have seen an increased interest in the development of large, structured data sources that allow the application of algorithms for tasks such as question answering or product recommendations. This has popularized the use of and research about knowledge graphs, which store information as a graph where nodes represent entities with attributes and edges represent relations between them. The creation of a large knowledge graph is not trivial, since it may require the use of data engineering techniques such as integration of data from several heterogeneous sources or the completion of missing knowledge. These take an initial knowledge graph and enrich it with additional facts. Integrating heterogeneous sources involves integrating external data into a local schema, which can be done by labelling external data with known classes (semantic labelling), or finding equivalences between the external schema and the local one (matching). This is usually done by means of similarity metrics or measurements about the format or values of data. Existing proposals use a limited set of features that may in some cases be insufficient to identify two concepts as equivalent or different, which motivates the design of new, more sophisticated features. Completing knowledge graphs involves guessing data that is missing in a knowledge graph, such as entity classes or relations between entities. Guessing missing edges can be seen as a classification problem in which candidate edges are classified as true or false. This is an error-prone process in which a badly trained technique could introduce a great amount of incorrect knowledge into the graph. Therefore, the creation of resources for supervised training and evaluation of such techniques is crucial. In order to make contributions to the state of the art in these fields (data integration and completion), we have developed methods and tools for three specific tasks: semantic labelling, property matching, and evaluation of knowledge graph edge completion techniques. Our contributions focus on the use of supervised data engineering, which is of particular relevance given the recent developments in the field of machine learning. Our evaluation shows that our methods achieve results that are significantly better than those of the studied baselines thanks to the use of novel groups of features which could be integrated into existing techniques. These results are expounded in detail in the publications that we present as fruits of our research.