Descripciones lingüísticas de fenómenos complejos: aplicaciones con big data

CONDE CLEMENTE, PATRICIA

Descripciones lingüísticas de fenómenos complejosaplicaciones con big data

CONDE CLEMENTE, PATRICIA

Dirixida por:

José M. Alonso Moral Director
Gracián Triviño Barros Co-director

Universidade de defensa: Universidad de Oviedo

Fecha de defensa: 15 de maio de 2017

Tribunal:

Alberto José Bugarín Diz Presidente
Luciano Sánchez Ramos Secretario/a
Daniel Sánchez Fernández Vogal

Tipo: Tese

Teseo: 475258 DIALNET

Resumo

Vivimos en un mundo cada vez más interconectado donde el volumen y la variedad de datos está creciendo cada día. Las nuevas tecnologías nos permiten adquirir y almacenar este gran conjunto de datos, con el propósito de extraer conocimientos valiosos que hagan nuestras vidas más fáciles, ya sea en el trabajo o en nuestro entorno social y familiar. En consecuencia, los ingenieros se enfrentan al reto de desarrollar sistemas para un gran número de posibles escenarios y usuarios. La interacción hombre-máquina es una piedra angular en este contexto. Las interacciones humanas a través del lenguaje natural son consideradas una forma de comunicación efectiva. Por lo tanto, el desarrollo de sistemas basados en una interacción hombre-máquina a través del lenguaje natural pueden llegar a estar muy valorados. Este hecho ha propiciado que la generación automática de texto reciba la atención de los científicos de datos. La generación de texto surge con el propósito de generar informes más cercanos a los humanos. Se basa en el uso de sistemas que procesan automáticamente los datos con la intención de generar información comprensible en lenguaje natural. Los informes lingüísticos pueden ser vistos como un complemento a otras formas de representación del conocimiento, ya que permiten reducir el esfuerzo de interpretar tablas y gráficos. Las dos líneas de investigación para la generación automática de texto a partir de datos numéricos y simbólicos son la Generación de Lenguaje Natural en las aplicaciones denominadas datos-a-texto y la Descripción Lingüística de Datos. Esta última se apoya en la Teoría Computacional de las Percepciones, introducida por Zadeh en 1999. Esta teoría ofrece un entorno de trabajo para implementar sistemas con capacidad de manejar el significado de las expresiones en lenguaje natural, i.e., con la capacidad de trabajar con descripciones imprecisas de forma similar a como lo hacen los humanos. El objetivo principal de esta tesis es realizar avances significativos en la línea de investigación relacionada con la generación automática de Descripciones Lingüísticas de Datos. Es decir, nos centramos en el modelado lingüístico de fenómenos complejos que se encarga de desarrollar sistemas computacionales que usan el lenguaje natural para describir los datos recopilados de los fenómenos estudiados. Cabe destacar que esta es la tercera tesis que se ha desarrollado en el contexto de esta línea de investigación que nació en la Unidad de Computación con Percepciones del Centro Europeo para el Soft Computing. Esta tesis doctoral se construye sobre los resultados obtenidos en las tesis anteriores, publicadas por el mismo grupo de investigación. Además, ha contribuido con la definición teórica y la aplicación práctica de nuevos conceptos que constituyen un avance significativo para la línea de investigación subyacente. Las principales contribuciones se resumen a continuación: caracterizar y medir la fiabilidad de los datos; definir nuevos tipos de percepción computacional; personalizar los informes lingüísticos de acuerdo con las necesidades de cada usuario específico; y la posibilidad de interactuar en tiempo real con el usuario a través de comandos lingüísticos. Esta tesis aborda algunas cuestiones técnicas tales como el tratamiento de Big Data en problemas del mundo real, así como el desarrollo de una nueva biblioteca de código abierto que facilita la implementación práctica de este tipo de sistemas computacionales. Esta tesis incluye varios experimentos ilustrativos sobre la generación del lenguaje natural, a saber, informes lingüísticos sobre (1) la velocidad de los autobuses en un área urbana; (2) la deforestación en la región amazónica; (3) la percepción de comodidad en una habitación; (4) el consumo de energía en el hogar; (5) el censo de los Estados Unidos; y (6) la interacción hombre-máquina en una aplicación móvil para ayudar a usuarios invidentes a encuadrarse en una foto de perfil. In a highly connected world, the volume and variety of data is growing and growing. New technologies allow us to acquire and store these vast arrays of data. The aim is to extract valuable knowledge from data. Hopefully, this new knowledge will facilitate our daily life, either at work, at home or at social environments. Accordingly, engineers face the challenge of developing systems for a huge number of potential scenarios and users. Moreover, the man-machine interaction arises as a key cornerstone in this context. One of the most effective ways of human interaction is through natural language. Therefore, it would be highly appreciated if man-machine interaction may be carried out in natural language. Nowadays, automatic text generation is a challenging task that is receiving attention of data scientists. It arises with the aim of generating more human- friendly reports. It is focused on computational systems that automatically process data with the aim of generating understandable information using natural language. These linguistic reports can be seen as a complement to other ways of knowledge representation. They actually reduce the effort of interpreting tables and graphs. The two research lines for text generation --from numerical and symbolic data-- are Natural Language Generation for the so-called data-to-text applications and Linguistic Descriptions of Data. The latter is supported by the Computational Theory of Perceptions, introduced by Zadeh in 1999. This theory provides a framework to develop computational systems with the capacity of computing with the meaning of natural language expressions, i.e., with the capacity of computing with imprecise descriptions of the world in a similar way how humans do. The main goal of this dissertation is to contribute with significant advances in the research line related to automatic generation of Linguistic Descriptions of Data. Namely, we focus on the linguistic modeling of complex phenomena. In other words, we focus on developing computational systems ready to describe in natural language the data coming out of the phenomena under study. It is noteworthy this is the third thesis which has been developed in the context of this research line which was born in the Computing with Perceptions Research Unit of the European Centre for Soft Computing. This doctoral thesis is built up on the basis of the outcomes provided by the previous dissertations published in the same research group. It has contributed with the theoretical definition and practical implementation of novel concepts which constitute a significant breakthrough for the underlying research line. The main contributions are summarized as follows: characterizing and measuring the reliability of data; defining new types of computational perception; customizing linguistic reports in accordance with the needs of each specific user; and the possibility of interacting in real time with the user through linguistic commands. This dissertation addresses some technical issues such as dealing with Big Data in real- world problems as well as the development of a novel open source library which makes easier the practical implementation of this type of computational systems. This dissertation presents several illustrative experiments on natural language generation, namely, linguistic reports about (1) the velocity of buses in an urban area; (2) the deforestation in the Amazon region; (3) the perception of comfort in a room; (4) the energy consumption at home; (5) the USA census; and (6) the man-machine interaction in a mobile application for assisting blind people to be framed in a profile photo.