Soft computing based learning and data analysis: missing values and data complexity

Luengo Martín, Julián

Soft computing based learning and data analysismissing values and data complexity

Luengo Martín, Julián

Dirixida por:

Francisco Herrera Triguero Director

Universidade de defensa: Universidad de Granada

Fecha de defensa: 20 de xaneiro de 2011

Tribunal:

Antonio González Muñoz Presidente/a
Jorge Casillas Barranquero Secretario/a
Alberto José Bugarín Diz Vogal
Jaume Bacardit Peñarroya Vogal
Luciano Sánchez Ramos Vogal

Tipo: Tese

Teseo: 301755 DIALNET DIGIBUG editor

Resumo

We have addressed different problems and challenges in the Missing Values (MVs) scope and the data complexity topic, considering the imbalanced data framework as well. In particular referring to the characterization of the performance of the classification algorithms and the analysis of the effect of imbalanced data in data complexity. We have to note that the problems tackled are independent: missing values imputation, performance characterization by data complexity and preprocessing for imbalanced data sets evaluation by means of data complexity; and they have been presented this way in their respective sections. We have not posed a common study in this memory and so it is raised in the future works. The objective is to point out that they are not connected works and for that reason we have not described a relation between the different parts. The present section briefly summarizes the obtained results and to point out the conclusions provided by this memory. We have studied the impact of MVs in different classification algorithms, and the suitability of the use of imputation methods in order to overcome the associated problematic. We pretend to indicate the best approach or approaches in each case for a wide family of classification algorithms in order to improve the behavior of the latter based on the type of the classifier with a well founded imputation method selection. We have used a wide range of well-known classification algorithms from the literature, using the largest amount of them with respect a large selection of imputation methods and real-world data sets with natural MVs. The special case of Fuzzy Rule Based Classification Systems (FRBCSs) is also studied, analyzed how the type of FRBCS influences in the selection of the best imputation schema, and how they are related. The performance of individual FRBCSs is also studied based on the characteristics of the data. We have related the best regions of the data complexity with the good or bad performance of the FRBCS, creating a rule set both in an ad-hoc and an automatic way. These rule set can be summarized into two rules which describe the good and bad regions of the FRBCS. We have observed particularities in the rules obtained for each FRBCS, but also many common regions. These common regions have been exploited using a family of classification algorithms known for being closely related: artificial neural networks and SVMs. We have analyzed the common regions of good and bad behavior for these classifiers jointly, showing that they have related performance in the complexity space. Using the same data complexity measures we have observed the effect of over-sampling and under-sampling approaches in the imbalanced data framework with respect to the C4.5 and PART algorithms. We have studied that the use of the preprocessing techniques transform the domains of competence of the classifier, widening the complexity region of good behavior and maintaining the bad behavior one. These results illustrate to what the process is beneficial and the subtle differences between the over-sampling and under-sampling approaches, being the latter more capable of producing better results.