On the theory and practice of variable selection for functional data

Torrecilla Noguerales, José Luis

On the theory and practice of variable selection for functional data

Torrecilla Noguerales, José Luis

Dirixida por:

José Ramón Berrendero Díaz Co-director
Antonio Cuevas González Co-director

Universidade de defensa: Universidad Autónoma de Madrid

Fecha de defensa: 17 de decembro de 2015

Tribunal:

Manuel Febrero Bande Presidente
Amparo Baíllo Moreno Secretario/a
Gérard Biau Vogal

Tipo: Tese

Teseo: 398466 DIALNET Biblos-e Archivo editor

Resumo

Functional Data Analysis (FDA) might be seen as a partial aspect of the modern mainstream paradigm generally known as Big Data Analysis. The study of functional data requires new methodologies that take into account their special features (e.g. infinite dimension and high level of redundancy). Hence, the use of variable selection methods appears as a particularly appealing choice in this context. Throughout this work, variable selection is considered in the setting of supervised binary classification with functional data $\{X(t),\ t\in[0,1]\}$. By variable selection we mean any dimension-reduction method which leads to replace the whole trajectory $\{X(t),\ t\in[0,1]\}$, with a low-dimensional vector $(X(t_1),\ldots,X(t_d))$ still keeping a similar classification error. In this thesis we have addressed the ``functional variable selection'' in classification problems from both theoretical and empirical perspectives. We first restrict ourselves to the standard situation in which our functional data are generated from Gaussian processes, with distributions $P_0$ and $P_1$ in both populations under study. The classical Hajek-Feldman dichotomy establishes that $P_0$ and $P_1$ are either mutually absolutely continuous with respect to each other (so there is a \acrfull{rn} density for each measure with respect to the other one) or mutually singular. Unlike the case of finite dimensional Gaussian measures, there are non-trivial examples of mutually singular distributions when dealing with Gaussian stochastic processes. This work provides explicit expressions for the optimal (Bayes) rule in several relevant problems of supervised binary (functional) classification under the absolutely continuous case. Our approach relies on some classical results in the theory of stochastic processes where the so-called Reproducing Kernel Hilbert Spaces (RKHS) play a special role. This RKHS framework allows us also to give an interpretation, in terms of mutual singularity, for the ``near perfect classification'' phenomenon described by \cite{del12np}. We show that the asymptotically optimal rule proposed by these authors can be identified with the sequence of optimal rules for an approximating sequence of classification problems in the absolutely continuous case. The methodological contributions of this thesis are centred in three variable selection methods. The obvious general criterion for variable selection is to choose the ``most representative'' or ``most relevant'' variables. However, it is also clear that a purely relevance-oriented criterion could lead to select many redundant variables. First, we provide a new model-based method for variable selection in binary classification problems, which arises in a very natural way from the explicit knowledge of the RN-derivatives and the underlying RKHS structure. As a consequence, the optimal classifier in a wide class of functional classification problems can be expressed in terms of a classical, linear finite-dimensional Fisher rule. Our second proposal for variable selection is based on the idea of selecting the local maxima $(t_1,\ldots,t_d)$ of the function ${\mathcal V}_X^2(t)={\mathcal V}^2(X(t),Y)$, where \gls{dcov} denotes the \textit{distance covariance} association measure for random variables due to \citet{sze07}. This method provides a simple natural way to deal with the relevance vs. redundancy trade-off which typically appears in variable selection. This proposal is backed by a result of consistent estimation for the maxima of \gls{fdcov}. We also show different models for the underlying process $X(t)$ under which the relevant information is concentrated on the maxima of ${\mathcal V}_X^2$. Our third proposal for variable selection consists of a new version of the \acrfull{mrmr} procedure proposed by \cite{din05} and \cite{pen05}. It is an algorithm to systematically perform variable selection, achieving a reasonable trade-off between relevance and redundancy. In its original form, this procedure is based on the use of the so-called \textit{ mutual information criterion} to assess relevance and redundancy. Keeping the focus on functional data problems, we propose here a modified version of the mRMR method, obtained by replacing the mutual information by the new \textit{distance correlation} measure in the general implementation of this method. The performance of the new proposals is assessed through an extensive empirical study, including about 400 simulated models (100 functional models $\times$ 4 sample sizes) and real data examples, aimed at comparing our variable selection methods with other standard procedures for dimension reduction. The comparison involves different classifiers. A real problem with biomedical data is also analysed in collaboration with researchers of Hospital Vall d'Hebron (Barcelona). The overall conclusions of the empirical experiments are quite positive in favour of the proposed methodologies.