# Contributions to survival analysis with applications to biomedicine

- Rodríguez Girondo, Mar

- Jacobo de Uña Álvarez Co-director
- Carmen María Cadarso Suárez Co-director

Universidade de defensa: Universidade de Santiago de Compostela

Fecha de defensa: 28 de xuño de 2013

- Guadalupe Gómez Melis Presidente/a
- César Andrés Sánchez Sellero Secretario
- Luís Meira Machado Vogal
- Javier Roca Pardiñas Vogal
- Francisco Gude Sampedro Vogal

Tipo: Tese

## Resumo

In this dissertation, we present several methodological contributions to the statistical field known as survival analysis and discuss their application to real biomedical datasets. Survival analysis refers to the study to time to event data, i.e., it is devoted to the analysis of the elapsed time from a starting point until the occurrence of a given event of interest. Despite also used in other application fields such as economy, biology and industry, in this dissertation, we are interested in the analysis of survival data from biomedical studies. In this context, the event of interest may represent the death or the outbreak of a given disease, and more generally, we can consider the event of interest as the transition between two different states, such as those representing different severity phases on a disease or recurrence of a tumor. Under this perspective, death can be viewed as the transition from the state alive to the state death. The distinguishable feature of survival analysis is that the variable of interest or response variable is time, called by lifetime, survival or waiting time and denoted by T. Given the dynamic nature of survival data, censoring is a very important phenomenon. Right censoring occurs when for some of the studied units the event of interest has not been observed when the data are evaluated and therefore we do not know the exact waiting time, just that it exceeds certain time. Right censoring can occur because the event has not yet occurred but also due to loss of follow-up or protocol restrictions related to the aim of the study (e.g. mortality before a predefined age of interest). Other types of censoring are left-censoring (all the information is that the lifetime is lower than a given value) and interval-censoring (all we know is that the lifetime is within a given interval). Hence, the presence of censoring implies that some units carry incomplete (but still important) information that needs to be incorporated into the analyses. Throughout this dissertation, we consider right and random censoring. This is the weakest assumption in the sense that the censoring of an observation should not provide any information with regard to the survival of that particular unit beyond the censoring time. In words, all we know about an observation censored at duration t is that the lifetime exceeds t. Multi-state models Multi-state models are typically used for modeling disease progression when several intermediate events of interest are observed during the follow-up time. These models are an extension of the traditional survival analysis previously revised and make it possible to account for complex individuals¿ history with a possible influence on the prognosis. In biomedical applications, the states might be based on clinical symptoms (e.g. bleeding episodes), biological markers (e.g. CD4 T-lymphocyte cell counts), some scale of the disease (e.g. stages of cancer or HIV infection), or a non-fatal complication in the course of the illness (e.g. cancer recurrence). Mathematically, a multi-state model refers to a stochastic process varying in continuous time, allowing individuals to move along a finite number of states. Furthermore, and according to previous Subsection, we assume that the trajectories of individuals can be right censored by a potential censoring time that is independent of the process. The traditional survival analysis refers to the simplest multi-state model, the mortality model, where, as it has been previously noted, only two states are considered, an initial (`alive¿) state and a final absorbing state (`dead¿). By splitting the `alive¿ state into two transient states which are visited in a successive way (with no chance of coming back) we obtain the three-state progressive model. This model is convenient when there exists an intermediate event (e.g. a recurrence) which may influence the survival prognosis. A typical situation in which a k-state progressive model is useful is when analyzing recurrent event data, which arise when each individual may go through a well-defined event several times along his history. Then, the inter-event times are referred to as the gap times, and they are of course determined by the times at which the recurrences take place (i.e. the recurrence times). In the three-state progressive model, the interest is focused on a given couple of successive gap times. In this dissertation, we are especially concerned about the illness-death model, a generalization of the three-state progressive model which allows a direct transition from the initial state 1 to the final state 3, so that it involves three different states and three possible transitions among them: 1 to 2, 2 to 3 and 1to 3. Many applications of the illness-death model can be found in the biomedical literature. The Markov condition in multi-state models Denoting by Z the sojourn time in state 1 and by T the total survival time, the illness-death model (and the three-state progressive model, as a particular case) is characterized by the joint distribution of (Z, T ). Usually, it is assumed that the stochastic process under investigation satisfies the Markov condition, and hence a Markovian multi-state model is fitted to the data at hand. The Markov assumption claims that given the present state, the future evolution of the illness is independent of the states previously visited and the transition times among them. One of the main arguments for checking the Markov condition is that the usual estimators (e.g. Aalen-Johansen transition probabilities) may be systematically biased in non-Markovian situations. On the other hand, despite non-Markovian estimators for transition probabilities and related curves are available, it has been quoted that including the Markov information in the construction of the estimators allows for variance reduction. So, in practice, one will be interested in testing Markovianity. New methods for testing the Markov condition in the three-state progressive model and illness-death model are explored in Chapters 2 and 3. The general idea of the new methods is to determine -without assuming any predefined structure- the grade of dependency between past (Z) and the future (T ) given the present. For this purpose, the measure of association Kendall¿s tau is used in a local way. Survival analysis including covariates So far, we have considered independent and identically distributed data, however, the main goal in many applications is to establish the influence of a set of covariates in the survival, for example to establish prognostic models to complement physician judgment for decision making in clinical practice. In general, regression models over the hazard function have been the standard tools to deal with this problem. A classical tool for studying the effect of covariates on continuous survival times (potentially, right censored) is the Cox proportional hazards model. In the traditional Cox framework, the baseline hazard remains unspecified and inferences are based on a partial likelihood that does not depend the baseline hazard function. Using the Cox model to evaluate the impact of a set of covariates on the hazard function implies two important assumptions: the effect of covariates do not vary over time (proportional hazards assumption) and the effect of covariates acts linearly on the logarithm of the hazard ratio (HR). However, continuous covariates may require nonlinear modeling, so that a purely linear predictor is not sufficient to capture complex association structures or covariates may impact survival in a time-varying manner. Hence, the Cox model specification is often not flexible enough for the correct modeling of variables affecting survival in many applications. The modeling of the hazard rate of the survival time in terms of a set of covariates can be extended by allowing the effect of covariates vary over time and assuming more complex, nonlinear relationships between covariates and survival. A variety of flexible methods in hazard regression have been developed in recent years. In this dissertation, we consider a smooth general specification of the hazard function with additive predictor including traditional linear effects of categorical covariates, smooth effects of continuous covariates and smooth time-varying effects of both categorical and continuous covariates. Furthermore, the baseline hazard rate is reparametrized and included in the linear Predictor. With this representation, the baseline hazard is modeled as a smooth effect of any other continuous covariate. This is of interest from a practical point of view and it will allow to use the full likelihood for model estimation. Model building in hazard regression Flexible survival regression models (accommodating both smooth and time-varying effects) are immediately appealing in terms of flexibility but they typically introduce additional difficulties when a subset of covariates and the corresponding modeling alternatives have to be chosen , i.e., for building the most suitable model for a given dataset. This is particularly true when potentially time-varying associations are given. In this dissertation, we address the problem of model building in the survival regression context. This is an important methodological issue relevant in many biomedical applications. The increasing availability of potentially relevant information on patients and the complex relations among them makes it difficult for practitioners to elucidate which are the best combinations of predictors in order to reasonable risk stratification of patients. When developing a prognostic model, we are interested in answering questions such as: ¿ Should a covariate enter the specified hazard model or is it not important and can be ignored? ¿ Should a continuous covariate enter the model as a purely linear effect or is more smoothness indeed required? ¿ Is the effect of a covariate constant along the follow-up time or does its effect vary over time? In Chapter 4, we explore model building strategies for flexible survival regression based on a Poisson likelihood estimation scheme coming from a piecewise constant representation of the original follow-up time. This enables to adapt model building methods proposed in the framework of generalized additive regression models (GAMs) to hazard regression. Specifically, we compare three different approaches for model building originating from different conceptual fields: ¿ a stepwise multi-stage approach based on Akaike¿s information criterion (AIC); ¿ a shrinkage method for generalized additive regression models based on penalized maximum likelihood estimation; ¿ and a Poisson-likelihood boosting algorithm. Biomedical applications The methodology proposed in this piece of research is illustrated by means of its application to real biomedical datasets. For illustrating our methods for testing the Markov condition in the three-state progressive and illness-death models we have used public and widely used medical databases. A three-state progressive model concerning first and second recurrence times of bladder cancer is analyzed using the data from the Veterans Administration Cooperative Urological Research Group. Moreover, we studied the Markov condition in the illness-death framework by using two previously published datasets. Firstly, we propose to consider an illness-death model for analyzing the influence of the time to chronic graft-versus-disease in evolution after a bone marrow transplant in patients with leukemia. Secondly, we consider the relationship between the time to the first wound excision and time to Straphylocous Aureaus infection in burn patients. A second major topic of this dissertation is the inspection of model building techniques in the field of survival analysis. This is illustrated by using a local dataset from cardiology. Specifically, we face the construction of a prognostic model for patients who have suffered a myocardial infarction. Myocardial infarction (MI) is a disease in which the development of more precise estimates of risk and prognosis is desirable, as it can result in serious and fatal outcomes. Moreover, efficacious therapies for this disease have been developed during the last two decades. This further amplifies the need for prognostic prediction on which to form an understanding of future expectations and to base therapeutic and other management decisions so as to reduce the associated short and long-term morbidity and mortality. With this application, we aim to investigate which are the most relevant predictors of survival after myocardial infarction and their optimal degree of complexity considering a high number of predictors simultaneously. For that, we use a dataset containing n=3.027 patients from the Cardiology department of the Complexo Hospitalario Universitario de Santiago de Compostela (Galicia, Spain) with a final diagnosis of myocardial infarction. Layout of contents This dissertation can be considered as divided into main two parts, the first deals with new methods for testing Markov condition in multi-state models concerning three states, specifically the three- state progressive and the illness-death model. The proposed goodness-of-fit tests are based on the Kendall¿s tau as a measure of past-future association. The second part involves the proposal of methods for automatic model building in flexible survival models using a piecewise exponential representation of the data. In both cases, the methods presented are illustrated using the previously presented biomedical datasets. Specifically, Chapter 2 and 3 are devoted to the study of the Markov condition in the three-state progressive and illness-death models. The three-state progressive model is regarded as a particular case of the illness-death model in which no direct transitions between states 1 and 3 are observed and both are fully characterized by the joint distribution of the sojourn time in state 1 (Z) and the total survival time (T ). The new methods are based on measuring association between the future (T ) and the past (Z) along time. In Chapter 2 we propose two different but closely related tests for zero future-past association in each time point of a predefined grid of t values. Accordingly, a significance trace is proposed. Both methods are based on the Kendall¿s tau measure defined in a local way. Firstly, we consider an estimator adapted to the possibility of censoring. The properties of this new estimator are investigated theoretically (consistency and asymptotic normality) and through simulations. Secondly, we considered a different test statistic defined as the (estimated) Kendall¿s tau between the censored versions of Z and T. In practice, this approach consists of ignoring the censoring indicators of the sample. Since it is based on observable variables, ordinary estimators can be used. A local bootstrap resampling plan is proposed to approximate the null distribution of the proposed test statistics. The finite sample performance of these two alternative tests is investigated through simulations and illustrated through real data analysis of recurrence of bladder cancer, the evolution of leukemia patients after a bone marrow transplantation and infection for burn patients datasets. In Chapter 3, a supremum-type test based on the local Kendall¿s tau between the censored versions of Z and T is proposed to test Markovianity in a global way. The weak convergence of the underlying test statistic is established, and a global bootstrap approximation is proposed. The finite sample performance of the test is studied through simulations. The new method is illustrated by the reanalysis of the same medical datasets used in Chapter 2, namely the relationship between the times up to first and second recurrence of bladder cancer patients, the impact of acute chronic graft-versus-disease in evolution after a bone marrow transplant in patients with leukemia, and also the relationship between the time to a first wound excision and time to Straphylocous Aureaus infection in burn patients. Model building strategies for flexible survival models, with special concern about the inspection of possible time-dependent associations, are studied in Chapter 4. We propose to conduct a piecewise exponential representation of the original survival data to link hazard regression with estimation schemes based on Poisson likelihood to make recent advances for model building in exponential family regression accessible also in the non proportional hazard regression context. A two-stage stepwise selection approach, a method based on doubly penalized likelihood and a componentwise functional gradient descent approach are adapted to the survival context via data augmentation. The three approaches are compared by means of an intensive simulation study. An application to prognosis after discharge for patients who suffered a myocardial infarction supplements the simulation to demonstrate pros and cons of the approaches in real data analyses. Chapter 5 contains a detailed description of the software developed to implement the methods proposed in this dissertation. Specifically, we present several R functions which enable the application of methods proposed along Chapters 2, 3 and 4. We conclude with some final remarks and possible directions for future research in Chapter 6.