# Survival analysis

Theme Co-ordinators: Bernard Rachet, Bianca De Stavola, Ulla Sovio, Ula Nur, David Cox (University of Oxford)

## Background

Survival analysis is at the core of any study of time to a particular event, such as death, infection, or diagnosis of a particular cancer. It is therefore fundamental to most epidemiological cohort studies, as well as many randomised controlled trials (RCTs).

An important issue in survival analysis is the choice of **time scale**: this could be for example time since entry into the study (or since first treatment in a RCT), time since a particular event (e.g. the Japanese tsunami), or time since birth (i.e. age). The latter is particularly relevant for epidemiological studies of chronic diseases, where age often exerts a substantial confounding effect (see [1], Chapter 6, for a discussion of alternative time scales).

Usually not all participants are followed up until they experience the event of interest, leading to their times being ‘**censored**‘. In this case, the available information consists only of a lower bound for their actual event time. It is typically assumed that the process giving rise to censoring is independent of the process determining time to the event of interest. In contrast to most regression approaches (which typically involve modelling means of distributions given explanatory variables), many survival analysis models are defined in terms of the **hazard** (or rate) of the event of interest. Within this framework, the hazard is expressed as a function of explanatory variables and an underlying ‘baseline’ hazard. Fully parametric models assume a particular form for the baseline hazard, the simplest being that it is constant over time (**Poisson** regression). **Cox’s proportional hazards model, **perhaps the most popular model for survival data, makes no parametric assumptions about the baseline hazard. Both the Poisson and Cox regression models assume the hazards to be **proportional** for individuals with different values of the explanatory variables. This assumption can be relaxed, for example through use of **Aalen****’s** additive hazard model.

Generalizations to deal with repeated episodes of an event of interest, such as infection, are possible through the introduction of random effects that capture the correleation among events that occur to the same individual. Within the survival analysis literature these are referred to as **frailty models. **→ Design and analysis for dependent data

An alternative approach to modelling survival data, more in keeping with most regression techniques, involves modelling the (logarithmically transformed) survival times directly. These are expressed in terms of as a linear function of explanatory variables and an error term, with a choice of distributions for the error terms leading to the family of **accelerated failure time models**. When the errors are assumed to be exponential, the accelerated failure time model is equivalent to a Poisson regression model.

Most of our applications of survival analysis models involve various flavours of the models mentioned above. However specific issues arise in certain contexts and are of interest to our group. These are discussed below.

## Areas of current interest

### Competing events

Censoring may occur for several reasons. A particular setting where censoring is not independent of the process governing the event of interest arises when there are competing events. Competing events are events that remove the individual from being at risk of the event of interest, in other words they preclude its occurrence. This happens for example if we study lung cancer mortality while individuals may die of other causes. Obviously the termination of the follow-up of individuals who die from other causes is not the same as loss to follow-up because the latter does not prevent the occurrence of the event of interest after time is censored.

The issues and methods arising for the analysis of competing events have been discussed in the biostatistical literature since the 1980s, (for a review see [2]) but have not really filtered into epidemiological practice, with the notable exception of applications to **AIDS research** [3]. They are only marginally discussed in the RCT literature, where the problem is usually dealt with by creating **composite events**.* →* Analysis of clinical trials including prognostic models

There are two main possible approaches to the analysis of data affected by competing events:

a) Carrying out a so-called ‘**cause-specific**‘ analysis, that is adopt traditional survival analysis methods where competing events are treated as censoring events. Note however that ’cause-specific’ in this context is a misnomer since the estimated effect depends on the rates generating all the other events (see [1], page 66). The main issue with this is approach is one of interpretation, as all estimated effects are conditional on suffering the competing event.

*b) *Adopting a different focus, that is model the **cumulative incidence** of the event of interest as opposed to its hazard (or rate). This approach was first proposed by Fine and Gray [4] but belongs to the broader family of **inverse probability weighting (IPW) estimators **(e.g. [5]) that has also been proposed in other contexts, notably to deal with informative missingness and selection bias [6-7]. → Causal inference, Missing data

### Net survival

Information on cancer survival is essential for cancer control and has important implications for cancer policy. The primary indicator of interest is **net survival**, a conceptual survival metric which would be observed if the patients were only subject to the mortality from the disease of interest and the mortality rate of this disease remained as in the context of analyses involving **competing events**, the only situation which can be observed.

Two approaches attempt to estimate net survival: **cause-specific survival** and **relative survival**. Relative survival [8] is the standard approach of estimating population-based cancer survival, when the actual cause of death is not accurately known. Although widely used in the cancer field, it can be applied to any disease at population level. Relative survival was originally defined as the ratio of the observed survival probability of the cancer patients and the survival probability that would have been expected if the patients had had the same mortality probability as the general population (background mortality) with similar demographic variables e.g. age, sex, calendar year. Background mortality is derived from life tables stratified at least by age, sex and calendar time.

*Unbiased estimator of net survival*

Both approaches (cause-specific and relative survival) provide biased estimation of net survival because of the competitive censoring in particular due to age. An unbiased descriptive estimator of net survival using the principle of inverse probability weighting has been recently proposed alongside the modelling approach (Pohar-Perme M, Stare J, Estève J. *Biometrics* 2011 – in review].

*Multivariable excess hazard models*

Relative survival is the survival analogue of excess mortality. Additive regression models for relative survival estimate the hazard at time t since diagnosis of cancer, as the sum of the expected hazard (background) of the general population at time t, and the **excess hazard** due to cancer [9-11]. More flexible models using splines for modelling the baseline excess hazard function of death as well as the non-proportionality of the co-variables effects have been recently developed [12-14]; modelling the log-cumulative excess hazard has been also proposed [15-16]. Alternative approaches were recently developed [17].

Unbiased estimation of net survival requires the inclusion of the main censoring variables in the excess hazard models, variables usually included in the life tables [18].

#### Current work

*Life tables*

Estimation of net survival relies on accurate life tables. Methodology based on multivariable flexible Poisson model has been developed in order to build complete, smoothed life tables for subpopulations, as defined by region, deprivation, ethnicity etc. [19].

*Survival on sparse data*

Contrasting with incidence and mortality, very little has been done on the estimation of survival based on sparse data or small areas [20]. The main challenge in survival is the additional dimension that is time since diagnosis. Multilevel modelling and Bayesian approaches are two main possible routes. Ultimately, presentation of such survival results can easily mislead healthcare policy makers and methodological work on mapping and funnel plots is needed [21].

*Public health relevance*

Several indicators (avoidable deaths, population ‘cure’ parameters, crude probability of death, partitioned excess mortality) have been explored to present cancer survival results in ways more relevant for public health and health policy.

*Missing data and misclassification*

The analysis of routine, population-based data always face the problem of incomplete data for which it may be difficult or impossible to obtain the required complementary information. A tutorial paper explored the estimation of relative survival when the data are incomplete [22]. Even when complete, tumour stage in particular may be misclassified, compromising comparison in cancer survival between subpopulations.

*Disparities in cancer survival*

Inequalities in cancer survival are still not well understood and structural equation modelling appears to be a possible approach to investigate potential causal pathways.

### References

1. Clayton D and Hills M. *Statistical Models in Epidemiology*. Oxford University Press, 1993, Oxford.

2. Putter, H., Fiocco, M., and Geskus, R. B. Tutorial in biostatistics: Competing risks and multi-state models. *Statistics in Medicine. 2007: *26**, **2389–2430.

3. CASCADE Collaboration. Effective therapy has altered the spectrum of cause specific mortality following HIV seroconversion. *AIDS*, 2006, 20:741–749

4. Fine, JP and Gray R J. A proportional hazards model for the subdistribution of a competing risk. *Journal of the American Statistical Association*. 1999: 94**, **496–509.

5. Klein JP, Andersen PK. *Regression Modeling of Competing Risks Data Based on Pseudovalues of the Cumulative Incidence Function.* Biometrics 2005: 61, 223–229.

6. Robins JM, et al. Semiparametric regression for repeated outcomes with non-ignorable non-response. *Journal of the American Statistical Association. *1998; 93 1321-1339.

7. Hernán MA, Hernandez-Diaz S, Robins JM. A structural approach to selection bias. *Epidemiology* 2004;15:615-625.

8. Ederer F, Axtell LM, Cutler SJ. The relative survival: a statistical methodology. *Natl Cancer Inst Monogr* 1961; 6: 101-21.

9. Hakulinen T, Tenkanen L. Regression analysis of relative survival rates. *J Roy Stat Soc Ser C* 1987; 36: 309-17.

10. Estève J, Benhamou E, Croasdale M, Raymond L. Relative survival and the estimation of net survival: elements for further discussion. *Stat Med* 1990; 9: 529-38.

11. Dickman PW, Sloggett A, Hills M, Hakulinen T. Regression models for relative survival. *Stat Med* 2004; 23: 51-64.

12. Bolard P, Quantin C, Abrahamowicz M, Estève J, Giorgi R, Chadha-Boreham H, Binquet C, Faivre J. Assessing time-by-covariate interactions in relative survival models using restrictive cubic spline functions. *J Cancer Epidemiol Prev* 2002; 7: 113-22.

13. Giorgi R, Abrahamowicz M, Quantin C, Bolard P, Estève J, Gouvernet J, Faivre J. A relative survival regression model using B-spline functions to model non-proportional hazards. *Stat Med* 2003; 22: 2767-84.

14. Remontet L, Bossard N, Belot A, Estève J, FRANCIM. An overall strategy based on regression models to estimate relative survival and models to estimate relative survival and model the effects of prognostic factors in cancer survival studies. *Stat Med* 2007; 26: 2214-28.

15. Nelson CP, Lambert PC, Squire IB, Jones DR. Flexible parametric models for relative survival, with application in coronary heart disease. *Stat Med* 2007; 26: 5486-98.

16. Lambert PC, Royston P. Further development of flexible parametric models for survival analysis. *Stata J* 2010; 9: 265-90.

17. Perme MP, Henderson R, Stare J. An approach to estimation in relative survival regression. *Biostatistics* 2009; 10: 136-46.

18. Estève J, Benhamou E, Raymond L. *Statistical methods in cancer research, volume IV. Descriptive epidemiology. (IARC Scientific Publications No. 128)*. Lyon: International Agency for Research on Cancer, 1994.

19. Cancer Research UK Cancer Survival Group. Life tables for England and Wales by sex, calendar period, region and deprivation. http://www.lshtm.ac.uk/ncdeu/cancersurvival/tools/, 2004.

20. Quaresma M, Walters S, Gordon E, Carrigan C, Coleman MP, Rachet B. A cancer survival index for Primary Care Trusts. Office for National Statistics, 7 Sep 2010. http://www.statistics.gov.uk/statbase/Product.asp?vlnk=15388

21. Spiegelhalter DJ. Funnel plots for comparing institutional performance. *Statistics in Medicine* 2005; 24: 1185-202.

22. Nur U, Shack LG, Rachet B, Carpenter JR, Coleman MP. Modelling relative survival in the presence of incomplete data: a tutorial. *IJE* 2010; 39: 118-28.