# Time series regression analysis

Theme Co-ordinators: Ben Armstrong, Antonio Gasparrini

This page is split into the following sections:

- Time series analysis for biomedical data
- Methodological issues
- Contributions of LSHTM researchers
- LSHTM people involved in developing or using time series regression methodology
- Publications by LSHTM researchers
- Other useful references

## Time series analysis for biomedical data

A time series may be defined as a sequence of measurements taken at (usually equally-spaced) ordered points in time.

Statistical methods applied to time series data were originally developed mainly in econometrics, and then used in many other fields, such as ecology, physics and engineering. In the original application the focus was in *prediction*, and the aim was to produce an accurate forecast of future measurements given an observed series. The standard statistical approaches adopted for this purpose usually rely on *auto-regressive moving average* (ARIMA) and related models.

Recently, the time series design has been exploited also in biomedical data, due to the availability of routinely-collected series of administrative or medical data, such as mortality or morbidity counts, and air pollution or temperature measurements. Within this research area, time series methods have been subject to an intense methodological development in the last 15 years. In contrast with the original interest on prediction, the main aim of time series analysis in biomedical applications is commonly to assess the association between an outcome and one or more predictor series: here the focus is instead in *estimation*, and the models reduce to the more traditional regression framework although possibly non-standard versions.

Two main features characterize time series data from a statistical viewpoint: the *correlation* displayed by observations and their *temporal sequence*. Statistical models need to cope with the former, in order to provide accurate inferences, and may exploit the latter, with the intention to strengthen the evidence on the causal nature or clarify details of the association under study.

## Methodological issues

The regression analysis of time series biomedical data poses several methodological problems, which result in an intense research carried out in the last few years. The main research directions are summarized below. References are provided in the related sections.

- Model selection: time series model are usually built with a pre-defined set of potential candidates. However, some criteria are needed to select other model parameters, such as the degree of control for seasonal and long time trends, or the adequacy of assumptions on the shape of the exposure-response relationship of predictors showing potential non-linear effects. Some investigators have tested the comparative performance of selection criteria based on information criteria (Akaike, Bayesian or related), minimization of partial autocorrelation of residuals, (generalized) cross-validation and others. Further research is needed to produce robust and general selection criteria.
- Smoothing methods: the specification of non-linear exposure-response relationship for predictors in the regression model is essential both to determine the association with the exposure of interest and to control for potential confounders. Smoothing techniques based on both parametric and non-parametric methods have been proposed in time series analysis. The former usually rely on regression splines within generalized linear models (GLM), while the latter are specified through smoothing or penalized splines within generalized additive models (GAM).
- Distributed lag (non-linear) models: commonly the effect of an exposure is not limited to the day it occurs, but persists for further days or weeks. This introduces the additional problem of modelling the lag structure of the exposure-response relationship. This issue has been initially addressed by
*distributed lag models*, which allows the linear effect of a single exposure event to be distributed over a specific period of time. More recently, this methodology has been generalized to non-linear exposure-response relationships through*distributed lag non-linear models*, a modelling framework which can flexibly describe simultaneously non-linear and delayed associations. - Harvesting effect (mortality displacement): this phenomenon arises when applying an ecological time series analysis to grouped data, for example mortality counts. The conceptual framework is based on the assumption that the exposure can affects mainly a pool of frail individuals, whose events are only brought forward by a brief period of time by the effect of exposure. For non-recurrent outcomes, the depletion of the pool following a high exposure event results in some reduction of cases few days later, thereby reducing the overall long-term impact. Specific models are needed to account for this reduction in the overall effect and thereby produce accurate estimates.
- Two-stage analysis: the usual approach to time series studies on environmental factors involves the analysis of series from multiple cities or regions. The complexity of the regression models prevents the specification of a very highly parameterized hierarchical structure in a single multilevel development. The analysis is instead carried out through a two-stage step, with a common city-specific model and then a meta-analysis to pool the results. The specification of complex exposure-response relationships in the first stage requires the development of non-standard meta-analytic techniques, such as
*meta-smoothing*and*multivariate meta-analysis*. - Interrupted time series: time series analysis is also applied to evaluate the longitudinal effects of interventions, such as public health policies. The main approach relies on a segmented regression analysis involving a pre-post design, where the effect is controlled for seasonality and long time trends. Although often defined as “quasi-experimental design”, this methodology faces important limitations due to the presence of potential unmeasured time-varying confounders. Important methodological developments, among others, focus on the definition of effect of interest, occurrence of non-linear trends and lagged effects, and inclusion of control areas.

## Contributions of LSHTM researchers

### Methodological research

Statisticians at the LSHTM have made contributions to time series regression methodology to address problems that have arisen in the use of these methods in substantive epidemiological studies carried out at the School (see below). A paper summarizes several issues as potential candidate for methodological work, focusing in particular on temperature-health associations (Gasparrini and Armstrong 2010).

Published methodological articles have proposed a new more flexible way to model lagged relationships, through the framework of distributed lag non-linear models (Armstrong 2006; Gasparrini et al, 2010), recently implemented in a R package *dlmn* (Gasparrini 2011) – see figure.

Other methodological efforts have explored ways to pool estimates of non-linear exposure-response relationships in two-stage analyses. The methods are based on multivariate meta-analytical techniques applied to estimates of multi-parameter associations from first-stage models, and implemented in the R package *mvmeta* (Gasparrini and Armstrong 2011b, Gasparrini et al. 2012) – see figure.

Two other papers have explored models to allow estimation of the extent to which the excess deaths associated with heat waves can be explained by a continuous association between temperature and mortality, or whether rather an additional “wave effect” due to sustained heat is necessary (Hajat et al, 2006; Gasparrini and Armstrong 2011a). Models to identify short term “harvesting” (see above) have also been explored (Hajat et al, 2005) – see figure.

Another research activity explores methodological issues on interrupted time series. A paper has evaluated the influence of alternative modelling assumptions on the estimate of the association between the introduction of state-wide smoking bans and the incidence of acute myocardial infarction (Gasparrini et al, 2009).

Finally, we have proposed a version of the “case-only” approach designed originally for studying gene-environment interactions in time series context, to study how effects of time-varying risk factors (e.g. weather) might be modified by time-fixed factors, such as age or socio-economic status (Armstrong 2003).

Ongoing methodological work continues, focused in particular on ways of characterising variation in distributed lag non-linear models (such as that in the 3d-plot below) across cities or sub-populations (gasparrini and Armstrong 2010), and on methodological issues in setting up heat-health warning systems (Hajat et al. 2010).

### Applied research

The substantive research using time series regression methods carried out at the LSHTM has concerned mainly the associations between daily occurrences of health outcomes (such as deaths) and time-varying environmental factors. Earliest examples (Gouveia and Fletcher 2000) concerned associations of daily air pollution on mortality, and this interest continues (Pattenden et al, 2010). Since then most focus has been on associations of weather and season with health – of particular interest in the context of impending global warming. The most common health outcome has been mortality (Hajat et al, 2002; McMichael et al, 2008; Armstrong et al, 2010, Gasparrini et al, 2012) but also studied have been: viral disease (Lopman et al, 2009), food-borne disease (Kovats et al, 2004; Tam et al, 2006), diarrhoea (Hashizume et al, 2008; Hashizume et al, 2010), preterm birth (Lee et al, 2008), GP visits (Hajat and Haines 2002), and myocardial infarctions (Bhaskaran et al, 2011, 2009a, 2009b). The same methods have also been used to study impact of circulating RSV and influenza on hospital admission (Mangtani et al, 2006) including estimating how much vaccination reduces that association (Armstrong et al, 2004). Several studies have focused in particular on which groups are vulnerable to the acute effects of weather (Wilkinson et al, 2004; Hajat et al, 2007, Hajat and Kosatky 2010). Other studies instead applied interrupted time series methods to explore the association between the introduction of state-wide smoking bans and the cardiovascular morbidity (Barone Adesi, Gasparrini et al. 2011).

For other and in particular more recent relevant papers check out the personal web pages of the staff members, accessible from the list below.

## LSHTM people involved in developing or using time series regression methodology

Ben Armstrong; Antonio Gasparrini; Shakoor Hajat; Mike Kenward; Paul Wilkinson; Sari Kovats; Sam Pattenden; Zaid Chalabi; Krishnan Bhaskaran; Clarence Tam; Punam Mangtani

## Publications by LSHTM researchers

### Methodological research:

Gasparrini, A., B. Armstrong, et al. (2012). Multivariate meta-analysis for non-linear and other multi-parameter associations. *Statistics in Medicine*. Epub ahead of print (doi: 10.1002/sim.5471).

Gasparrini, A. and B. Armstrong (2011a). The impact of heat waves on mortality. *Epidemiology* 22(1): 68-73.

Gasparrini A, Armstrong B (2011b). Multivariate meta-analysis: a method to summarize non-linear associations. *Statistics in Medicine* 30(20):204-206.

Gasparrini, A. (2011). Distributed Lag Linear and Non-Linear Models in R: The Package dlnm. *J Stat Softw* 43(8): 1-20.

Hajat, S., S. C. Sheridan, et al. (2010). Heat-health warning systems: a comparison of the predictive capacity of different approaches to identifying dangerously hot days. Am J Public Health 100(6): 1137-1144.

Gasparrini, A., B. Armstrong and M. G. Kenward (2010). Distributed lag non-linear models. Statistics in Medicine 29(21): 2224-34.

Gasparrini, A. and B. Armstrong (2010). Time series analysis on the health effects of temperature: Advancements and limitations. Environmental Research 110(6): 633-8.

Gasparrini, A., G. Gorini and A. Barchielli (2009). On the relationship between smoking bans and incidence of acute myocardial infarction. European Journal of Epidemiology 24(10): 597-602.

Hajat, S., B. Armstrong, M. Baccini, et al. (2006). Impact of high temperatures on mortality: is there an added heat wave effect? Epidemiology 17(6): 632-8.

Hajat, S., B. G. Armstrong, N. Gouveia, et al. (2005). Mortality displacement of heat-related deaths: a comparison of Delhi, Sao Paulo, and London. Epidemiology 16(5): 613-20.

Armstrong, B. G. (2003). Fixed factors that modify the effects of time-varying factors: applying the case-only approach. *Epidemiology* 14(4): 467-72.

### Applied research:

Gasparrini, A., B. Armstrong, et al. (2012). The effect of high temperatures on cause-specific mortality in England and Wales. Occup Environ Med 69(1): 56-61.

Bhaskaran, K., S. Hajat, et al. (2011). The effects of hourly differences in air pollution on the risk of myocardial infarction: case crossover analysis of the MINAP database. BMJ 343: d5531.

Armstrong, B. G., Z. Chalabi, et al. (2011). Association of mortality with high temperatures in a temperate climate: England and Wales. J Epidemiol Community Health 65(4): 340-345.

Barone-Adesi, .F, A. Gasparrini, et al. (2011). Effects of Italian smoking regulation on rates of hospital admission for acute coronary events: a country-wide study. PLoS One 6(3):e17419.

Bhaskaran, K., S. Hajat, et al. (2010). Short term effects of temperature on risk of myocardial infarction in England and Wales: time series regression analysis of the Myocardial Ischaemia National Audit Project (MINAP) registry. BMJ 341: c3823.

Hajat, S. and T. Kosatky (2010). Heat-related mortality: a review and exploration of heterogeneity. J Epidemiol Community Health 64(9): 753-760.

Pattenden, S., B. Armstrong, A. Milojevic, et al. (2010). Ozone, heat and mortality: acute effects in 15 British conurbations. Occupational and Environmental Medicine 67(10): 699.

Hashizume, M., A. S. G. Faruque, Y. Wagatsuma, et al. (2010). Cholera in Bangladesh: Climatic Components of Seasonal Variation. Epidemiology 21(5): 706-10.

Bhaskaran, K., S. Hajat, A. Haines, et al. (2009a). The effects of air pollution on the incidence of myocardial infarction – A systematic review. Heart 95(21): 1746-59.

Bhaskaran, K., S. Hajat, A. Haines, et al. (2009b). The effects of ambient temperature on the incidence of myocardial infarction – A systematic review. Heart 95(21): 1760-69.

Lopman, B., B. Armstrong, C. Atchison, et al. (2009). Host, weather and virological factors drive norovirus epidemiology: time-series analysis of laboratory surveillance data in England and Wales. *PLoS One* 4(8): e6671.

McMichael, A. J., P. Wilkinson, R. S. Kovats, et al. (2008). International study of temperature, heat and urban mortality: the ‘ISOTHURM’ project. *International Journal of Epidemiology* 37(5): 1121.

Hashizume, M., B. Armstrong, S. Hajat, et al. (2008). The effect of rainfall on the incidence of cholera in Bangladesh. *Epidemiology* 19(1): 103-10.

Lee, S. J., S. Hajat, P. J. Steer, et al. (2008). A time-series analysis of any short-term effects of meteorological and air pollution factors on preterm births in London, UK. *Environ Res* 106(2): 185-94.

Hajat, S., R. S. Kovats and K. Lachowycz (2007). Heat-related and cold-related deaths in England and Wales: who is at risk? *Occupational and Environmental Medicine* 64(2): 93-100.

Armstrong, B. G., P. Mangtani, A. Fletcher, et al. (2004). Effect of influenza vaccination on excess deaths occurring during periods of high circulation of influenza: cohort study in elderly people. *Bmj* 329(7467): 660.

Mangtani, P., S. Hajat, S. Kovats, et al. (2006). The association of respiratory syncytial virus infection and influenza with emergency admissions for respiratory disease in London: an analysis of routine surveillance data. *Clin Infect Dis* 42(5): 640-6.

Tam, C. C., L. C. Rodrigues, S. J. O’Brien, et al. (2006). Temperature dependence of reported Campylobacter infection in England, 1989-1999. *Epidemiol Infect* 134(1): 119-25.

Wilkinson, P., S. Pattenden, B. Armstrong, et al. (2004). Vulnerability to winter mortality in elderly people in Britain: population based study. *British Medical Journal* 329(7467): 647.

Kovats, R. S., S. J. Edwards, S. Hajat, et al. (2004). The effect of temperature on food poisoning: a time-series analysis of salmonellosis in ten European countries. *Epidemiol Infect* 132(3): 443-53.

Hajat, S. and A. Haines (2002). Associations of cold temperatures with GP consultations for respiratory and cardiovascular disease amongst the elderly in London. *International Journal of Epidemiology* 31(4): 825-30.

Hajat, S., R. S. Kovats, R. W. Atkinson, et al. (2002). Impact of hot temperatures on death in London: a time series approach. *Journal of Epidemiology and Community Health* 56(5): 367-72.

Gouveia, N. and T. Fletcher (2000). Time series analysis of air pollution and mortality: effects by cause, age and socioeconomic status. *Journal of epidemiology and community health* 54(10): 750.

## Other useful references

### General:

Zeger, S. L., R. Irizarry and R. D. Peng (2006). On time series analysis of public health and biomedical data. *Annual Review of Public Health* 27: 57-79.

Peng, R. D. and F. Dominici (2008). *Statistical Methods for Environmental Epidemiology with R – A Case Study in Air Pollutioon and Health*. New York, Springer.

Armstrong, B. (2006). Models for the relationship between ambient temperature and daily mortality. *Epidemiology* 17(6): 624-31.

Dominici, F. (2004). Time-series analysis of air pollution and mortality: a statistical review. *Research report – Health Effects Institute* 123: 3-27; discussion 9-33.

Dominici, F., A. McDermott and T. J. Hastie (2004). Improved semiparametric time series models of air pollution and mortality. *Journal of the American Statistical Association* 99(468): 938-49.

Touloumi, G., R. Atkinson, A. Le Tertre, et al. (2004). Analysis of health outcome time series data in epidemiological studies. *EnvironMetrics* 15(2): 101-17.

### On model selection:

Dominici, F., C. Wang, C. Crainiceanu, et al. (2008). Model selection and health effect estimation in environmental epidemiology. *Epidemiology* 19(4): 558-60.

Crainiceanu, C. M., F. Dominici and G. Parmigiani (2008). Adjustment uncertainty in effect estimation. *Biometrika* 95(3): 635.

Baccini, M., A. Biggeri, C. Lagazio, et al. (2007). Parametric and semi-parametric approaches in the analysis of short-term effects of air pollution on health. *Computational Statistics and Data Analysis* 51(9): 4324-36.

He, S., S. Mazumdar and V. C. Arena (2006). A comparative study of the use of GAM and GLM in air pollution research. *EnvironMetrics* 17(1): 81-93.

Peng, R. D., F. Dominici and T. A. Louis (2006). Model choice in time series studies of air pollution and mortality. *Journal of the Royal Statistical Society: Series A* 169(2): 179-203.

On smoothing methods:

Marra, G. and R. Radice (2010). Penalised regression splines: theory and application to medical research. *Statistical Methods in Medical Research* 19(2): 107-25.

Schimek, M. G. (2009). Semiparametric penalized generalized additive models for environmental research and epidemiology. *EnvironMetrics* 20(6): 699-717.

Wood, S. N. (2006). *Generalized Additive Models: an Introduction with R*, Chapman \& Hall/CRC.

Dominici, F., M. J. Daniels, S. L. Zeger, et al. (2002a). Air pollution and mortality: estimating regional and national dose-response relationships. *Journal of the American Statistical Association* 97: 100-11.

Dominici, F., A. McDermott, S. L. Zeger, et al. (2002b). On the use of generalized additive models in time-series studies of air pollution and health. *American Journal of Epidemiology* 156(3): 193-203.

### On harvesting effect:

Rabl, A. (2005). Air pollution mortality: harvesting and loss of life expectancy. *Journal of Toxicology and Environmental Health: Part A* 68(13-14): 1175-80.

Schwartz, J. (2001). Is there harvesting in the association of airborne particles with daily deaths and hospital admissions? *Epidemiology* 12(1): 55-61.

Schwartz, J. (2000b). Harvesting and long term exposure effects in the relation between air pollution and mortality. *American Journal of Epidemiology* 151(5): 440-8.

### On distributed lag (non-linear) models:

Gasparrini, A. (2011). Distributed Lag Linear and Non-Linear Models in R: The Package dlnm. *J Stat Softw* 43(8): 1-20.

Gasparrini, A., B. Armstrong and M. G. Kenward (2010). Distributed lag non-linear models. *Statistics in Medicine* 29(21): 2224-34.

Muggeo, V. M. (2008). Modeling temperature effects on mortality: multiple segmented relationships with common break points. *Biostatistics* 9(4): 613-20.

Schwartz, J. (2000a). The distributed lag between air pollution and daily deaths. *Epidemiology* 11(3): 320-6.

### On meta-analytic techniques:

Gasparrini, A., B. Armstrong, et al. (2012). Multivariate meta-analysis for non-linear and other multi-parameter associations. *Statistics in Medicine*. Epub ahead of print (doi: 10.1002/sim.5471).

Dominici, F., J. M. Samet and S. L. Zeger (2000). Combining evidence on air pollution and daily mortality from the 20 largest US cities: a hierarchical modelling strategy. *Journal of the Royal Statistical Society: Series A* 163(3): 263-302.

Schwartz, J. and A. Zanobetti (2000). Using meta-smoothing to estimate dose-response trends across multiple studies, with application to air pollution and daily death. *Epidemiology* 11(6): 666-72.

### On interrupted time series:

Wagner, A. K., S. B. Soumerai, F. Zhang, et al. (2002). Segmented regression analysis of interrupted time series studies in medication use research. *Journal of Clinical Pharmacy and Therapeutics* 27(4): 299-309.