The problem of missing data is almost ubiquitous in medical research, in both observational studies and randomized trials. Until the advent of sufficiently powerful computers, much of the research in this area was focused on the problem of how to handle, in a practicable way, the lack of balance caused by incompleteness. A example of such a development was the key idea of the EM algorithm (Dempster et al 1976). As routine computation became less of a problem, attention moved to the much more subtle issue of the consequences of missing data on the validity of subsequent analyses. The seminal work was Rubin (1976), from which all subsequent work in this area has developed to a greater or lesser degree.
Although the underlying missing data concepts are the same for observational and randomized studies, the emphases differ somewhat in practice in the two areas. However, both are the subject of development within the Centre. From 2002, supported by several grants from the Economic and Social Research Council, an entire programme has been developed around the handling of missing data in observational studies. This includes the development of multiple imputation in a multilevel setting (e.g. Goldstein et al 2009, Carpenter et al 2010), a series of short courses, and the establishment of a leading website devoted to the topic:
which contains background material, answers to frequently asked questions, course notes, software, details of upcoming courses and events, a bibliography, and a discussion forum.
A central problem in the clinical trial setting is the appropriate handling of dropout and withdrawal in longitudinal studies. This has been the subject of great debate among academics, trialists and regulators for the last 10-15 years. Members of the centre have had long involvement in this (e.g. Diggle and Kenward 1994, Carpenter et al 2002). A textbook was published by Wiley on the broad subject of missing data in clinical studies (Molenberghs and Kenward 2007). More recently the UK NHS National Co-ordinating Centre for Research on Methodology commissioned a monograph on the subject which was published in 2008 (Carpenter and Kenward 2008). Members of the Centre are also actively involved in current regulatory developments. Two important documents have recently appeared. In the US an FDA commissioned National Research Council Panel on Handling Missing Data in Clinical Trials, chaired by Professor Rod Little, produced in 2010 a report, ‘The Prevention and Treatment of Missing Data in Clinical Trials.’ James Carpenter was one of several experts invited to give a presentation to this panel. Implementation of the guidelines in this report is to be discussed at the 5th Annual FDA/DIA Statistics Forum in April 2011, where Mike Kenward is giving the one day pre-meeting tutorial on missing data methodology. In Europe, again in 2010, the CHMP released their ‘Guideline on Missing Data in Confirmatory Clinical Trials’. James Carpenter, Mike Kenward and James Roger were members of the PSI working party that provided a response to the draft of this document (Burzykowski T et al. 2009).
At the School there continues a broad research programme in both the observational study and randomized trials settings, and there is an active continuing programme of workshops. Missing data is an issue for many of the studies run and analysed within the School and there is much cross-fertilization across different research areas. There are also strong methodological links with other themes, especially causal inference, indeed one recent piece of work explicitly connects the two areas (Daniel et al. 2011).
Those most directly involved in missing data research are
Jonathan Bartlett, James Carpenter, Mike Kenward, James Roger (honorary), and two research students: Mel Smuk and George Vamvakis.
Many others have an interest in, and have contributed to, the area, including Rhian Daniel, Bianca de Stavola, George Ploubidis, and Stijn Vansteelandt (honorary).
Burzykowski T et al. (2009). Missing data: Discussion points from the PSI missing data expert group. Pharmaceutical Statistics. DOI: 10.1002/pst.391
Carpenter JR, Goldstein H and Kenward MG (2010). REALCOM-IMPUTE software for multilevel multiple imputation with mixed response types. Journal of Statistical Software, to appear.
Carpenter JR and Kenward MG (2008). Missing data in clinical trials – a practical guide. National Health Service Coordinating Centre for Research Methodology: Birmingham. Downloadable from http://www.haps.bham.ac.uk/publichealth/methodology/docs/invitations/Final_Report_RM04_JH17_mk.pdf.
Carpenter J, Pocock S and Lamm C (2002). Coping with missing values in clinical trials: a model based approach applied to asthma trials Statistics in Medicine, 21, 1043-1066.
Daniel RM, Kenward MG, Cousens S, de Stavola B (2009) Using directed acyclic graphs to guide analysis in missing data problems. Statistical Methods in Medical Research, to appear.
Dempster AP Laird NM and Rubin DB (2007). Maximum likelihood from incomplete data via the EM algorithm (with discussion). Journal of the Royal Statistical Society, Series B, 39, 1-38.
Diggle PJ and Kenward MG (1994). Informative dropout in longitudinal data analysis (with discussion). Applied Statistics, 43, 49-94.
Goldstein H, Carpenter JR, Kenward MG and Levin K (2009). Multilevel models with multivariate mixed response types. Statistical Modelling, 9, 173-197.
Molenberghs G and Kenward MG (2007). Missing Data in Clinical Studies. Chichester: Wiley.
Rubin DB (1976). Inference and missing data. Biometrika, 63, 581-592.
The measurement of variables of interest is central to epidemiological study. Often, the measurements we obtain are noisy error-prone versions of the underlying quantity of primary interest. Such errors can arise due to technical error induced by imperfect measurement instruments and short-term fluctuations over time. An example is a single measurement of blood pressure, considered as a measure of an individual’s underlying average blood pressure. Variables obtained by asking individuals to answer questions about their behaviour or characteristics are also often subject to error, either due to the individual’s inability to accurately recall the behaviour in question or a tendency, for whatever reason, to over-estimate or under-estimate the quantity being requested.
The consequences of measurement error in a variable depend on the variable’s role in the substantive model of interest (Carroll et al). For example, independent error in the continuous outcome variable in a linear regression does not cause bias. In contrast, measurement error in the explanatory variables of regression models does cause bias, in general. Measurement error in an exposure of interest may distort estimates of the exposures effect on the outcome of interest, while error in confounders will lead to imperfect adjustment for confounding, leading to biased estimates of the effect of an exposure.
When explanatory variables in regression models are categorical the analogy of measurement error is misclassification. Unlike measurement errors, which can often plausibly be assumed to be independent of underlying true levels, a misclassification error is never independent of the underlying value of the predictor variable and so different theory covers the effects of misclassification and measurement errors (White et al).
Over the past thirty years a vast array of methods has been developed to accommodate measurement errors and misclassification in statistical analysis models. While simple methods include method of moments correction and regression calibration have sometimes been applied in epidemiological research, more sophisticated approaches, such as maximum likelihood (Bartlett et al) and semi-parametric methods (Carroll et al), have received less attention. This is likely partly due to a relative scarcity of implementation in statistical software packages.
Areas for future research efforts
Greater recognition of the effects of measurement error and misclassification in the analysis of epidemiological and clinical studies.
Increasing the accessibility of methods to deal with measurement error, through dissemination of methods and the implementation of methods into statistical software.
Development of methods that allow for the effects of measurement errors in causal models that describe how risk factors, and therefore risks of disease, change over time.
Bartlett J. W., De Stavola B. L., Frost C. (2009). Linear mixed models for replication data to efficiently allow for covariate measurement error. Statistics in Medicine; 28: 3158-3178.
Carroll R. J., Ruppert D., Stefanski L. A., Crainiceanu C. M. (2006). Measurement error in nonlinear models. Chapman & Hall/CRC, Boca Raton, FL, US.
Frost C., Thompson S. G. (2000). Correcting for regression dilution bias: comparison of methods for a single predictor variable. Journal of the Royal Statistical Society A; 163: 173-189.
Frost C., White I. R. (2005). The effect of measurement error in risk factors that change over time in cohort studies: do simple methods overcorrect for `regression dilution’?. International Journal of Epidemiology; 34: 1359-1368.
Gustafson, P. (2003). Measurement Error and Misclassification in Statistics and Epidemiology: Impacts and Bayesian Adjustments. Chapman and Hall/CRC Press.
White I., Frost C., Tokunaga S. (2001). Correcting for measurement error in binary and continuous variables using replicates. Statistics in Medicine; 20:3441-3457
Knuiman M. W., Divitini M. L., Buzas J. S., Fitzgerald P. E. B. (1998). Adjustment for regression dilution in epidemiological regression analyses. Annals of Epidemiology; 8: 56-63.