In recent years, big data has become the new hype in biomedical research due to great achievements in technological development. Benchmark examples of big data are (i) online tracking of flu epidemics, (ii) the genetic and genomic analyses of many human diseases, and (iii) the analysis of millions of health and hospital records. However, the explosion of big data applications has brought with it interesting methodological questions, such as how to best storage, manage, analyze and integrate such ever-increasing data.
This theme aims to provide a sharing space for methodological development on big data problems and its dissemination across the LSHTM research community.
Some of the key methodological issues that members of our theme are working on are:
- Methods for assessing and improving data quality
- Missing and poorly measured data
- Data Linkage
- Data mining, multivariate statistics
- Causal inference for big data
- Stochastic models for high throughput technologies
- Machine learning
Some areas of application using big data within the school are:
- Environmental epidemiology
- Health service evaluation
- Health economics studies
- Nutritional epidemiology
- Genomic epidemiology
- ‘Omics integration and systems biology
- Sero-epidemiology of infectious disease
- Analysis of Microbiome
Methods for assessing and improving data quality
The 3 V’s for big data – volume, variety, and velocity – were quickly succeeded by the 4 V’s, adding in “veracity”. It was quickly recognised that the most sophisticated big data analytics could not overcome limitations of poorly captured data. Increasing electronic capture and storage of information does not, unfortunately, guarantee good data quality.
There is a relative paucity of methodological work to assess and improve the quality of data in big data settings. However, better detection of errors, leading to enhanced chances of correcting erroneous data, is essential for the validity of subsequent analysis.
Various approaches for detecting likely errors in data have been proposed. In the context of longitudinal data within routinely collected primary care data, one promising method developed in collaboration with members of our theme uses an iterative approach of fitting mixed models, identifying likely outliers, and re-fitting the model after removal of outliers (Welch et al, 2012).
Some relevant references:
Welch C, Petersen I, Walters K, Morris RW, Nazareth I, Kalaitzaki E, White IR, Marston L, Carpenter J. Two-stage method to remove population- and individual-level outliers from longitudinal data in a primary care database. Pharmacoepidemiology and Drug Safety, 2012; 21: 725-732.
The challenge of missing data is not restricted to the context of large datasets of routinely or semi-automatically collected data. However, missing data in such settings raises complex and often novel challenges; we highlight two below.
The first is referred to as ‘data dependent sampling’ – in other words the process you are trying to collect data on controls – to some extent – the data you are able to collect. To give two examples:
- using wearable devices to measure activity can over-estimate usual activity due to participants choosing to leave their device at home on low activity days; this is a form of measurement error
- in routinely collected primary care data, clinical and therapeutic information is collected only when the patient chooses to visit their general practitioner – and then only for reasons specifically relevant to the consultation.
The second challenge arises because of the sheer volume of the data. While imputation and related approaches are a flexible and powerful approach and offer much potential, they must be adapted to meet these challenges, which can violate their underpinning assumptions.
Some recent work within the group that has addressed some of these challenges:
- two-fold imputation, an adaption of multiple imputation which attempts to simplify the problem by conditioning only on measurements which are local in time (Welch et al, 2014); and
- a paper correcting misconceptions on the use of multiple imputation to handle missing data in propensity score analyses (Leyrat et al, 2017)
Some relevant references:
Welch C, Petersen I, Bartlett J, White IR, Marston L, Morris RW, Nazareth I, Walters K, Carpenter J. Evaluation of two-fold fully conditional specification multiple imputation for longitudinal electronic health record data. Statist. Med. 2014, 33:3725–3737.
Leyrat C, Seaman SR, White IR, Douglas I, Smeeth L, Kim J, Resche-Rigon M, Carpenter JR, Williamson EK. Propensity score analysis with partially observed covariates: How should multiple imputation be used? Stat Methods Med Research, 2017, doi: 10.1177/0962280217713032. [Epub ahead of print]
Assessing causal relationships from non-randomised data poses many methodological challenges, particularly related to confounding and selection bias. These are exacerbated in studies conducted using routinely collected data: data not collected for the primary purpose of research tend to be less regular and less complete than traditional data sources used to address such questions.
In comparative effectiveness studies of medications, there is often a wealth of information available regarding previous diagnoses, medications, referrals and therapies. However, how best to incorporate this information into analyses remains unclear. The high-dimensional propensity score (Schneeweiss et al, 2009) is an empirical algorithm to select potential confounders, prioritise candidates, and incorporate selected variables into a propensity-score based statistical model. This algorithm was developed in the context of US claims data; the validity of its application to different settings, such as routinely collected primary care data in the UK, remains unclear.
An alternative approach to the incorporation of a large number of potential confounders into a causal model is offered by Targeted Maximum Likelihood Estimation (TMLE). This approach has been applied to UK primary care data to investigate the association between statins and all-cause mortality (Pang et al, 2016), with the authors concluding that a deeper understanding of the comparative advantages and disadvantages of this approach was needed within this big-data setting.
To begin to address this knowledge gap, members of our theme have developed a free open source online tutorial to introduce TMLE for Causal Inference, which can be found here.
They have also created and made available a free open source Stata program to implement double-robust methods for causal inference, including Machine Learning algorithms for prediction (see links below).
A promising approach to poorly measured, or unmeasured, confounding is offered by self-controlled designs. The self-control risk interval, case-crossover and self-controlled case series, for example, the self-controlled case series uses individuals as their own control, thus removing time-invariant confounders.
Some relevant references:
Franklin JM, Schneeweiss S, Solomon DH. Assessment of Confounders in Comparative Effectiveness Studies From Secondary Databases. Am J Epidemiol. 2017; 185(6): 474-478. doi: 10.1093/aje/kww136.
Franklin JM, Eddings W, Austin PC, Stuart EA, Schneeweiss S. Comparing the performance of propensity score methods in healthcare database studies with rare outcomes. Stat Med. 2017; 36(12): 1946-1963. doi: 10.1002/sim.7250.
Schneeweiss S, Rassen JA, Glynn RJ, Avorn J, Mogun H, Brookhart MA. High-dimensional propensity score adjustment in studies of treatment effects using health care claims data. Epidemiology. 2009; 20(4): 512-522.
Pang M, Schuster T, Filion KB, Eberg M, Platt RW. Targeted maximum likelihood estimation for pharmacoepidemiologic research. Epidemiology. 2016; 27(4): 570-577.
Kang JD, Schafer JL. Demystifying double robustness: A comparison of alternative strategies for estimating a population mean from incomplete data. Statistical Science. 2007: 523-39.
Schuler MS, Rose S. Targeted Maximum Likelihood Estimation for Causal Inference in Observational Studies. American Journal of Epidemiology. 2016. doi: 10.1093/aje/kww165.
S Gruber and MJ van der Laan. tmle: An R Package for Targeted Maximum Likelihood Estimation. Journal of Statistical Software. 2012; 51(13).
Gruber. Targeted Learning in Healthcare Research. Big Data. 2016; 3(4), 211-218. DOI:10.1089/big.2015.0025.
Software (open source):
Author: Dr. Miguel Angel Luque-Fernandez, LSHTM.
Linkage has been described as “a merging that brings together information from two or more sources of data with the object of consolidating facts concerning an individual or an event that are not available in any separate record” (Organisation for Economic Co-operation and Development (OECD) Glossary of Statistical Terms).
Linking health related datasets offers the opportunity to improve data quality, by improving ascertainment of key risk-factors and outcomes, allowing inconsistencies to be identified and resolved. It is a cost-effective means of assembling a dataset, exploiting existing resources. However, challenges associated with data linkage include the lack of unique identifiers for linkage, leading to possible errors in the linkage, and data security considerations.
Small amounts of linkage error can result in substantially biased results. False matches introduce variability and weaken the association between variables, often resulting in bias to the null, and missed matches reduce the sample size and result in a loss of statistical power and potential selection bias. Evaluating the potential impact of linkage error on results is vital (Harron et al, 2014).
Some relevant references:
Harron K, Wade A, Gilbert R, Muller-Pebody B, Goldstein H. Evaluating bias due to data linkage error in electronic healthcare records. BMC Medical Research Methodology, 2014, 14: 36. DOI: 10.1186/1471-2288-14-36.
In recent decades, the research community has made important steps forward in understanding the relationship between exposure to environmental factors and human health. Big data technologies offer the opportunity to extend this research further, for instance by making available high-resolution exposure data from remote sensing tools and real-time measurement from smartphone mobile applications, and by linking electronic health records including large collections of variables on health data and personal characteristics. However, this new setting requires the development of novel analytical methods for handling complex data structures and for modelling individual risk profiles with longitudinal measures on time-varying exposures, health outcomes and susceptibility factors. This new ‘big data’ framework can improve the analytical capability of environmental health studies and extend our knowledge on the complex pathways linking exposures to environmental stressors and human health.
Some relevant references:
Gasparrini, A et al. Mortality risk attributable to high and low ambient temperature: a multicountry observational study. The Lancet, Volume 386, Issue 9991, 369 – 375.
Di Q, Wang Y et al. Air pollution and mortality in the Medicare population. New Eng J Med. 2017, Volume 376, No 26, 2513-22.
Large-scale routinely collected health data provide unprecedented potential for population-based health research. The advantages of using these data include the low cost and timeliness of the research, greatly increased population coverage, and increased statistical power. The US Food and Drug Administration and European Medicines Agency now mandate the use of “real world evidence” of medication effects in drug licensing; in practice such real world evidence often comes from studies incorporating routinely collected health data.
Use of routinely collected data to establish causal relationships, however, raises a number of challenges. Information bias, due to low quality or missing information, remains an issue despite financial incentive schemes aimed at improving data quality such as the Quality Outcomes Framework in the UK. Linkage between data sources improves capture of key outcomes and exposures, but brings additional potential sources of bias. Confounding by poorly measured or unmeasured factors complicates the comparison between groups prescribed different medications. Related to this, the very reasons drugs are prescribed are often highly correlated with the outcomes we wish to study, and this “confounding by indication” remains a key challenge in pharmacoepidemiology.
Despite these challenges, there have been notable successes. For example, members of our theme used linked primary and secondary care data to replicate well known results from randomised trials regarding the effect of statins on vascular outcomes. The same study was used to demonstrate no association between statins and cancer; the absence of this link was later confirmed by randomised trials. Self-controlled designs have great potential to remove time-invariant confounding, and have been successfully implemented by our group to investigate a wide range of associations between vaccines/drugs and adverse outcomes.
Some relevant references:
Smeeth L, Douglas I, Hall AJ, Hubbard R, Evans S. Effect of statins on a wide range of health outcomes: a cohort study validated by comparison with randomised trials. Br J Clin Pharmacol. 2009; 67(1): 99-109.
The effects of dietary intake (containing many different foods and nutrients) on health are complex. Understanding specific effects requires taking into account the interactions among dietary exposures which, as it is now recognised, should be analysed jointly to enucleate dietary patterns, in order to better summarise the effect of food intake on health. To this end, multivariate statistical methods like principal component, cluster and factor analysis are needed.
As food intake can be difficult to measure as an epidemiological exposure, attempts at measurement of dietary behaviour via (validated) metabolic biomarkers are generally underway, which has linked the field of nutritional epidemiology to the Omics field and the various methodological issues characterising chemometric data obtained via Nuclear Magnetic Resonance (NMR) and high-throughput Mass Spectrometry (MS).
Additional Big Data complexities arise in nutritional surveys conducted through dietary diaries as these record the eating occasions at different times for each individual involved, which results in very large number of observations that on the one hand can be used for data mining and hypothesis generation (e.g. in the context and timing of eating) through multivariate methods and on the other call for methodological developments to accommodate their complex hierarchical structure.
Some relevant references:
Gleason PM, Boushey CJ, Harris JE et al. Publishing nutrition research: A review of Multivariate Techniques- Part 3: Data Reduction Methods, 2015, J Acad Nutr Diet. 2015;115:1072-1082.
Assi N, Moskat A, Slimani N et al. A treelet transform analysis to relate nutrient patterns to the risk of hormonal receptor-defined breast cancer in the European Prospective Investigation into Cancer and Nutrition (EPIC). Public Health Nutr, 2016 Feb; 19(2): 242-5.
Chapman A, Beh E, Palla L. Application of Correspondence Analysis to Graphically Investigate Associations Between Foods and Eating Locations Studies in Health Technology and Informatics, 2017; 235: 166-170.
We recently held an introductory workshop, to discuss shared methodological interests across the school. Please see here for slides and audio recordings.
On 7 July we will be holding our Big Data Symposium. Slides and audio recordings will be available on this site after the event.
Through the year, we will organise a series of workshops and seminars aimed at bringing together researchers encountering methodological challenges in analysing big data, and methodologists with interests in relevant areas.