Data reduction methods

Theme Co-ordinators: George Ploubidis, Anna Goodman

It’s a jungle out there!!!

Epidemiologists and population health scientists often have to deal with large complex datasets that cannot be analysed in a straightforward manner – just imagine a regression model with fifty predictors!!

Data reduction methods offer researchers a set of analytical tools that make the derivation of meaningful summaries from large datasets possible. These summaries can be expressed as continuous or categorical variables (i.e. typologies), can be derived from cross sectional or longitudinal data and can then used in further analyses.

Traditionally data reduction has used methods such as Principal Components Analysis and Cluster Analysis (e.g. as in [1]). However, recent developments within the Generalised Latent Variable Modelling framework allow researchers to obtain summaries corrected for measurement error, and in addition derive information about the measurement properties of their data, as well as test specific theory driven hypotheses [2]. Furthermore, latent variable measurement models are not only useful for data reduction, but can also be incorporated into the outcome model of interest within a structural equation modelling framework.

Selected references

[1] Filmer D, Pritchett LH. Estimating wealth effects without expenditure data – or tears: an application to educational enrollments in states of India. Demography 2001;38:115-32.

[2] Rabe-Hesketh, S. and Skrondal, A. (2008). Classical latent variable models for medical research. Statistical Methods in Medical Research 17, 5-32.