Imputation to replace missing data

Modified on Tue, 8 Dec, 2020 at 10:06 AM

What is the consequence of missing data?

If your dataset contains missing data, then the multivariate analysis will be performed only on the complete cases of your dataset: patients with missing data will be removed from the analyses. It means that you will test a part of your patients and lose statistical power and representativeness.

To avoid this loss, statistical techniques have been developed to replace missing data and still be able to use the whole dataset for analyses. This is called imputation.

Can you tell me more about imputation?

A large set of techniques have been developed to impute missing data. From the simplest to the more complex:

replacement of missing data by the mean (or median) for numeric variable.
creation of a "missing data" modality for categorical variable.
the k-nearest neighbor imputation: each missing data is replaced by the data observed for the most similar patient.
the hotdeck method: each missing data is replaced by the data observed for a similar subject randomly chosen among a set of similar patients.
the multiple imputation: multiple imputed datasets are created and then combined to compute unbiased coefficients.

What is the more suitable option for my dataset?

The best way to deal with missing data is to avoid them by collecting as precisely as possible your data.

In reality, it is rarely possible to have a complete cases dataset. Then, in multivariate analysis you will need to make some choices to be able to perform your analysis.

In EMS, if a variable has more than 50% of missing data, it will be automatically removed from the analysis.
If a variable has less than 10% of missing data, the patients will be automatically removed from the analysis.
In other cases, EMS proposes you to:

remove the variables with missing data
impute the missing data by the mean (numeric data) or by the most common modality (categorical data)

It is usually admitted that if you have less than 10 % of missing data in your dataset, the imputation of the incomplete cases will not improve the quality of the results compared with their deletion. There is not yet a clear guideline to deal with missing data, each choice has to be made regarding the specificity of the data and the research question.

For more details, see the following articles:

Sharath S.E., Zamani N., Kougias P., Kim S. Missing data in surgical data sets: a review of pertinent issues and solutions. Journal of surgical research. 2018:232;240-246.