“The idea of imputation is both seductive and dangerous” (R.J.A Little & D.B. Rubin).
Indeed, a predicted value is considered as an observed one and the uncertainty of prediction is ignored, conducting to bad inferences with missing values. That is why Multiple Imputation is recommended.
The missMDA package quickly generates several imputed datasets with quantitative variables and/or categorical variables. It is based on dimensionality reduction methods such as PCA for continuous variables or multiple correspondence analysis for categorical variables. Compared to the packages Amelia and mice, it better handles cases where the number of variables is larger than the number of units, and cases where regularization is needed (i.e. when the imputation model is prone to overfitting issues). For categorical variables, it is particularly interesting with many variables and many levels, but also with rare levels.
With 3 lines of code, we generate 1000 imputed datasets for the quantitative orange data available in missMDA:
library(missMDA) data(orange) nbdim
In the same way, MIMCA can be used for categorical data:
library(missMDA) data(vnf) nb
You can also watch this playlist on Youtube to practice with R.
R-bloggers.com offers daily e-mail updates about R news and tutorials on topics such as: Data science, Big Data, R jobs, visualization (ggplot2, Boxplots, maps, animation), programming (RStudio, Sweave, LaTeX, SQL, Eclipse, git, hadoop, Web Scraping) statistics (regression, PCA, time series, trading) and more…