Aurora Tsai Carnegie Mellon University
December 22, 2017 aurorat-at-andrew.cmu.edu



This tutorial introduces several techniques for conducting multiple imputation analysis (MIA).

Parts are adapted from the following sites:




Contents:


1. Introduction 2. Visualizing Missing Data
3. MIA with MICE (Multivariate Imputation by Chained Equations)
4. MIA with the missForest package
5. MIA with the Hmisc package
6. References & other resources


Introduction

 

We might have to work with incomplete data sets for any number of reasons. Participants may have skipped questions in a survey or assessment. Perhaps we designed our computer-mediated assessment to randomly select 50 questions from a pool of 60 questions to give to each test taker. For whatever the reason, we often have to work with incomplete data sets.

Multiple Imputation Analysis (MIA) (Little and Rubin, 2002) is a method used to fill in missing observations. It takes into account the uncertainty related to the unknown real values by imputing M plausible values for each unobserved response in the data. This renders M different versions of the data set, where the non-missing data is identical, but the missing data entries differ. Discarding all partially observed data units (e.g., through listwise deletion) is generally not recommended because it can lead to substantial bias and poor predictions (Ambler, Omar, & Royston, 2007).

It’s important to know how much missing data we have and how it is spread across our data. Data is considered Missing Completely at Random (MCAR) if “the propensity to observe a missing value in an item is unrelated to a) the value of the item itself and to other items; b) to the latent trait values; and c) to any other measured variables in the analysis” (Sulis & Porcu, 2017, p. 331). In other words, data is MCAR if there are no variables influencing what observations are missing (e.g., participants’ don’t answer a question because of the nature of the question, because it’s too difficult, or because participants in class B weren’t given the question. Data is Missing at Random (MAR) is their distribution depends only on observed data.



Preparing Data

For this tutorial, make sure you install and load the following packages:

library(missForest)
library(Hmisc)
library(mice)
library(VIM)
library(rms)


Sample Data iris

Let’s work with the iris sample data set in R.

data <- iris
head(iris)
ABCDEFGHIJ0123456789
 
 
Sepal.Length
<dbl>
Sepal.Width
<dbl>
Petal.Length
<dbl>
Petal.Width
<dbl>
Species
<fctr>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.30.2setosa
44.63.11.50.2setosa
55.03.61.40.2setosa
65.43.91.70.4setosa


Now let’s randomly add missing values using the prodNA function from missForest.

#Produce NAs in 10% of the data
iris.mis <- prodNA(iris, noNA = 0.1)
head(iris.mis)
ABCDEFGHIJ0123456789
 
 
Sepal.Length
<dbl>
Sepal.Width
<dbl>
Petal.Length
<dbl>
Petal.Width
<dbl>
Species
<fctr>
15.13.51.40.2setosa
24.93.01.40.2setosa
34.73.21.3NAsetosa
44.63.11.50.2setosa
55.0NA1.40.2setosa
65.43.91.70.4setosa


Visualizing “Missing Data”

We can create a table of missing values with this function from the mice package:

md.pattern(iris.mis)
   Sepal.Length Species Sepal.Width Petal.Length Petal.Width   
90            1       1           1            1           1  0
 5            0       1           1            1           1  1
11            1       1           0            1           1  1
 7            1       1           1            0           1  1
16            1       1           1            1           0  1
 6            1       0           1            1           1  1
 1            0       1           0            1           1  2
 2            0       1           1            0           1  2
 3            1       1           0            0           1  2
 1            0       1           1            1           0  2
 2            1       1           1            0           0  2
 2            0       0           1            1           1  2
 1            1       0           0            1           1  2
 2            1       0           1            0           1  2
 1            1       0           1            1           0  2
             11      12          16           16          20 75


Alternatively, we can also find the number of NAs for each variable using sapply:

sapply(iris.mis, function(x) sum(is.na(x)))
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          11           16           16           20           12 


Use the aggr function from the VIM package to visualize missing data.

miss_plot <- aggr(iris.mis, col=c('navyblue','yellow'),
                    numbers=TRUE, sortVars=TRUE,
                    labels=names(iris.mis), cex.axis=.7,
                    gap=3, ylab=c("Missing data","Pattern"))
not enough horizontal space to display frequencies

 Variables sorted by number of missings: 
     Variable      Count
  Petal.Width 0.13333333
  Sepal.Width 0.10666667
 Petal.Length 0.10666667
      Species 0.08000000
 Sepal.Length 0.07333333


Using marginplot to visualize missing data for the Sepal.Width and Sepal.Length variables:

marginplot(iris.mis[c(1,2)])



MICE

Multivariate Imputation by Chained Equations (MICE)

In order to impute missing values with MICE, we use the mice package Depending on how big your data set is, this can take a while (30 sec to a few hours), so be prepared to wait.

imputed_Data <- mice(iris.mis, m=5, maxit = 50, method = 'pmm', seed = 500)

m: the number of imputations made per missing observation (5 is normal–generates 5 data sets with imputed/original values)
maxit: the number of iterations?
method: We use ’probable means ?? seed: Values to randomly generate from??

We can get a summary of the data here:

summary(imputed_Data)
Multiply imputed data set
Call:
mice(data = iris.mis, m = 5, method = "pmm", maxit = 50, seed = 500)
Number of multiple imputations:  5
Missing cells per column:
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
          11           16           16           20           12 
Imputation methods:
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
       "pmm"        "pmm"        "pmm"        "pmm"        "pmm" 
VisitSequence:
Sepal.Length  Sepal.Width Petal.Length  Petal.Width      Species 
           1            2            3            4            5 
PredictorMatrix:
             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length            0           1            1           1
Sepal.Width             1           0            1           1
Petal.Length            1           1            0           1
Petal.Width             1           1            1           0
Species                 1           1            1           1
             Species
Sepal.Length       1
Sepal.Width        1
Petal.Length       1
Petal.Width        1
Species            0
Random generator seed value:  500 


To check all 5 sets of imputed values for a given variable (such as Sepal.Width), run the following:

imputed_Data$imp$Sepal.Width
ABCDEFGHIJ0123456789
 
 
1
<dbl>
2
<dbl>
3
<dbl>
4
<dbl>
5
<dbl>
53.43.63.13.83.5
113.53.74.13.73.7
283.43.73.04.13.8
433.23.13.13.13.1
522.92.93.43.43.0
562.92.73.02.52.6
573.02.92.82.72.7
663.03.43.02.83.1
863.42.82.73.22.9
1003.02.62.63.02.9


Ways to visualize missing & observed data:

Plot sepal width against all other categories

xyplot(imputed_Data,Sepal.Width ~ Sepal.Length + Petal.Width,pch=18,cex=1)