Missing Value Imputation

Simple Linear Regression can be used to impute missing values, NA, from a data set comprising of correlated variables.

Understanding the dataset

 

# vector y has missing values we need to impute

x = c(1,2,3,4,5,6,7,8,9,10)
y = c(11,12,18,14,17,NA,NA,19,NA,27)
z = c(19,11,2,14,20,4,9,10,18,1)
w = c(1,4,7,10,3,5,7,6,6,9)

 

# create a data frame for the data set

data = data.frame(x,y,z,w)
data
#       x      y      z       w
# 1     1     11     19     1
# 2    2     12    11      4
# 3    3     18    2       7
# 4    4     14    14     10
# 5    5     17     20     3
# 6    6     NA    4      5
# 7    7     NA     9      7
# 8   8     19     10    6
# 9   9     NA    18    6
# 10  10   27      1     9

Find Correlation

To find the best dependent variable for fitting the model

# correlation between variables

cor(data)
#      x                       y             z                        w
#x   1.0000000      NA       -0.27367           66 0.5029477
#y   NA                     1            NA                    NA
#z   -0.2736766      NA        1.0000000     -0.5276512
#w  0.5029477      NA        -0.5276            512 1.0000000

# since there are NA values in y, we cannot find its correlation with other variables
# we need to ignore the missing values for correlation

cor(data, use="complete.obs")
#        x                       y                        z                         w
# x    1.0000000   0.9088508     -0.4794970      0.5427928
# y    0.9088508   1.0000000     -0.6931033      0.5575189
# z   -0.4794970   -0.6931033      1.0000000     -0.6438960
# w   0.5427928    0.5575189       -0.6438960     1.0000000

# hightest correlation of y is with x

# we can also use symbols for correlation

symnum(cor(data,use="complete.obs"))
#     x y z w
#  x 1
#  y * 1
#  z . , 1
#  w . . , 1
#  attr(,"legend")
#  [1] 0 ‘ ’ 0.3 ‘.’ 0.6 ‘,’ 0.8 ‘+’ 0.9 ‘*’ 0.95 ‘B’ 1

# '*' indicates 0.95 - highest correlation between y and x

Fit the model

Since y is highly correlated with x, we use the formula y~x for fitting the linear model

 

# fitting linear regression model of Y on X

lrm = lm(y~x, data = data)

Find the coefficients of the linear model

# print the linear model to find the coefficients

print(lrm)

#Call:
# lm(formula = y ~ x, data = data_new)

#Coefficients:
# (Intercept) x
# 9.743 1.509

# Using the coefficients the linear equation is
# y = 9.743 + 1.590*x

Predict y values using linear model

These values can be used to impute the missing y values

 

# predict y using linear model

y_pred = predict(lrm, newdata = data)

# compare the predicted values and the original values

data_compare = data.frame(y_pred,y)
data_compare

#        y_pred          y
# 1     11.25225       11
# 2    12.76126      12
# 3    14.27027      18
# 4    15.77928      14
# 5    17.28829      17
# 6    18.79730      NA
# 7    20.30631      NA
# 8   21.81532       19
# 9   23.32432      NA
# 10 24.83333      27