Practical Extension

Missing Data: Replacement

Replacing missing data is a complex topic, with many academics, statisticians and data scientists debating how it should be handled. As previously mentioned, you can:

  • Do nothing, thereby leaving in missing data
  • Remove missing data
  • Replace missing data

Within the main section we have examined how to identify and remove missing data, in this section we will explore how to replace missing data, and discuss what to replace it with.

Why Replace Missing Values?

Missing value replacement, or imputation, is important as for many advanced statistical, machine learning and analytical techniques missing values can cause issues which the techniques effectiveness. Furthermore, if alternative methods of missing value handling are selected (for example removal) this can significantly reduce the amount of available data, potentially limiting the interpretation of the analysis.

To mitigate this impact, the correct imputation strategy must be selected. If a poor method is selected, then this can further impact the effectiveness of the techniques. There is currently a wide and broad range of different techniques for imputation ranging from the simple, classical replacement of missing values with a measure of central tendency for the variable, to more complex methods which predict individual missing values based upon techniques such as neural networks or regressions.

Furthermore, replacement of missing values is also important given the potential reasons for why a value is missing. These are typically grouped under three types:

  • Missing Completely at Random (MCAR): The probability of an observation missing is the same for all observations. Where missingness is not the product of factors within the data.
  • Missing at Random (MAR): The probability of an observation missing is related to a factor defined by the observed data.
  • Missing Not at Random (MNAR): The probability of an observation missing is related to a factor unknown to us (the analyst).

Understanding which of these three types your missing data falls into, can be extremely important for selecting the correct method of imputation or generally handling missing data. Due to the complexity of this topic, I would highly suggest further exploring this independently, the textbook Flexible Imputation of Missing Data, van Buuren provides a useful starting place for this exploration.

Imputation Method 1: Measures of Central Tendency

As a method of imputation, this could be argued as one of the simplest. As a concept you directly replace any and all missing values for a specific variable with its selected measure of central tendency for that variable.

For example, you could choose to calculate the mean for a given variable, and apply that mean to all missing observations for that variable.

There are however several issues with the use of measures of central tendency for replacement. Primarily this method assumes that the data is not missing for a specific reason (thereby only MCAR). This has the potential limitation of reducing the quality of the information, if data was missing for a specific reason, for example specific group membership.

Additionally, this has the potential to introduction bias into the presented model. For example, if this was applied to all techniques the model produced would be more bias to typical predictions of the data, rather than alternative novel data. To address this multiple other techniques have evolved to deliberately tackle this issue (see later sections).

Code Example 1: Replacement

We can apply Measures of Central Tendency directly through specifically replacing missing values with the desired value.

## Firstly Specify the value to be replaced
  ## For this example ed.years

ed.years.mean <- mean(WBD_1999$ed.years, na.rm = TRUE)

## Secondly replace the missing values with the specific value
WBD_1999$ed.years[$ed.years)] <- ed.years.mean

This technique can be applied to any measure of central tendency or the replacement of a specific value (for example zero). This direct replacement is useful for both not only singular columns but entire data frames through looping functions. These will be further explored in later sessions of this series.

Imputation Method 2: Predicting Regression Values

To populate missing values with those predicted through statistical modelling (including regression), a model must first be designed, before the predicted are used to replace the missing values. The creation and generation of this model can be challenging as a high R-squared (and/or R-squared adjusted) should be achieved to ensure that a sufficient representation of the dependent variable (that you are predicting) is made through the independent variable(s) provided.

This can cause some additional complexity, as if the output variable is later predicted by a variable for which you have generated variables through this process, an argument for multicollinarity could be made. However, if steps are taken to ensure that those variables used to predict the missing variables values are not used in the final model, this assumption can still be upheld.

Code Example 2: Regression

Imputing missing values with those derived from regression equations can be achieved similarly to that of direct replacement. Except, that individual values must be applied where missingness is present. Due to the non-related nature between variables of this data, I would not suggest using this method for this in reality, however for the purpose of an example this is sufficient!

## Firstly, generate a regression model
reg_1 <- lm(data = WBD_1999, 
            formula =  co2 ~ gdp + Pop + Continent + lifeexp)

## Next apply the predicted values to those missing in the data (in this case co2 emission)
  ## Using the predict function to predict the outcome Co2 values
      reg_1.pred <- predict(reg_1)

  ## Replace the missing values with those predicted using a for-loop with if conditions
      for(i in 1:nrow(WBD_1999)){
        if([i,18]) == TRUE){
          WBD_1999[i,18] <- reg_1.pred[i]
  ## This will replace each NA value in the column with its corresponding predicted value.

Imputation Method 3: Use of Alternative Statistical Functions

Given the complexity of this area, multiple statistical methods have developed to handle missing data directly. Due to the complexity of the techniques, R packages have been built to handle the application methods. A list of these methods and their associated packages is provided below. Due to the individuality of these packages, I would advise furthering your own research to see how these can be applied and which is most suitable to your situation.