How to Prevent Data Leakage when Engineering Historical Sum and Rate Features for Predictive Modeling

By James Marquez, May 14, 2017

Data leakage is a problem that occurs when the training data used to train a machine learning algorithm includes information from the outcome variable "ground truth." This can lead to overly optimistic results when testing the model and poor performance in real-world applications. The authors of the "Handbook of Statistical Analysis and Data Mining Application" say that data leakage is one of the top ten machine learning mistakes.

A popular Kaggle article explains common causes of data leakage as:

  • Leaking the correct prediction or ground truth into the test data.
  • Leaking of information from the future into the past.
  • Leaking test data into the training data.
  • Retaining proxies for removed variables a model is restricted from knowing.
  • Reversing of intentional obfuscation, randomization or anonymization.
  • Inclusion of data not present in the model's operational environment.
  • Distorting information from samples outside of scope of the model's intended use.
  • Any of the above present in third party data joined to the training set.

The problem we will be using in our example is common in every medical clinic. We'll be predicting patients that are not going to show up for their appointment. We'll be creating two features; each patient's historical no-show rate in a one-year range and the sum of their appointments in a one-year range. There are two issues that can cause data leakage when creating our features:

  1. Using the "no-show" response variable to create our no-show rate feature should not include the current observation. Doing so would include the ground truth into the training set.
  2. Current and future observations should not be included in the sums or rate. For example, assigning the sum of each patient's appointments to all observations of the same patient. Doing so would include future information into the training set that would not be available during prediction time. Each obervation's sum and rate should include only past dates, each constrained to a specified date range, and not include the current observation.

A third issue is the need to use the same date range for all patients. Some patients may not have as much medical history as others. One patient's no-show rate might be calculated from two years of appointment history while another's will be from one year of history. We must constrain our historical sums and rates to the same range for every observation to allow the algorithm to generalize to new unseen patients. We will use a rolling period for every observation that will look back 365 days.

Creating Our Dataset

We're going to use three packages in our example; data.table, xts, and zoo. We'll generate random dates and 0,1 values for our Noshow label. We're going to use the data.table package to calculate our sums for each patient. Also, we're going to use the make.index.unique function from the xts package to make our dates unique if there are duplicates. It will increment duplicate dates by one second. This is important because the zoo function requires our dates to be unique.

In [281]:
library(data.table) 
library(xts) # Required for the make.index.unique function

# Create our features
PatientID <- c(rep("Pt_1", 6), rep("Pt_2", 10))
Date <- make.index.unique(as.POSIXct(sample(seq(as.Date('2015-01-01'), as.Date('2017-01-01'), by="day"), 16)))
Apt <- rep(1, 16)

# Create our label
Noshow <- sample(c(0, 1), 16, replace=TRUE)

# Create data table
train <- data.table(PatientID, Date, Noshow, Apt)

# Ensure your dates are ordered if they are not
train <- train[order(PatientID, Date), ]

Creating Our rollSum Function

This function will sum a variable within a givin date range. The original function was created by G. Grothendieck in the StackOverflow article Compute rolling sum by id variables, with missing timepoints. I modified it to accept arguments for any dataset, and also changed it to days instead of years.

In [282]:
library(zoo) # Required to make a zoo object

rollSum <- function(i, data, count, dates) {
         z <- with(data[i, ], zoo(count, dates))
         g <- zoo(, seq(start(z), end(z), by="day"))
         m <- merge(z, g)
         window(rollapplyr(m, 365, sum, na.rm=TRUE, partial=TRUE), time(z))
}

Creating NoshowSum and AptSum Features

We're going to use our rollSum function to sum no-shows and appointments for each patient within 365 days of the respective observation. We're calling as.numeric around rollSum because it returns a zoo object, which causes an error when we divide them to create our NoshowRate. Finally, we subtract the original feature from the sum to remove the current observation.

In [283]:
train[, NoshowSum := as.numeric(rollSum(data=train, count=Noshow, dates=Date) - Noshow), by=PatientID]
train[, AptSum := as.numeric(rollSum(data=train, count=Apt, dates=Date) - Apt), by=PatientID]
train
PatientID Date Noshow Apt NoshowSum AptSum
Pt_1 2015-05-06 20:00:00 0 1 0 0
Pt_1 2015-07-24 20:00:00 1 1 0 1
Pt_1 2015-09-11 20:00:00 1 1 1 2
Pt_1 2015-12-22 19:00:00 1 1 2 3
Pt_1 2015-12-31 19:00:00 1 1 3 4
Pt_1 2016-11-18 19:00:00 1 1 2 2
Pt_2 2015-02-06 19:00:00 1 1 0 0
Pt_2 2015-07-09 20:00:00 0 1 1 1
Pt_2 2015-08-01 20:00:00 0 1 1 2
Pt_2 2015-08-30 20:00:00 0 1 1 3
Pt_2 2016-01-29 19:00:00 1 1 1 4
Pt_2 2016-03-03 19:00:00 0 1 1 4
Pt_2 2016-05-13 20:00:00 0 1 1 5
Pt_2 2016-09-03 20:00:00 1 1 1 3
Pt_2 2016-12-17 19:00:00 1 1 2 4
Pt_2 2016-12-29 19:00:00 1 1 3 5

Lastly, we divide our NoshowSum by the AptSum to get our NoshowRate. Then we assign 0 to all NaN values that result from dividing 0 / 0.

In [284]:
train$NoshowRate <- train$NoshowSum / train$AptSum
train$NoshowRate[is.nan(train$NoshowRate)] <- 0
train
PatientID Date Noshow Apt NoshowSum AptSum NoshowRate
Pt_1 2015-05-06 20:00:00 0 1 0 0 0.0000000
Pt_1 2015-07-24 20:00:00 1 1 0 1 0.0000000
Pt_1 2015-09-11 20:00:00 1 1 1 2 0.5000000
Pt_1 2015-12-22 19:00:00 1 1 2 3 0.6666667
Pt_1 2015-12-31 19:00:00 1 1 3 4 0.7500000
Pt_1 2016-11-18 19:00:00 1 1 2 2 1.0000000
Pt_2 2015-02-06 19:00:00 1 1 0 0 0.0000000
Pt_2 2015-07-09 20:00:00 0 1 1 1 1.0000000
Pt_2 2015-08-01 20:00:00 0 1 1 2 0.5000000
Pt_2 2015-08-30 20:00:00 0 1 1 3 0.3333333
Pt_2 2016-01-29 19:00:00 1 1 1 4 0.2500000
Pt_2 2016-03-03 19:00:00 0 1 1 4 0.2500000
Pt_2 2016-05-13 20:00:00 0 1 1 5 0.2000000
Pt_2 2016-09-03 20:00:00 1 1 1 3 0.3333333
Pt_2 2016-12-17 19:00:00 1 1 2 4 0.5000000
Pt_2 2016-12-29 19:00:00 1 1 3 5 0.6000000

That's it. Please leave a comment if you have any questions or suggestions. You can grab the notebook from my GitHub here prevent_data_leakage_when_engineering_historical_sum_features.ipynb. Thanks for reading!