Walkthrough Of Patient No-show Supervised Machine Learning Classification Project With XGBoost In R

By James Marquez, March 14, 2017

This walk-through is a project I've been working on for some time to help improve the missed opportunity rate (no-show rate) for medical centers. It demonstrates the entire machine learning process starting with extracting datasets from an SQL server and loading them directly into your R environment, performing exploratory analyses and creating visualizations to gain insight, engineering new features, model tuning and training, and finally measuring the model's performance. Finally, a web application is demonstrated at the end of this article which I developed to present the patient prediction results to the agency's leaders and frontline staff. I would like to share my results and methodology as a guide to help others starting their project or to assist others with improving upon my results.

If you enjoy this article, please leave a comment at the end if you have any questions about my process or if you have any suggestions. Thank you for reading!

The Problem

Patients that do not show up for their appointment without advanced warning is an issue that negatively affects the Bronx VAMC's 191,000+ veteran patient population and the facility because each no-show lowers access (increases wait times) and is a missed billing opportunity. This is a substantial problem because the facility's average 17.5% no-show rate, which consists of roughly 450,000 scheduled appointments each year, has never dropped below the national 10% goal. This project is an attempt to lower the number of patients that no-show by using the XGBoost implementation of the gradient boosting machine learning algorithm. This project is on-going as new features are continuously being tested to increase the model's accuracy.

  • ALL
  • 2 FY
  • 1 FY

Results Summary

Setting the probability threshold to 0.22 results in a true positive rate of 0.7283 (72.83%) and a true negative rate of 0.8617 (86.17%).

Features or Explanatory Variables Used

  1. Patient No-show Rate - Patient's historical no-show rate.
  2. Patient Cancellation Rate - Patient's historical cancelation rate.
  3. Clinic No-show Rate - Patient's historical no-show rate by clinic.
  4. Clinic Cancellation rate - Patient's historical cancellation rate by clinic.
  5. New Patient to Clinic - Has patient visited clinic in past 24 months.
  6. Department No-show Rate - Patient's historical no-show rate by department.
  7. Department Cancellation Rate - Patient's historical cancellation rate by department.
  8. New Patient to Department - Has patient visited department in past 24 months.
  9. Appointment Lead Time - Number of days patient is waiting for appointment.
  10. Days Since Last Appointment - Number of days since patient's previous appointment.
  11. Consecutive No-shows By Patient - Number of most recent consecutive no-shows patient has accrued.
  12. Consecutive No-shows By Department - Number of most recent consecutive no-shows patient has accrued by department.
  13. Consecutive No-shows By Clinic - Number of most recent consecutive no-shows patient has accrued by clinic.
  14. Appointment Sum Per Day - Number of appointments patient has on the same day of predicted appointment.
  15. Appointment Sum - Patient's historical appointment sum in past 12 months.
  16. Consult Sum - Patient's historical consult sum in past 12 months.
  17. Department - Appointment's department.
  18. Length of Appointment - Appointment's duration in minutes.
  19. Appointment Month - Appointment's month (Jan, Feb, etc.)
  20. Appointment Weekday - Appointment's Weekday (Mon, Tues, etc.)
  21. Appointment Hour - Appointment's hour (8am, 9am, etc.)
  22. Consult Lead Time - Number of days elapsed since consult requested.
  23. Gender - Patient's gender.
  24. Age - Patient's age.
  25. Marital Status - Married or single.
  26. Race - Patient's race.
  27. Address - Patient's address (longitude & latitude)
  28. Homeless - Patient diagnosed as homeless.
  29. Substance Abuse - Patient diagnosed as a substance abuser.
  30. Low Income - Patient diagnosed as having low income.

Loading Required Packages

The first step is to install and load the following R packages which are required to perform each step in this project. Package descriptions were sourced from their respective reference manual which can be found at https://cran.r-project.org/web/packages/.

In [2]:
library(RODBC)   # Used to establish a connection with Microsoft Server
library(plyr)    # Tools for splitting, applying and combining data
library(dplyr)   # Functions to simplify data manipulation
library(data.table) # Functions to simplify data manipulation
library(xts)
library(zoo)     # Required to make a zoo object
library(caret)   # Functions to streamline the model training process for complex classification problems
library(e1071)   # Misc. statistics functions required for caret package
library(Matrix)  # Creates sparse matrices
library(xgboost) # An efficient and scalable implementation of gradient boosting framework
library(Metrics) # Evaluation metrics for machine learning
library(ROCR)    # Used to visualize the performance of scoring classifiers
library(pROC)    # Used to display and analyze ROC curves
library(sp)      # Classes and methods for spatial data
library(ggmap)   # Create spatial visualizations with ggplot2
library(ggplot2) # Create elegant data visualizations using the grammar of graphics
library(doParallel) # Provides parallel processing
n.cores <- detectCores() # Identify the machine's number of cores
registerDoParallel(n.cores) # Used to train model across multiple cores in parallel
n.cores <- getDoParWorkers()
paste(n.cores, 'workers utilized')
'8 workers utilized'

Building The Datasets With SQL

All database table and column names have been given aliases for security reasons. In this next step, we will gather a period of two years of historical appointment information as well as patient demographic information from VHA's Corporate Data Warehouse. We will connect R directly to Microsoft SQL Server via an ODBC connection using the RODBC package. We will use Structured Query Language (SQL) to pull the information from 11 tables. We will set three variables; start.date, end.date, and station, which allows us to reuse the same dates and facility number across each query.

We will store each table in its own data frame and combine them later because R provides us more control joining datasets with the dplyr package instead of SQL. Also, it makes writing SQL statements much easier when writing many small queries instead of one large complex and slow SQL query.

In [3]:
# Set variables to use across all SQL queries
start.date <- '2015-01-01'
end.date   <- '2017-01-01'
station    <- '526'

channel <- odbcConnect("R/SQL Server Connection") # Establish connection with Microsoft SQL Server
# Gather apppointment and patient demographic information
appointments <- sqlQuery(channel, paste("
SELECT c.Location
      ,c.Division
      ,b.Deceased
      ,b.TestPatient
      ,c.NoncountClinic
      ,a.AppointmentLength
      ,g.AppointmentStatus
      ,a.NoshowCancel
      ,a.AppointmentDateTime
      ,a.CancelDateTime
      ,a.PatientID
      ,b.Gender
      ,b.MaritalStatus
      ,b.Age
      ,b.Race
      ,b.GISAddressLatitude AS Lat
      ,b.GISAddressLongitude AS Lon
      ,DATEDIFF(DAY, a.AppointmentMadeDate, a.AppointmentDateTime) AS ApptLeadTime
      ,f.Department
      ,f.StopCodeName
      ,c.LocationName
      ,a.Consult
      ,DATEDIFF(DAY, d.FileEntryDateTime, a.AppointmentDateTime) AS ConsultLeadTime
      ,LEFT(b.PatientName, 1) + RIGHT(b.PatientSSN, 4) AS PatientInfo
      ,b.PatientName
      ,b.PhoneCellular
      ,b.PhoneResidence
FROM Appointment a
  INNER JOIN Patient b
    ON a.PatientID = b.PatientID
  INNER JOIN Location c
    ON a.LocationID = c.LocationID
  LEFT JOIN Consult d
    ON a.ConsultID = d.ConsultID
  LEFT JOIN ConsultReason e
    ON d.ConsultID = e.ConsultID
  INNER JOIN dim.StopCode f
    ON c.PrimaryStopCodeSID = f.StopCodeID
  LEFT JOIN OutpatVisit g
    ON a.VisitID = g.VisitID
WHERE a.Station = ", station, "
  AND a.AppointmentDateTime >= '", start.date, "'
  AND a.AppointmentDateTime  < '", end.date, "'
ORDER BY a.PatientID, a.AppointmentDateTime", sep=''), stringsAsFactors=FALSE)

# Gather patient information that identifies homelessness using ICD 10 codes
homeless <- sqlQuery(channel, paste("
SELECT PatientID
      ,ICD10ID
      ,VisitDateTime AS HomelessVisitDateTime
FROM OutpatDiagnosis
WHERE Station = ", station, "
  AND VisitDateTime >= '", start.date, "'
  AND VisitDateTime  < '", end.date, "'
  AND ICD10ID IN (324543515, 324543516, 324543517, 324543518, 324543519,
                  324543520, 324543521, 324543522, 324543523, 324543524)
ORDER BY PatientID, VisitDateTime", sep=''), stringsAsFactors=FALSE)

# Gather patient information that identifies substance abuse using ICD 10 codes
subabuse <- sqlQuery(channel, paste("
SELECT PatientID
      ,ICD10ID
      ,VisitDateTime AS SubAbuVisitDateTime
FROM OutpatDiagnosis
WHERE Station = ", station, "
  AND VisitDateTime >= '", start.date, "'
  AND VisitDateTime  < '", end.date, "'
  AND ICD10ID IN (345244244, 345244245, 345244246, 345244247, 345244248,
                  345244249, 345244250, 345244251, 345244252, 345244253,
                  345244254, 345244259, 1400858951, 1400858952)
ORDER BY PatientID, VisitDateTime", sep=''), stringsAsFactors=FALSE)

# Gather patient information that identifies low income using ICD 10 codes
lowincome <- sqlQuery(channel, paste("
SELECT PatientID
      ,ICD10ID
      ,VisitDateTime AS LowIncVisitDateTime
FROM OutpatDiagnosis
WHERE Station = ", station, "
  AND VisitDateTime >= '", start.date, "'
  AND VisitDateTime  < '", end.date, "'
  AND ICD10ID = 324543521
ORDER BY PatientID, VisitDateTime", sep=''), stringsAsFactors=FALSE)

# Gather inpatient patient information
admits <- sqlQuery(channel, paste("
SELECT PatientID
      ,AdmissionDateTime
FROM InpatInpatient
WHERE Station = ", station, "
  AND AdmissionDateTime >= '", start.date, "'
  AND AdmissionDateTime  < '", end.date, "'
ORDER BY PatientID, AdmissionDateTime", sep=''), stringsAsFactors=FALSE)
close(channel) # Close connection with database

# View the number of rows and columns for each data frame
dim(appointments)
dim(homeless)
dim(subabuse)
dim(lowincome)
dim(admits)
  1. 930242
  2. 27
  1. 43479
  2. 3
  1. 1019
  2. 3
  1. 403
  2. 3
  1. 8872
  2. 2

Preprocessing And Cleaning Data

Now that the raw datasets are stored in separate data frames, the next step is to clean and combine them. This step also requires a lot of domain specific knowledge. For example, we will remove all deceased patients because they will no longer be scheduling appointments and will negatively affect the model if left in. Also, we will remove patients flagged as test patients in the database. We will also remove clinics where no-show performance is not monitored. Next, we'll remove inpatient appointments and appointments with data entry errors. Lastly, even though appointments canceled by the facility after the appointment date/time count toward the no-show rate, we'll remove those because they are out of the patient's control.

In [4]:
# Create working copy of appointments data frame
d <- appointments

d <- subset(d, (Division == 648))        # Get only James J. Peters patients
d <- subset(d, (Deceased == 'N'))           # Remove deceased patients
d <- subset(d, is.na(TestPatient))      # Remove test patients
d <- subset(d, (NoncountClinic == 'N')) # Remove non-count clinics
d <- subset(d, (AppointmentStatus == 21105) | (AppointmentStatus == -1)) # Get only checked out and noshows/cancellations, exclude inpatient and incomplete encounters
d <- subset(d, !grepl('^CA?', d$NoshowCancel) | is.na(d$NoshowCancel)) # Remove appointments canceled by the clinic that are out of the patient's control

# Display number of rows that were dropped
dim(appointments) - dim(d)
  1. 177823
  2. 0

We dropped 177,823 observations that would have otherwise negatively affected our model.

Next, we'll combine the homeless, subabuse, and lowincome data frames to our working data frame d. We'll remove duplicate patient ID records by calling subset(x, !duplicated(PatientID)) on each data frame because each patient may have multiple entries, but we only need one instance for each unique patient. Then we assign them to their matching patient IDs in our working data frame.

In [5]:
# Remove duplicate patients from each data frame
h <- droplevels(subset(homeless, !duplicated(PatientID)))
s <- droplevels(subset(subabuse, !duplicated(PatientID)))
l <- droplevels(subset(lowincome, !duplicated(PatientID)))

# Join Homeless, SubstanceAbuse, and LowIncome data frames to working data frame
d$Homeless       <- h$ICD10ID[match(d$PatientID, h$PatientID)] # Assign ICD 10 code to Homeless feature by patient ID
d$SubstanceAbuse <- s$ICD10ID[match(d$PatientID, s$PatientID)]
d$LowIncome      <- l$ICD10ID[match(d$PatientID, l$PatientID)]

Visualizing the Label Column

Let's create our Label feature. This is the feature that we are trying to predict. It may also be called the outcome variable or the dependent variable. For this project, since we are trying to predict patients that will no-show, we'll assign a 1 to every instance that the patient either no-showed or canceled after their appointment date/time. We'll assign 0 to every case where the patient did show up. We want to train our model to understand which features contribute to every case where our Label feature is equal to 1.

In [6]:
# Label feature 1 = patient no-showed or canceled after appointment date/time
d$Label <- ifelse(grepl('N', d$NoshowCancel) | ((d$CancelDateTime > d$AppointmentDateTime) & grepl('PC', d$NoshowCancel)), 1, 0) 
d$Label[is.na(d$Label)] <- 0 # Assign 0 to patient cancelations where time/date stamp was not saved in the database

# Proportions of each Label value
prop <- data.frame(round(prop.table(table(d$Label)), 4))
prop$Var1 <- as.integer(prop$Var1)
prop$y <- as.integer(table(d$Label))
prop[1, 1] <- 0
prop[2, 1] <- 1

# Plot both Label values
ggplot(d, aes(Label)) +
    geom_bar(fill=c('#00BFC4','#FF9999')) +
    scale_x_continuous(breaks=seq(0, 1)) +
    geom_text(data=prop, aes(x=Var1, y=y, label=Freq), vjust=-0.35, size=6.5, color="#000000")

This plot shows us that most patients, 86.02%, show up for their appointment while 13.98% no-show. For this project, we are most interested in predicting which patients are most likely to no-show rather than which patients will likely show.

Feature Engineering

Feature engineering is the process of creating new features that help the learning algorithm understand the problem. Features in machine learning are analogous to independent variables in statistical experiments.

Department Features

VHA groups similar services into departments to determine if a patient is new to the department or if the patient is returning for a follow-up visit. We will use VHA's national guidelines to group the services together, which will also reduce the number of factors or categorical variables in this feature from 120 down to 51. Our goal is to find out if patients are not showing up for some departments more than others. Each service is assigned a 3-digit identifier that we'll use when assigning them to groups.

In [8]:
# Group departments
# The following statments will overwrite the existing 3-digit identifier with the group's name
d$Department[d$Department %in% c('203', '204')] <- "Audiology"
d$Department[d$Department %in% c('159', '436')] <- "Alternative"
d$Department[d$Department %in% c('211', '418')] <- "Amputation"
d$Department[d$Department %in% c('419', '420', '427', '434', '441')] <- "Anesthesia"
d$Department[d$Department %in% c('209', '217', '218', '220', '221', '229', '437', '438', '439')] <- "Blindrehab"
d$Department[d$Department %in% c('402', '413')] <- "Cardiothoracic"
d$Department[d$Department %in% c('231', '303', '311', '333', '334', '369')] <- "Cardiology"
d$Department[d$Department %in% c('163', '164', '165', '166', '167', '168', '169')] <- "Chaplain"
d$Department[d$Department %in% c('118', '119', '121', '122', '170', '171', '172', '173', '174',
                                 '175', '176', '177', '178', '182', '191', '347', '354', '680',
                                 '683', '685', '686')] <- "Homebased"
d$Department[d$Department %in% c('208', '222')] <- "Cwt"
d$Department[d$Department %in% c('180', '181')] <- "Dental"
d$Department[d$Department %in% c('101', '130')] <- "Emergency"
d$Department[d$Department %in% c('305', '306')] <- "Endo"
d$Department[d$Department %in% c('407', '408', '428', '718')] <- "Eye"
d$Department[d$Department %in% c('307', '321', '337')] <- "Gastroenterology"
d$Department[d$Department %in% c('102', '103', '131', '301', '309', '317', '324', '327', '329',
                                 '331', '336', '339', '340', '349', '394')] <- "GeneralMedicine"
d$Department[d$Department %in% c('401', '412', '416', '424', '429', '431', '432', '435')] <- "GeneralSurgery"
d$Department[d$Department %in% c('190', '318', '319', '320', '326', '351', '352', '353')] <- "Geriatric"
d$Department[d$Department %in% c('404', '426')] <- "Gyn"
d$Department[d$Department %in% c('308', '316', '330')] <- "Hema"
d$Department[d$Department %in% c('555', '556')] <- "HomelessEmployment"
d$Department[d$Department %in% c('504', '507', '508', '511', '522', '529', '590', '591', '592')] <- "Homeless"
d$Department[d$Department %in% c('108', '111')] <- "Laboratory"
d$Department[d$Department %in% c('520', '521', '156', '157', '502', '503', '505', '506', '509',
                                 '510', '512', '513', '514', '516', '519', '523', '524', '525',
                                 '533', '534', '535', '538', '539', '540', '547', '548', '550',
                                 '552', '553', '554', '557', '558', '560', '561', '562', '563',
                                 '564', '565', '566', '567', '568', '571', '572', '573', '574',
                                 '575', '576', '577', '578', '580', '581', '582', '583', '588',
                                 '589', '593', '594', '595', '596', '598', '599')] <- "MentalHealth"
d$Department[d$Department %in% c('313', '602', '603', '604', '606', '607', '608', '610', '611')] <- "Nephrology"
d$Department[d$Department %in% c('106', '126', '127', '128', '148', '315', '325', '335', '345', '346')] <- "Neurology"
d$Department[d$Department %in% c('109', '144', '145', '146', '148', '149', '158')] <- "Nuclear"
d$Department[d$Department %in% c('142', '147', '328', '332', '433')] <- "Nursing"
d$Department[d$Department %in% c('123', '124', '147', '708')] <- "Nutrition"
d$Department[d$Department %in% c('405', '409', '422')] <- "Orthopedics"
d$Department[d$Department %in% c('147', '160', '161')] <- "Pharmacy"
d$Department[d$Department %in% c('195', '196', '197', '198')] <- "Polytrauma"
d$Department[d$Department %in% c('322', '323', '338', '341', '342', '348', '350', '702', '704',
                                 '705', '709', '711')] <- "Primarycare"
d$Department[d$Department %in% c('417', '423', '425')] <- "Prosthetics"
d$Department[d$Department %in% c('104', '116', '148', '312')] <- "Pulmonary"
d$Department[d$Department %in% c('105', '110', '115', '148', '150', '151', '152', '153', '154',
                                 '155', '703')] <- "Radiology"
d$Department[d$Department %in% c('201', '216')] <- "Rehabphysician"
d$Department[d$Department %in% c('586', '587', '725', '726', '727', '728', '729', '730', '731')] <- "ResidentialRehab"
d$Department[d$Department %in% c('125', '147')] <- "SocialWork"
d$Department[d$Department %in% c('210', '215', '224', '225')] <- "SpinalCordInjury"
d$Department[d$Department %in% c('202', '205', '206', '207', '213', '214', '230', '240', '250')] <- "Therapy"
d$Department[d$Department %in% c('414', '430')] <- "Urology"
d$Department[d$Department %in% c('372', '373')] <- "Weightmgmt"

Data Exploration (No-show Rates by Department)

Okay, now that we have our departments grouped, let's see if some departments have more no-shows than others. It's important to view department no-shows by rate rather than by total no-shows because some departments are much larger than others. Looking at it simply by total no-shows, the Mental Health department would seem like they have the most no-shows. However, they have more no-shows because they have a much higher volume of appointments than any other department. That would not indicate whether there is some factor causing patients to no-show in Mental Health more than other departments. Let's find each department's no-show rate and plot them from highest to lowest.

In [9]:
dep <- data.frame(table(d$Department, d$Label)) # Create a data frame that aggregates and split the total shows and no-shows for each department
dep.no.shows <- subset(dep, Var2 == 1) # Create data frame with service's no-show aggregates
dep.shows    <- subset(dep, Var2 == 0) # Create data frame with service's show aggregates 

# Because there may be some departments that do not have no-shows, we cannot simply join the Freq column from dep.shows to dep.no.shows because the length of the columns may differ
dep.no.shows$Shows <- dep.shows$Freq[match(dep.no.shows$Var1, dep.shows$Var1)] # We use the match function to assign the show values from only those that are in the dep.no.shows data frame
dep.no.shows$Rate  <- dep.no.shows$Freq / (dep.no.shows$Freq + dep.no.shows$Shows)
dep.no.shows$Lab   <- paste(round(dep.no.shows$Rate * 100, 1), '%', sep='')

ggplot(data=dep.no.shows, aes(x=(reorder(Var1, Rate)), y=Rate)) +
    ggtitle("Department No-show Rates") +
    xlab("Departments") +
    geom_bar(stat="identity", fill="#FF9999") +
    geom_text(data=dep.no.shows, aes(x=Var1, y=Rate, label=Lab), vjust=0.35, hjust=1.05, size=2.5, color="#000000") +
    coord_flip() # flip the position of the x and y axes

rm(dep, dep.no.shows, dep.shows) # Remove temp data frames

We can see that the emergency department has an extremely low no-show rate, which makes sense. However, it's the 9th largest producer of appointments out of 51 departments. That's significant because it means patients scheduled in the Emergency department will most likely always show up. We'll see later that the model realizes this fact as well. We can also use this information to find best practices from departments that have low no-show rates with high volume such as Prosthetics.

Our training dataset will consist of 12 months of data, but our features will require us to use data during the period 12 - 24 months. We can drop the 12 - 24-month period after we create them.

New Patient to Department Feature

The first feature will determine if the patient has not visited the same department within a 24-month period, and thus be considered a new patient. We'll do this by first creating a new column that concatenates each patient's unique ID and their scheduled appointment's department only if they had a checked-out visit. Then, we'll loop over the column and assign a Yes if no duplicates are found and No if it does find one or more duplicates.

In [10]:
# New to department
d$NewPatientID <- ifelse(is.na(d$NoshowCancel), paste(d$PatientID, d$Department, sep=''), NA)
d$NewPatientDepartment <- ifelse(!duplicated(d$NewPatientID), 'Yes', 'No') # Yes = new patient

New Patient to Clinic Feature

We'll use the same concept to determine if the patient has not visited the same clinic in a 24-month period. Clinics are subcomponents of each department.

In [11]:
# Create a unique patient and clinic identifier if patient showed up to appointment
d$NewClinicSID <- ifelse(is.na(d$NoshowCancel), paste(d$PatientID, d$LocationName, sep=''), NA)

# Patient is identified as 'Yes' if there are no duplicte unique patient clinic IDs
d$NewPatientClinic <- ifelse(!duplicated(d$NewClinicSID), 'Yes', 'No') # Yes = new patient

Days Since Last Appointment Feature

The next feature we'll create is the number of days that are between each patient's past most recent appointment and the current appointment. We want to find out if a long time has elapsed since the patient has last visited the medical center.

In [12]:
# Concert appointment date/time to a number
d$ApptDate <- as.numeric(as.Date(d$AppointmentDateTime))

# Function to find the difference between appointment dates grouped by patient
d$DaysSinceLastAppt <- ave(d$ApptDate, d$PatientID, FUN=function(x) c(0, diff(x)))

Race Feature

Next, let's group similar race ID's into five race groups and one group for unknown race.

In [13]:
d$Race[d$Race %in% c(1698)]             <- 0 # Native
d$Race[d$Race %in% c(1694, 1695, 1691)] <- 1 # Asian
d$Race[d$Race %in% c(1693, 1699, 1701)] <- 2 # Black
d$Race[d$Race %in% c(1696)]             <- 3 # Hispanic
d$Race[d$Race %in% c(1689, 1697)]       <- 4 # White
d$Race[d$Race %in% c(-1, 0, 1692, 1700, 1690, 1702)] <- 5 # Unknown

Marital Status Feature

Next, let's reduce the amount of married and non-married factors down to just married or single. The data has several types of non-married such as divorced, single, and widow. We are most interested in only married and single because we want to find out which patients have a loved one at home that can help them.

In [14]:
# Patient's marital status
d$MaritalStatus <- ifelse(grepl('MARRIED', d$MaritalStatus), 'M', 'S') # Code values with married as 'M' and all the others as 'S'

Patient Age Feature Using K-means Clustering

For the next feature, we'll assign each patient into one of n age categories. We'll calculate n by using k-means clustering. Then we'll assign the centers of each cluster to each patient to create our factors.

In [15]:
# Assign average age to missing values
d$Age[is.na(d$Age)] <- mean(d$Age, na.rm=TRUE)

# Group ages into five categories
age.cluster <- kmeans(d$Age, centers=5)

# Extract cluster centers vector and assign to our working data frame
d$AgeClusterCenters <- as.factor(round(fitted(age.cluster, method = c("centers", "classes"))))

# Visualize clusters
ggplot(d, aes(x=Age)) +
          geom_density(aes(group=AgeClusterCenters, color=AgeClusterCenters, fill=AgeClusterCenters), alpha=0.3) +
          labs(title = "Patient Age Clusters") +
          labs(x = "Age (ommitted values because text too small)") +
          theme(axis.text.x = element_text(size=0))

Function To Sum Variable by Group for a Given Past Date Range

In this next section, we'll create a function that will sum a variable within 365 days from the current observation. This step is crucial to prevent data leakage when using historical sums and rates as features. You can find more details in my other article How to Prevent Data Leakage when Engineering Historical Sum and Rate Features for Predictive Modeling. We'll use this function when creating our patient, department, and clinic no-show rates and appointment sum features.

In [16]:
# Function will sum values 365 days prior to observation's date
rollingSum <- function(i, data, count, dates) {
  z <- with(data[i, ], zoo(count, dates))
  g <- zoo(, seq(start(z), end(z), by="day"))
  m <- merge(z, g)
  window(rollapplyr(m, 365, sum, na.rm=TRUE, partial=TRUE), time(z))
}

# Add one microsecond to duplicate dates to create a unique index
d$AppointmentDateTime <- make.index.unique(d$AppointmentDateTime)

Appointment Sum Feature

The next feature, the sum of each patient's appointments within 365 days, will be a feature on its own and be used to calculate other features, such as each patient's cancelation rates and no-show rates. First, we must sum each patient's appointments grouped by their unique ID. We'll use data.table's functions to create a new aggregate feature in our data frame with the sums for each unique patient ID. We'll use this same technique for several other features as well.

In [ ]:
# Convert data frame to data.table
d <- as.data.table(d)

# Assign 1 to new column to be used for summation
d$Appt <- 1

# Sum patient appointments by patient, then subtract observation's appointment from it's own result
d[, ApptSum := as.numeric(rollingSum(data=d, count=Appt, dates=AppointmentDateTime) - Appt), by=PatientID]

Consult Sum Feature

Next, we'll sum the total number of consults each patient had in 365 days. A consult is a digital request when a patient sees a specialist for the first time. Follow-up visits with the same specialist do not require consults. The dataset contains three values; NA, -1, and a value greater than 0. All values greater than 0 are the consult request ID. We'll transform all values less than 0 or NA to 0, and every other value to 1.

In [ ]:
# Determine if appointment is for a consultation or return follow-up
d$Consult <- ifelse(is.na(d$Consult) | d$Consult < 0, 0, 1) # 1 = Consult

# Sum patient consults by patient, then subtract observation's appointment from it's own result
d[, ConsultSum := as.numeric(rollingSum(data=d, count=Consult, dates=AppointmentDateTime) - Consult), by=PatientID]

Patient No-show Rate Feature

Next, we'll find each patient's no-show rate.

In [ ]:
# Sum patient no-shows by patient, then subtract observation's no-show from it's own result
d[, PatientNoshowSum := as.numeric(rollingSum(data=d, count=Label, dates=AppointmentDateTime) - Label), by=PatientID]

# Calculate patient no-show rate
d$PatientNoshowRate <- d$PatientNoshowSum / d$ApptSum

# Assign 0 to all NaN values resulting from dividing 0 / 0
d$PatientNoshowRate[is.nan(d$PatientNoshowRate)] <- 0

Department No-show Rate

Next, we'll calculate each patient's no-show rates by department. First, we'll concatenate each patient's ID and the department to make a new DepartmentID. Then, we'll sum the no-shows and total appointments based on the new DepartmentID we created. Next, we'll use that ID to assign our aggregates back to our working data frame. Once we have the total no-shows and total appointments by our new ID, we can then divide them to get the no-show rate.

In [ ]:
# Create unique identifier for which we can aggregate on to find no-show by patient and department
d$DepartmentSID <- paste(d$PatientID, d$Department, sep='')

# Sum patient appointments by patient and department, then subtract observation's appointment from it's own result
d[, DepartmentApptSum := as.numeric(rollingSum(data=d, count=Appt, dates=AppointmentDateTime) - Appt), by=DepartmentSID]

# Sum patient no-shows by patient and department, then subtract observation's no-show from it's own result
d[, DepartmentNoshowSum := as.numeric(rollingSum(data=d, count=Label, dates=AppointmentDateTime) - Label), by=DepartmentSID]

# Calculate department no-show rate
d$DepartmentNoshowRate <- d$DepartmentNoshowSum / d$DepartmentApptSum

# Assign 0 to all NaN values resulting from dividing 0 / 0
d$DepartmentNoshowRate[is.nan(d$DepartmentNoshowRate)] <- 0

Clinic No-show Rate

In [ ]:
# Create unique identifier for which we can aggregate on to find no-show by patient and clinic
d$ClinicSID <- paste(d$PatientID, d$LocationName, sep='')

# Sum patient appointments by patient and clinic, then subtract observation's appointment from it's own result
d[, ClinicApptSum := as.numeric(rollingSum(data=d, count=Appt, dates=AppointmentDateTime) - Appt), by=ClinicSID]

# Sum patient no-shows by patient and clinic, then subtract observation's no-show from it's own result
d[, ClinicNoshowSum := as.numeric(rollingSum(data=d, count=Label, dates=AppointmentDateTime) - Label), by=ClinicSID]

# Calculate clinic no-show rate
d$ClinicNoshowRate <- d$ClinicNoshowSum / d$ClinicApptSum

# Assign 0 to all NaN values resulting from dividing 0 / 0
d$ClinicNoshowRate[is.nan(d$ClinicNoshowRate)] <- 0

Patient Cancelation Rate

Next, we'll use the same DepartmentID and ClinicID to find each patient's cancelation rate by department and clinic. This is the total number of times the patient acted responsibly by canceling their appointment before their scheduled date/time relative to the total amount of appointments they've had.

In [ ]:
# Assign 1 to new column where appointments were canceled prior to the scheduled date/time
d$Canx <- ifelse(grepl('PC', d$NoshowCancel) & d$CancelDateTime < d$AppointmentDateTime, 1, 0)
d$Canx[is.na(d$Canx)] <- 0

# Sum patient no-shows by patient, then subtract observation's no-show from it's own result
d[, PatientCancelSum := as.numeric(rollingSum(data=d, count=Canx, dates=AppointmentDateTime) - Canx), by=PatientID]

# Calculate patient no-show rate
d$PatientCancelRate <- d$PatientCancelSum / d$ApptSum

# Assign 0 to all NaN values resulting from dividing 0 / 0
d$PatientCancelRate[is.nan(d$PatientCancelRate)] <- 0

Department Cancelation Rate

In [ ]:
# Sum patient cancels by patient and department, then subtract observation's cancel from it's own result
d[, DepartmentCancelSum := as.numeric(rollingSum(data=d, count=Canx, dates=AppointmentDateTime) - Canx), by=DepartmentSID]

# Calculate department cancelation rate
d$DepartmentCancelRate <- d$DepartmentCancelSum / d$DepartmentApptSum

# Assign 0 to all NaN values resulting from dividing 0 / 0
d$DepartmentCancelRate[is.nan(d$DepartmentCancelRate)] <- 0

Clinic Cancelation Rate

In [ ]:
# Sum patient cancels by patient and clinic, then subtract observation's cancel from it's own result
d[, ClinicCancelSum := as.numeric(rollingSum(data=d, count=Canx, dates=AppointmentDateTime) - Canx), by=ClinicSID]

# Calculate department cancelation rate
d$ClinicCancelRate <- d$ClinicCancelSum / d$ClinicApptSum

# Assign 0 to all NaN values resulting from dividing 0 / 0
d$ClinicCancelRate[is.nan(d$ClinicCancelRate)] <- 0

Patient Consecutive No-shows Feature

Next, we'll calculate the number of consecutive no-shows each patient has accrued up to their appointment date. We'll calculate them based on each patient's ID and by the ClinicID.

In [25]:
# Most Recent Consecutive No-shows by Patient ID
d <- setDT(d)[, ConsecNoshows := seq(.N) * Label, by=.(PatientID, rleid(Label))]

# Shifts all results down one row
d <- d[, ConsecNoshows := shift(ConsecNoshows, fill=0), by=PatientID]

Department Consecutive No-shows Feature

In [26]:
# Most Recent Consecutive No-shows by Department ID
d <- setDT(d)[, ConsecNoshowsDepartment := seq(.N) * Label, by=.(DepartmentSID, rleid(Label))]

# Shifts all results down one row
d <- d[, ConsecNoshowsDepartment := shift(ConsecNoshowsDepartment, fill=0), by=DepartmentSID]

Clinic Consecutive No-shows Feature

In [27]:
# Most Recent Consecutive No-shows by Clinic ID
d <- setDT(d)[, ConsecNoshowsClinic := seq(.N) * Label, by=.(ClinicSID, rleid(Label))]

# Shifts all results down one row
d <- d[, ConsecNoshowsClinic := shift(ConsecNoshowsClinic, fill=0), by=ClinicSID]

Appointment Date Features (Month, Weekday, & Hour)

Next, we'll create three separate features by extracting date parts from the appointment's date and time. We'll extract the appointment's month, the weekday, and the hour.

In [28]:
# Extract date/time parts from appointment date/time
d$ApptMonth   <- format(d$AppointmentDateTime, format='%B') # Appointment Month
d$ApptWeekday <- format(d$AppointmentDateTime, format='%A') # Appointment Weekday (Mon, Tues, Weds)
d$ApptHour    <- format(d$AppointmentDateTime, format='%H') # Appointment hour 1 - 24

Appointment Sum Per Day Feature

Our next feature will be the total number of appointments each patient has on the same date as the predicted appointment. We'll use the same technique by creating a new ID based on the patient's ID and the appointment date. Then we'll sum by the new PatientApptDateID.

In [ ]:
# Count sum of appointments in each day
d$PatientApptDateID <- paste(d$PatientID, format(d$AppointmentDateTime, format='%m%d%Y'), sep='') # Create Unique ID with patient's ID and appointment date

# Sum appointments on the same day using our created PatientApptDateID
d[, ApptSumPerDay := sum(Appt), by=PatientApptDateID]

Homeless, Substance Abuse, and Low Income Features

Next, we'll fill in missing values in the AppointmentLength feature with 'None'. We'll fill in missing values in the Homeless, SubstanceAbuse, and LowIncome columns with 0's and transform all others into 1's.

In [30]:
# Fill length of appointments that are missing values with 'none'
d$AppointmentLength <- ifelse(is.na(d$AppointmentLength), 0, 1)

# Patients identified as homeless
d$Homeless <- ifelse(is.na(d$Homeless), 0, 1)

# Patients identified as substance abusers
d$SubstanceAbuse <- ifelse(is.na(d$SubstanceAbuse), 0, 1)

# Patients identified as receiving low income
d$LowIncome <- ifelse(is.na(d$LowIncome), 0, 1)

Mapping No-shows by Patient Address

Next, let's look to see if patients are more likely to no-show based on their home address' geolocation. If so, we can create a new categorical feature that groups patients into different geo areas. We'll use the ggmap package to plot patient's home addresses' longitude and latitude coordinates on top of a map of New York. We'll color the shows in blue and the no-shows in red to differentiate between the two groups.

First, we're going to separate the no-shows from shows and save them in two different data frames. Then, we'll call the ggmap() function and pass in our longitude and latitude vectors from the no-shows' data frame.

In [31]:
# Create new vectors
labels     <- subset(d, Label == 1) # Get only missed opportunities
non.labels <- subset(d, Label == 0) # Get completed appointments

# Assign longitude and latitude as the center for the map we're going to download with the get_map() function
center <- c(-73.895283, 40.845000)

# Coordinates of Bronx VAMC to plot its location on the map
lat <-  40.868521
lon <- -73.905272

# Plot to find high concentrations of no-shows
ggmap(get_map(center, zoom=13, source='stamen', maptype='toner-lines'), extent="device") +
    ggtitle("Shows and No-shows by Patient's Addresses") +
    # Plot no-shows
    stat_density2d(data=labels,
                   aes(x=Lon, y=Lat, fill="No-shows", alpha=..level..),
                   size=2, bins=4, geom="polygon") +
    # Plot shows
    stat_density2d(data=non.labels,
                   aes(x=Lon, y=Lat, fill="Shows", alpha=..level..),
                   size=2, bins=4, geom="polygon") +
    # Add point and label for Bronx VAMC
    geom_point(aes(x=-73.905272, y=40.867021), col="black", alpha=1, size=3, pch=15) +
    geom_text(mapping=aes(x=-73.905272, y=40.865500, label="VAMC"), size=4)

rm(labels, non.labels) # Remove temp data frame from environment
Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=40.845,-73.895283&zoom=13&size=640x640&scale=2&maptype=terrain&sensor=false
Map from URL : http://tile.stamen.com/toner-lines/13/2413/3074.png
Map from URL : http://tile.stamen.com/toner-lines/13/2414/3074.png
Map from URL : http://tile.stamen.com/toner-lines/13/2415/3074.png
Map from URL : http://tile.stamen.com/toner-lines/13/2413/3075.png
Map from URL : http://tile.stamen.com/toner-lines/13/2414/3075.png
Map from URL : http://tile.stamen.com/toner-lines/13/2415/3075.png
Map from URL : http://tile.stamen.com/toner-lines/13/2413/3076.png
Map from URL : http://tile.stamen.com/toner-lines/13/2414/3076.png
Map from URL : http://tile.stamen.com/toner-lines/13/2415/3076.png
Map from URL : http://tile.stamen.com/toner-lines/13/2413/3077.png
Map from URL : http://tile.stamen.com/toner-lines/13/2414/3077.png
Map from URL : http://tile.stamen.com/toner-lines/13/2415/3077.png
Warning message:
"`panel.margin` is deprecated. Please use `panel.spacing` property instead"Warning message:
"Removed 49391 rows containing non-finite values (stat_density2d)."Warning message:
"Removed 323706 rows containing non-finite values (stat_density2d)."