# Get Up And Running With XGBoost In R¶

By James Marquez, April 30, 2017

The goal of this article is to quickly get you running XGBoost on any classification problem and measuring its performance. It won't explain feature engineering, model tuning, or the theory or math behind the algorithm. There's already a plethoral of free resources to learn those elements. In my opinion, I learn better when I run my data through an algorithm and then use various resources to learn how to improve my prediction performance.

## Loading The Dataset¶

We'll be using the popular Titanic: Machine Learning from Disaster dataset from Kaggle where we'll be predicting which passengers survived the sinking of the Titanic. You can download and learn all about the dataset from the provided link. We're going to load train.csv and split it into a train and test set. We'll train the model on our train set with the known outcomes and then test the model's performance on the test set with unknown outcomes.

```
# Load the dataset into our environment
mydata <- read.csv('/file_path/train.csv')
# Select and rearrange the order of the features we'll be using
mydata <- mydata[, c('Survived', 'Pclass', 'Sex', 'Age', 'SibSp', 'Parch', 'Fare', 'Embarked', 'Cabin')]
# Convert the Pclass feature to an ordered factor 1 < 2 < 3
mydata$Pclass <- factor(mydata$Pclass, ordered=TRUE)
str(mydata) # View the structure of our dataset
head(mydata) # View the first six rows of the dataset
```

## Finding Missing Data¶

Before we split our dataset into a train and test set, we must deal with missing data. We'll find every feature with missing data and simply remove those rows for simplicity sake.

```
sapply(mydata, function(x) sum(is.na(x)))
```

The Age feature is missing 177 observations so we'll simply remove those rows altogether.

```
# Get all rows that are not NA in the Age feature
mydata <- subset(mydata, !is.na(Age))
```

```
# Load the caret package
library(caret)
# Set the seed to create reproducible train and test sets
set.seed(300)
# Create a stratified random sample to create train and test sets
# Reference the outcome variable
trainIndex <- createDataPartition(mydata$Survived, p=0.75, list=FALSE, times=1)
train <- mydata[ trainIndex, ]
test <- mydata[-trainIndex, ]
# Create separate vectors of our outcome variable for both our train and test sets
# We'll use these to train and test our model later
train.label <- train$Survived
test.label <- test$Survived
```

## One-hot Encoding¶

XGBoost requires your dataset to be a sparse matrix, which is a memory-efficient way to represent a large dataset that holds many zeros. We're going to use the Matrix package to convert our data frame to a sparse matrix and all our factored (categorical) features into dummy variables in one step.

```
# Load the Matrix package
library(Matrix)
# Create sparse matrixes and perform One-Hot Encoding to create dummy variables
dtrain <- sparse.model.matrix(Survived ~ .-1, data=train)
dtest <- sparse.model.matrix(Survived ~ .-1, data=test)
# View the number of rows and features of each set
dim(dtrain)
dim(dtest)
```

## Training The Model¶

For simplicity sake, we'll use the below hyperparameters, but you can improve the performance of your model by tuning them using the caret package. We're using objective = "binary:logistic" because this is a logistic regression for binary classification problem. We're also using eval_metrix = "error" which is used for classification problems. You can learn about all available options here XGBoost.

```
# Load the XGBoost package
library(xgboost)
# Set our hyperparameters
param <- list(objective = "binary:logistic",
eval_metric = "error",
max_depth = 7,
eta = 0.1,
gammma = 1,
colsample_bytree = 0.5,
min_child_weight = 1)
set.seed(1234)
# Pass in our hyperparameteres and train the model
system.time(xgb <- xgboost(params = param,
data = dtrain,
label = train.label,
nrounds = 500,
print_every_n = 100,
verbose = 1))
```

## Evaluate The Model's Performance With A Confusion Matrix¶

The Confusion Matrix provides us with a large amount of information regarding the performance of our model. But first, we must choose a threshold. The threshold value is the cutoff at which we determine a positive prediction and a negative prediction. Down below, we have chosen 0.34 as our cutoff threshold. The model produces a probability for each prediction and we convert it to either a 1 (Survived) if the probability is above or equal to 0.86 and a 0 (Did Not Survive) if the probability is below 0.86.

```
# Create our prediction probabilities
pred <- predict(xgb, dtest)
# Set our cutoff threshold
pred.resp <- ifelse(pred >= 0.86, 1, 0)
# Create the confusion matrix
confusionMatrix(pred.resp, test.label, positive="1")
```

Without creating new features or tuning our hyperparameters, we were able to get a 76% true positive rate and a 94% true negative rate with a Kappa score of 0.7173.

## Determine Which Features Are Most Important¶

Now, we can view which features had the greatest impact on predictive performance.

```
# Get the trained model
model <- xgb.dump(xgb, with_stats=TRUE)
# Get the feature real names
names <- dimnames(dtrain)[[2]]
# Compute feature importance matrix
importance_matrix <- xgb.importance(names, model=xgb)[0:20] # View top 20 most important features
# Plot
xgb.plot.importance(importance_matrix)
```

## Plotting The ROC To View Various Thresholds¶

An ROC curve allows us to visualize our model's performance when selecting different thresholds. The threshold value is indicated by the dots on the curved line. Each dot lets us view the average true positive rate and average false positive rate for each threshold. As the threshold value gets lower, the average true positive rate gets higher. However, the average false positive rate gets higher as well. It's important to select a threshold that provides an acceptable true positive rate while also limiting the false positive rate. You can read more at https://en.wikipedia.org/wiki/Receiver_operating_characteristic.

```
library(ROCR)
# Use ROCR package to plot ROC Curve
xgb.pred <- prediction(pred, test.label)
xgb.perf <- performance(xgb.pred, "tpr", "fpr")
plot(xgb.perf,
avg="threshold",
colorize=TRUE,
lwd=1,
main="ROC Curve w/ Thresholds",
print.cutoffs.at=seq(0, 1, by=0.05),
text.adj=c(-0.5, 0.5),
text.cex=0.5)
grid(col="lightgray")
axis(1, at=seq(0, 1, by=0.1))
axis(2, at=seq(0, 1, by=0.1))
abline(v=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
abline(h=c(0.1, 0.3, 0.5, 0.7, 0.9), col="lightgray", lty="dotted")
lines(x=c(0, 1), y=c(0, 1), col="black", lty="dotted")
```

These are decent results, but you can get much better predictions by creating new features. However, that topic is for another discussion.

Please leave a comment if you have any questions, spot any errors, or if you know of any other packages or graphs to display correlation matrices. You can grab the notebook from my GitHub here get_up_and_running_with_xgboost_in_r.ipynb. Thanks for reading!