Plot ROC curve and lift chart in R

This tutorial with real R code demonstrates how to create a predictive model using cforest (Breiman’s random forests) from the package party, evaluate the predictive model on a separate set of data, and then plot the performance using ROC curves and a lift chart. These charts are useful for evaluating model performance in data mining and machine learning.

# You only need to install packages once per machine
# (plus maybe after upgrading R), but otherwise they persist across R sessions.
install.packages('party')
install.packages('ROCR')

# Load the kyphosis data set.
require(rpart)

# Split randomly
x <- kyphosis[sample(1:nrow(kyphosis), nrow(kyphosis), replace = F),]
x.train <- kyphosis[1:floor(nrow(x)*.75), ]
x.evaluate <- kyphosis[(floor(nrow(x)*.75)+1):nrow(x), ]

# Create a model using "random forest and bagging ensemble algorithms
# utilizing conditional inference trees."
require(party)
x.model <- cforest(Kyphosis ~ Age + Number + Start, data=x.train,
control = cforest_unbiased(mtry = 3))

# Alternatively, use "recursive partitioning [...] in a conditional
# inference framework."
# x.model <- ctree(Kyphosis ~ Age + Number + Start, data=x.train)

# ctree plots nicely (but cforest doesn"t plot)
# plot (x.model)

# Use the model to predict the evaluation.
x.evaluate$prediction <- predict(x.model, newdata=x.evaluate)

# Calculate the overall accuracy.
x.evaluate$correct <- x.evaluate$prediction == x.evaluate$Kyphosis
print(paste("% of predicted classifications correct", mean(x.evaluate$correct)))

# Extract the class probabilities.
x.evaluate$probabilities <- 1- unlist(treeresponse(x.model,
newdata=x.evaluate), use.names=F)[seq(1,nrow(x.evaluate)*2,2)]

# Plot the performance of the model applied to the evaluation set as
# an ROC curve.
require(ROCR)
pred <- prediction(x.evaluate$probabilities, x.evaluate$Kyphosis)
perf <- performance(pred,"tpr","fpr")
plot(perf, main="ROC curve", colorize=T)

# And then a lift chart
perf <- performance(pred,"lift","rpp")
plot(perf, main="lift curve", colorize=T)

This tutorial was tested on Linux and Windows with R 2.9.

Here are some exercises for the reader:

  1. Why use mtry= 3? Compare different values, or take out the control = ....
  2. Output the results to PDF for printing.
  3. Try ctree instead of cforest. Which is better?
  4. Replace cforest with other classifiers: rpart, randomForest, or svm (e1071).
  5. Use 10-fold cross-validation instead of the simple splitting (though the party packages have cross-validation ‘built in.’).
  6. Combine two performance curves (for two different classifiers or settings) in one plot.

For a similar but more detailed tutorial, read “Guide to Credit Scoring in R” by Dhruv Sharma.

If this programming is too much for you, try rattle (a GUI interface to R for data mining) or Weka (a machine learning suite). Otherwise, go on to the next tutorial: Compare performance of machine learning classifiers in R.

About these ads

3 thoughts on “Plot ROC curve and lift chart in R

  1. Pingback: Compare performance of machine learning classifiers in R « Heuristic Andrew

  2. Pingback: Identifing Potential Customers with Classification Techniques in R Language | Data Apple

  3. This is a great article, thanks for the post.

    Please help me understand the difference. I thought lift chart is plotted % of responses vs % of sample size.
    Why is lift curve plotted as Lift Value vs RPP ?

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s