This tutorial with real R code demonstrates how to create a predictive model using cforest (Breiman’s random forests) from the package party, evaluate the predictive model on a separate set of data, and then plot the performance using ROC curves and a lift chart. These charts are useful for evaluating model performance in data mining and machine learning.

# You only need to install packages once per machine # (plus maybe after upgrading R), but otherwise they persist across R sessions. install.packages('party') install.packages('ROCR') # Load the kyphosis data set. require(rpart) # Split randomly x <- kyphosis[sample(1:nrow(kyphosis), nrow(kyphosis), replace = F),] x.train <- kyphosis[1:floor(nrow(x)*.75), ] x.evaluate <- kyphosis[(floor(nrow(x)*.75)+1):nrow(x), ] # Create a model using "random forest and bagging ensemble algorithms # utilizing conditional inference trees." require(party) x.model <- cforest(Kyphosis ~ Age + Number + Start, data=x.train, control = cforest_unbiased(mtry = 3)) # Alternatively, use "recursive partitioning [...] in a conditional # inference framework." # x.model <- ctree(Kyphosis ~ Age + Number + Start, data=x.train) # ctree plots nicely (but cforest doesn"t plot) # plot (x.model) # Use the model to predict the evaluation. x.evaluate$prediction <- predict(x.model, newdata=x.evaluate) # Calculate the overall accuracy. x.evaluate$correct <- x.evaluate$prediction == x.evaluate$Kyphosis print(paste("% of predicted classifications correct", mean(x.evaluate$correct))) # Extract the class probabilities. x.evaluate$probabilities <- 1- unlist(treeresponse(x.model, newdata=x.evaluate), use.names=F)[seq(1,nrow(x.evaluate)*2,2)] # Plot the performance of the model applied to the evaluation set as # an ROC curve. require(ROCR) pred <- prediction(x.evaluate$probabilities, x.evaluate$Kyphosis) perf <- performance(pred,"tpr","fpr") plot(perf, main="ROC curve", colorize=T) # And then a lift chart perf <- performance(pred,"lift","rpp") plot(perf, main="lift curve", colorize=T)

This tutorial was tested on Linux and Windows with R 2.9.

Here are some exercises for the reader:

- Why use
`mtry= 3`? Compare different values, or take out the`control = ...`. - Output the results to PDF for printing.
- Try ctree instead of cforest. Which is better?
- Replace cforest with other classifiers: rpart, randomForest, or svm (e1071).
- Use 10-fold cross-validation instead of the simple splitting (though the party packages have cross-validation ‘built in.’).
- Combine two performance curves (for two different classifiers or settings) in one plot.

For a similar but more detailed tutorial, read “Guide to Credit Scoring in R” by Dhruv Sharma.

If this programming is too much for you, try rattle (a GUI interface to R for data mining) or Weka (a machine learning suite). Otherwise, go on to the next tutorial: Compare performance of machine learning classifiers in R.

Pingback: Compare performance of machine learning classifiers in R « Heuristic Andrew

Pingback: Identifing Potential Customers with Classification Techniques in R Language | Data Apple

This is a great article, thanks for the post.

Please help me understand the difference. I thought lift chart is plotted % of responses vs % of sample size.

Why is lift curve plotted as Lift Value vs RPP ?

Thank you for the article. It is very helpful!

Hi Andrew,

Thank you very much for such a great article.

I plotted lift chart for one my project using R. But I dont know how to get the table that is used for the plot.

Please let me know, how to get decile wise lift table using R?

Thanks in advance.

you could set seed and use stratified sampling with caTools::sample.split instead of the default split function for a better train/test split-ratio:

set.seed(1000)

x.split<-sample.split(kyphosis$Kyphosis, SplitRatio=0.75)

x.train<-kyphosis[x.split==TRUE,]

x.evaluate <-kyphosis[x.split==FALSE,]