This tutorial with real R code demonstrates how to create a predictive model using cforest (Breiman’s random forests) from the package party, evaluate the predictive model on a separate set of data, and then plot the performance using ROC curves and a lift chart. These charts are useful for evaluating model performance in data mining and machine learning.
# You only need to install packages once per machine # (plus maybe after upgrading R), but otherwise they persist across R sessions. install.packages('party') install.packages('ROCR') # Load the kyphosis data set. require(rpart) # Split randomly x <- kyphosis[sample(1:nrow(kyphosis), nrow(kyphosis), replace = F),] x.train <- kyphosis[1:floor(nrow(x)*.75), ] x.evaluate <- kyphosis[(floor(nrow(x)*.75)+1):nrow(x), ] # Create a model using "random forest and bagging ensemble algorithms # utilizing conditional inference trees." require(party) x.model <- cforest(Kyphosis ~ Age + Number + Start, data=x.train, control = cforest_unbiased(mtry = 3)) # Alternatively, use "recursive partitioning [...] in a conditional # inference framework." # x.model <- ctree(Kyphosis ~ Age + Number + Start, data=x.train) # ctree plots nicely (but cforest doesn"t plot) # plot (x.model) # Use the model to predict the evaluation. x.evaluate$prediction <- predict(x.model, newdata=x.evaluate) # Calculate the overall accuracy. x.evaluate$correct <- x.evaluate$prediction == x.evaluate$Kyphosis print(paste("% of predicted classifications correct", mean(x.evaluate$correct))) # Extract the class probabilities. x.evaluate$probabilities <- 1- unlist(treeresponse(x.model, newdata=x.evaluate), use.names=F)[seq(1,nrow(x.evaluate)*2,2)] # Plot the performance of the model applied to the evaluation set as # an ROC curve. require(ROCR) pred <- prediction(x.evaluate$probabilities, x.evaluate$Kyphosis) perf <- performance(pred,"tpr","fpr") plot(perf, main="ROC curve", colorize=T) # And then a lift chart perf <- performance(pred,"lift","rpp") plot(perf, main="lift curve", colorize=T)
This tutorial was tested on Linux and Windows with R 2.9.
Here are some exercises for the reader:
- Why use mtry= 3? Compare different values, or take out the control = ....
- Output the results to PDF for printing.
- Try ctree instead of cforest. Which is better?
- Replace cforest with other classifiers: rpart, randomForest, or svm (e1071).
- Use 10-fold cross-validation instead of the simple splitting (though the party packages have cross-validation ‘built in.’).
- Combine two performance curves (for two different classifiers or settings) in one plot.
For a similar but more detailed tutorial, read “Guide to Credit Scoring in R” by Dhruv Sharma.
If this programming is too much for you, try rattle (a GUI interface to R for data mining) or Weka (a machine learning suite). Otherwise, go on to the next tutorial: Compare performance of machine learning classifiers in R.