Compare performance of machine learning classifiers in R

This tutorial demonstrates to the R novice how to create five machine learning models for classification and compare the performance graphically with ROC curves in one chart.

For a simpler introduction, start with Plot ROC curve and lift chart in R.

```# load the mlbench package which has the BreastCancer data set
require(mlbench)
# if you don't have any required package, use the install.packages() command
data(BreastCancer)
# some algorithms don't like missing values, so remove rows with missing values
BreastCancer <- na.omit(BreastCancer)
# remove the unique identifier, which is useless and would confuse the machine learning algorithms
BreastCancer\$Id <- NULL
# partition the data set for 80% training and 20% evaluation (adapted from ?randomForest)
set.seed(2)
ind <- sample(2, nrow(BreastCancer), replace = TRUE, prob=c(0.8, 0.2))

# create model using recursive partitioning on the training data set
require(rpart)
x.rp <- rpart(Class ~ ., data=BreastCancer[ind == 1,])
# predict classes for the evaluation data set
x.rp.pred <- predict(x.rp, type="class", newdata=BreastCancer[ind == 2,])
# score the evaluation data set (extract the probabilities)
x.rp.prob <- predict(x.rp, type="prob", newdata=BreastCancer[ind == 2,])

# To view the decision tree, uncomment this line.
# plot(x.rp, main="Decision tree created using rpart")

# create model using conditional inference trees
require(party)
x.ct <- ctree(Class ~ ., data=BreastCancer[ind == 1,])
x.ct.pred <- predict(x.ct, newdata=BreastCancer[ind == 2,])
x.ct.prob <-  1- unlist(treeresponse(x.ct, BreastCancer[ind == 2,]), use.names=F)[seq(1,nrow(BreastCancer[ind == 2,])*2,2)]

# To view the decision tree, uncomment this line.
# plot(x.ct, main="Decision tree created using condition inference trees")

# create model using random forest and bagging ensemble using conditional inference trees
x.cf <- cforest(Class ~ ., data=BreastCancer[ind == 1,], control = cforest_unbiased(mtry = ncol(BreastCancer)-2))
x.cf.pred <- predict(x.cf, newdata=BreastCancer[ind == 2,])
x.cf.prob <-  1- unlist(treeresponse(x.cf, BreastCancer[ind == 2,]), use.names=F)[seq(1,nrow(BreastCancer[ind == 2,])*2,2)]

# create model using bagging (bootstrap aggregating)
require(ipred)
x.ip <- bagging(Class ~ ., data=BreastCancer[ind == 1,])
x.ip.prob <- predict(x.ip, type="prob", newdata=BreastCancer[ind == 2,])

# create model using svm (support vector machine)
require(e1071)
# svm requires tuning
x.svm.tune <- tune(svm, Class~., data = BreastCancer[ind == 1,],
ranges = list(gamma = 2^(-8:1), cost = 2^(0:4)),
tunecontrol = tune.control(sampling = "fix"))
# display the tuning results (in text format)
x.svm.tune
# If the tuning results are on the margin of the parameters (e.g., gamma = 2^-8),
# then widen the parameters.
# I manually copied the cost and gamma from console messages above to parameters below.
x.svm <- svm(Class~., data = BreastCancer[ind == 1,], cost=4, gamma=0.0625, probability = TRUE)
x.svm.prob <- predict(x.svm, type="prob", newdata=BreastCancer[ind == 2,], probability = TRUE)

##
## plot ROC curves to compare the performance of the individual classifiers
##

# Output the plot to a PNG file for display on web.  To draw to the screen,
# comment this line out.
png(filename="roc_curve_5_models.png", width=700, height=700)

# load the ROCR package which draws the ROC curves
require(ROCR)

# create an ROCR prediction object from rpart() probabilities
x.rp.prob.rocr <- prediction(x.rp.prob[,2], BreastCancer[ind == 2,'Class'])
# prepare an ROCR performance object for ROC curve (tpr=true positive rate, fpr=false positive rate)
x.rp.perf <- performance(x.rp.prob.rocr, "tpr","fpr")
# plot it
plot(x.rp.perf, col=2, main="ROC curves comparing classification performance of five machine learning models")

# Draw a legend.
legend(0.6, 0.6, c('rpart', 'ctree', 'cforest','bagging','svm'), 2:6)

# ctree
x.ct.prob.rocr <- prediction(x.ct.prob, BreastCancer[ind == 2,'Class'])
x.ct.perf <- performance(x.ct.prob.rocr, "tpr","fpr")
# add=TRUE draws on the existing chart

# cforest
x.cf.prob.rocr <- prediction(x.cf.prob, BreastCancer[ind == 2,'Class'])
x.cf.perf <- performance(x.cf.prob.rocr, "tpr","fpr")

# bagging
x.ip.prob.rocr <- prediction(x.ip.prob[,2], BreastCancer[ind == 2,'Class'])
x.ip.perf <- performance(x.ip.prob.rocr, "tpr","fpr")

# svm
x.svm.prob.rocr <- prediction(attr(x.svm.prob, "probabilities")[,2], BreastCancer[ind == 2,'Class'])
x.svm.perf <- performance(x.svm.prob.rocr, "tpr","fpr")

# Close and save the PNG file.
dev.off()
```

Here are some exercises left for the reader:

• Is the performance good for a medical diagnostic? How about for a direct mailing campaign?
• Can rpart() performance be improved? Is rpart() overfitting? Tweak its parameters.
• For printing on paper, output to PDF or SVG instead of PNG.
• Add another classifier algorithm or tweak the settings of an existing classifier (but plot it as a separate ROC curve). Hint: the randomForest() function may get confused because the covariates are factors.
• Create a generic R function to abstract the process of adding another classifier.
• Switch from the BreastCancer to the kyphosis data set.
• Adapt the generic function above so it accepts arbitrary data sets.

8 thoughts on “Compare performance of machine learning classifiers in R”

1. MM_UAM says:

When computing SVM performance it is written: x.svm.prob.rocr <- prediction(attr(x.svm.prob, "probabilities")[,2], BreastCancer[ind == 2,'Class']).
Maybe my question is stupid, but I wonder why do you choose second column of probabilities a'priori? Second column informed about probability of "malignance". What if actual label is "benign"? Shouldn't we take propabilities of actual labels into account? I wonder because I am comparing classifiers performance on my dataset I dont' understand this particular line.

• heuristicandrew says:

You may have noticed the first column is benign and the second is malignant with class labels.

```> head(attr(x.svm.prob, "probabilities"))
benign   malignant
5  0.983368820 0.016631180
6  0.009886121 0.990113879
8  0.990562164 0.009437836
16 0.027112649 0.972887351
17 0.993830078 0.006169922
23 0.995430977 0.004569023```

Also the rows equal one

```> head(rowSums(attr(x.svm.prob, "probabilities")))
5  6  8 16 17 23
1  1  1  1  1  1
```

You are right it would be better to use

`attr(x.svm.prob, "probabilities")[,'malignant']`

If the columns switched for some reason, you should quickly notice during model selection because the metrics would be very poor. I tried this recently with an ROC curve, and it was convex instead of concave.

2. dana, KIM says:

Thank you your code. It is really useful for me.
And I have a question about a repetition. If I want to 100 times repeat of SVM and then plot the roc curve and lift, how can I do?
Can I use mean of x.values and y.values??

• heuristicandrew says:

Go to the ROCR home page and open the Slide deck for a tutorial talk and pull up the slide “Examples (3/8): Averaging across multiple runs” to find this code
pred <- prediction(scores, labels)
perf <- performance(pred, "tpr", "fpr")

I haven’t tried this, but it looks like a good place to start.

3. dana, KIM says:

Hi~ I have one more question.
When I use the cforest, I want to use the weights for misclassification for unbalanced data.
However, I don’t understand the use of weights option. I want to different weights to two groups.
one group is 1, the other is 9.
so I use this
wts[1]<-1
wts[2]<-9
wts1> ‘weights’ are not a double matrix of 1258 rows