Model decision tree in R, score in Base SAS

This code creates a decision tree model in R using party::ctree() and prepares the model for export it from R to Base SAS, so SAS can score new records. SAS Enterprise Miner and PMML are not required, and Base SAS can be on a separate machine from R because SAS does not invoke R.

I use this method for several reasons. First, it’s convenient for my workflow to build all data sets in SAS, model in R, and score in SAS. Second, for some strange reason the R predict() function consumes an insane amount of memory on large data sets—much more memory for scoring than building the model—so in many cases, this is the only way to get the scores for the evaluation set and score set. I can model with “big data” (bigger than my laptop with 6GB of RAM): I spin up a 64GB Linux instance on Amazon EC2, and installing R is almost as easy as sudo apt-get install r-base. Finally, this output is in some ways easier to read than print()ing or plot()ing the model in R.

The code supports R variables which are numeric and factors, but it does not yet support missing values. For numeric variables, the export may work as-is in other programming languages such as C, C++, and Java, and minimal effort may be enough to support character variables and Python. If you have an improvement, post a comment with WordPress source code highlighting.

# Copyright (C) 2011 Andrew Ziem
# Licensed under the GNU General Public License version 2 or later <https://www.gnu.org/licenses/gpl-2.0.html>

# get node ID for left child
btree_left <- function(mytree, parent_id)
{
	nodes(mytree, parent_id)[[1]]$left$nodeID
}

# get right child
btree_right <- function(mytree, parent_id)
{
	nodes(mytree, parent_id)[[1]]$right$nodeID
}

# get prediction for this node
btree_prediction <- function(mytree, node_id)
{
	p <- nodes(mytree, node_id)[[1]]$prediction
	if (2 == length(p)) {
		return(p[2])
	}
	return (p)

}

# criteria for this node as a string
btree_criteria <- function(mytree, node_id, left)
{
	if (nodes(mytree, node_id)[[1]]$terminal)
	{
		return("(error: terminal node)");
	} 
	if (nodes(mytree, node_id)[[1]]$psplit$ordered)
	{
		sp <- nodes(mytree, node_id)[[1]]$psplit$splitpoint
		vn <- nodes(mytree, node_id)[[1]]$psplit$variableName
		if (left) {
			op <- '<='	
		} else {
			op <- '>'
		}
		return(paste(vn, op, sp))
	} else {
		psplit <- nodes(mytree, node_id)[[1]]$psplit
		if (left){
			l <- as.logical(psplit$splitpoint)
		} else {
			l <- as.logical(!psplit$splitpoint)
		}

		r <- paste(attr(psplit$splitpoint, 'levels')[l], sep='', collapse="','")
		return(paste(psplit$variableName, " in ('", r,"')", sep=''))
	}
}

walk_node <- function(mytree, node_id = 1, parent_criteria = character(0))
{
	if (nodes(mytree, node_id)[[1]]$terminal) {
		prediction <- btree_prediction(mytree, node_id)
		sprediction <- paste('else if', parent_criteria, 'then prediction =',prediction,';')
		return (sprediction)
	}

	left_node_id <- btree_left(mytree, node_id)
	right_node_id <- btree_right(mytree, node_id)

	if (is.null(left_node_id) != is.null(right_node_id)) {
		print('left node ID != right node id')
	}
	sprediction <- character(0)
	if (!is.null(left_node_id)) {
		new_criteria <- paste(parent_criteria, btree_criteria(mytree, node_id, T), sep=' and ')
		if (1 == node_id)
			new_criteria <- btree_criteria(mytree, node_id, T)
		sprediction <- walk_node(mytree, left_node_id, new_criteria)
	}
	if (!is.null(right_node_id)) {
		new_criteria <- paste(parent_criteria, btree_criteria(mytree, node_id, F), sep=' and ')
		if (1 == node_id)
			new_criteria <- btree_criteria(mytree, node_id, F)
		sprediction <- paste(sprediction, walk_node(mytree, right_node_id, new_criteria), sep='\n')
	}
	return(sprediction)
}

# demonstration
require(party)
airq <- airquality[complete.cases(airquality),]
airct <- ctree(Ozone ~ ., data = airq)
sprediction <- walk_node(airct)

Here is how ctree prints the tree to the R console. Notice the terminal nodes do not show the values for ozone!

> print(airct)

         Conditional inference tree with 5 terminal nodes

Response:  Ozone 
Inputs:  Solar.R, Wind, Temp, Month, Day 
Number of observations:  111 

1) Temp <= 82; criterion = 1, statistic = 53.676
  2) Wind <= 6.9; criterion = 0.999, statistic = 14.175
    3)*  weights = 9 
  2) Wind > 6.9
    4) Temp <= 77; criterion = 0.997, statistic = 11.921
      5)*  weights = 47 
    4) Temp > 77
      6)*  weights = 21 
1) Temp > 82
  7) Wind <= 10.3; criterion = 0.998, statistic = 12.625
    8)*  weights = 27 
  7) Wind > 10.3
    9)*  weights = 7 

Here is how in R to print to the console the flattened tree for use in SAS:

> print(sprediction)
[1] "else if Temp <= 82 and Wind <= 6.9 then prediction = 61 ;\nelse if Temp <= 82 and Wind > 6.9 and Temp <= 77 then prediction = 18.2765957446809 ;\nelse if Temp <= 82 and Wind > 6.9 and Temp > 77 then prediction = 31.1428571428571 ;\nelse if Temp > 82 and Wind <= 10.3 then prediction = 84.074074074074 ;\nelse if Temp > 82 and Wind > 10.3 then prediction = 48.7142857142857 ;"

I copy the output from the last line into Notepad++, and then I use a regular expression to replace each \n with a real line break: this is an ugly hack because I didn’t easily see how to do that in R. Then I remove the first “else” and paste into SAS. Just to be safe, I append this SAS code:

/* sanity check after scoring a record */
if missing(prediction) then abort;

Be careful of missing values among feature (not yet supported) and when R is not aware of all values for factors. For example, if in R a factor takes the values red and green but in SAS it also takes on the value blue, then you will have a problem.

SAS now is ready to score new observations.

Why stop with a decision tree? Make an ensemble by doing the same with a neural network: Train neural network in R, predict in SAS.

About these ads

20 thoughts on “Model decision tree in R, score in Base SAS

  1. This is great.
    For the last bit where you use notepad++, you can replace it like so:
    temp <- print(sprediction)
    gsub(";\n",";",temp)

    • That removes the line breaks, but it leaves everything on one giant line. In Notepad++ I make these \n real line breaks to avoid the one-giant-line problem. If you don’t mind one giant line (the computer is often the one reading it anyway), you can remove the explicit addition of \n from the source code above.

  2. Couldn’t you simply use cat(prediction, file = “theoutput.txt”) to avoid notepad++. You could even use append = TRUE.

    • Good idea. That led me to

      writeClipboard(prediction)

      which avoids the temporary file. With normal Notepad this creates one giant line, but with the SAS editor and Notepad++, this creates the proper line breaks. I’m guessing normal Notepad would want something like \n\r or whatever it is Windows normally uses.

    • Thank you.

      rpart and and ctree store their trees differently, but you may be able to convert using the partykit package. I tried it only for a few minutes, but I couldn’t get it working: this may be because it was converting to the partykit variant of ctree? I didn’t spend more time on it because in my experience the trees produced by ctree perform better than rpart. Alternatively you would need to modify the code on this page.

    • Do you use these on customer data? I haven’t tried boosting yet on donor database because of this issue

      In 2008 Phillip Long (at Google) and Rocco A. Servedio (Columbia University) published a paper at the 25th International Conference for Machine Learning suggesting that these algorithms are provably flawed in that “convex potential boosters cannot withstand random classification noise,” thus making the applicability of such algorithms for real world, noisy data sets questionable.

      (from Wikipedia)

      • Sorry, just saw this reply. I would say that GBM (and boosting in general) is the most accurate algorithm to build predictive models in the CRM / database marketing realm I have ever found. It is amazingly robust, accurate and deals with missing values seamlessly (contrast with randomForest).

    • You could try Model Compression as developed by Cristian Bucil, Rich Caruana, and Alexandru Niculescu-Mizil at Cornell University. I haven’t looked at depth into it, but basically you build your normal GBM model (or whatever model or ensemble) and score a large data set in R using this model. The large data set can be unlabeled! Then, you use this prediction as the dependent variable of the neural network. Finally, you export the neural network from R to SAS using my code from another article.

      • Yeah this is a pretty well established method. I am not sure how well it works and I have not used it before. Where I have seen the idea is when building a model that can not be taken by rule or formula but requires the data itself – e.g. GAM or KNN.

        My hope is eventually someone writes a direct routine from GBM to if then else rules.

  3. With just few modifications I was able to convert it to a SQL UPDATE syntax , it was useful to me so I though I should post it.

    walk_node <- function(mytree, node_id, left)
    {
        if (nodes(mytree, node_id)[[1]]$terminal)
        {
            return("(error: terminal node)");
        }
        if (nodes(mytree, node_id)[[1]]$psplit$ordered)
        {
            sp <- nodes(mytree, node_id)[[1]]$psplit$splitpoint
            vn <- nodes(mytree, node_id)[[1]]$psplit$variableName
            if (left) {
                op <- '<='   
            } else {
                op <- '>'
            }
            return(paste(vn, op, sp))
        } else {
            psplit <- nodes(mytree, node_id)[[1]]$psplit
            if (left){
                l <- as.logical(psplit$splitpoint)
            } else {
                l <- as.logical(!psplit$splitpoint)
            }
            
            r <- paste(attr(psplit$splitpoint, 'levels')[l], sep='', collapse=paste("' OR ",psplit$variableName," = '"))
            return(paste("(",psplit$variableName, " = '", r,"'",")", sep=''))
        }
    }
    
    btree_criteria <- function(mytree, node_id = 1, parent_criteria = character(0))
    {
        if (nodes(mytree, node_id)[[1]]$terminal) {
            prediction <- btree_prediction(mytree, node_id)
            sprediction <- paste('UPDATE foo SET bar = ',prediction, 'WHERE ', parent_criteria,';')
            return (sprediction)
        }
        
        left_node_id <- btree_left(mytree, node_id)
        right_node_id <- btree_right(mytree, node_id)
        
        if (is.null(left_node_id) != is.null(right_node_id)) {
            print('left node ID != right node id')
        }
        sprediction <- character(0)
        if (!is.null(left_node_id)) {
            new_criteria <- paste(parent_criteria, btree_criteria(mytree, node_id, T), sep=' AND ')
            if (1 == node_id)
                new_criteria <- btree_criteria(mytree, node_id, T)
            sprediction <- walk_node(mytree, left_node_id, new_criteria)
        }
        if (!is.null(right_node_id)) {
            new_criteria <- paste(parent_criteria, btree_criteria(mytree, node_id, F), sep=' AND ')
            if (1 == node_id)
                new_criteria <- btree_criteria(mytree, node_id, F)
            sprediction <- paste(sprediction, walk_node(mytree, right_node_id, new_criteria), sep='\n')
        }
        return(sprediction)
    }
    
  4. I have stated to use GBM model and have a requirement to use the scoring engine in XML, C++ or SAS. I have not much exposure to R and could not uderstand the coed fully. Can you please put littel elaborated comments like what does mytree means? and what is parent ID. Appriciate your explanation.

    • mytree is a trained decision tree made in party::ctree(), and parent_id refers to the node identifier the parent node that will be processed. So the function btree_left retrieves the left child ID for the given tree and given parent node.

      The code on this page is not compatible with GBM. I’ve done the same R-to-SAS metaprogramming for GBM version 1.6 and 2.0, but it is not ready for publication.

  5. I have had a bit more success using tree (rather than ctree), honestly my R skills have become a bit rusty from using SAS in corporate America. Is there a way to apply this to a model made from (tree)? I’ve been given permission to program the model in R as long as I can import (a statement exactly like you have created) into SAS for the rest of the team to use.

    Any help would be greatly appreciated.

    • You would have to either rewrite the “decision tree from R to SAS” function on this page based on tree’s internal representation of a decision tree, or you would have to find a way to convert the trained decision tree to party’s internal representation.

  6. Pingback: Mario Segal – Professional Site | Create SAS Code from R ‘tree’ Objects

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s