Dummy coding in SAS

Here is a macro to generate binary features (also called dummy coding) from a nominal variable (also called a categorical variable, such as eye color). This is also called one-hot encoding and one-of-K encoding.

The automation saves time and avoids mistakes when there are many possible values to a category or the values can change.

/* A macro for dumming coding */
%macro nominal_to_binary(
		sm_dataset=/* data set */, 
		sm_var= /* categorical variable */, 
		sm_prefix= /* prefix for dummy variables */);

/* Find the unique levels of the categorical variable */
proc sort data=&sm_dataset(keep=&sm_var) out=&sm_dataset._unique nodupkey;
	by &sm_var;

data _null_;
	set &sm_dataset._unique end=end;
	/* Use CALL EXECUTE to dynamically create a macro that executes */
	/* after this DATA step finishes. The metaprogrammed macro */
	/* modifies the original data set. */
	if _N_ eq 1 then do;
		call execute("data &sm_dataset;");
		call execute("set &sm_dataset;");
	call execute(cat("length &sm_prefix", &sm_var," 3;")); /* use minimum storage */
	call execute(cats("&sm_prefix", &sm_var," = &sm_var = '", &sm_var,"';"));
	if end then call execute('run;');

proc sql;
	/* Clean up */
	drop table &sm_dataset._unique;

/* Generate a sample data set */
data nominal;
	infile datalines dsd;
	input hair$ eye$;

/* Example invocation */
%nominal_to_binary(sm_dataset=nominal, sm_var=hair, sm_prefix=hair_);
%nominal_to_binary(sm_dataset=nominal, sm_var=eye, sm_prefix=eye_);

Here is the final data set:

binary features from categorical variable

This generates k binary variables for a categorical variable with k levels, so it is not suitable for linear regression unless you remove one of the binary feature. Really, in SAS you would pass the categorical variable directly to the regression procedure (REG, LOGISTIC, GLM) using a CLASS statement.

I use this macro for machine learning, and I keep all k levels. Decision trees seem like they shouldn’t benefit from one-hot encoding, but in my experience with decision trees made using ctree::party seem to overfit with categorical variables coded as factors.

The excellent UCLA stats web site has an alternate dummy coding method using arrays.

This macro was tested on SAS 9.3 on Windows 7 64-bit.


Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s