Strip Non-Printable ASCII Characters (SAS)

Say in your SAS data set, which comes from a text file, XML, or database, has non-ASCII characters that look like garbage—perhaps an odd square. These are probably control characters that mean something to some system, but they may be useless to you. The good news is they are easy to filter out or to replace.


Documents come in various encodings. The most common is ASCII which is made up of the basic Latin alphabet (as used in English), numbers, punctuation, and control characters (such as a backspace and carriage return). Unlike UTF-16, ASCII is a single-byte encoding, so it contains a maximum of 256 characters. The hexadecimal representation of is 00 through FF. A limitation of ASCII is it cannot represent many non-English alphabets, so UTF-8, for example, is a superset of ASCII. For a longer explanation and a table of the characters and their hex and decimal codes, see ASCII on Wikipedia.

Filtering low and high ASCII

Say you don’t want characters before hex 20 (which is the space) and after 7E which is the tilde. Here is how to remove both ASCII sets in SAS with a Perl regular expression:

data ascii;
	/* This string is encoded in hex.  It has low ASCII, ABC, a space, and 123 */
	in =  '01020341424320303132'x; 

	/* strip out lower ASCII (before space) */
	out = prxchange('s/[\x00-\x19]//', -1, in);
	/* strip out higher ASCII */
	out = prxchange('s/[\x7F-\xFF]//', -1, out);

Keeping line breaks

DOS, UNIX, and Mac each have their own way of encoding a line break, but in all cases, the line breaks are a combination of hex 0A (new line or line feed) and 0D (carriage return). Here is how to filter out the low ASCII and high ASCII while keeping the line break—as a single regular expression filter.

data crlf;
	/* This string is encoded in hex.  It has low ASCII, ABC, line break, and 123 */
	in =  '0102034142430D0A303132'x; 

	/* strip out non-middle ASCII but keep line break */
	out = prxchange('s/[^\x20-\x7E\x0A\x0D]//', -1, in);

Notice that SAS displays both the line feed and carriage return with the same display character: a square. These are non-printable characters in this context, but in some systems, they will print as a line break.

How SAS 9.2 prints CRLF on Windows XP

Replacing characters

Instead of removing the characters, they can be replaced with a marker—say with an X.

data replace;
	in =  '0102034142430D0A303132'x; 
	out = prxchange('s/[^\x20-\x7E\x0A\x0D]/X/', -1, in);

This code was tested with SAS 9.2 on Windows XP, but it should work in most SAS environments.


3 thoughts on “Strip Non-Printable ASCII Characters (SAS)

    • Actually I started writing this post after someone at my organization asked for some help with code she was doing with SAS compress(). I chose Perl regular expressions because of the power and flexibility, because I’m more familiar with regexps, and because in our case, CPU speed is not a big concern.

  1. This was exactly what I needed for a check that I was designing! Instead of removing the characters, I wanted to identify them and compare them back to my original dataset. Since you were using prxchange, it was very simple to modify the code to replace the characters with something else (XX in my case).

    %macro nonascii(nds);

    *Gather character variable names;
    proc contents data=&nds. noprint out=vars (where=(type=2) keep=name type);

    *Compress macro for removing non-ascii characters (I didn’t want line feeds or breaks);
    %macro rmcntl(var);
    &var.=compress(prxchange(‘s/[^\x20-\x7E\x0A\x0D]/XX/’, -1, &var.),,’c’);

    *Loop macro;
    %macro dataset;
    %do j=1 %to %eval(&cnt.);
    %let cds=%scan(&vars.,&j);
    %mend dataset;

    *Set macro variables for character variables and counts;
    proc sql noprint;
    select name into: vars separated by ‘ ‘ from vars;
    select count(“&vars.”,’ ‘)+1 as cnt into: cnt from vars;

    *Run macro through dataset;
    data non_&nds.;
    set &nds.;

    *Compare results back to original;
    proc compare base=&nds. compare=non_&nds. outbase outcomp outnoequal noprint out=non_diff;

    %mend nonascii;

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s