Say in your SAS data set, which comes from a text file, XML, or database, has non-ASCII characters that look like garbage—perhaps an odd square. These are probably control characters that mean something to some system, but they may be useless to you. The good news is they are easy to filter out or to replace.
Documents come in various encodings. The most common is ASCII which is made up of the basic Latin alphabet (as used in English), numbers, punctuation, and control characters (such as a backspace and carriage return). Unlike UTF-16, ASCII is a single-byte encoding, so it contains a maximum of 256 characters. The hexadecimal representation of is 00 through FF. A limitation of ASCII is it cannot represent many non-English alphabets, so UTF-8, for example, is a superset of ASCII. For a longer explanation and a table of the characters and their hex and decimal codes, see ASCII on Wikipedia.
Filtering low and high ASCII
Say you don’t want characters before hex 20 (which is the space) and after 7E which is the tilde. Here is how to remove both ASCII sets in SAS with a Perl regular expression:
data ascii; /* This string is encoded in hex. It has low ASCII, ABC, a space, and 123 */ in = '01020341424320303132'x; /* strip out lower ASCII (before space) */ out = prxchange('s/[\x00-\x19]//', -1, in); /* strip out higher ASCII */ out = prxchange('s/[\x7F-\xFF]//', -1, out); run;
Keeping line breaks
DOS, UNIX, and Mac each have their own way of encoding a line break, but in all cases, the line breaks are a combination of hex 0A (new line or line feed) and 0D (carriage return). Here is how to filter out the low ASCII and high ASCII while keeping the line break—as a single regular expression filter.
data crlf; /* This string is encoded in hex. It has low ASCII, ABC, line break, and 123 */ in = '0102034142430D0A303132'x; /* strip out non-middle ASCII but keep line break */ out = prxchange('s/[^\x20-\x7E\x0A\x0D]//', -1, in); run;
Notice that SAS displays both the line feed and carriage return with the same display character: a square. These are non-printable characters in this context, but in some systems, they will print as a line break.
Instead of removing the characters, they can be replaced with a marker—say with an X.
data replace; in = '0102034142430D0A303132'x; out = prxchange('s/[^\x20-\x7E\x0A\x0D]/X/', -1, in); run;
This code was tested with SAS 9.2 on Windows XP, but it should work in most SAS environments.