This guide introduces MD5 and hash functions in general, lists common uses for hash functions, and gives advise on how to best use MD5 in SAS.
MD5 in general
Basic things to know about MD5
- MD5 is a cryptographic hash, or message digest, algorithm.
- There are other hash functions such as SHA, but MD5 is the most common.
- MD5 always takes exactly 16 bytes (128 bits) to represent a message of any size.
- MD5 is typically displayed in hexadecimal, which takes 32 bytes.
- MD5 digests any character value (or any numeric converted to character values), so any data can be digested.
- It’s possible, but very unlikely, for different strings to generate the hash value (which is called a collision).
- It is generally difficult, but not impossible, to reverse a hashed value.
Common uses of hash functions include:
- A space-saving way to identify a long value.
- A private way to uniquely identify a sensitive value, such as a credit card number, customer ID, or password.
- A way to check a value for accidental corruption or malicious tampering.
For example, if two organizations want to know how many customers have the same email address without disclosing new email addresses to each other, they can hash their email addresses and send the hashed values to each other. While this would still let them identify which individuals did business with the other organization, it wouldn’t let the organization get contact information for new customers. If a third party compares salted and hashed email addresses, no would be able to identify the individual customers.
When used to hash a password, the hash should be salted to maintain secrecy. Without a salt, many passwords like trustno1 can be easily reversed. Try searching for its hash 5fcfd41e547a12215b173ff47fdd3739.
To salt a password, concatenate another value to it. Ideally each password would have its own salt. Example.
When hashing “psuedo-keys” like email address or snail mail address to use as unique identifiers, be sure to first convert them to a canonical form. (See also Email address normalization in SAS.)
MD5 in SAS
When merging in a Data Step, joining in PROC SQL, or sorting in PROC SORT, a hashed value of a long key may improve performance. SAS has additional uses of hash functions beyond the scope of this document.
Remember in many cases SAS pads string values with blank spaces, so it is generally best to strip them off like md5(strip(value)).
SAS doesn’t support the SHA hash, but you can still calculate an SHA hash in SAS using Java and some patience.
MD5 works in Data Steps. Format the value using $hex32. for the typical display format used for MD5.
data md5_data; set sashelp.class; format key $hex32.; key = md5(strip(name)); keep name key; run;
In PROC SQL, by default SAS will allocate 200 bytes for the variable, which wastes 184 bytes (92%), so especially for tables with many rows, it’s worth explicitly shortening it.
proc sql; create table md5_sql as select name , md5(strip(name)) as key format $hex32. length=16 from sashelp.class; quit;
Tested with SAS 9.3 on Windows 7.