Watch out for this common pitfall when importing .CSV files into SAS using the PROC IMPORT procedure. If your .CSV files always have character strings only with consistent widths (such as ISBN book codes, ethernet MAC addresses, UPC barcodes, Social Security Numbers) or numeric values only, then you may stop reading now. However, if your .CSV has character strings with varying lengths (such as names, street addresses, and email addresses), then read carefully. I will explain what happens, why, and two ways to fix it. I’ll also show how to programmatically modify the SAS registry using SAS code. (Of course you’ll also see how to simply import a .csv file into SAS.)
This Base SAS code demonstrates the issue:
/* create a temporary file */ filename f1 temp; /* write test data to the csv file */ data _null_; file f1; put "id,address"; put "1,email@example.com"; put "12,firstname.lastname@example.org"; put "123,email@example.com"; put "1234,firstname.lastname@example.org"; put "12345,email@example.com"; put "123456,firstname.lastname@example.org"; put "1234567,email@example.com"; run; /* read the csv file but only use 5 rows for guessing the width */ proc import datafile=f1 dbms=csv out=test; guessingrows=5; /* if omitted, the default is 20 */ run;
Examine the imported data set and notice all the numeric variables are fine but some of the character variables are truncated:
It bugs me SAS set the default (20) so low, and also there is no warning when truncation happens (even though SAS is good about printing notes, warnings, and errors for all sorts of other cases).
Why it happens
The PROC IMPORT is a convenience procedure to dynamically creates SAS code to read the .CSV file: see this code by checking the SAS log. You could write the code yourself, but it’s a hassle when working with many different .CSV layouts. To avoid wasting disk space and processing time, PROC IMPORT creates character variables with the minimum length. (Read “Make SAS much faster with data set compression” for information about compressing data sets with wide character variables with varying lengths.)
However, by default PROC IMPORT scans only the first 20 rows to get a sense of the maximum width of the variables, and later when reading the file, the width cannot be adjusted. If the 21st row (or any later) contains a character string wider than what is seen in the first 20, it will be truncated (shortened to fit).
Compare this behavior to the function read.csv2 in the R language: R’s equivalent of SAS data sets (which R calls data frames) don’t define character widths, so truncation never happens. (It’s not simple to say one language is better than other. R doesn’t have this issue and is fast because it stores all data in RAM, but on the other hand it struggles with large data sets, which SAS always stores on the hard drive.)
Solution #1: Modify PROC IMPORT statement
Each time you call PROC IMPORT, you can specify the number of rows like this:
proc import datafile=f1 dbms=csv out=test; guessingrows=32767; /* this is the maximum for Base SAS 9.2 */ run;
How may rows is enough? I suggest always using the maximum, which 32,767. There is a performance penalty for scanning the file twice (and disk I/O is slow for computers), but that’s much better than accidentally truncating data. If your file is short, specifying a larger value doesn’t hurt (and the performance penalty is small). If you are sure (for example) 1000 rows is enough for your .csv file, are you sure it won’t change tomorrow with a new .csv file? Don’t gamble with corrupt data.
If your file is larger than 32,767, consider the third solution below.
Solution #2: Modify the default value in the SAS registry
You may run this code once, and it will be remembered across sessions. You may also add it to your SAS autoexec file, so it runs each time: this is useful for sites with multiple users, so everyone has the same environment (even after re-installing or upgrading SAS). To centralize changes, the autoexec for each SAS installation at our site is only a one-line %INCLUDE to a network autoexec file.
/* Change the default guessing rows for PROC IMPORT CSV to avoid truncating data. */ /* This is potentially slower for large .CSV files, but that's better than corrupting data. */ /* The value can be overwritten in each PROC IMPORT statement (where the column widths are consistent). */ /* This creates a text file in a temporary location (such as on Windows XP /* C:\Documents and Settings\username\Local Settings\Temp\SAS Temporary Files). */ filename f1 temp; data _null_; file f1; /* This path is the same as in the GUI browser found in the menu at Solutions - Accesses - Registry Editor*/ put "[products\base\efi]"; put ' "GuessingRows"= int:32767'; run; /* This merges the temporary file into the SAS registry */ proc registry import= f1; run;
Solution #3: Modify the generated import code
This solution makes the most sense when you are repeatedly reading a large .CSV file with the same format.
When you run the PROC IMPORT procedure, read the log. Notice it’s simply generating (and running) code for a DATA step. The demonstration above yields this code:
33 data WORK.TEST ; 34 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 35 infile F1 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; 36 informat id best32. ; 37 informat address $17. ; 38 format id best12. ; 39 format address $17. ; 40 input 41 id 42 address $ 43 ; 44 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection 44 ! macro variable */ 45 run;
Simply run PROC IMPORT once, copy the generated DATA step from the log to your program, remove the line numbers, and adjust the width manually. In the example above I would change address from 17 to 60.
This article was tested on Base SAS 9.1.3 and 9.2 on Windows XP SP3.