Avoid truncating characters in PROC IMPORT csv

Watch out for this common pitfall when importing .CSV files into SAS using the PROC IMPORT procedure. If your .CSV files always have character strings only with consistent widths (such as ISBN book codes, ethernet MAC addresses, UPC barcodes, Social Security Numbers) or numeric values only, then you may stop reading now. However, if your .CSV has character strings with varying lengths (such as names, street addresses, and email addresses), then read carefully. I will explain what happens, why, and two ways to fix it. I’ll also show how to programmatically modify the SAS registry using SAS code. (Of course you’ll also see how to simply import a .csv file into SAS.)

What happens

This Base SAS code demonstrates the issue:

/* create a temporary file */
filename f1 temp;

/* write test data to the csv file */
data _null_;
	file f1;
	put "id,address";
	put "1,1@example.com";
	put "12,12@example.com";
	put "123,123@example.com";
	put "1234,1234@example.com";
	put "12345,12345@example.com";
	put "123456,123456@example.com";
	put "1234567,1234567@example.com";

/* read the csv file but only use 5 rows for guessing the width */
proc import 
	guessingrows=5; /* if omitted, the default is 20 */

Examine the imported data set and notice all the numeric variables are fine but some of the character variables are truncated:

It bugs me SAS set the default (20) so low, and also there is no warning when truncation happens (even though SAS is good about printing notes, warnings, and errors for all sorts of other cases).

Why it happens

The PROC IMPORT is a convenience procedure to dynamically creates SAS code to read the .CSV file: see this code by checking the SAS log. You could write the code yourself, but it’s a hassle when working with many different .CSV layouts. To avoid wasting disk space and processing time, PROC IMPORT creates character variables with the minimum length. (Read “Make SAS much faster with data set compression” for information about compressing data sets with wide character variables with varying lengths.)

However, by default PROC IMPORT scans only the first 20 rows to get a sense of the maximum width of the variables, and later when reading the file, the width cannot be adjusted. If the 21st row (or any later) contains a character string wider than what is seen in the first 20, it will be truncated (shortened to fit).

Compare this behavior to the function read.csv2 in the R language: R’s equivalent of SAS data sets (which R calls data frames) don’t define character widths, so truncation never happens. (It’s not simple to say one language is better than other. R doesn’t have this issue and is fast because it stores all data in RAM, but on the other hand it struggles with large data sets, which SAS always stores on the hard drive.)

Solution #1: Modify PROC IMPORT statement

Each time you call PROC IMPORT, you can specify the number of rows like this:

proc import 
	guessingrows=32767; /* this is the maximum for Base SAS 9.2 */

How may rows is enough? I suggest always using the maximum, which 32,767. There is a performance penalty for scanning the file twice (and disk I/O is slow for computers), but that’s much better than accidentally truncating data. If your file is short, specifying a larger value doesn’t hurt (and the performance penalty is small). If you are sure (for example) 1000 rows is enough for your .csv file, are you sure it won’t change tomorrow with a new .csv file? Don’t gamble with corrupt data.

If your file is larger than 32,767, consider the third solution below.

Solution #2: Modify the default value in the SAS registry

You may run this code once, and it will be remembered across sessions. You may also add it to your SAS autoexec file, so it runs each time: this is useful for sites with multiple users, so everyone has the same environment (even after re-installing or upgrading SAS). To centralize changes, the autoexec for each SAS installation at our site is only a one-line %INCLUDE to a network autoexec file.

/* Change the default guessing rows for PROC IMPORT CSV to avoid truncating data. */
/* This is potentially slower for large .CSV files, but that's better than corrupting data. */
/* The value can be overwritten in each PROC IMPORT statement (where the column widths are consistent). */

/* This creates a text file in a temporary location (such as on Windows XP 
/* C:\Documents and Settings\username\Local Settings\Temp\SAS Temporary Files). */
filename f1 temp;

data _null_;
   file f1;
   /* This path is the same as in the GUI browser found in the menu at Solutions - Accesses - Registry Editor*/
   put "[products\base\efi]"; 
   put ' "GuessingRows"= int:32767';

/* This merges the temporary file into the SAS registry */
proc registry import= f1; 

Solution #3: Modify the generated import code

This solution makes the most sense when you are repeatedly reading a large .CSV file with the same format.

When you run the PROC IMPORT procedure, read the log. Notice it’s simply generating (and running) code for a DATA step. The demonstration above yields this code:

33       data WORK.TEST                                    ;
34       %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
35       infile F1 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
36          informat id best32. ;
37          informat address $17. ;
38          format id best12. ;
39          format address $17. ;
40       input
41                   id
42                   address $
43       ;
44       if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection
44 ! macro variable */
45       run;

Simply run PROC IMPORT once, copy the generated DATA step from the log to your program, remove the line numbers, and adjust the width manually. In the example above I would change address from 17 to 60.

This article was tested on Base SAS 9.1.3 and 9.2 on Windows XP SP3.

28 thoughts on “Avoid truncating characters in PROC IMPORT csv

  1. I have been strugling with this for a long time, you have saved me. I used to append a dummy column with my data to avoid my data being truncated.

    Thank you

    • One option is to use PROC IMPORT to read the log, a DATA step to seelct the rows you want, and PROC EXPORT to save a new log with the same name. If the log is currently being written (by say, Apache httpd server on Linux), then you need to signal it (using kilall, e.g.) to re-open the log file.

    • Make sure the new column is wide enough. Before the MERGE statement, try using a LENGTH statement with the largest of the possible widths. Or is that a FORMAT statement?

  2. I thought I’d add one more thing that ended up being the second “key” for making my numeric variables work. I was merging two large data sets, and some numeric values kept getting significantly truncated. This did not occur on all computers that ran the same program. Two things fixed it:

    1) guessingrows from above, and
    2) DATA sheetname;
    SET sheetname;

    That “BEST32.” made my long proportions (i.e., 0.00246787) stop becoming integers (i.e., 0). It was not enough just to specify digits [i.e., newname=INPUT(oldname,10.12.);]. Hopefully this helps someone out there!

  3. Pingback: Avoid data truncation in SAS when importing CSV files | Super Nerdy Cool - My (technical) adventures through the professional world…

  4. Does this keep it from making incorrect assumptions about the format of the data? I keep getting the comment that there is invalid data based on the assumption that that variable is numeric. How can I read in all data as character, so that it doesn’t make such mistakes?

    • Sometimes guessingrows can help, but in cases where it does not, use PROC IMPORT (which will read the CSV file wrong), from the SAS log copy the SAS code (which is a DATA step) that PROC IMPORT generates, paste the SAS code into your program, and then edit the SAS code as needed. In this process, PROC IMPORT gives you a starting point, so you don’t have to write the whole DATA step from scratch.

      Just this morning I had to do this because SAS was treating a numeric as a character, and I wanted it as a numeric. Also SAS was messing up the variable (column) names.

      • I am now faced with a critical problem. I have a variable in the csv file that is witch 49 but has only numbers in it. SAS is assuming that it is supposed to be numeric data, but then stores it in scientific notation. Normally to change numbers to characters I would use substr() , but if I try this here, it will give me something like 1.0203010101011110E48. I cannot use this. I need a character variable with the value of “000020301010101111010020005000100005000 instead of a numeric of character value of “1.0203010101011110E48.”

        Is there any way I can specify to read in specific or all variables as character?

    • Though it did not read them in as text, this seemed to work well enough to correctly identify character variables, without making the assumption that it is numeric. Thanks so much!

  5. Hi, I have a doubt. I am importing 20 .csv’s one by one. But while importing, a particular column is getting truncated in all of them(string length same throughout the column). What do I do?

    • tallibilli, after running PROC IMPORT, check the SAS log for the format it is using for that column. If the column is too short, copy the DATA step from the SAS log into your program as a template to replace PROC IMPORT, and then modify the DATA step with a wider format.

  6. Pingback: Avoid truncating characters in PROC IMPORT csv - 数据帮

  7. This was extremely helpful for me. Not only did you solve my problem, but you gave me a clear understanding of why the problem was there. You’re awesome, thanks!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s