Avoid truncating characters in PROC IMPORT csv

Watch out for this common pitfall when importing .CSV files into SAS using the PROC IMPORT procedure. If your .CSV files always have character strings only with consistent widths (such as ISBN book codes, ethernet MAC addresses, UPC barcodes, Social Security Numbers) or numeric values only, then you may stop reading now. However, if your .CSV has character strings with varying lengths (such as names, street addresses, and email addresses), then read carefully. I will explain what happens, why, and two ways to fix it. I’ll also show how to programmatically modify the SAS registry using SAS code. (Of course you’ll also see how to simply import a .csv file into SAS.)

What happens

This Base SAS code demonstrates the issue:

/* create a temporary file */
filename f1 temp;

/* write test data to the csv file */
data _null_;
	file f1;
	put "id,address";
	put "1,1@example.com";
	put "12,12@example.com";
	put "123,123@example.com";
	put "1234,1234@example.com";
	put "12345,12345@example.com";
	put "123456,123456@example.com";
	put "1234567,1234567@example.com";
run;


/* read the csv file but only use 5 rows for guessing the width */
proc import 
	datafile=f1
	dbms=csv
	out=test;
	guessingrows=5; /* if omitted, the default is 20 */
run;

Examine the imported data set and notice all the numeric variables are fine but some of the character variables are truncated:

It bugs me SAS set the default (20) so low, and also there is no warning when truncation happens (even though SAS is good about printing notes, warnings, and errors for all sorts of other cases).

Why it happens

The PROC IMPORT is a convenience procedure to dynamically creates SAS code to read the .CSV file: see this code by checking the SAS log. You could write the code yourself, but it’s a hassle when working with many different .CSV layouts. To avoid wasting disk space and processing time, PROC IMPORT creates character variables with the minimum length. (Read “Make SAS much faster with data set compression” for information about compressing data sets with wide character variables with varying lengths.)

However, by default PROC IMPORT scans only the first 20 rows to get a sense of the maximum width of the variables, and later when reading the file, the width cannot be adjusted. If the 21st row (or any later) contains a character string wider than what is seen in the first 20, it will be truncated (shortened to fit).

Compare this behavior to the function read.csv2 in the R language: R’s equivalent of SAS data sets (which R calls data frames) don’t define character widths, so truncation never happens. (It’s not simple to say one language is better than other. R doesn’t have this issue and is fast because it stores all data in RAM, but on the other hand it struggles with large data sets, which SAS always stores on the hard drive.)

Solution #1: Modify PROC IMPORT statement

Each time you call PROC IMPORT, you can specify the number of rows like this:

proc import 
	datafile=f1
	dbms=csv
	out=test;
	guessingrows=32767; /* this is the maximum for Base SAS 9.2 */
run;

How may rows is enough? I suggest always using the maximum, which 32,767. There is a performance penalty for scanning the file twice (and disk I/O is slow for computers), but that’s much better than accidentally truncating data. If your file is short, specifying a larger value doesn’t hurt (and the performance penalty is small). If you are sure (for example) 1000 rows is enough for your .csv file, are you sure it won’t change tomorrow with a new .csv file? Don’t gamble with corrupt data.

If your file is larger than 32,767, consider the third solution below.

Solution #2: Modify the default value in the SAS registry

You may run this code once, and it will be remembered across sessions. You may also add it to your SAS autoexec file, so it runs each time: this is useful for sites with multiple users, so everyone has the same environment (even after re-installing or upgrading SAS). To centralize changes, the autoexec for each SAS installation at our site is only a one-line %INCLUDE to a network autoexec file.

/* Change the default guessing rows for PROC IMPORT CSV to avoid truncating data. */
/* This is potentially slower for large .CSV files, but that's better than corrupting data. */
/* The value can be overwritten in each PROC IMPORT statement (where the column widths are consistent). */

/* This creates a text file in a temporary location (such as on Windows XP 
/* C:\Documents and Settings\username\Local Settings\Temp\SAS Temporary Files). */
filename f1 temp;

data _null_;
   file f1;
   /* This path is the same as in the GUI browser found in the menu at Solutions - Accesses - Registry Editor*/
   put "[products\base\efi]"; 
   put ' "GuessingRows"= int:32767';
run;

/* This merges the temporary file into the SAS registry */
proc registry import= f1; 
run;

Solution #3: Modify the generated import code

This solution makes the most sense when you are repeatedly reading a large .CSV file with the same format.

When you run the PROC IMPORT procedure, read the log. Notice it’s simply generating (and running) code for a DATA step. The demonstration above yields this code:

33       data WORK.TEST                                    ;
34       %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
35       infile F1 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
36          informat id best32. ;
37          informat address $17. ;
38          format id best12. ;
39          format address $17. ;
40       input
41                   id
42                   address $
43       ;
44       if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection
44 ! macro variable */
45       run;

Simply run PROC IMPORT once, copy the generated DATA step from the log to your program, remove the line numbers, and adjust the width manually. In the example above I would change address from 17 to 60.

This article was tested on Base SAS 9.1.3 and 9.2 on Windows XP SP3.

Advertisements

39 thoughts on “Avoid truncating characters in PROC IMPORT csv

  1. I have been strugling with this for a long time, you have saved me. I used to append a dummy column with my data to avoid my data being truncated.

    Thank you

    • One option is to use PROC IMPORT to read the log, a DATA step to seelct the rows you want, and PROC EXPORT to save a new log with the same name. If the log is currently being written (by say, Apache httpd server on Linux), then you need to signal it (using kilall, e.g.) to re-open the log file.

    • Make sure the new column is wide enough. Before the MERGE statement, try using a LENGTH statement with the largest of the possible widths. Or is that a FORMAT statement?

  2. I thought I’d add one more thing that ended up being the second “key” for making my numeric variables work. I was merging two large data sets, and some numeric values kept getting significantly truncated. This did not occur on all computers that ran the same program. Two things fixed it:

    1) guessingrows from above, and
    2) DATA sheetname;
    SET sheetname;
    newname=INPUT(oldname,BEST32.);
    RUN;

    That “BEST32.” made my long proportions (i.e., 0.00246787) stop becoming integers (i.e., 0). It was not enough just to specify digits [i.e., newname=INPUT(oldname,10.12.);]. Hopefully this helps someone out there!

  3. Pingback: Avoid data truncation in SAS when importing CSV files | Super Nerdy Cool - My (technical) adventures through the professional world…

  4. Does this keep it from making incorrect assumptions about the format of the data? I keep getting the comment that there is invalid data based on the assumption that that variable is numeric. How can I read in all data as character, so that it doesn’t make such mistakes?

    • Sometimes guessingrows can help, but in cases where it does not, use PROC IMPORT (which will read the CSV file wrong), from the SAS log copy the SAS code (which is a DATA step) that PROC IMPORT generates, paste the SAS code into your program, and then edit the SAS code as needed. In this process, PROC IMPORT gives you a starting point, so you don’t have to write the whole DATA step from scratch.

      Just this morning I had to do this because SAS was treating a numeric as a character, and I wanted it as a numeric. Also SAS was messing up the variable (column) names.

      • I am now faced with a critical problem. I have a variable in the csv file that is witch 49 but has only numbers in it. SAS is assuming that it is supposed to be numeric data, but then stores it in scientific notation. Normally to change numbers to characters I would use substr() , but if I try this here, it will give me something like 1.0203010101011110E48. I cannot use this. I need a character variable with the value of “000020301010101111010020005000100005000 instead of a numeric of character value of “1.0203010101011110E48.”

        Is there any way I can specify to read in specific or all variables as character?

    • Though it did not read them in as text, this seemed to work well enough to correctly identify character variables, without making the assumption that it is numeric. Thanks so much!

  5. Hi, I have a doubt. I am importing 20 .csv’s one by one. But while importing, a particular column is getting truncated in all of them(string length same throughout the column). What do I do?

    • tallibilli, after running PROC IMPORT, check the SAS log for the format it is using for that column. If the column is too short, copy the DATA step from the SAS log into your program as a template to replace PROC IMPORT, and then modify the DATA step with a wider format.

  6. Pingback: Avoid truncating characters in PROC IMPORT csv - 数据帮

  7. This was extremely helpful for me. Not only did you solve my problem, but you gave me a clear understanding of why the problem was there. You’re awesome, thanks!

  8. Please help! I am having trouble importing a file with mostly character columns. I am using csv, so that I can using the guessingrows row statement, since we can’t using guessingrows for importing excel files. Though there are long comments, that are being split across subsequent columns. It looks like it may be due to commas in the text. I am at a loss. If I import the file in an xlsx format, then the text gets truncated.

    Any insight you can share would be much appreciated.

    • There are a few ways to deal with it. One option: open the file in Excel and remove all the commas using search and replace, or if you do not need the column with the long comment, delete the whole column.

      • Thanks for your super quick reply. I tried removing the commas, and now upon import, some of my columns are shifting. For example, my id column and a few other columns are completely gone for at least four rows in the first screen.

  9. Also, I am trying to import all the open ended questions into SAS, so unfortunately we are not able to delete the columns. I would like to bring the file into SAS, because I need to delete duplicate records, and the code to delete the duplicate records is in SAS. I’d rather not do it manually in excel, to eliminate human error and save time. Though, at this point, it may have been faster, to remove the rows in excel manually. Any other suggestions?

    Thank you so much for your help!

    • Rebecca, two things that PROC IMPORT struggles with when importing CSV are values that have commas or carriage returns/line breaks, and both of these are common in open-ended text fields like in surveys. Try also deleting any the carriage returns: do a Google search for something like “excel remove carriage returns” (without the quotation marks). Also double check for any commas you may have missed.

      If CSV still does not work, try the XLSX again. You said it was truncating, so which method did you use exactly? Here is how I import XLSX

      libname myxls
      odbc
      required= ”
      Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};
      Dbq=c:\folder\file.xlsx;

      ;
      proc sql;
      create table get_data as
      select *
      from myxls.’Sheet1$’n;
      quit;
      libname myxls;

      • Thanks! The carriage returns seemed to be the culprits. Though, now I am having trouble getting a few comments to show their full content in questions q25a and q35. The individual files show the full comment, though not when I set the files on top of each other. I am using this code –

        data text;
        length
        token $ 1110
        q2_comment_ $ 1110
        q6_comment_ $ 1110
        q10_comment_ $ 1110
        q17other $ 1110
        q23a $ 1110
        q24a $ 1110
        q25a $ 30000
        q26_comment_ $ 1110
        a28a_other_ $ 1110
        q29a_other_ $ 1110
        q31_other_ $ 1110
        q32_other_ $ 1110
        q33_1_ $ 1110
        q33_2_ $ 1110
        q35 $ 30000;
        set son_text som_text sph_text sod_text cop_text pa_text;
        run;

        Thoughts?

      • I found a work around. It seems as though SAS uses the length of the variables of the first file that is listed in the ‘set’ statement. So I switched the order of the files. Fortunately, the questions that were getting hung up were in the same file, so that solution worked. Thanks again for your help today!

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s