Avoid truncating characters in PROC IMPORT csv

heuristicandrew / November 18, 2010

Watch out for this common pitfall when importing .CSV files into SAS using the PROC IMPORT procedure. If your .CSV files always have character strings only with consistent widths (such as ISBN book codes, ethernet MAC addresses, UPC barcodes, Social Security Numbers) or numeric values only, then you may stop reading now. However, if your .CSV has character strings with varying lengths (such as names, street addresses, and email addresses), then read carefully. I will explain what happens, why, and two ways to fix it. I’ll also show how to programmatically modify the SAS registry using SAS code. (Of course you’ll also see how to simply import a .csv file into SAS.)

What happens

This Base SAS code demonstrates the issue:

/* create a temporary file */
filename f1 temp;

/* write test data to the csv file */
data _null_;
	file f1;
	put "id,address";
	put "1,1@example.com";
	put "12,12@example.com";
	put "123,123@example.com";
	put "1234,1234@example.com";
	put "12345,12345@example.com";
	put "123456,123456@example.com";
	put "1234567,1234567@example.com";
run;


/* read the csv file but only use 5 rows for guessing the width */
proc import 
	datafile=f1
	dbms=csv
	out=test;
	guessingrows=5; /* if omitted, the default is 20 */
run;

Examine the imported data set and notice all the numeric variables are fine but some of the character variables are truncated:

It bugs me SAS set the default (20) so low, and also there is no warning when truncation happens (even though SAS is good about printing notes, warnings, and errors for all sorts of other cases).

Why it happens

The PROC IMPORT is a convenience procedure to dynamically creates SAS code to read the .CSV file: see this code by checking the SAS log. You could write the code yourself, but it’s a hassle when working with many different .CSV layouts. To avoid wasting disk space and processing time, PROC IMPORT creates character variables with the minimum length. (Read “Make SAS much faster with data set compression” for information about compressing data sets with wide character variables with varying lengths.)

However, by default PROC IMPORT scans only the first 20 rows to get a sense of the maximum width of the variables, and later when reading the file, the width cannot be adjusted. If the 21st row (or any later) contains a character string wider than what is seen in the first 20, it will be truncated (shortened to fit).

Compare this behavior to the function read.csv2 in the R language: R’s equivalent of SAS data sets (which R calls data frames) don’t define character widths, so truncation never happens. (It’s not simple to say one language is better than other. R doesn’t have this issue and is fast because it stores all data in RAM, but on the other hand it struggles with large data sets, which SAS always stores on the hard drive.)

Solution #1: Modify PROC IMPORT statement

Each time you call PROC IMPORT, you can specify the number of rows like this:

proc import 
	datafile=f1
	dbms=csv
	out=test;
	guessingrows=32767; /* this is the maximum for Base SAS 9.2 */
run;

How may rows is enough? I suggest always using the maximum, which 32,767. There is a performance penalty for scanning the file twice (and disk I/O is slow for computers), but that’s much better than accidentally truncating data. If your file is short, specifying a larger value doesn’t hurt (and the performance penalty is small). If you are sure (for example) 1000 rows is enough for your .csv file, are you sure it won’t change tomorrow with a new .csv file? Don’t gamble with corrupt data.

If your file is larger than 32,767, consider the third solution below.

Solution #2: Modify the default value in the SAS registry

You may run this code once, and it will be remembered across sessions. You may also add it to your SAS autoexec file, so it runs each time: this is useful for sites with multiple users, so everyone has the same environment (even after re-installing or upgrading SAS). To centralize changes, the autoexec for each SAS installation at our site is only a one-line %INCLUDE to a network autoexec file.

/* Change the default guessing rows for PROC IMPORT CSV to avoid truncating data. */
/* This is potentially slower for large .CSV files, but that's better than corrupting data. */
/* The value can be overwritten in each PROC IMPORT statement (where the column widths are consistent). */

/* This creates a text file in a temporary location (such as on Windows XP 
/* C:\Documents and Settings\username\Local Settings\Temp\SAS Temporary Files). */
filename f1 temp;

data _null_;
   file f1;
   /* This path is the same as in the GUI browser found in the menu at Solutions - Accesses - Registry Editor*/
   put "[products\base\efi]"; 
   put ' "GuessingRows"= int:32767';
run;

/* This merges the temporary file into the SAS registry */
proc registry import= f1; 
run;

Solution #3: Modify the generated import code

This solution makes the most sense when you are repeatedly reading a large .CSV file with the same format.

When you run the PROC IMPORT procedure, read the log. Notice it’s simply generating (and running) code for a DATA step. The demonstration above yields this code:

33       data WORK.TEST                                    ;
34       %let _EFIERR_ = 0; /* set the ERROR detection macro variable */
35       infile F1 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ;
36          informat id best32. ;
37          informat address $17. ;
38          format id best12. ;
39          format address $17. ;
40       input
41                   id
42                   address $
43       ;
44       if _ERROR_ then call symputx('_EFIERR_',1);  /* set ERROR detection
44 ! macro variable */
45       run;

Simply run PROC IMPORT once, copy the generated DATA step from the log to your program, remove the line numbers, and adjust the width manually. In the example above I would change address from 17 to 60.

This article was tested on Base SAS 9.1.3 and 9.2 on Windows XP SP3.

39 thoughts on “Avoid truncating characters in PROC IMPORT csv”

R Leung says:

December 9, 2010 at 12:32 pm

A life saver. thanks for the tip.

Reply
- calculush says:
  
  March 8, 2017 at 1:20 pm
  
  I’m trying to import an excel spreadsheet and this solution gives me errors . Is there another solution for importing from excel? I’m using SAS 9 btw
  
  Reply
  - heuristicandrew says:
    
    March 8, 2017 at 1:23 pm
    
    It depends on the error. What exactly is the error?
Riva says:

May 11, 2011 at 12:15 am

I have been strugling with this for a long time, you have saved me. I used to append a dummy column with my data to avoid my data being truncated.

Thank you

Reply
cristian says:

September 21, 2011 at 1:29 am

but how do I delete the number of rows code from log in an automated manner if I have 490 rows?

Reply
- heuristicandrew says:
  
  September 21, 2011 at 7:33 am
  
  One option is to use PROC IMPORT to read the log, a DATA step to seelct the rows you want, and PROC EXPORT to save a new log with the same name. If the log is currently being written (by say, Apache httpd server on Linux), then you need to signal it (using kilall, e.g.) to re-open the log file.
  
  Reply
Zhuoye Xu says:

February 29, 2012 at 7:12 pm

really good stuff

Reply
Betsy says:

March 20, 2012 at 12:19 pm

How to avoid truncating characters when merging?

Reply
- heuristicandrew says:
  
  March 20, 2012 at 12:22 pm
  
  Make sure the new column is wide enough. Before the MERGE statement, try using a LENGTH statement with the largest of the possible widths. Or is that a FORMAT statement?
  
  Reply
F Monroe says:

April 9, 2012 at 10:01 am

I thought I’d add one more thing that ended up being the second “key” for making my numeric variables work. I was merging two large data sets, and some numeric values kept getting significantly truncated. This did not occur on all computers that ran the same program. Two things fixed it:

1) guessingrows from above, and
2) DATA sheetname;
SET sheetname;
newname=INPUT(oldname,BEST32.);
RUN;

That “BEST32.” made my long proportions (i.e., 0.00246787) stop becoming integers (i.e., 0). It was not enough just to specify digits [i.e., newname=INPUT(oldname,10.12.);]. Hopefully this helps someone out there!

Reply
Lindsay says:

May 15, 2012 at 3:26 pm

Thank you so much!! I’d been trying to figure this out all day.

Reply
vishnu says:

May 17, 2012 at 8:50 am

thank you sir 🙂

Reply
Christopher says:

January 3, 2013 at 10:29 am

This was really helpful for me. Many thanks.

Reply
Pingback: Avoid data truncation in SAS when importing CSV files | Super Nerdy Cool - My (technical) adventures through the professional world…
Tim Wright says:

May 31, 2013 at 12:10 pm

Does this keep it from making incorrect assumptions about the format of the data? I keep getting the comment that there is invalid data based on the assumption that that variable is numeric. How can I read in all data as character, so that it doesn’t make such mistakes?

Reply
- heuristicandrew says:
  
  May 31, 2013 at 12:13 pm
  
  Sometimes guessingrows can help, but in cases where it does not, use PROC IMPORT (which will read the CSV file wrong), from the SAS log copy the SAS code (which is a DATA step) that PROC IMPORT generates, paste the SAS code into your program, and then edit the SAS code as needed. In this process, PROC IMPORT gives you a starting point, so you don’t have to write the whole DATA step from scratch.
  
  Just this morning I had to do this because SAS was treating a numeric as a character, and I wanted it as a numeric. Also SAS was messing up the variable (column) names.
  
  Reply
  - Tim Wright says:
    
    June 2, 2013 at 12:39 pm
    
    I am now faced with a critical problem. I have a variable in the csv file that is witch 49 but has only numbers in it. SAS is assuming that it is supposed to be numeric data, but then stores it in scientific notation. Normally to change numbers to characters I would use substr() , but if I try this here, it will give me something like 1.0203010101011110E48. I cannot use this. I need a character variable with the value of “000020301010101111010020005000100005000 instead of a numeric of character value of “1.0203010101011110E48.”
    
    Is there any way I can specify to read in specific or all variables as character?
- Tim Wright says:
  
  May 31, 2013 at 12:38 pm
  
  Though it did not read them in as text, this seemed to work well enough to correctly identify character variables, without making the assumption that it is numeric. Thanks so much!
  
  Reply
Jona says:

September 5, 2013 at 5:56 am

Thanks a lot!

Reply
Shawn K says:

September 10, 2013 at 8:48 am

You rock! Not only did you provide a solution, but it is written in easy to understand English. Thank you 🙂

Reply
tallibilli says:

November 11, 2013 at 11:31 pm

Hi, I have a doubt. I am importing 20 .csv’s one by one. But while importing, a particular column is getting truncated in all of them(string length same throughout the column). What do I do?

Reply
- heuristicandrew says:
  
  November 12, 2013 at 9:28 am
  
  tallibilli, after running PROC IMPORT, check the SAS log for the format it is using for that column. If the column is too short, copy the DATA step from the SAS log into your program as a template to replace PROC IMPORT, and then modify the DATA step with a wider format.
  
  Reply
Srinivas says:

November 17, 2014 at 12:12 pm

Thank you so much. guessingrows= did the trick for my dataset.

Reply
Pingback: Avoid truncating characters in PROC IMPORT csv - 数据帮
lordlnyc says:

November 13, 2015 at 9:02 am

This is 5 year old and still helping people. Thank you “guessingrows=” fixed my problem as well.

Reply
Yevhen says:

December 9, 2015 at 8:29 am

Helped me. Somehow not the obvious thing to solve. Thank you so much!

Reply
Dylan says:

May 4, 2016 at 8:15 am

Helped me to discover that now in SAS 9.4 (maybe only 64 bit systems?) the max is higher.

Or you can just set it to GUESSINGROWS=MAX;
http://support.sas.com/documentation/cdl/en/proc/68954/HTML/default/viewer.htm#p13kvtl8ezj13in17i6m99jypcwi.htm

Thanks for the help.

Reply
Shaddab says:

May 29, 2016 at 1:40 am

Worked for my trouble of truncation. Thanks a lot !

Reply
Srini says:

June 1, 2016 at 4:28 pm

Thanks for the tip. It help me.. I was about write a data step instead.. thanks to guessingrows= !!

Reply
Nicky says:

July 22, 2016 at 8:27 am

This was extremely helpful for me. Not only did you solve my problem, but you gave me a clear understanding of why the problem was there. You’re awesome, thanks!

Reply
Xinyi says:

December 27, 2016 at 12:21 pm

Worked for me. Thanks so much!

Reply
sfljnv says:

January 29, 2017 at 4:51 pm

Thank you very much for the tip! Really saved my life.

Reply
Rebecca says:

August 29, 2018 at 6:52 am

Please help! I am having trouble importing a file with mostly character columns. I am using csv, so that I can using the guessingrows row statement, since we can’t using guessingrows for importing excel files. Though there are long comments, that are being split across subsequent columns. It looks like it may be due to commas in the text. I am at a loss. If I import the file in an xlsx format, then the text gets truncated.

Any insight you can share would be much appreciated.

Reply
- heuristicandrew says:
  
  August 29, 2018 at 7:26 am
  
  There are a few ways to deal with it. One option: open the file in Excel and remove all the commas using search and replace, or if you do not need the column with the long comment, delete the whole column.
  
  Reply
  - Rebecca says:
    
    August 29, 2018 at 9:09 am
    
    Thanks for your super quick reply. I tried removing the commas, and now upon import, some of my columns are shifting. For example, my id column and a few other columns are completely gone for at least four rows in the first screen.
Rebecca says:

August 29, 2018 at 9:12 am

Also, I am trying to import all the open ended questions into SAS, so unfortunately we are not able to delete the columns. I would like to bring the file into SAS, because I need to delete duplicate records, and the code to delete the duplicate records is in SAS. I’d rather not do it manually in excel, to eliminate human error and save time. Though, at this point, it may have been faster, to remove the rows in excel manually. Any other suggestions?

Thank you so much for your help!

Reply
- heuristicandrew says:
  
  August 29, 2018 at 9:29 am
  
  Rebecca, two things that PROC IMPORT struggles with when importing CSV are values that have commas or carriage returns/line breaks, and both of these are common in open-ended text fields like in surveys. Try also deleting any the carriage returns: do a Google search for something like “excel remove carriage returns” (without the quotation marks). Also double check for any commas you may have missed.
  
  If CSV still does not work, try the XLSX again. You said it was truncating, so which method did you use exactly? Here is how I import XLSX
  
  libname myxls
  odbc
  required= ”
  Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};
  Dbq=c:\folder\file.xlsx;
  ”
  ;
  proc sql;
  create table get_data as
  select *
  from myxls.’Sheet1$’n;
  quit;
  libname myxls;
  
  Reply
  - Rebecca says:
    
    August 29, 2018 at 5:14 pm
    
    Thanks! The carriage returns seemed to be the culprits. Though, now I am having trouble getting a few comments to show their full content in questions q25a and q35. The individual files show the full comment, though not when I set the files on top of each other. I am using this code –
    
    data text;
    length
    token $ 1110
    q2_comment_ $ 1110
    q6_comment_ $ 1110
    q10_comment_ $ 1110
    q17other $ 1110
    q23a $ 1110
    q24a $ 1110
    q25a $ 30000
    q26_comment_ $ 1110
    a28a_other_ $ 1110
    q29a_other_ $ 1110
    q31_other_ $ 1110
    q32_other_ $ 1110
    q33_1_ $ 1110
    q33_2_ $ 1110
    q35 $ 30000;
    set son_text som_text sph_text sod_text cop_text pa_text;
    run;
    
    Thoughts?
  - Rebecca says:
    
    August 29, 2018 at 7:05 pm
    
    I found a work around. It seems as though SAS uses the length of the variables of the first file that is listed in the ‘set’ statement. So I switched the order of the files. Fortunately, the questions that were getting hung up were in the same file, so that solution worked. Thanks again for your help today!