Watch out for this common pitfall when importing .CSV files into SAS using the PROC IMPORT procedure. If your .CSV files always have character strings only with consistent widths (such as ISBN book codes, ethernet MAC addresses, UPC barcodes, Social Security Numbers) or numeric values only, then you may stop reading now. However, if your .CSV has character strings with varying lengths (such as names, street addresses, and email addresses), then read carefully. I will explain what happens, why, and two ways to fix it. I’ll also show how to programmatically modify the SAS registry using SAS code. (Of course you’ll also see how to simply import a .csv file into SAS.)
What happens
This Base SAS code demonstrates the issue:
/* create a temporary file */ filename f1 temp; /* write test data to the csv file */ data _null_; file f1; put "id,address"; put "1,1@example.com"; put "12,12@example.com"; put "123,123@example.com"; put "1234,1234@example.com"; put "12345,12345@example.com"; put "123456,123456@example.com"; put "1234567,1234567@example.com"; run; /* read the csv file but only use 5 rows for guessing the width */ proc import datafile=f1 dbms=csv out=test; guessingrows=5; /* if omitted, the default is 20 */ run;
Examine the imported data set and notice all the numeric variables are fine but some of the character variables are truncated:
It bugs me SAS set the default (20) so low, and also there is no warning when truncation happens (even though SAS is good about printing notes, warnings, and errors for all sorts of other cases).
Why it happens
The PROC IMPORT is a convenience procedure to dynamically creates SAS code to read the .CSV file: see this code by checking the SAS log. You could write the code yourself, but it’s a hassle when working with many different .CSV layouts. To avoid wasting disk space and processing time, PROC IMPORT creates character variables with the minimum length. (Read “Make SAS much faster with data set compression” for information about compressing data sets with wide character variables with varying lengths.)
However, by default PROC IMPORT scans only the first 20 rows to get a sense of the maximum width of the variables, and later when reading the file, the width cannot be adjusted. If the 21st row (or any later) contains a character string wider than what is seen in the first 20, it will be truncated (shortened to fit).
Compare this behavior to the function read.csv2 in the R language: R’s equivalent of SAS data sets (which R calls data frames) don’t define character widths, so truncation never happens. (It’s not simple to say one language is better than other. R doesn’t have this issue and is fast because it stores all data in RAM, but on the other hand it struggles with large data sets, which SAS always stores on the hard drive.)
Solution #1: Modify PROC IMPORT statement
Each time you call PROC IMPORT, you can specify the number of rows like this:
proc import datafile=f1 dbms=csv out=test; guessingrows=32767; /* this is the maximum for Base SAS 9.2 */ run;
How may rows is enough? I suggest always using the maximum, which 32,767. There is a performance penalty for scanning the file twice (and disk I/O is slow for computers), but that’s much better than accidentally truncating data. If your file is short, specifying a larger value doesn’t hurt (and the performance penalty is small). If you are sure (for example) 1000 rows is enough for your .csv file, are you sure it won’t change tomorrow with a new .csv file? Don’t gamble with corrupt data.
If your file is larger than 32,767, consider the third solution below.
Solution #2: Modify the default value in the SAS registry
You may run this code once, and it will be remembered across sessions. You may also add it to your SAS autoexec file, so it runs each time: this is useful for sites with multiple users, so everyone has the same environment (even after re-installing or upgrading SAS). To centralize changes, the autoexec for each SAS installation at our site is only a one-line %INCLUDE to a network autoexec file.
/* Change the default guessing rows for PROC IMPORT CSV to avoid truncating data. */ /* This is potentially slower for large .CSV files, but that's better than corrupting data. */ /* The value can be overwritten in each PROC IMPORT statement (where the column widths are consistent). */ /* This creates a text file in a temporary location (such as on Windows XP /* C:\Documents and Settings\username\Local Settings\Temp\SAS Temporary Files). */ filename f1 temp; data _null_; file f1; /* This path is the same as in the GUI browser found in the menu at Solutions - Accesses - Registry Editor*/ put "[products\base\efi]"; put ' "GuessingRows"= int:32767'; run; /* This merges the temporary file into the SAS registry */ proc registry import= f1; run;
Solution #3: Modify the generated import code
This solution makes the most sense when you are repeatedly reading a large .CSV file with the same format.
When you run the PROC IMPORT procedure, read the log. Notice it’s simply generating (and running) code for a DATA step. The demonstration above yields this code:
33 data WORK.TEST ; 34 %let _EFIERR_ = 0; /* set the ERROR detection macro variable */ 35 infile F1 delimiter = ',' MISSOVER DSD lrecl=32767 firstobs=2 ; 36 informat id best32. ; 37 informat address $17. ; 38 format id best12. ; 39 format address $17. ; 40 input 41 id 42 address $ 43 ; 44 if _ERROR_ then call symputx('_EFIERR_',1); /* set ERROR detection 44 ! macro variable */ 45 run;
Simply run PROC IMPORT once, copy the generated DATA step from the log to your program, remove the line numbers, and adjust the width manually. In the example above I would change address from 17 to 60.
This article was tested on Base SAS 9.1.3 and 9.2 on Windows XP SP3.
A life saver. thanks for the tip.
I’m trying to import an excel spreadsheet and this solution gives me errors . Is there another solution for importing from excel? I’m using SAS 9 btw
It depends on the error. What exactly is the error?
I have been strugling with this for a long time, you have saved me. I used to append a dummy column with my data to avoid my data being truncated.
Thank you
but how do I delete the number of rows code from log in an automated manner if I have 490 rows?
One option is to use PROC IMPORT to read the log, a DATA step to seelct the rows you want, and PROC EXPORT to save a new log with the same name. If the log is currently being written (by say, Apache httpd server on Linux), then you need to signal it (using kilall, e.g.) to re-open the log file.
really good stuff
How to avoid truncating characters when merging?
Make sure the new column is wide enough. Before the MERGE statement, try using a LENGTH statement with the largest of the possible widths. Or is that a FORMAT statement?
I thought I’d add one more thing that ended up being the second “key” for making my numeric variables work. I was merging two large data sets, and some numeric values kept getting significantly truncated. This did not occur on all computers that ran the same program. Two things fixed it:
1) guessingrows from above, and
2) DATA sheetname;
SET sheetname;
newname=INPUT(oldname,BEST32.);
RUN;
That “BEST32.” made my long proportions (i.e., 0.00246787) stop becoming integers (i.e., 0). It was not enough just to specify digits [i.e., newname=INPUT(oldname,10.12.);]. Hopefully this helps someone out there!
Thank you so much!! I’d been trying to figure this out all day.
thank you sir 🙂
This was really helpful for me. Many thanks.
Pingback: Avoid data truncation in SAS when importing CSV files | Super Nerdy Cool - My (technical) adventures through the professional world…
Does this keep it from making incorrect assumptions about the format of the data? I keep getting the comment that there is invalid data based on the assumption that that variable is numeric. How can I read in all data as character, so that it doesn’t make such mistakes?
Sometimes guessingrows can help, but in cases where it does not, use PROC IMPORT (which will read the CSV file wrong), from the SAS log copy the SAS code (which is a DATA step) that PROC IMPORT generates, paste the SAS code into your program, and then edit the SAS code as needed. In this process, PROC IMPORT gives you a starting point, so you don’t have to write the whole DATA step from scratch.
Just this morning I had to do this because SAS was treating a numeric as a character, and I wanted it as a numeric. Also SAS was messing up the variable (column) names.
I am now faced with a critical problem. I have a variable in the csv file that is witch 49 but has only numbers in it. SAS is assuming that it is supposed to be numeric data, but then stores it in scientific notation. Normally to change numbers to characters I would use substr() , but if I try this here, it will give me something like 1.0203010101011110E48. I cannot use this. I need a character variable with the value of “000020301010101111010020005000100005000 instead of a numeric of character value of “1.0203010101011110E48.”
Is there any way I can specify to read in specific or all variables as character?
Though it did not read them in as text, this seemed to work well enough to correctly identify character variables, without making the assumption that it is numeric. Thanks so much!
Thanks a lot!
You rock! Not only did you provide a solution, but it is written in easy to understand English. Thank you 🙂
Hi, I have a doubt. I am importing 20 .csv’s one by one. But while importing, a particular column is getting truncated in all of them(string length same throughout the column). What do I do?
tallibilli, after running PROC IMPORT, check the SAS log for the format it is using for that column. If the column is too short, copy the DATA step from the SAS log into your program as a template to replace PROC IMPORT, and then modify the DATA step with a wider format.
Thank you so much. guessingrows= did the trick for my dataset.
Pingback: Avoid truncating characters in PROC IMPORT csv - 数据帮
This is 5 year old and still helping people. Thank you “guessingrows=” fixed my problem as well.
Helped me. Somehow not the obvious thing to solve. Thank you so much!
Helped me to discover that now in SAS 9.4 (maybe only 64 bit systems?) the max is higher.
Or you can just set it to GUESSINGROWS=MAX;
http://support.sas.com/documentation/cdl/en/proc/68954/HTML/default/viewer.htm#p13kvtl8ezj13in17i6m99jypcwi.htm
Thanks for the help.
Worked for my trouble of truncation. Thanks a lot !
Thanks for the tip. It help me.. I was about write a data step instead.. thanks to guessingrows= !!
This was extremely helpful for me. Not only did you solve my problem, but you gave me a clear understanding of why the problem was there. You’re awesome, thanks!
Worked for me. Thanks so much!
Thank you very much for the tip! Really saved my life.
Please help! I am having trouble importing a file with mostly character columns. I am using csv, so that I can using the guessingrows row statement, since we can’t using guessingrows for importing excel files. Though there are long comments, that are being split across subsequent columns. It looks like it may be due to commas in the text. I am at a loss. If I import the file in an xlsx format, then the text gets truncated.
Any insight you can share would be much appreciated.
There are a few ways to deal with it. One option: open the file in Excel and remove all the commas using search and replace, or if you do not need the column with the long comment, delete the whole column.
Thanks for your super quick reply. I tried removing the commas, and now upon import, some of my columns are shifting. For example, my id column and a few other columns are completely gone for at least four rows in the first screen.
Also, I am trying to import all the open ended questions into SAS, so unfortunately we are not able to delete the columns. I would like to bring the file into SAS, because I need to delete duplicate records, and the code to delete the duplicate records is in SAS. I’d rather not do it manually in excel, to eliminate human error and save time. Though, at this point, it may have been faster, to remove the rows in excel manually. Any other suggestions?
Thank you so much for your help!
Rebecca, two things that PROC IMPORT struggles with when importing CSV are values that have commas or carriage returns/line breaks, and both of these are common in open-ended text fields like in surveys. Try also deleting any the carriage returns: do a Google search for something like “excel remove carriage returns” (without the quotation marks). Also double check for any commas you may have missed.
If CSV still does not work, try the XLSX again. You said it was truncating, so which method did you use exactly? Here is how I import XLSX
libname myxls
odbc
required= ”
Driver={Microsoft Excel Driver (*.xls, *.xlsx, *.xlsm, *.xlsb)};
Dbq=c:\folder\file.xlsx;
”
;
proc sql;
create table get_data as
select *
from myxls.’Sheet1$’n;
quit;
libname myxls;
Thanks! The carriage returns seemed to be the culprits. Though, now I am having trouble getting a few comments to show their full content in questions q25a and q35. The individual files show the full comment, though not when I set the files on top of each other. I am using this code –
data text;
length
token $ 1110
q2_comment_ $ 1110
q6_comment_ $ 1110
q10_comment_ $ 1110
q17other $ 1110
q23a $ 1110
q24a $ 1110
q25a $ 30000
q26_comment_ $ 1110
a28a_other_ $ 1110
q29a_other_ $ 1110
q31_other_ $ 1110
q32_other_ $ 1110
q33_1_ $ 1110
q33_2_ $ 1110
q35 $ 30000;
set son_text som_text sph_text sod_text cop_text pa_text;
run;
Thoughts?
I found a work around. It seems as though SAS uses the length of the variables of the first file that is listed in the ‘set’ statement. So I switched the order of the files. Fortunately, the questions that were getting hung up were in the same file, so that solution worked. Thanks again for your help today!