How much storage space and I/O resources can be saved by shrinking the length of date and datetimes in SAS while maintaining sufficient precision? In my experience, saving I/O generally makes SAS run faster (such as in a DATA step) proportional to the I/O savings, so, for example, a 50% reduction in file size makes SAS run about 50% faster.
SAS stores dates and datetimes as numerics, exactly like integers and decimals: the only difference is the format applied. (In other words, SAS doesn’t have a unique data type for dates and datetimes.) Date are stored as days since the epoch January 1, 1960, and datetimes are stored as seconds since the same epoch.
Using the LENGTH statement, SAS can store numerics using 3 to 8 bytes, where 8 is the default. When a value exceeds the maximum integer size for the numeric storage, SAS will continue to store the value, but it loses precision.
Date
First, let’s see what storing a date with lengths 3 and 4 will allow, assuming no loss of precision is tolerable with dates:
data _null_;
length date3 3;
length date4 4;
format date3 date4 yymmdd10.;
date3 = 8192; /* the maximum integer size using 3 bytes of storage */
date4 = 2097152; /* the maximum integer size using 4 bytes of storage */
put date3= date4=;
run;
The SAS log indicates 3 bytes is enough for date 1982-06-06, and 4 bytes is enough for the date 7701-10-23. In other words, you can cut half the storage and processing time, while practically storing any date without loss of precision.
Datetime
Now to answer a similar question for datetimes, but here we assume some loss of precision is tolerable.
/* Returns an integer in the range min...max */
%macro rand_between(min, max);
&min+ceil(rand('uniform')*(&max - &min ))
%mend;
/* Create random datetimes within the years 1980 and 2050 */
data datetime_accuracy;
length dt3 3 dt4 4 dt5 5 dt8 8;
format d yymmdd10.;
do i = 1 to 100000;
day_of_year = %rand_between(365);
year = %rand_between(1980, 2050);
d = mdy(1, 1, year) + day_of_year;
h = %rand_between(0, 23);
m = %rand_between(0, 59);
s = %rand_between(0, 59);
dt8 = dhms(d, h, m, s);
dt5 = dt8;
dt4 = dt8;
dt3 = dt8;
output;
end;
run;
/* Measure errors */
data datetime_accuracy;
set datetime_accuracy;
dt3_err = dt8 - dt3;
dt4_err = dt8 - dt4;
dt5_err = dt8 - dt5;
run;
/* Summarize errors */
proc means data=datetime_accuracy min max mean;
var dt3_err dt4_err dt5_err;
run;
This reveals storing a datetime with a length of 3 bytes has a maximum error of 524,282 seconds, which is 146 days: this is much less efficient than a date.
Storing a datetime with length 4 has a maximum error of 2047 seconds, or 0.57 hours. This is tolerable in many applications, and reduces storage size by half.
Storing a datetime with length 5 has a maximum error of 7 seconds.
Notes
In some cases shrinking one variable (column) does not affect the size of the data set, but shrinking several columns always helps. It seems SAS allocates row size in chunks.
In some cases wise use of the LENGTH statement makes SAS data set compression anti-effective: it may actually increase the size of a data set. When you see a warning like “Compressing data set WORK.FOO increased size by 33.33 percent” in the log, turn off compression.
This code was tested with Base SAS 9.3 on Windows 7.