I have an ETL job that updates 10,000 rows in a large, remote Microsoft SQL table with 700,000 rows (1.4%). (The specific brand of remote SQL database doesn’t matter: this should work equally well for any ODBC RDMS.) This tutorial shows two ways of updating the remote table with SAS’s PROC SQL, and the second method is much faster.
First, setup some small example data sets. Pretend data set A is the large table on a remote SQL server, and data set B contains the changes you want to make. Column (variable) X is a key (unique identifier), and column Y is the value (what we want to change).
data a; input x 1. y $1.; cards; 1a 2b 3c ; run; data b; input x 1. y $1.; cards; 1c 3a ; run;
So you see we want to change 1 from A to C, and also we want to change 3 from C to A.
In SAS, PROC SQL UPDATE doesn’t support JOINs, so the update is done like this:
/* WARNING: DO NOT EXECUTE THIS SECTION WITH REAL DATA */ proc sql; update a set y = (select y from b where a.x = b.x) ; quit;
Execute the PROC SQL step above, and notice the SAS log:
NOTE: 3 rows were updated in WORK.A.
Three rows? I thought we were updating two rows. Open table A and notice in the second row the column Y is set to a blank (NULL) value: this happen because table B didn’t contain an new value for it. It’s easy enough to fix by adding a WHERE clause:
proc sql; update a set y = (select y from b where a.x = b.x) where x in (select x from b) ; quit;
Functionally this works fine, but on my real data set (10,000 rows) it takes 40 minutes! There is a faster way using a DATA step to dynamically generate a PROC SQL using CALL EXECUTE.
data _null_; set b end=final; if _n_ = 1 then call execute("proc sql ;"); call execute("update a set y='"||trim(y)||"' where x="||trim(x)||";"); if final then call execute('quit;'); run;
The new CALL EXECUTE method finishes in 4 minutes (90% faster!).
If using the CALL EXECUTE method, keep in mind a few things. First, SAS doesn’t execute the CALL EXECUTE code immediately inside the DATA step: it executes once the DATA step finishes (“run”). Second, when testing your code, start with
proc sql noexec;
to validate syntax while preventing actual execution (no changes will be made in the database). Third, make your first real updates small by limiting the input observations like
set b(obs=10) end=final;
and then validate the changes. Fourth, make sure the key used in the WHERE clause is indexed on the remote SQL database: otherwise each UPDATE will cause a painfully slow table scan. It’s fine to use two conditions in the WHERE clause: in my real table, the unique identifier is not indexed, so I pass a second condition in the WHERE clause with a date that is indexed. Finally, it’s good practice to test the automatic macro variables SQLOBS and SQLRC to verify the PROC SQL executed correctly.
Tested with SAS 9.1.3 on Windows XP SP3 with Microsoft SQL Server 2005 (9.00.3257.00).