SAS: “The query requires remerging summary statistics back with the original data”

Coming from a background writing SQL code directly for “real” RDBMS (Microsoft SQL Server, MySQL, and SQLite), I was initially confused when SAS would give me the following ‘note’ for a simple summary PROC SQL query:

429  proc sql;
430      create table undel_monthly as
431      select
432          year(date) as year,
433          month(date) as month,
434          count(*) as n_undel
435      from
436          comments2
437      group by
438          year(date),
439          month(date)
440      order by
441          year(date),
442          month(date)
443      ;
NOTE: The query as specified involves ordering by an item that doesn't appear in its SELECT clause.
NOTE: The query requires remerging summary statistics back with the original  data.
NOTE: Table WORK.UNDEL_MONTHLY created, with 31004 rows and 3 columns.

The same code executes fine in “real” RDBMS. As is often the case, SAS has its own way of doing things. SAS expects its special keyword calculated in each summary/grouping statistic like below. The calculated keyword simply refers to the value calculated in the SELECT section.

proc sql;
	create table undel_monthly as
	select
		year(date) as year,
		month(date) as month,
		count(*) as n_undel
	from
		comments2
	group by
		calculated year,
		calculated month
	order by
		calculated year,
		calculated month
	;
quit;

So how does SAS optimize such a simple “group by” query when sending it to a RDBMS? It doesn’t! Below is a simple query to pull monthly phone call volume from CIC 2.4. Notice in the log I turned on SQL tracing, and SAS shows it pulls individual records instead of letting Microsoft SQL Server produce the summary. SAS complains about an error, but it doesn’t describe it. In SAS’s defense, it is probably confused because the date-type is a date-time type instead of a date-only data type, but Microsoft SQL Server doesn’t support date-only data types. A pass-through query would be faster for large data sets, but then you lose some of the usefulness of PROC SQL.

532  options sastrace=',,,d' sastraceloc=saslog nostsuffix;
533
534  proc sql;
535      create table calls_volume as
536      select
537          year(DatePart(InitiatedDate)) as year,
538          month(DatePart(InitiatedDate)) as month,
539          count(*) as number_of_calls
540      from
541          i3eic.CallDetail
542      where
543          InitiatedDate > '01jan09'd
544      group by
545          calculated year,
546          calculated month
547      ;
ODBC: AUTOCOMMIT is NO for connection 10
ODBC: AUTOCOMMIT turned ON for connection id 10

ODBC_4: Prepared:
SELECT * FROM "dbo"."CallDetail"

ODBC: AUTOCOMMIT is NO for connection 11
SAS_SQL:  Unable to convert the query to a DBMS specific SQL statement due to
an error.
ACCESS ENGINE:  SQL statement was not passed to the DBMS, SAS will do the
processing.
ODBC: COMMIT performed on connection 11.

ODBC_5: Prepared:
SELECT  "InitiatedDate"  FROM "dbo"."CallDetail"  WHERE  ( "InitiatedDate" > {d
'2009-01-01' } )


ODBC_6: Executed:
Prepared statement ODBC_5

NOTE: Table WORK.CALLS_VOLUME created, with 9 rows and 3 columns.

548  quit;
NOTE: PROCEDURE SQL used (Total process time):
      real time           2.86 seconds
      user cpu time       1.81 seconds
      system cpu time     0.31 seconds
      Memory                            34697k
About these ads

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s