Delete non-exact duplicates in SAS

When deleting non-complete duplicates in SAS, in each duplicate set you may want to keep a particular record identified by a rule: it may be the oldest, newest, first, or last observation in each set. You need a identifier to be unique, but you can’t randomly choose which observation to keep. To be precise, you could say the observations (as a whole) are not duplicates, but there are duplicates of an identifier. The problem may arise when doing a JOIN in PROC SQL or MERGE in a DATA step: in some cases, the join/merge cannot produce the needed final data set without additional processing.

First, create a data set for demonstration:

data foo;
	input customer_id version_id keep;
0 0 0
0 1 1
1 7 1
1 7 1
1 3 0
2 5 0
2 9 1

The first column represents customers which are duplicated. The second is the version: we want to keep the newest (highest) version. You could easily substitute a date variable for our version_id because SAS stores dates as integers. SAS will ignore the keep variable, but we use it just for the tutorial to quickly check the results: keep=1 means keep and keep=0 means discard. If we do it right, the output will always show keep=1.

A “key” step is sorting the data set so the record we want to keep is first in its set (rather than in some random order):

proc sort data=foo;
	by customer_id descending version_id;

Notice the version_id column is descending: that means the newest (highest) version is first for each customer_id set. Now we have a few choices on how to remove the duplicates. The first method uses a DATA set and extracts the first record. You may want to use a DATA step to improve speed if you can combine separate DATA steps.

data first; /* you can use "data foo;" to replace the existing data set */
	set foo;
	by customer_id;
	if first.customer_id;

Alternatively, you can use PROC SQL. Add the statement option out=first to get identical results as above.

proc sort data=foo nodupkey;
	by customer_id;

If you want to remove exact duplicates, look up PROC SQL with the DISTINCT option or PROC SORT with the NODUPRECS option.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s