When deleting non-complete duplicates in SAS, in each duplicate set you may want to keep a particular record identified by a rule: it may be the oldest, newest, first, or last observation in each set. You need a identifier to be unique, but you can’t randomly choose which observation to keep. To be precise, you could say the observations (as a whole) are not duplicates, but there are duplicates of an identifier. The problem may arise when doing a JOIN in PROC SQL or MERGE in a DATA step: in some cases, the join/merge cannot produce the needed final data set without additional processing.
First, create a data set for demonstration:
data foo; input customer_id version_id keep; datalines; 0 0 0 0 1 1 1 7 1 1 7 1 1 3 0 2 5 0 2 9 1 ;
The first column represents customers which are duplicated. The second is the version: we want to keep the newest (highest) version. You could easily substitute a date variable for our version_id because SAS stores dates as integers. SAS will ignore the keep variable, but we use it just for the tutorial to quickly check the results: keep=1 means keep and keep=0 means discard. If we do it right, the output will always show keep=1.
A “key” step is sorting the data set so the record we want to keep is first in its set (rather than in some random order):
proc sort data=foo; by customer_id descending version_id; run;
Notice the version_id column is descending: that means the newest (highest) version is first for each customer_id set. Now we have a few choices on how to remove the duplicates. The first method uses a DATA set and extracts the first record. You may want to use a DATA step to improve speed if you can combine separate DATA steps.
data first; /* you can use "data foo;" to replace the existing data set */ set foo; by customer_id; if first.customer_id; run;
Alternatively, you can use PROC SORT. Add the statement option out=first to get identical results as above.
proc sort data=foo nodupkey; by customer_id; run;
If you want to remove exact duplicates, look up PROC SQL with the DISTINCT option or PROC SORT with the NODUPRECS option.