Email address normalization in SAS

This SAS macro performs email address normalization by changing email addresses like First.Last+tag@googlemail.com to the canonical form firstlast@gmail.com. Also, it demonstrates basic unit testing in SAS, which ensures quality and eases code maintenance.

Email address normalization is often used to transform email addresses into unique keys for identifying or preventing duplicate accounts.

/* Create a data set with the input email address and */
/* the expected output for (automated) unit testing. */
data email_test;
	infile datalines delimiter=',' ;
	informat original $60. expected $60.;
	input original $ expected $;
datalines;
First@example.com,first@example.com
first.last@example.com,first.last@example.com
first.last+tag@example.com,first.last+tag@example.com
First(comment)@example.com,first@example.com
name@googlemail.com,name@gmail.com
First.Last+Plus@gmail.com,firstlast@gmail.com
First.Last+Plus@googlemail.com,firstlast@gmail.com
first.last@outlook.com,first.last@outlook.com
first.last+plus@hotmail.com,first.last@hotmail.com
first.last+tag@fastmail.fm,first.last@fastmail.fm
foldername@username.fastmail.fm,username@fastmail.fm
;

/* Address normalization macro */
%macro normalize_email(original /* input */, normalized /* output */);
	retain re_local_part_plus re_domain_part;
	if _n_ = 1  then do;
		re_local_part_plus = prxparse('/([^+@]+)/');
		re_domain_part = prxparse('/@(.*)/');
		end;
	format local_part domain_part $200.; /* prevent truncation */

	/* Techically the local part is case sensitive, but usually case is ignored. */
	&normalized = strip(lowcase(&original));

	/* Remove comment like First(comment)@example.com */
	&normalized = prxchange('s/\([^()]+\)//', -1, &normalized);

	/* Get the domain part */
	if prxmatch(re_domain_part, &normalized) then domain_part = prxposn(re_domain_part, 1, &normalized);
		else abort;

	/* gmail.com and googlemail.com are the same */
	if domain_part eq 'googlemail.com' then domain_part = 'gmail.com';

	/* Identify domains that support plus addressing (also called address tags or sub-addressing */
	plus_addressing = domain_part in (
		/* Reference: http://lifehacker.com/outlook-com-users-can-now-create-instant-disposable-ema-1341682638 */
		'gmail.com', 'outlook.com',
		/* Reference: http://blogs.msdn.com/b/ekraus/archive/2011/11/09/hotmail-plus-addresses.aspx */
		'hotmail.com',
		/* These are all Fastmail domains */
		/* Reference: https://www.fastmail.fm/help/features_plus_addressing_and_subdomain_addressing.html */
		/* Reference: https://www.fastmail.fm/help/signup_domains.html (December 3, 2013) */
		'123mail.org','150mail.com','150ml.com','16mail.com','2-mail.com','4email.net','50mail.com','airpost.net','allmail.net','bestmail.us','cluemail.com',
		'elitemail.org','emailcorner.net','emailengine.net','emailengine.org','emailgroups.net','emailplus.org','emailuser.net','eml.cc','fastem.com',
		'fast-email.com','fastemail.us','fastemailer.com','fastest.cc','fastimap.com','fastmail.cn','fastmail.com.au','fastmail.fm','fastmail.jp',
		'fastmail.us','fastmail.co.uk','fastmail.to','fmail.co.uk','fast-mail.org','fastmailbox.net','fastmessaging.com','fea.st','f-m.fm','fmailbox.com',
		'fmgirl.com','fmguy.com','ftml.net','hailmail.net','h-mail.us','imap.cc','imap-mail.com','imapmail.org','inoutbox.com','internet-e-mail.com',
		'internetemails.net','internet-mail.org','internetmailing.net','jetemail.net','justemail.net','letterboxes.org','mailandftp.com','mailas.com',
		'mailbolt.com','mailc.net','mailcan.com','mail-central.com','mailforce.net','mailftp.com','mailhaven.com','mailingaddress.org','mailite.com',
		'mailmight.com','mailnew.com','mail-page.com','mailsent.net','mailservice.ms','mailup.net','mailworks.org','ml1.net','mm.st','myfastmail.com',
		'mymacmail.com','nospammail.net','ownmail.net','petml.com','postinbox.com','postpro.net','proinbox.com','promessage.com','realemail.net','reallyfast.biz',
		'reallyfast.info','rushpost.com','sent.as','sent.at','sent.com','speedpost.net','speedymail.org','ssl-mail.com','swift-mail.com','the-fastest.net',
		'theinternetemail.com','the-quickest.com','veryfast.biz','veryspeedy.net','warpmail.net','xsmail.com','yepmail.net','your-mail.com');

	if plus_addressing then do;
		/* Get the local part (account name). */
		if prxmatch(re_local_part_plus, &normalized) then local_part = prxposn(re_local_part_plus, 1, &normalized);
		else abort;
		/* Additionally, gmail.com ignores dots in the local part. */
		/* Reference: https://en.wikipedia.org/wiki/Email_address#Local-part_normalization */
		if domain_part eq 'gmail.com' then 
			local_part = compress(local_part, '.');
		/* Replace the domain part. */
		&normalized = cats(local_part, '@', domain_part);
		end;
	drop re_local_part_plus re_domain_part domain_part local_part plus_addressing;
%mend;

/* Demonstrate use of the macro */
data email_test;
	set email_test;
	%normalize_email(original, normalized);
	/* Test expected behavior */
	correct = normalized eq expected;
	if not correct then put 'ERROR: ' original= expected= normalized=;
run;

Tested with 64-bit SAS 9.3 on Windows 7.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s