In Natural Language Processing it can be helpful within a larger body of text to standardize written numbers to Arabic numerals. For example, we will change “I am forty-six years old” to “I am 46 years old,” so the age can be more easily extracted later with another regular expression.
SAS makes this task particularly easy because the WORDSw. format does the opposite (writes a numeric variable as a written word), and we leverage the format to generate a large set of regular expressions to perform our conversion. Actually we metaprogramming: using SAS code to make regular expression code. In case you don’t use SAS, look farther down.
The SAS code below demonstrates the technique for numbers 10 to 110. We convert words written with and without a hyphen (like “fourty-five” and “fourty five”), and because the purpose is age extraction, we assume there are no written numbers greater than 110 (like “one hundred forty five”).
data numbers; infile datalines truncover; length num_str $30; input text $200.; text = prxchange('s/fourty/forty/i', -1, text); /* typo according to https://en.wikipedia.org/wiki/40_%28number%29 */ /* Process in reverse order because if we try to change "forty" before "forty five" then we would get "40 five" */ i = 110; do while (i>=10); num_str = put(i, words30.); re = 's/\b(' || strip(num_str) || '|' || strip(translate(num_str, ' ', '-')) || ')\b/' || strip(i) || '/i'; text = prxchange(re, -1, text); i= i - 1; end; drop i num_str re; datalines; I am fourty-six years old now, my wife is fifty two, and my son ten. I am fifteen years old I am fifty four years old, I will be sixty-nine years old on September 28th. I will soon be forty. Can you believe I am one hundred ten years old? run;
For large data sets we slightly increase run-time performance by replacing the DO WHILE loop with a macro variable loop, so SAS doesn’t have to recompile the regular expression for each observation:
data numbers; infile datalines truncover; input text $200.; text = prxchange('s/fourty/forty/i', -1, text); /* typo according to https://en.wikipedia.org/wiki/40_%28number%29 */ /* Process in reverse order because if we try to change "forty" before "forty five" then we would get "40 five" */ /* Use DO WHILE instead of DO x to Y because the latter doesn't allow reverse order */ %macro written2arabic(sm_text_var /* the name of the variable containing the text */); %let i =110; %do %while(&i >= 10); %put i = &i; /* Compile regular expression only on the first observation */ if _N_ eq 1 then do; retain re_w2a_&i; re_w2a_&i = prxparse('s/\b(' || strip(put(&i, words30.)) || '|' || strip(translate(put(&i, words30.), ' ', '-')) || ')\b/' || strip(&i) || '/io'); end; &sm_text_var = prxchange(re_w2a_&i, -1, &sm_text_var); drop re_w2a_&i; %let i =%eval(&i - 1); %end; %mend; %written2arabic(text); datalines; I am fourty-six years old now, my wife is fifty two, and my son ten. I am fifteen years old I am fifty four years old, I will be sixty-nine years old on September 28th. I will soon be fourty. Can you believe I am one hundred ten years old? run;
In case you are not using SAS, here are the generated regular expressions:
s/\b(one hundred ten|one hundred ten)\b/110/i s/\b(one hundred nine|one hundred nine)\b/109/i s/\b(one hundred eight|one hundred eight)\b/108/i s/\b(one hundred seven|one hundred seven)\b/107/i s/\b(one hundred six|one hundred six)\b/106/i s/\b(one hundred five|one hundred five)\b/105/i s/\b(one hundred four|one hundred four)\b/104/i s/\b(one hundred three|one hundred three)\b/103/i s/\b(one hundred two|one hundred two)\b/102/i s/\b(one hundred one|one hundred one)\b/101/i s/\b(one hundred|one hundred)\b/100/i s/\b(ninety-nine|ninety nine)\b/99/i s/\b(ninety-eight|ninety eight)\b/98/i s/\b(ninety-seven|ninety seven)\b/97/i s/\b(ninety-six|ninety six)\b/96/i s/\b(ninety-five|ninety five)\b/95/i s/\b(ninety-four|ninety four)\b/94/i s/\b(ninety-three|ninety three)\b/93/i s/\b(ninety-two|ninety two)\b/92/i s/\b(ninety-one|ninety one)\b/91/i s/\b(ninety|ninety)\b/90/i s/\b(eighty-nine|eighty nine)\b/89/i s/\b(eighty-eight|eighty eight)\b/88/i s/\b(eighty-seven|eighty seven)\b/87/i s/\b(eighty-six|eighty six)\b/86/i s/\b(eighty-five|eighty five)\b/85/i s/\b(eighty-four|eighty four)\b/84/i s/\b(eighty-three|eighty three)\b/83/i s/\b(eighty-two|eighty two)\b/82/i s/\b(eighty-one|eighty one)\b/81/i s/\b(eighty|eighty)\b/80/i s/\b(seventy-nine|seventy nine)\b/79/i s/\b(seventy-eight|seventy eight)\b/78/i s/\b(seventy-seven|seventy seven)\b/77/i s/\b(seventy-six|seventy six)\b/76/i s/\b(seventy-five|seventy five)\b/75/i s/\b(seventy-four|seventy four)\b/74/i s/\b(seventy-three|seventy three)\b/73/i s/\b(seventy-two|seventy two)\b/72/i s/\b(seventy-one|seventy one)\b/71/i s/\b(seventy|seventy)\b/70/i s/\b(sixty-nine|sixty nine)\b/69/i s/\b(sixty-eight|sixty eight)\b/68/i s/\b(sixty-seven|sixty seven)\b/67/i s/\b(sixty-six|sixty six)\b/66/i s/\b(sixty-five|sixty five)\b/65/i s/\b(sixty-four|sixty four)\b/64/i s/\b(sixty-three|sixty three)\b/63/i s/\b(sixty-two|sixty two)\b/62/i s/\b(sixty-one|sixty one)\b/61/i s/\b(sixty|sixty)\b/60/i s/\b(fifty-nine|fifty nine)\b/59/i s/\b(fifty-eight|fifty eight)\b/58/i s/\b(fifty-seven|fifty seven)\b/57/i s/\b(fifty-six|fifty six)\b/56/i s/\b(fifty-five|fifty five)\b/55/i s/\b(fifty-four|fifty four)\b/54/i s/\b(fifty-three|fifty three)\b/53/i s/\b(fifty-two|fifty two)\b/52/i s/\b(fifty-one|fifty one)\b/51/i s/\b(fifty|fifty)\b/50/i s/\b(forty-nine|forty nine)\b/49/i s/\b(forty-eight|forty eight)\b/48/i s/\b(forty-seven|forty seven)\b/47/i s/\b(forty-six|forty six)\b/46/i s/\b(forty-five|forty five)\b/45/i s/\b(forty-four|forty four)\b/44/i s/\b(forty-three|forty three)\b/43/i s/\b(forty-two|forty two)\b/42/i s/\b(forty-one|forty one)\b/41/i s/\b(forty|forty)\b/40/i s/\b(thirty-nine|thirty nine)\b/39/i s/\b(thirty-eight|thirty eight)\b/38/i s/\b(thirty-seven|thirty seven)\b/37/i s/\b(thirty-six|thirty six)\b/36/i s/\b(thirty-five|thirty five)\b/35/i s/\b(thirty-four|thirty four)\b/34/i s/\b(thirty-three|thirty three)\b/33/i s/\b(thirty-two|thirty two)\b/32/i s/\b(thirty-one|thirty one)\b/31/i s/\b(thirty|thirty)\b/30/i s/\b(twenty-nine|twenty nine)\b/29/i s/\b(twenty-eight|twenty eight)\b/28/i s/\b(twenty-seven|twenty seven)\b/27/i s/\b(twenty-six|twenty six)\b/26/i s/\b(twenty-five|twenty five)\b/25/i s/\b(twenty-four|twenty four)\b/24/i s/\b(twenty-three|twenty three)\b/23/i s/\b(twenty-two|twenty two)\b/22/i s/\b(twenty-one|twenty one)\b/21/i s/\b(twenty|twenty)\b/20/i s/\b(nineteen|nineteen)\b/19/i s/\b(eighteen|eighteen)\b/18/i s/\b(seventeen|seventeen)\b/17/i s/\b(sixteen|sixteen)\b/16/i s/\b(fifteen|fifteen)\b/15/i s/\b(fourteen|fourteen)\b/14/i s/\b(thirteen|thirteen)\b/13/i s/\b(twelve|twelve)\b/12/i s/\b(eleven|eleven)\b/11/i s/\b(ten|ten)\b/10/i