reformating non-uniform strings

lordsmiter · July 20, 2010, 4:15pm

I have a set of free-form phone numbers that are not uniform and I want to reformat them into a standard uniform string. These are embedded at the end of a colon seperated file built by a large nawk + tr piping like such:

XXXXX:YYYYY:ZZZZZ:(333)333-3333x33333
XXXXX:YYYYY:ZZZZZ:x44444
XXXXX:YYYYY:ZZZZZ:(555)-555-5555
XXXXX:YYYYY:ZZZZZ:66666
XXXXX:YYYYY:ZZZZZ:777 - 777 - 7777
XXXXX:YYYYY:ZZZZZ:(888)888-8888
XXXXX:YYYYY:ZZZZZ:999.999.9999

Note: Row 1 & 3 are extensions that would come out to 444-444-4444 & 666-666-6666 from an outside line respectively. So the output would need to be:

XXXXX:YYYYY:ZZZZZ:333-333-3333
XXXXX:YYYYY:ZZZZZ:444-444-4444
XXXXX:YYYYY:ZZZZZ:555-555-5555
XXXXX:YYYYY:ZZZZZ:666-666-6666
XXXXX:YYYYY:ZZZZZ:777-777-7777
XXXXX:YYYYY:ZZZZZ:888-888-8888
XXXXX:YYYYY:ZZZZZ:999-999-9999

Is there a way to build a template with arrays or something similar. I have tried several things with awk and IFS, but can't seem to adequately break up these into array's by each byte. Solving this is step 1.

Step 2 is how do I incorporate this solution into a 3 step piping so that only 1 file is created during the script. No temporary files allowed. So it would need to occur like this:

nawk 'large string of operations to join 2 files' File1 File2 | tr -d ' ' | "this phone number solution" > output.txt

Is this the best way to approach this? I don't want to do nawk + "tr" to remove whitespace and create output.txt, then come back through and do the phone number solution on output.txt to create new_output.txt. It all needs to be done in one swoop without the temp file unless you can rewrite the new phone number to the output.txt file after its generated.

joeyg · July 20, 2010, 4:39pm

Does this help to just read numbers?

>echo "XXXXX:YYYYY:ZZZZZ:(333)333-3333x33333" | tr -cd '[:digit:]'
333333333333333

If so, then the next step is to format the data.
Digits beyond first ten are dropped, right?
Fewer than ten digits, how to format?

---------- Post updated at 04:39 PM ---------- Previous update was at 04:29 PM ----------

(Heading home shortly, but wanted to provide more for you to ponder.)

>echo "XXXXX:YYYYY:ZZZZZ:222 333 4444 x5555" | tr -cd '[:digit:]' | gawk '{print substr($1,1,3)"-"substr($1,4,3)"-"substr($1,7,4)}'
222-333-4444

lordsmiter · July 21, 2010, 10:43am

I think that would work for any of the crap encountered when a person has the full number and area code formatted in 20 different ways. The remaining issue is when they just have their extension there. It needs to be exploded based off of the first digit of the extension to include the full number + area code. For example, if the extension is 66666 then the first 6 would translate to adding XXX-XX6-6666 to make the full number. Likewise for 33333 it would morph to XXX-XX3-3333.

Would this be possible with a larger awk and conditional statements? Oh and this is AIX so no gawk...only nawk and awk. Even though your gawk just has "substr" which should be fine with nawk.

aigles · July 21, 2010, 10:58am

I don't understand the full logic of reformating process.
Can you show us the required output from this input file :

XXXXX:YYYYY:ZZZZZ:(123)456-7890x84848
XXXXX:YYYYY:ZZZZZ:x12345
XXXXX:YYYYY:ZZZZZ:(987)-654-3210
XXXXX:YYYYY:ZZZZZ:73849
XXXXX:YYYYY:ZZZZZ:543 - 987 - 2106
XXXXX:YYYYY:ZZZZZ:(123)987-0456
XXXXX:YYYYY:ZZZZZ:098.765.4321

Jean-Pierre.

joeyg · July 21, 2010, 10:58am

I am thinking about
sprintf = formatted printing
gsub = global substitution
as useful functions within awk to help you.

(Sorry, kinda busy right now to think thru, but wanted to provide some thoughts)

lordsmiter · July 21, 2010, 11:15am

Your input:

XXXXX:YYYYY:ZZZZZ:(123)456-7890x84848
XXXXX:YYYYY:ZZZZZ:x12345
XXXXX:YYYYY:ZZZZZ:(987)-654-3210
XXXXX:YYYYY:ZZZZZ:73849
XXXXX:YYYYY:ZZZZZ:543 - 987 - 2106
XXXXX:YYYYY:ZZZZZ:(123)987-0456
XXXXX:YYYYY:ZZZZZ:098.765.4321

Required output:

XXXXX:YYYYY:ZZZZZ:123-456-7890
XXXXX:YYYYY:ZZZZZ:736-251-2345
XXXXX:YYYYY:ZZZZZ:987-654-3210
XXXXX:YYYYY:ZZZZZ:655-627-3849
XXXXX:YYYYY:ZZZZZ:543-987-2106
XXXXX:YYYYY:ZZZZZ:123-987-0456
XXXXX:YYYYY:ZZZZZ:098-765-4321

As you can see, the numbers where the area code was provided are easier to figure out. The lines that just have an extension require an additional set of numbers based off the first byte of the extension. So the 12345 extension becomes 736-251-2345 because the 1 of the extension signifies a certain constant of 5 digits to go in front of the extension. There is no calculation just a constant value based on the number of the extension. Something similar to this:

if extension starts with "1" append 736-25 on the front of extension and add dash after 1.
if extension starts with "2" append 854-32 on the front of extension and add dash after 2.
...
if extension starts with "7" append 655-62 on the front of extension and add dash after 7.
...
etc..

aigles · July 21, 2010, 12:20pm

Try and adapt the following script:

awk '
BEGIN {
   FS = OFS = ":"
   #               1xxxx 2xxxx 3xxxx 4xxxx 5xxxx 6xxxx 7xxxx
   n=split ("?????,73625,85432,33333,44444,55555,66666,65562", ac, ",");
}
{
   gsub(/[^0-9]/, "", $4);
   tel = $4;
   len = length(tel);
   if (len == 5) {
      ext = substr(tel, 1, 1);
      tel = (ext in ac ? ac[ext+1] : "?????") tel;
   }
   tel = tel "??????????";
   $4 = substr(tel, 1, 3) "-" substr(tel, 4, 3) "-" substr(tel, 7, 4);
   print;
}
' lordmiter.txt

Input file (lordmiter.txt):

XXXXX:YYYYY:ZZZZZ:(123)456-7890x84848
XXXXX:YYYYY:ZZZZZ:x12345
XXXXX:YYYYY:ZZZZZ:(987)-654-3210
XXXXX:YYYYY:ZZZZZ:73849
XXXXX:YYYYY:ZZZZZ:543 - 987 - 2106
XXXXX:YYYYY:ZZZZZ:(123)987-0456
XXXXX:YYYYY:ZZZZZ:098.765.4321
XXXXX:YYYYY:ZZZZZ:777.654
XXXXX:YYYYY:ZZZZZ:98765

Output:

XXXXX:YYYYY:ZZZZZ:123-456-7890
XXXXX:YYYYY:ZZZZZ:736-251-2345
XXXXX:YYYYY:ZZZZZ:987-654-3210
XXXXX:YYYYY:ZZZZZ:655-627-3849
XXXXX:YYYYY:ZZZZZ:543-987-2106
XXXXX:YYYYY:ZZZZZ:123-987-0456
XXXXX:YYYYY:ZZZZZ:098-765-4321
XXXXX:YYYYY:ZZZZZ:777-654-????
XXXXX:YYYYY:ZZZZZ:???-??9-8765

Jean-Pierre.

dr.house · July 21, 2010, 12:28pm

#!/bin/bash

cat data | cut -d':' -f4 | sed 's/[ -]//g;s/[()x.]//g' \
  | awk '{
      if (length($0)>10) {print $0" ???"}
      else if (length($0)==10) {print $0}
      else if (length($0)==5 && substr($0,1,1)=="1") {print "73625"$0}
      else if (length($0)==5 && substr($0,1,1)=="2") {print "85432"$0}
      else if (length($0)==5 && substr($0,1,1)=="7") {print "65562"$0}
    }' \
  | sed 's/\(...\)\(...\)\(....\)/\1\-\2\-\3/g'

exit 0
#finis

[house@leonov] cat data
XXXXX:YYYYY:ZZZZZ:(123)456-7890x84848
XXXXX:YYYYY:ZZZZZ:x12345
XXXXX:YYYYY:ZZZZZ:(987)-654-3210
XXXXX:YYYYY:ZZZZZ:73849
XXXXX:YYYYY:ZZZZZ:543 - 987 - 2106
XXXXX:YYYYY:ZZZZZ:(123)987-0456
XXXXX:YYYYY:ZZZZZ:098.765.4321
[house@leonov] bash code
123-456-789084848 ???
736-251-2345
987-654-3210
655-627-3849
543-987-2106
123-987-0456
098-765-4321

lordsmiter · July 21, 2010, 1:31pm

aigles,

I will try yours out in a bit. I need some time to figure it out.

dr. house,

Anything after the first 10 numeric digits (or 12 with dashes) can be thrown away, such as line 1 of your output. Would you add maybe another conditional (length() maybe) to your "awk" or can the "sed" be modified to trim this?

Thanks for the input.

dr.house · July 21, 2010, 1:43pm

# if (length($0)>10) {print $0" ???"}
if (length($0)>10) {print substr($0,1,10)}