Parsing to_addr field in bash

chebarbudo · July 25, 2017, 6:23am

Hi there,

I'm trying to parse the to_addr field of emails and split it into individual email addresses. The idea is to split using the comma character (,):
These two first approach work:

$ field="'Paul FOO' <paul@foo.com>, Andrew   FOO <andrew@foo.com>"
$ IFS=, read -r -a array <<< "$field"; for a in "${array[@]}"; do echo "$a"; done
'Paul FOO' <paul@foo.com>
 Andrew   FOO <andrew@foo.com>
$ (IFS=,; for a in $field; do echo "$a"; done)
'Paul FOO' <paul@foo.com>
 Andrew   FOO <andrew@foo.com>

But it fails if one name has a comma:

$ field='"Paul FOO" <paul@foo.com>, "Andrew, M. FOO" <andrew@foo.com>'
$ IFS=, read -r -a array <<< "$field"; for a in "${array[@]}"; do echo "$a"; done
"Paul FOO" <paul@foo.com>
 "Andrew
 M. FOO" <andrew@foo.com>
$ (IFS=,; for a in $field; do echo "$a"; done)
"Paul FOO" <paul@foo.com>
 "Andrew
 M. FOO" <andrew@foo.com>

How can I handle this situation?

Regards
Santiago

joeyg · July 25, 2017, 8:19am

Seem to recall years ago a similar type of problem. So, I changed the trap rules.
If I am following your example, each record should end with .com> or .com>,
So, what if before your awk you substitute .com>, with .com>; and then use the ; as your ifs? Not sure if all are .com so you may need to substitute with .org also.
If you do not want to use the ; character, you could also try a character not likely to appear like ~ or ^ or |

chebarbudo · July 25, 2017, 9:28am

Thanks joeyg for your thoughts...
It's unfortunately more complicated.
On the small pool I'm working on (4000+ messages / 200- addresses), I already have a few patterns:

 
FOO
Raw
root
Support
undisclosed-recipients:
Undisclosed, recipients:
Santiago DIEZ <foo@bar.com>
"Santiago DIEZ" <foo@bar.com>
"Santiago, F. DIEZ" <foo@bar.com>
"'Santiago DIEZ' via Gmail" <foo@bar.com>
<foo@bar.com>
foo@bar.com
foo@bar.com (Santiago DIEZ)

Line 1 is blank. Lines 2-7 will be ignored so it's OK if we break line 7 (because of the comma). I already have a regex pattern for the rest but I fear to bump on some new patterns so I'd like a way of splitting by the comma (except when between quotes).

jim_mcnamara · July 25, 2017, 10:18am

You will have to write a parser in awk to handle unvalidated user input like that.

IF you have Linux - gawk should be there. Try regular expressions for field delimiting patterns:

Example:

gawk -vFPAT='[^,]*|"[^"]*"'   '{ your code to print goes here, split with FPAT }' somefile

You can use alternation:

 -vFPAT='(pattern set 1|pattern set 2|pattern set 3)'

You can also declare fields with

awk -F 'regex goes here' {code here}' somefile

I cannot give you a fixed set of rules to use, it looks like you do not have a complete set yet. You should do some serious validation on the input to that dataset so you do not get difficult formatting problems. Otherwise you may have to resort to using some bizarre character as a field delimiter. Maybe high ASCII > 127.

chebarbudo · July 26, 2017, 4:09am

Great lead. Thanks jim mcnamara.