text manipulation and pattern matching

caprica13 · July 24, 2008, 7:01pm

Hi guys,

I need help:
I started receiving automatic emails containing download information. The problem is that these emails are coming in a rich format (I have no control of this) so the important information is buried under a bunch of mumbo-jumbo. To complicated things even further I need to automated the download process too so I need to somehow identify and extract the exact path to the file and forward it for further processing

the relevant part of the email looks something like this:

more_blah_before
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; ">Software</td><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; "><a =
href=3D"afp://server.company.com/del/e/QQ888-9999/Q=
Q888-9999-3/QQ888-9999-3.dmg">del/QQ888-9999/QQ888-9999-3</a></td=
></tr><tr style=3D"vertical-align: top; margin-top: 0px; margin-right: =
0px; margin-bottom: 0px; margin-left: 0px; padding-top: 0px; =
padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
more_blah_after

so the part that I need to extract from here is
afp://server.company.com/del/e/QQ888-9999/QQ888-9999-3/QQ888-999-3.dmg

the problem is that the path to the file is split with "=" so that would have to be removed somehow (if present)

also I am not sure how to remove anything present before afp:// (like href=3D" in this case) or anything present after .dmg (
">del/QQ888-9999/QQ888-9999-3</a></td= in this case)

any help would be appreciated

thank you

danmero · July 24, 2008, 8:42pm

This should do the job:

tr -d '\n' < file | sed 's/^.*"afp/"afp/;s/>.*$//'

caprica13 · July 24, 2008, 10:28pm

wow. you're amazing. thank you!

to expand on this, most of the time I would get an email with not one, but two files to download (and two to avoid).
would you mind suggesting a loop that would extract both afp links

for example:
afps to get:

afp://MYserver.company.com/del/e/QQ888-9999/QQ888-9999-/QQ888-9999-3.dmg
and
afp://MYserver.company.com/del/e/QQ666-7777/QQ666-7777-/QQ666-7777-3.dmg

both buried in the rich formatting non-sense.

to makes things a bit more complicated, the email would also contain a couple of afp links to a different server, that I would need to be skipped

for example

afps to be skipped:
afp://NOTMYserver.company.com/del/e/QQ888-9999/QQ888-9999-/QQ888-9999-3.dmg
and
afp://NOTMYserver.company.com/del/e/QQ666-7777/QQ666-7777-/QQ666-7777-3.dmg

the sample email would look something like this:

more_blah_before
0px; padding-bottom: 0px; padding-left: 0px; ">Software</td><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; "><a =
href=3D"afp://NOTMYserver.company.com/del/e/QQ888-9999/Q=
Q888-9999-3/QQ888-9999-3.dmg">del/QQ888-9999/QQ888-9999-3</a></td=
></tr><tr style=3D"vertical-align: top; margin-top: 0px; margin-right: =
padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><td =
href=3D"afp://NOTmyserver.company.com/del/e/QQ666-7777/Q=
Q666-7777-3/QQ666-7777-3.dmg">del/QQ666-7777/QQ666-7777-3</a></td=
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
0px; padding-bottom: 0px; padding-left: 0px; ">Software</td><td =
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
margin-bottom: 0px; margin-left: 0px; padding-top: 0px; padding-right: =
0px; padding-bottom: 0px; padding-left: 0px; "><a =
href=3D"afp://MYserver.company.com/del/e/QQ888-9999/Q=
Q888-9999-3/QQ888-9999-3.dmg">del/QQ888-9999/QQ888-9999-3</a></td=
></tr><tr style=3D"vertical-align: top; margin-top: 0px; margin-right: =
padding-right: 0px; padding-bottom: 0px; padding-left: 0px; "><td =
href=3D"afp://MYserver.company.com/del/e/QQ666-7777/Q=
Q666-7777-3/QQ666-7777-3.dmg">del/QQ666-7777/QQ666-7777-3</a></td=
style=3D"font-size: 11px; margin-top: 0px; margin-right: 0px; =
more_blah_after

thanks again, much appreciated

joeyg · July 24, 2008, 11:02pm

The grep command should be able to select the records you want to include. Using grep -v allows you to exclude records that match a criteria.

So, you might want to append

grep "afp://MYserver.company.com"
or
grep -v "afp://NOTMYserver.company.com"

to the end of the previous command string.

danmero · July 24, 2008, 11:15pm

.. or replace sed by awk:

tr -d '\n' < file | awk -F'"' -v v="MYserver" '{for(i=1;i<=NF;i++){if(match($i,"/"v)) print $i}}'

caprica13 · July 25, 2008, 12:27pm

again, unbelievable. thank you guys, what would take me days (if not weeks) to figure out is sometimes just a couple of posts away. anyway this will be a great starting point for me to learn something new and useful.

thanks again.

caprica13 · July 25, 2008, 5:21pm

almost there:

I am now able to get the desired paths, and do further string replacements which is great.

my outputs ends up being something like this:

output="file1path file2path"

I'd like to further process this and have the two paths in two separate variables:

file1="file1path"
file2="file2path"

what's the best approach here ? I don't think I need an array, just two simple variables.

thanks again for any hints

matrixmadhan · July 26, 2008, 2:14am

this is just a hint

please dont use this as such

firstfile=`echo $output | sed 's/\(.*\) \(.*\)/\1/'`
secondfile=`echo $output | sed 's/\(.*\) \(.*\)/\2/'`

danmero · July 26, 2008, 8:24am

..just a followup, all in one step.

output="file1path file2path"
eval $(sed 's/\(.*\) \(.*\)/file1=\1 file2=\2/' <<< $output)

caprica13 · July 28, 2008, 12:15pm

thanks everyone (especially danmero )

case closed

era · July 29, 2008, 1:46am

The solutions posted so far fail to cope with the = followed by newlines part of the encoding. Also there are other characters which might or might not be encoded using quoted-printable. I would recommend that you split the processing into two steps: decoding the QP, and extracting the information you want.

Here's an attempt at defusing the QP encoding:

perl -0777 -pe 's/=\n//g; s/=([0-9A-F]{2})/chr(oct("0x$1"))/ge;' inputfile

You could pipe the output from that to what you already have (use it instead of the tr command you had before) or extend the Perl script to also extract the information you require:

perl -0777 -ne 's/=\n//g; s/=([0-9A-F]{2})/chr(oct("0x$1"))/ge;
    while (m%<a href="(afp://myserver\.[^"]*)"%g) { print "file", ++$i, "=\"$1\"\n"}' inputfile