How do I feed numbers from awk(1) to tail(1)?

ropers · April 2, 2008, 1:41pm

Hello,

I am too daft to remember how to properly feed numbers that I've extracted with awk(1) to tail(1).

The actual question is probably a lot more simple than the context, but let me give you the context anyway:

I've just received some email that was sent with MS Outlook and arrived in my mailbox garbled. It was supposed to contain .jpeg images, but it just contained garbled text. Looking at the raw source of the email, I realised that for some reason the .jpg files had been uuencoded but not uudecoded. Here is a truncated part of the email:

Delivered-To: ropers
(...)
Received: from mail.gmx.net (mail.gmx.net)
        by mx.google.com (...)
Received-SPF: pass (google.com: permitted sender)
Authentication-Results: mx.google.com; spf=pass (google.com: permitted (...)
Received: (qmail invoked by alias); 02 Apr 2008 11:47:01 -0000
Received: from p57B96BD3.dip.t-dialin.net (EHLO babe) by mail.gmx.net (mp006) with SMTP; 02 Apr 2008 13:47:01 +0200
(...)
From: "GRopers"
To: "'Ropers'" 
Subject: Pictures
Date: Wed, 2 Apr 2008 13:46:58 +0200
X-Mailer: Microsoft Office Outlook, Build 11.0.5510
(...)
X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.3198
X-Y-GMX-Trusted: 0

begin 666 IMG_1232.jpg
M_]C_X `02D9)1@`!`0$`M "T``#_X1'317AI9@``24DJ``@````/``\!`@`&
M````P@```! !`@`6````R ```!H!!0`!````W@```!L!!0`!````Y@```"@!
M`P`!`````@```#(!`@`4````[@```!,"`P`!`````0````$0`P`!``````L`
(...)
MLVBN)"$(3.<X_2@1!8-Y=OM/8FIO,))/-4[,$QMUQO-8NL>+-/TNX>VG:1YD
MZJB]/QI6`I?$J3_B70*3C,G'Y&O,;E]D;$<UVOCB^%[IEA.B.B2$L XP:X.]
M?Y"O<UFU>11!;Y+$C[Q[GM4^#GDDFH[9=J$XY-2$TGN CU&QISFF4(!">6IK
6?=H8YW4UONTP!N@Y[44UN@^E%,#_V0``
`
end
(...)

Now the email contains just a bunch of jpg pics. I figured out that I could uudecode(1) the pics by saving the email's raw source text as /tmp/tmp.mail and issuing:

uudecode /tmp/tmp.mail

That worked, but only sort of. It extracted the first JPG, but only the first one, and there are several jpegs in that file. I then found that I can grep for the "begin" string that all the jpeg files start with (see above email excerpt), and what's more, I can tell grep(1) to print me the line numbers for each of the "begin" lines it spits out:

grep -n begin < /tmp/tmp.mail

The result is that grep prints this:

35:begin 666 IMG_1232.jpg
587:begin 666 IMG_1229.jpg
1154:begin 666 IMG_1221.jpg
2012:begin 666 IMG_1217.jpg
2938:begin 666 IMG_1215.jpg
4034:begin 666 IMG_1192.jpg
4538:begin 666 IMG_1190.jpg
5227:begin 666 IMG_1189.jpg
5644:begin 666 IMG_1188.jpg
6280:begin 666 IMG_1185.jpg
6891:begin 666 IMG_1184.jpg
7733:begin 666 IMG_1183.jpg
8237:begin 666 IMG_1247.jpg
9134:begin 666 IMG_1244.jpg
9826:begin 666 IMG_1242.jpg
10613:begin 666 IMG_1238.jpg
11297:begin 666 IMG_1237.jpg
11893:begin 666 IMG_1235.jpg
12325:begin 666 IMG_1234.jpg
13217:begin 666 IMG_1233.jpg

So far so good. Now I want to use awk(1) to extract only the line numbers. I currently use:

grep -n begin < /tmp/tmp.mail | awk -F : '{print $1}'

In case you're wondering, -F specifies the field separator character to be the colon, meaning awk will print only the stuff before the ":". Now I've got a list of the line numbers at which the respective uuencoded jpeg files start:

Now I can use tail(1) to feed the jpegs starting at these lines to uudecode(1) for decoding. Because uudecode ignores all but the first jpegs it encounters, I don't need to locate the end of the individual respective jpegs; I should be able to simply use tail(1) to make uudecode see everything from line 35, then from line 587, then 1154, and so on.

I can successfully do this manually by issuing e.g.:

tail -n +1154 /tmp/tmp.mail | uudecode

Here, tail will list the contents of /tmp/tmp.mail from line 1154 to the end of the file (EOF), and uudecode will decode the first (and only the first) jpeg file it sees, which is the one starting at line 1154.

Of course, I now could simply be stupid, and issue the same command again and again and again, manually iterating through the line numbers I have, but there has to be a better way and I want to learn (and thus be able to be lazy in the future ;)).

I tried defining a variable $LINE and feeding that to tail, but I just could not figure out how to properly glue the awk and tail commands together. My last attempts resulted in me having a $LINE variable that contained all the line numbers on one line, and of course tail interpreted these as extraneous file names and bitterly complained. This probably runs down to something really simple, but I just could not figure it out.

Any help would be very much appreciated.

PS: The "666" in the email excerpt, in case you're wondering, is just the permissions of the extracted file (rw-rw-rw-). No need to get all Christian about it :D.

unilover · April 2, 2008, 2:23pm

try this:

egrep -n 'zcat|touch' preou*|\
awk -F: '{print $1}'|\
paste -sd ",\n" -|\
sed 's=\(.*\)=sed -n "\1p" mymail | uudecode='|\
sh

You can run each consequtive piped-command to see what it produces and how the complete pipe does he job.

unilover · April 2, 2008, 2:25pm

Sorry! "zcat|touch" preou* was in my test!!

You should have "begin|end" mymail.txt

Franklin52 · April 2, 2008, 2:35pm

Another approach:

awk '/begin 666/{f=1;close("name");name=$3;next}f{print > name}' /tmp/tmp.mail

Regards

ropers · April 2, 2008, 6:56pm

Thanks for your replies, unilover and Franklin52. I intend to work through both of your suggestions, to learn from them. I am of course aware that there are probably dozens or hundreds of possible solutions to this problem; but I'm also trying to take things one step at a time and figure out what was missing in my attempts.

Right now I'm looking at unilover's solution, and I'm a little bit stumped:

I've tried the first part of his solution, and this is what I initially got:

grep -n -E "begin|end" /tmp/tmp.mail
11:Received-SPF: pass (google.com: domain of gropers@xxx.xxx designates 192.168.64.20 as permitted sender) client-ip=192.168.64.20;
12:Authentication-Results: mx.google.com; spf=pass (google.com: domain of gropers@xxx.xx designates 192.168.64.20 as permitted sender) smtp.mail=gropers@xxx.xxx
35:begin 666 IMG_1232.jpg
585:end
587:begin 666 IMG_1229.jpg
1152:end
1154:begin 666 IMG_1221.jpg
2010:end
2012:begin 666 IMG_1217.jpg
2936:end
2938:begin 666 IMG_1215.jpg
4032:end
4034:begin 666 IMG_1192.jpg
4536:end
4538:begin 666 IMG_1190.jpg
5225:end
5227:begin 666 IMG_1189.jpg
5642:end
5644:begin 666 IMG_1188.jpg
6278:end
6280:begin 666 IMG_1185.jpg
6889:end
6891:begin 666 IMG_1184.jpg
7731:end
7733:begin 666 IMG_1183.jpg
8235:end
8237:begin 666 IMG_1247.jpg
9132:end
9134:begin 666 IMG_1244.jpg
9824:end
9826:begin 666 IMG_1242.jpg
10611:end
10613:begin 666 IMG_1238.jpg
11295:end
11297:begin 666 IMG_1237.jpg
11891:end
11893:begin 666 IMG_1235.jpg
12323:end
12325:begin 666 IMG_1234.jpg
13215:end
13217:begin 666 IMG_1233.jpg
13765:end

Note the output lines 11 and 12, which don't contain "begin" or "end". (Yes, I changed the email addresses and IP addresses, but I did that in /tmp/tmp.mail, in which lines 11 and 12 really don't contain either string.)

It also didn't matter whether I used grep -E or egrep, or single or double quotes.

At long last, I finally figured out what tripped up grep: It turns out that lines 11 and 12 both contained carriage returns (\r). vi confirmed this by showing the familiar ^M characters. I then did :%s/\r//g in vi and saved the file, after which grep worked as expected, and the lines 11 and 12 were no longer included in its output. So I know what made the error occur. What I don't understand is why this error occurred. Why would extraneous carriage returns cause grep to include these lines?

Many thanks for your help.

ropers · April 2, 2008, 8:26pm

Arrgh!!! Nevermind the aforesaid, I've just figured things out -- turns out I was wrong when I wrote that the lines 11 and 12 don't contain "begin" or "end". Both lines contain the word "sender".

The carriage returns were a total red herring. I was wrong when I thought that removing them had fixed things. It turns out that when I tested grep after removing them I had only grepped "begin" and not "begin|end". :rolleyes:

So it's probably not a good idea to grep for "end" in this case without throwing away the email headers first. In my initial --only partially successful-- approach this also was unnecessary, because tail lends itself really well to clipping off the upper parts of the file without even looking at them (and uudecode ignores anything boyond the first uuencoded file, so the rest can be left as it).

On a more positive note, I've found that

 grep -n -E "begin|end" /tmp/tmp.mail | awk -F : '{print $1}' | paste -s -d ",\n" - | sed 's=\(.*\)=sed -n "\1p" /tmp/tmp.mail | uudecode='| sh -

does indeed work -- though it complains:

uudecode: stdin: No `begin' line

which is entirely understandable, because the command line

sed -n "11,12p" /tmp/tmp.mail | uudecode

that's generated and passed to sh is bogus, and of course line 11 has no "begin" in it.

I still don't fully understand the nested sed stuff though. I'll try some more and/or come back with more questions. Also, if someone has a hint to get my initial approach with awk and tail to work, that would be really cool.

But again, many thanks so far.

ropers · April 4, 2008, 6:50am

I think I've sort of cracked it. I've understood the gist of unilover's solution, and I've managed to incorporate part of his approach into my initial attempted solution. Now I've got a working solution that's based on what I tried initially and on unilover's solution. And look mom, no error messages!

Here it is:

grep -n begin /tmp/tmp.mail | awk -F : '{print $1}' | sed 's:\(.*\):tail -n +\1 /tmp/tmp.mail | uudecode:' | sh -

So first grep lists all the lines in /tmp/tmp.mail that contain "begin", and it prefixes the lines it prints with their respecive line numbers in /tmp/tmp.mail (that's what -n does).

Then awk throws away everything except the line numbers.

Then sed replaces each line number with "tail -n +\1 /tmp/tmp.mail | uudecode", where "\1" is substituted with the respective line numbers. Because this string contains slashes ("/"), we're not using the / as a separation character in the subsititute command, we're using an arbitrary other character instead (":" in our case). We could also escape them like so

sed 's/\(.*\)/tail -n +\1 \/tmp\/tmp.mail | uudecode/'

, but we're too lazy for that.

Then each of the "tail -n +\1 /tmp/tmp.mail | uudecode" strings are passed to sh, so they can be executed instead of being just standard output.

The tail and uuencode commands work as I described in my earlier post.

I guess that leaves me to try and crack Franklin52's solution next.

ropers · April 4, 2008, 4:54pm

Ok, I've now looked at Franklin52's proposed solution. While I don't know awk that well and haven't understood it completely, I know that it doesn't really apply to my issue at all, and does something completely different.

Here's what Franklin52 wrote:

awk '/begin 666/{f=1;close("name");name=$3;next}f{print > name}' /tmp/tmp.mail

That means /tmp/tmp.mail is used as input for the

/begin 666/{f=1;close("name");name=$3;next}f{print > name}

awk script.

The script looks for the "begin 666" pattern and then extracts the file name written after that (i.e. $3 = the third item on that line, after "begin" and "666"). This file name is stored in a variable called name. The script then pipes the contents of /tmp/tmp.mail from after the current "begin 666" line up to the next line containing "begin 666" into a file that is called whatever the name variable says. Apparently this process is then repeated.

The result is that the current directory contains a bunch of files that are correctly named (IMG_something.jpg in my case), but they are not jpeg files, because they have not been uudecoded. They just each contain the uuencoded data for a single JPEG file without the begin line (but with the end line still included).

It's not even possible to then manually uudecode these files, because uudecode expect proper begin and end lines (because otherwise it wouldn't know what to name its output file and what permissions to give it, and where the uuencoded data ends):

$ uudecode IMG_1183.jpg 
uudecode: IMG_1183.jpg: No `begin' line

Even if the begin line were still included, it wouldn't work this way, because the (input) file with the uuencoded data would have the same filename as the (output) file uudecode would try to create. Out of curiosity, I tried this by using vi to edit the "IMG_1232.jpg" file that Franklin52's awk command creates in my case; I inserted a "begin 666 IMG_1232.jpg" line in front of the uuencoded data. But that just caused uudecode to fail:

$ uudecode IMG_1232.jpg
uudecode: IMG_1232.jpg: Short file

Apparently uudecode reads part of its input file ("IMG_1232.jpg" in this case), then writes out the decoded data to the specified filename ("IMG_1232.jpg" again), then tries to read the next chunk of uuencoded data, but can't because it has just clobbered its (longer) source file with its (short partial) output, so it screams "IMG_1232.jpg: Short file" and dies.

So while the awk script is interesting, it is not a solution to my issue. Sorry Franklin52. Happily, I've already got a fully working solution, thanks to unilover's input.

Franklin52 · April 5, 2008, 7:23am

Well, with the given approach, it should be challenge for you to make it works.

Regards