Wget, grep, sort, sed in 1 command/script

p1ne · April 19, 2017, 2:39pm

Hi, I need to join these statements for efficiency, and without having to make a new directory for each batch. I'm annotating commands below.

wget -q -r -l1 URL 
^^ can't use -O - here and pipe | to grep because of -r
grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" * > first.txt
^^ Need to grep the output of wget only; at present it's grepping other files in the directory.
sort -u first.txt > second.txt && sed '0~5 a\\' second.txt > third.txt
^^ piping | these doesn't work; && does.

Thanks in advance for direction.

Don_Cragun · April 19, 2017, 3:29pm

What operating system are you using?

What shell are you using?

If the commands:

grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" * > first.txt
sort -u first.txt > second.txt && sed '0~5 a\\' second.txt > third.txt

produce the output you want in third.txt after running the command:

wget -q -r -l1 URL

what is the difference between third.txt and fourth.txt after you also run:

grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" * |
    sort -u |
    sed '0~5 a\\' > fourth.txt

? Are any diagnostic messages produced by either of these sets of commands? If so, exactly what are those diagnostic messages?

p1ne · April 19, 2017, 8:43pm

Hi, thanks for responding:
Linux Mint 18.1
GNU bash, version 4.3.46(1)-release (x86_64-pc-linux-gnu)

The results are identical in the example you suggested, no errors.

wget's -O - option doesn't work in recursive mode, so how to feed wget output to the grep statement programmatically, without grep reading every previously-wget'd folder in the local directory? (Which is what is happening now.)

Don_Cragun · April 19, 2017, 11:39pm

According to the wget man page, old versions (before version 1.11) and new versions of wget (version 1.11.2 and later, although it may issue a warning in this case) should work just fine with:

wget -q -r -l1 -O - URL |
    grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" |
    sort -u |
    sed '0~5 a\\' > fifth.txt

p1ne · April 20, 2017, 9:18am

wget -q -r -l1 -O - URL/ |
>     grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" |
>     sort -u |
>     sed '0~5 a\\' > 6.txt
-k or -r can be used together with -O only if outputting to a regular file.

...and this yields an empty file in its own directory (or grabs all previous wget'd folders in shared directory--see my post above):

wget -q -r -l1 -O wget.html URL/ |     grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" |     sort -u |     sed '0~5 a\\' > 6.txt

So the problem persists as mentioned in my post above, that grep is not getting the wget I'm trying to attach to it. When the URL is downloaded, it creates a folder with many files, so I can see why outputting everything to an .html file (or any single file) is not going to work. Even if it did work, it's not getting piped to the grep statement, because executed in a directory with other previously wget'd URL's, the script is extracting info from all those other folders. So I need the script to access only the wget'd folder in the current command. I can do this by creating a new directory (I know, I'm repeating myself...), but I'd like to do it in the same directory.

RudiC · April 20, 2017, 9:54am

Would using a fifo (as a regular file for -O ) be an option?

p1ne · April 20, 2017, 11:26am

Just Googled fifo...this isn't doing anything differently than what I've already posted (no error either).

mkfifo /tmp/jobqueue |
wget -q -r -l1 -O /tmp/jobqueue URL/ |
    grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" |
    sort -u |
    sed '0~5 a\\' > 7.txt

Corona688 · April 20, 2017, 11:53am

That's not how fifo's work.

mkfifo /tmp/jobqueue

# Need to put this in the background, since it will wait until both
# ends of the fifo are open
( wget -q -r -l1 -O /tmp/jobqueue URL/ & )

< /tmp/jobqueue grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" |
    sort -u |
    sed '0~5 a\\' > 7.txt

wait

rm -f /tmp/jobqueue

p1ne · April 20, 2017, 2:16pm

That's interesting, thanks. Probably getting closer?

>     sed '0~5 a\\' > 7.txt
-k or -r can be used together with -O only if outputting to a regular file.

Corona688 · April 20, 2017, 2:57pm

That's the clue we needed - the reason it doesn't work when you put the same command in a pipe chain.

It doesn't work simply because wget refuses to output to anything but a regular file. Pipes, named or otherwise, are no good. For -k, this is because it has to edit the file after it downloads it. For -p, this is because it opens the output file more than once. Both of which would indeed do awful things to a pipe.

So, not a lot you can do. Sorry.

p1ne · April 20, 2017, 3:18pm

Thanks for explanation; you'll see I did post that error message a few posts ago.

So I'll have to make a new directory every time I run this sequence...pretty inefficient. Anyhow if you or anyone else gets another idea on this please update. Thanks.

Corona688 · April 20, 2017, 5:00pm

I don't see why using a folder makes it inefficient.. What exactly are you trying to avoid?

p1ne · April 20, 2017, 5:16pm

Making, and cd'ing into all those nested directories is inefficient, when I might have all the URL folders in 1 directory.

Corona688 · April 20, 2017, 5:31pm

If you don't want files in folders, why are you downloading recursively?

If you want multiple files, why are you downloading recursively, when you'll have no order over the order they're retrieved?

What exactly are you trying to do, that wget is not doing? What is the grep actually for? What sort of data are you retrieving?

Perhaps explain, from the beginning, what you are actually trying to do? I think we may have taken a wrong turn somewhere.

p1ne · April 20, 2017, 5:50pm

Of course the files will be in folders. I just wanted all the downloaded folders in 1 directory, and that's not gonna happen, given that wget -O can't be piped to grep. wget and grep are working as expected; there's no wrong turn. And thanks for showing me about fifo.