Hi, I need to join these statements for efficiency, and without having to make a new directory for each batch. I'm annotating commands below.
wget -q -r -l1 URL
^^ can't use -O - here and pipe | to grep because of -r
grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" * > first.txt
^^ Need to grep the output of wget only; at present it's grepping other files in the directory.
sort -u first.txt > second.txt && sed '0~5 a\\' second.txt > third.txt
^^ piping | these doesn't work; && does.
Hi, thanks for responding:
Linux Mint 18.1
GNU bash, version 4.3.46(1)-release (x86_64-pc-linux-gnu)
The results are identical in the example you suggested, no errors.
wget's -O - option doesn't work in recursive mode, so how to feed wget output to the grep statement programmatically, without grep reading every previously-wget'd folder in the local directory? (Which is what is happening now.)
According to the wget man page, old versions (before version 1.11) and new versions of wget (version 1.11.2 and later, although it may issue a warning in this case) should work just fine with:
wget -q -r -l1 -O - URL/ |
> grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" |
> sort -u |
> sed '0~5 a\\' > 6.txt
-k or -r can be used together with -O only if outputting to a regular file.
...and this yields an empty file in its own directory (or grabs all previous wget'd folders in shared directory--see my post above):
So the problem persists as mentioned in my post above, that grep is not getting the wget I'm trying to attach to it. When the URL is downloaded, it creates a folder with many files, so I can see why outputting everything to an .html file (or any single file) is not going to work. Even if it did work, it's not getting piped to the grep statement, because executed in a directory with other previously wget'd URL's, the script is extracting info from all those other folders. So I need the script to access only the wget'd folder in the current command. I can do this by creating a new directory (I know, I'm repeating myself...), but I'd like to do it in the same directory.
mkfifo /tmp/jobqueue
# Need to put this in the background, since it will wait until both
# ends of the fifo are open
( wget -q -r -l1 -O /tmp/jobqueue URL/ & )
< /tmp/jobqueue grep -hrio "\b[a-z0-9.-]\+@[a-z0-9.-]\+\.[a-z]\{2,4\}\+\b" |
sort -u |
sed '0~5 a\\' > 7.txt
wait
rm -f /tmp/jobqueue
That's the clue we needed - the reason it doesn't work when you put the same command in a pipe chain.
It doesn't work simply because wget refuses to output to anything but a regular file. Pipes, named or otherwise, are no good. For -k, this is because it has to edit the file after it downloads it. For -p, this is because it opens the output file more than once. Both of which would indeed do awful things to a pipe.
Thanks for explanation; you'll see I did post that error message a few posts ago.
So I'll have to make a new directory every time I run this sequence...pretty inefficient. Anyhow if you or anyone else gets another idea on this please update. Thanks.
Of course the files will be in folders. I just wanted all the downloaded folders in 1 directory, and that's not gonna happen, given that wget -O can't be piped to grep. wget and grep are working as expected; there's no wrong turn. And thanks for showing me about fifo.