Combine several commands in a bash script

georgi58 · October 12, 2013, 2:53am

Hi all,

I have large files with url-s ending on "|<number>" which is the Page Rank for the website as shown in the example below
http://www.machinokairo.com/2012/05/post-39.html|2

I am using "grep" to sort out all url-s in a particular way: first, remove all ending on "|0" and write the output to a file, then remove all ending on "|1" and write the output to a new file and so on up to "|5". Each time I remove certain PR and have the rest in separate file. For now I use the following commands to do that

grep --invert-match "|0" sitelist > sitelist_PR1.txt
grep --invert-match "|1" sitelist_PR1.txt > sitelist_PR2.txt
.
.
grep --invert-match "|5" sitelist_PR5.txt > sitelist_PR6.txt

I will appreciate if someone helps me with a bash script to perform all of the above

MadeInGermany · October 12, 2013, 3:26am

A character set catches the whole range: "|[0-5]"

georgi58 · October 12, 2013, 3:54am

yes, right. but how to implement all these in a bash script using "grep" or "sed" commands and still have all separate output files?

Scrutinizer · October 12, 2013, 4:46am

awk OK? Try something like:

awk -F\| '/http:/{print $1 > ("sitelist_PR" $2 ".txt")}' file

MadeInGermany · October 12, 2013, 5:20am

grep -v "|0" sitelist | tee sitelist_PR1.txt |
grep -v "|1" | tee sitelist_PR2.txt |
.
.
grep -v "|5" > sitelist_PR6.txt

georgi58 · October 12, 2013, 7:33am

awk -F\| '/http:/{print $1 > ("sitelist_PR" $2 ".txt")}' file

Thank you for your input.
Command works in a bit different way that I need and
creates output files from sitelist_PR .txt to sitelist_PR9 .txt, actually I needed only to PR.6, but that is fine.

There are two things however that are different from the desire output:

each file contains only url-s with PR same as in the filename i.e. sitelist_PR1 .txt contains url-s with PR1 only - my goal was to remove those url-s and have all the rest higher than PR1 in this file;
when I look at the file name I see blank space before .txt

---------- Post updated at 02:33 PM ---------- Previous update was at 02:29 PM ----------

Thank you for your efforts but I really need a script or one line command. I have already tried the following

#!/bin/bash

for PR in {0..5} ; do
 grep --invert-match "|${PR}$" sitelist.txt > sitelist_PR${PR}.txt
done

but unfortunately it creates only empty files

MadeInGermany · October 12, 2013, 8:19am

Save whatever commands to a file, and you can run it with

bash file

Scrutinizer · October 12, 2013, 8:55am

georgi58:

awk -F\| '/http:/{print $1 > ("sitelist_PR" $2 ".txt")}' file
Thank you for your input.
Command works in a bit different way that I need and
creates output files from sitelist_PR .txt to sitelist_PR9 .txt, actually I needed only to PR.6, but that is fine.

There are two things however that are different from the desire output:

each file contains only url-s with PR same as in the filename i.e. sitelist_PR1 .txt contains url-s with PR1 only - my goal was to remove those url-s and have all the rest higher than PR1 in this file;

when I look at the file name I see blank space before .txt
[..]

So more like this?

awk -F\| '/http:/{for(i=1; i<=6; i++) if($2>=i)print $1 > ("sitelist_PR" i ".txt")}' file

georgi58 · October 12, 2013, 9:14am

works!! thank you very much.

---------- Post updated at 04:14 PM ---------- Previous update was at 04:10 PM ----------

this works fine with just one small detail: is there any way to preserve "|<number>" at the end of each line.

thank you very much.

Scrutinizer · October 12, 2013, 9:16am

Yes, use print instead of print $1 . You are welcome..