awk command with a loop

aberg · August 11, 2017, 12:12pm

Dear all,

I would be grateful for your help with the following.

I have the following file (file.txt), which is about 10,000 lines long:

ID1  ID2  0  1  0.5  0.6
ID3  ID4  0  0  0.4  0.8
ID1  ID5  0  1  0.5  0.3
ID6  ID2  1  0  0.4  0.8

The IDs in the first two columns can occur between 1 to 10 times in the file (in either column 1 or column 2).

What I want to achieve:

I want to scan this file line by line, and print IDs to an ever-growing exclusion list if they meet the following criteria:

If $3 > $4, print $2 (ID2) > exclusionlist.txt
If $3 < $4, print $1 (ID1) > exclusionlist.txt
If $3==$4 && $5 < $6, print $2 (ID2) > exclusionlist.txt
If $3==$4 && $5 > $6, print $1 (ID1) > exclusionlist.txt

So applying this to row 1, either ID1 or ID2 should have been added to my exclusion list.

I then want to delete all lines in the file where that ID from the exclusion list appears. This can be up to 10 rows.

Output for file.txt once row 1 has been scanned:

ID3 ID4 0 0 0.4 0.8
ID6 ID2 1 0 0.4 0.8

And exclusionlist.txt:
ID1

I then want to start again at the new row 1, and execute the same process, but keep adding my exclusion from the new row 1 to the same exclusion list.

The commands that I have at my disposal are:

awk 'NR==1{print;}' file.txt
awk '{if ($3>$4 || $3==$4 && $5<$6) print $2;}' file.txt > exclusionlist.txt
awk '{if ($3>$4 || $3==$4 && $5>$6) print $1;}' file.txt > exclusionlist.txt
grep -v -f exclusionlist.txt file.txt

But there are problems inherent in this:

The exclusionlist.txt does not 'keep growing'.
Also, how do I loop it back so that it starts again at line 1?

I would be grateful for any solutions.

Thank you,

A.B.

vbe · August 11, 2017, 12:17pm

In the second code part, you never append to exclusionlist.txt...
Quite usre you have only one awk result there, the last one...

aberg · August 11, 2017, 12:21pm

Yes, I want to append to the same list (rather than over-writing), and I'm not sure how to do that.
Also, I want this to loop so that it starts again at the new line 1 once the original line 1 (plus any other lines containing the exclusion) has been removed.

vbe · August 11, 2017, 12:29pm

e.g

If $3 > $4, print $2 (ID2) > exclusionlist.txt    # This one will create or if exist,overwrite
If $3 < $4, print $1 (ID1) >> exclusionlist.txt   # Then here you append...
If $3==$4 && $5 < $6, print $2 (ID2) >> exclusionlist.txt
If $3==$4 && $5 > $6, print $1 (ID1) >> exclusionlist.txt

aberg · August 11, 2017, 12:39pm

Thank you vbe.

So could I incorporate that into a bash script? Say I renamed my file.txt to 1.txt:

#! bin/bash
for i in {1..10000}
awk 'NR==1{print;}' $i.txt
awk '{if ($3>$4 || $3==$4 && $5<$6) print $2;}' file.txt > exclusionlist.txt
awk '{if ($3>$4 || $3==$4 && $5>$6) print $1;}' file.txt >> exclusionlist.txt
grep -v -f exclusionlist.txt $i.txt > $(i+1).txt
rm $i.txt
done

Would that help me to execute this function recursively?

RudiC · August 11, 2017, 1:34pm

Let me paraphrase your request: You select either of the IDs in field1 or 2 depending on conditions in the rest of the line, and then remove all occurrences of the selected ID in the rest of the file. Do you HAVE to populate the exclusion file, i.e. do you need it afterwards? Or would a single pass operation be sufficient, removing ALL the applicable IDs?

aberg · August 11, 2017, 1:40pm

Thank you RudiC.

Yes, your interpretation is correct, and it is vitally important that I populate an exclusion file. The original file itself should eventually grind itself down to 0 lines, and it is the the exclusion file that I am interested in.

My latest attempt is as follows (it involves having to rename file.txt to 1.txt):

#! bin/bash for i in {1..5000} 
do awk 'NR==1{print;}' $i.txt 
awk '{if ($3>$4 || $3==$4 && $5<$6) print $2;}' $i.txt > exclusionlist_$i.txt 
awk '{if ($3>$4 || $3==$4 && $5>$6) print $1;}' $i.txt >> exclusionlist_$i.txt 
grep -v -f exclusionlist_$i.txt $i.txt > $((i+1)).txt 
rm $i.txt 
done

Due to my poor scripting skills, I am having to: (1) rename my file after each loop in order for it to be continuously executed, and (2) ending up with a new exclusion list per loop, rather than a single 'master' exclusion list - I can easily concatenate them all at the end, so this is not a major problem, but it's messy.

The problem I have when I execute this script is that it seems to scan through the whole file on the first pass (rather than just line 1), creating a long exclusion list just from the first run.

Any help/suggestions would be greatly appreciated.

Thank you.

AB

RudiC · August 11, 2017, 2:05pm

I'm not yet sure I entirely grasp it, but that "grinding itself down to 0 lines" confirms my gut feeling about the processing. Unfortunately the few sample limes don't allow for a thorough testing. How about

awk '
$1 in X  || $2 in X     {next
                        }

$3 >  $4 ||
$3 == $4 && $5 < $6     {TMP = $2
                        }

$3 <  $4 ||
$3 == $4 && $5 > $6     {TMP = $1
                        }

                        {X[TMP]
                         print TMP
                        }
' file
ID1
ID4
ID2

aberg · August 11, 2017, 2:18pm

I think that's worked.. thank you!!

RudiC · August 11, 2017, 2:20pm

If that logic works, you might want to try

awk '
!($1 in X  || $2 in X)  {TMP = 1 + ($3 >  $4 || $3 == $4 && $5 < $6)
                         X[$TMP]
                         print $TMP
                        }
' file
ID1
ID4
ID2

You may want to post a bit larger sample so the test can be built on a larger basis.

What if $3 == $4 && $5 == $6 ?

aberg · August 11, 2017, 2:47pm

That's even more elegant and gives the same output. I'm struggling to understand how you managed to condense my four criteria into :

($3 >  $4 || $3 == $4 && $5 < $6)

Re: if

$3==$4 && $5==$6

, there is one occurrence of this in the whole file, but one of the IDs has already been 'eliminated' by the time we get to that row (hence that row no longer exists), rendering it not a problem..

RudiC · August 11, 2017, 3:05pm

In fact, what I applied is your logic from post#1, deploying the fact that the two expressions are mutually exclusive except for the $5 == $6 case. The result of a boolean expression is either 1 (TRUE), or 0 (FALSE), and by adding 1 the target field (1 or 2) is calculated (In a strict language like e.g. PASCAL you can't do that (arithmetics with logical values), but C or awk allow for it).
If you can make very sure the $5 == $6 case doesn't exist (or disappears in the process), you can leave its consideration out, but good programming style would require it to be covered, and be it with an error condition.

aberg · August 11, 2017, 7:11pm

Thank you for the explanation - v. helpful.