I have two large files (~250GB) that I am trying to remove the where GT: 0/0 or 1/1 or 2/2 for both files. I was going to use a bash with the below awk , which I think will find each line but how do I remove that line is that condition is found? Thank you :).
Input
20 60055 . A . 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/0:25:25:99:PASS
20 60056 . G A. 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/1:12,13:25:99:PASS,PASS
20 60057 . T . 35 PASS DP=26;PF=20;MF=6;MQ=60;SB=0.769 GT:AD:DP:GQ:FL 0/0:26:26:99:PASS
20 60058 . C T 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 1/1:25:25:99:PASS
awk '$9~"^[012]"{$0=$0($9~"^(0/0|1/1|2/2)"?" hom
":" het")}1' input
Desired output
20 60056 . G A. 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/1:12,13:25:99:PASS,PASS
Your spec is (not for the first time) rather misleading. There's NO field that contains GT: 0/0 or 1/1 or 2/2 . It is left to the reader's interpretation that field 9 is a sort of description for the next field, and field 10 seems to have the respective values. Your unfit code snippet doesn't help either. It doesn't remove any lines, nor will field 9 ever start with 0, 1, or 2.
And, no logic connection between the TWO files is perceivable. You seem to request a solution for ANY file applicable for your two generic files.
Please be aware that a correct, detailed, and carefully taylored specification will save everybody's time including your's!
For your problem, try
awk '$NF !~ /^(0\/0|1\/1|2\/2)/' file
20 60056 . G A. 35 PASS DP=25;PF=20;MF=5;MQ=60;SB=0.800 GT:AD:DP:GQ:FL 0/1:12,13:25:99:PASS,PASS