Hi, to determine if a value is not present in a column, you have to read the entire file first. There are two choices, read the file and put all relevant information in memory and then print the results, or read the same file twice.
With the latter approach, something like this should work:
awk 'NR==FNR{A[$1]; next} !($3 in A)' file.txt file.txt
id name parentID
1 A 5
--
Note: NR==FNR is a condition that only applies when the file is being read for the first time. The next statement ensures the rest of the code is used when reading the file for the second time.
Thanks, that worked beautifully! I had a bit of trouble getting it to work in my real life application (a 2 GB file with dozens of columns and over 2 million lines), but I managed to get it to work by specifying the field separator:
awk -F '\t' 'NR==FNR{A[$1]; next} !($3 in A)' file.txt file.txt
There were blank spaces in some of the fields.
Thanks a lot for your help.
--- Post updated at 02:57 AM ---
Actually, one more thing. The current output includes lines if there's no value in column 3, e. g., with this file:
id name parentID
4 D 2
2 B 1
3 C 1
1 A 5
6 E
I get this result:
id name parentID
1 A 5
6 E
Since the purpose of this exercise is to find parentIDs that are missing from the id column, I am not interested in lines where $3 is empty. How can I get it to omit those?
It adds an empty string to array "A" so that when it encounters an empty string in $3 it is already in the array and so the line does not get printed ( !($3 in A) ).