Awk: compare values in two columns of the same file

I'm trying to learn awk, but I've hit a roadblock with this problem. I have a hierarchy stored in a file with 3 columns:

id	name	parentID
4	D	2
2	B	1
3	C	1
1	A	5

I need to check if there are any values in column 3 that are not represented anywhere in column 1. I've tried this:

awk '{arr[$1];} !($3 in arr) {print $0}' file.txt

The desired output would be:

1	A	5

But it prints the entire file instead. What am I doing wrong?

Hi, to determine if a value is not present in a column, you have to read the entire file first. There are two choices, read the file and put all relevant information in memory and then print the results, or read the same file twice.

With the latter approach, something like this should work:

awk 'NR==FNR{A[$1]; next} !($3 in A)' file.txt file.txt
id	name	parentID
1	A	5

--
Note: NR==FNR is a condition that only applies when the file is being read for the first time. The next statement ensures the rest of the code is used when reading the file for the second time.

2 Likes

Thanks, that worked beautifully! I had a bit of trouble getting it to work in my real life application (a 2 GB file with dozens of columns and over 2 million lines), but I managed to get it to work by specifying the field separator:

awk -F '\t' 'NR==FNR{A[$1]; next} !($3 in A)' file.txt file.txt 

There were blank spaces in some of the fields.

Thanks a lot for your help.

--- Post updated at 02:57 AM ---

Actually, one more thing. The current output includes lines if there's no value in column 3, e. g., with this file:

id	name	parentID
4	D	2
2	B	1
3	C	1
1	A	5
6	E	

I get this result:

id	name	parentID
1	A	5
6	E

Since the purpose of this exercise is to find parentIDs that are missing from the id column, I am not interested in lines where $3 is empty. How can I get it to omit those?

Hi, try modifying the condition like so:

$3!="" && !($3 in A)
1 Like

Works great, thanks!

To avoid one logical test for every line in a file, this (based on Scrutinizer's proposal) might be interesting, too:

awk 'BEGIN {A[""]} NR==FNR{A[$1]; next} !($3 in A)' file file
1 Like

Thanks, for my 2GB file, this saves a little bit of processing time: 2m38.580s vs. 2m43.779s. Could you please explain what adding A[""] does?

It adds an empty string to array "A" so that when it encounters an empty string in $3 it is already in the array and so the line does not get printed ( !($3 in A) ).

1 Like