Awk: compare values in two columns of the same file

kaktus · July 24, 2019, 7:11pm

I'm trying to learn awk, but I've hit a roadblock with this problem. I have a hierarchy stored in a file with 3 columns:

id	name	parentID
4	D	2
2	B	1
3	C	1
1	A	5

I need to check if there are any values in column 3 that are not represented anywhere in column 1. I've tried this:

awk '{arr[$1];} !($3 in arr) {print $0}' file.txt

The desired output would be:

1	A	5

But it prints the entire file instead. What am I doing wrong?

Scrutinizer · July 24, 2019, 7:30pm

Hi, to determine if a value is not present in a column, you have to read the entire file first. There are two choices, read the file and put all relevant information in memory and then print the results, or read the same file twice.

With the latter approach, something like this should work:

awk 'NR==FNR{A[$1]; next} !($3 in A)' file.txt file.txt

id	name	parentID
1	A	5

--
Note: NR==FNR is a condition that only applies when the file is being read for the first time. The next statement ensures the rest of the code is used when reading the file for the second time.

kaktus · July 24, 2019, 10:02pm

Thanks, that worked beautifully! I had a bit of trouble getting it to work in my real life application (a 2 GB file with dozens of columns and over 2 million lines), but I managed to get it to work by specifying the field separator:

awk -F '\t' 'NR==FNR{A[$1]; next} !($3 in A)' file.txt file.txt

There were blank spaces in some of the fields.

Thanks a lot for your help.

--- Post updated at 02:57 AM ---

Actually, one more thing. The current output includes lines if there's no value in column 3, e. g., with this file:

id	name	parentID
4	D	2
2	B	1
3	C	1
1	A	5
6	E

I get this result:

id	name	parentID
1	A	5
6	E

Since the purpose of this exercise is to find parentIDs that are missing from the id column, I am not interested in lines where $3 is empty. How can I get it to omit those?

Scrutinizer · July 25, 2019, 1:45am

Hi, try modifying the condition like so:

$3!="" && !($3 in A)

kaktus · July 25, 2019, 11:06am

Works great, thanks!

RudiC · July 25, 2019, 1:06pm

To avoid one logical test for every line in a file, this (based on Scrutinizer's proposal) might be interesting, too:

awk 'BEGIN {A[""]} NR==FNR{A[$1]; next} !($3 in A)' file file

kaktus · July 25, 2019, 4:10pm

Thanks, for my 2GB file, this saves a little bit of processing time: 2m38.580s vs. 2m43.779s. Could you please explain what adding A[""] does?

Scrutinizer · July 25, 2019, 4:17pm

It adds an empty string to array "A" so that when it encounters an empty string in $3 it is already in the array and so the line does not get printed ( !($3 in A) ).