Simple awk command to compare two files and print first difference

LMHmedchem · April 26, 2017, 5:34pm

Hello,

I have two text files, each with a single column,
file 1:

file 2:

I am trying to identify the value in red above which is the first value that doesn't match the second file. I need to print that value and exit.

At first I tried diff,

diff file1 file2 | head -n 2

This gives what I want, but there are multiple lines of output and so it was more steps to get the value into a bash variable, which is what I need.

I then tried awk,

awk ' NR==FNR { a[NR]=$0; next } !($0 in a){ print $1; exit } ' file2 file1

Note that the order of input files is reversed because I want the first line of file1 that does not match file2. This just prints the first line of file1. Even if it did work, I think that this just tells me that the value is, or is not, in the file, not if the lines match.

awk ' NR==FNR { a[NR]=$0; next } $0 != a[FNR] { print a[FNR]; exit } file1 file2

I am sure I could do a loop with read, but that would be slow.

This seems like a very simple task. Are there any suggestions?

LMHmedchem

RudiC · April 26, 2017, 6:03pm

How about

diff -y -b --suppress-common-lines file1 file2 | cut -f1 | head -1
123476854

Or, slightly adapting your own awk proposal:

awk ' NR==FNR { a[$0]; next } !($0 in a){ print $1; exit } ' file2 file1
123476854

LMHmedchem · April 26, 2017, 6:11pm

rudic:

How about

diff -y -b --suppress-common-lines file1 file2 | cut -f1 | head -1
123476854

Or, slightly adapting your own awk proposal:

awk ' NR==FNR { a[$0]; next } !($0 in a){ print $1; exit } ' file2 file1
123476854

It seems something like this would be correct,

awk ' NR==FNR { a[$0]; next } $0 != a[FNR] { print a[FNR]; exit } file1 file2'

but that doesn't do anything at all. Am I right that evaluating !($0 in a) looks for $0 anywhere in a? I am checking that the files match, so it matters that the value appears on the same line in both files, not that it appears anywhere.

LMHmedchem

rdrtx1 · April 26, 2017, 6:18pm

paste -d" " file1 file2 | awk '$1 != $2 {print $1; exit;}'

Scrutinizer · April 26, 2017, 6:30pm

@OP, you second suggestion seems to work alright but you forgot the second quote:

awk ' NR==FNR { a[NR]=$0; next } $0 != a[FNR] { print a[FNR]; exit }' file1 file2

However, it would read the whole of file1 first and put it in memory..

Another approach you could try:

awk '{getline s<f} $0!=s{print; exit}' f=file2 file1

LMHmedchem · April 26, 2017, 9:10pm

In the end, I did this based on the code posted by Scrutinizer,

error_record=$(awk '{getline s<f} $0!=s{print; exit}' f=file2 file1)

It seems like it will work well enough and was the fastest of the methods that worked.

This suggestion of RudiC also worked but was marginally slower.

error_record=$(diff -y -b --suppress-common-lines file1 file2 | cut -f1 | head -1)

By slower I mean 0m0.391s as opposed to 0m0.156s with the first method. Not enough difference to bother with but I guess you need some reason to pick a method.

The method suggested by rdrtx1 also worked but again was a bit slower,

error_record=$(paste -d" " file1 file2 | awk '$1 != $2 {print $1; exit;}')

My guess is that the two slower methods both made calls to more than one program and this is the origin of the difference.

I was not able to get any output from this, even though it looks correct,

awk ' NR==FNR { a[NR]=$0; next } $0 != a[FNR] { print a[FNR]; exit }' file1 file2

Don't know what the issue is there.

LMHmedchem