Compare 2 huge files wrt to a key using awk

Ranjani · July 14, 2008, 6:25am

Hi Folks,

I need to compare two very huge file ( i.e the files would contain a minimum of 70k records each) using awk or sed. The comparison needs to be done with respect to a 'key'. For example :

File1
**********
1234|TONY|Y75634|20/07/2008
1235|TINA|XCVB56|30/07/2009
43456|PATS|U74454|12/04/2009
23456|DAPS|R4576|15/03/2008

File2
******
1235|TINA|XCVB56|30/07/2009
1234|TONY|Y75634|20/07/2008
23456|DAPS|R4576|15/03/2008

In this case, if I consider '|' as the delimiter , the value at column 3 as 'key' for the files, I need to look out for this key in the second file and once that is got, I need to compare the values at column 2 of the corresponding records in both the files.

Also, I need to report a message in case the key is not present in file2.

PS: I have a perl script running for this.. but it takes way too long to perform this comparison, your help in suggesting some awk script which would perform this action much faster would be really appreciated.

Thanks in advance
Ranjani

radoulov · July 14, 2008, 7:09am

What should be the desired output?

Ranjani · July 14, 2008, 7:11am

Sorry, i had forgotten to mention the desired output.

The output needs to be logged in file3 which says either : " the correspoding records matched" or " the corresponding records did not match" or "the key in file1 does not exist in file2"

Thanks
Ranjani

Franklin52 · July 14, 2008, 7:47am

Try something like this:

awk -F"|" '
NR==FNR{a[$3]=$2;next}
a[$3]==$2{print $0 " <= Corresponding records match";next}
a[$3]{print $0 " <= "Corresponding records did not match"}
{print $0 " <= Key File1 not exist in File2"}
' File1 File2

Regards

radoulov · July 14, 2008, 7:59am

Another one:
(use nawk or /usr/xpg4/bin/awk on Solaris)

awk>file3 -F\| 'NR==FNR{f1[$3]=$2;next}
{print "key",$3 in f1?$3" records "(f1[$3]==$2?"":"do not ")\
 "match":$3" is missing"}' file2 file1

Ranjani · July 14, 2008, 9:19am

Hi Franklin and Radoulov.. thanks a lot for your responseS, but I am getting the following o/p when i run this script :
******************************************************
1234 RANJ 45678 y786 <= Corresponding records did not match
1234 RANJ 45678 y786 <= Key File1 not exist in File2
567 SREE 3457 xg456 <= Corresponding records did not match
567 SREE 3457 xg456 <= Key File1 not exist in File2
34567 TANGO 4567 H7694 <= Corresponding records did not match
34567 TANGO 4567 H7694 <= Key File1 not exist in File2
4567 qrea 3456 but731 <= Corresponding records did not match
4567 qrea 3456 but731 <= Key File1 not exist in File2
34567 TANGO 4567 H7694 <= Corresponding records did not match
34567 TANGO 4567 H7694 <= Key File1 not exist in File2
567 SREE 3457 xg456 <= Corresponding records did not match
567 SREE 3457 xg456 <= Key File1 not exist in File2
1234 RANJ 45678 y786 <= Corresponding records did not match
1234 RANJ 45678 y786 <= Key File1 not exist in File2

*******************************************************

This is not the desired o/p. Please could you help :

The i/p files are :
************************
1234 RANJ 45678 y786
567 SREE 3457 xg456
34567 TANGO 4567 H7694
4567 qrea 3456 but731

*****************************
File2:
******************
34567 XAXRO 4567 H7694
567 SREE 3457 xg456
1234 RANJ 45678 y786

The desired output file is :

For the key:45678 , the fields are matching
For the key:3457, the fields are matching
For the key : 4567 , the fileds are not matching
For the key : 45678 , record is not present in file2.

*******************************************

Please provide your useful inputs. Your help would be much appreciated.

Thanks a lot

Ranjani

radoulov · July 14, 2008, 9:30am

Perhaps I'm missing something,
this is what I get from the code I posted:
(you changed the field separator so the -F switch is removed)

$ head file[12]
==> file1 <==
1234 RANJ 45678 y786
567 SREE 3457 xg456
34567 TANGO 4567 H7694
4567 qrea 3456 but731

==> file2 <==
34567 XAXRO 4567 H7694
567 SREE 3457 xg456
1234 RANJ 45678 y786
$ nawk  'NR==FNR{f1[$3]=$2;next}
{print "key",$3 in f1?$3" records "(f1[$3]==$2?"":"do not ")\
 "match":$3" is missing"}' file2 file1
key 45678 records match
key 3457 records match
key 4567 records do not match
key 3456 is missing

You say you want to mark the key 45678 as not present, but it is ...

Ranjani · July 14, 2008, 9:46am

Thanks a ton Radoulov , the code is working perfectly fine. I am sorry for the typo, it was the key : 3456 . Thanks a lot for your time.

Best Regards
Ranjani

Ranjani · July 14, 2008, 10:06am

Hi , just another quick doubt.

What do i need to do incase i want to compare the values of 2 fields as in ( the value at col 1 and value at col2 ) for the corresponding records .

Your help will be much appreciated. Thanks a lot again

Ranjani

radoulov · July 14, 2008, 10:11am

Like this?

awk 'NR == FNR { 
  f1[$3] = $1 SUBSEP $2 
  next
  }
{ 
  print "key", 
  $3 in f1 ? $3 " records " \
  (f1[$3] == $1 SUBSEP $2 ? "" : "do not ") \
  "match" : $3 " is missing" 
  }' file2 file1

Ranjani · July 14, 2008, 10:25am

Awesome!! .. works perfect .. thanks a lot for all your help. Its much appreciated

Ranjani · July 14, 2008, 11:57am

Could I ask another doubt in addition : I hope you guys dont mind :

I would want to add a condition to check for blank records in between , as in, right now if there is any blank records , the script output : key is missing . It identifies the key as a simple space. I want it to display an error "blank record in the file". I tried doing it, but could not succeed... Need your help in this as well. Thanks a lot.

Actually I am a newbee to awk and hence finding it difficult to do the modifications.. it would be great if you could post an expalination to how this code would work so that i could understand and take it further myself.

Thanking you in advance
Ranjani

radoulov · July 14, 2008, 3:53pm

Something like this?

awk 'NR == FNR { 
  f1[$3] = $1 SUBSEP $2 
  next
  }
NF { 
  print "key", 
  $3 in f1 ? $3 " records " \
  (f1[$3] == $1 SUBSEP $2 ? "" : "do not ") \
  "match" : $3 " is missing" 
  }' file2 file1

The above code works like this:

NR == FNR { 
  f1[$3] = $1 SUBSEP $2 
  next
  }

While reading the first file - NR == FNR (the current record number of the entire input equals the current record number of the current file, this is a common AWK idiom) build the f1 associative array: the third field is the key, the first and the second fields are the value (check the awk documentation for SUBSEP, you can use FS here also). The next statement:

[from effective awk programming]

{ 
  print "key", 
  $3 in f1 ? $3 " records " \
  (f1[$3] == $1 SUBSEP $2 ? "" : "do not ") \
  "match" : $3 " is missing" 
  }

While reading the second file print the string "key", followed by the result of the following expression:

$3 in f1 ? $3 " records " (f1[$3] == $1 SUBSEP $2 ? "" : "do not ") "match" : $3 " is missing"

The if?then:else is the ternary operator, it means: if a key in the f1 array matches the third field of the current (the second) file - $3 in f1 , then print the third field followed by the string " records " followed by (another ternary operator): if the value of the previous key equals the first and the second field pair in the current file, then nothing (""), else print the string "do not " (end of the embedded second ternary operator), followed by the string "match", else (the first ternary operator) print the third field and the string " is missing".

That's all.

Ranjani · July 15, 2008, 4:58am

Perfect! .. thank a lot Radoulov... Thanks a ton for all the help!!