awk - compare 1st 15 fields of record with 20 fields

sljnk · October 19, 2013, 5:20pm

I'm trying to compare 2 files for differences in a selct number of fields. When differnces are found it will write the whole record of the second file including appending '|C' out to a delta file. Each record will have 20 fields, but only want to do comparison of 1st 15 fields. The 1st field of each record is the primary key. When comparing the fields differences should only be captured if there is a change in fields 2-15. If there is a difference in fields 16-20 they should be ignored and therefore a record will not be written to the delta file.

awk -F\| 'NR==FNR{for(i=1; i<=15;i++) a[$i]++;next} ($1 in a) ' FileA.txt FileB.txt | sed 's/^[ \t]*//;s/[ \t]*$/|C/' > delta.txt

File1 (differences in red):
000001323567|BELLTOWER|BOBBY|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1
000001678932|STRAIGHT|HENRY|CAT|SHCAT@hello.com|PARTY2|SUZY|SU@goodbye.com|R||COMPANY7|00993|456 GREEN WAY|N|N||SUNNY|FL|45691|9
 
File2 (differences in red)
000001323567|BELLTOWER|ROBERT|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1
000001678932|STRAIGHT|HENRY|CAT|SHCAT@hello.com|PARTY2|SUZY|SU@goodbye.com|R||COMPANY7|00993|456 GREEN WAY|N|N||SUNNYSIDE|FL|45691|9
 
Output (only record 1 since the difference in the second record occurred in the 17th field):
000001323567|BELLTOWER|ROBERT|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1|C

mjf · October 19, 2013, 8:58pm

Try this:

$ awk -F"|" 'NR==FNR{a[$1]=$2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $15;next}a[$1]==$2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $15{;next}a[$1]{print $0 "|C"}' file1.txt file2.txt

sljnk · October 20, 2013, 1:57am

mjf, thank you for your help. It is working great for all my test files/scenarios

MadeInGermany · October 20, 2013, 4:13am

Or have a long artificial key

awk -F\| 'NR==FNR{a[$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10 FS $11 FS $12 FS $13 FS $14 FS $15];next} (($1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10 FS $11 FS $12 FS $13 FS $14 FS $15) in a) '

Anyway there should be delimiters between the fields!

RudiC · October 20, 2013, 5:02am

Nor sure if this is more elegant / faster / more efficient:

awk     '               {X=$0; NF=15; $1=""}
         NR==FNR        {Arr[$0];next}
         !($0 in Arr)   {print X, "C"}
        ' FS=\| OFS=\| file1 file2
000001323567|BELLTOWER|ROBERT|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1|C

MadeInGermany's proposal is lacking an ! negation in front of the ((... in a)) .

mjf · October 20, 2013, 8:01am

MadeInGermany,
For my education, can you please explain why there should be delimiters in my solution (Don Cragun also pointed this out in another of my awk posts)? Is there a scenario that would not work without delimeters but works with delimiters?

Don_Cragun · October 20, 2013, 9:14am

Suppose that you have two records in a file with '|' as the field separator:

ab|c
a|bc

and you want to count the number of occurrences of different settings in the 1st two fields.

awk -F '|' '
{ cnt[$1 $2]++ }
END { for(i in cnt) print i, cut }
' file

would produce:

abc 2

even though field 1 is different in both lines and field 2 is different in both lines. Using cnt[$1 FS $2]++ or cnt[$1,$2]++ instead of cnt[$1 $2]++ avoids the problem.

mjf · October 20, 2013, 9:28am

Don, understood. Thank you!