awk - compare 1st 15 fields of record with 20 fields

I'm trying to compare 2 files for differences in a selct number of fields. When differnces are found it will write the whole record of the second file including appending '|C' out to a delta file. Each record will have 20 fields, but only want to do comparison of 1st 15 fields. The 1st field of each record is the primary key. When comparing the fields differences should only be captured if there is a change in fields 2-15. If there is a difference in fields 16-20 they should be ignored and therefore a record will not be written to the delta file.

awk -F\| 'NR==FNR{for(i=1; i<=15;i++) a[$i]++;next} ($1 in a) ' FileA.txt FileB.txt | sed 's/^[ \t]*//;s/[ \t]*$/|C/' > delta.txt

File1 (differences in red):
000001323567|BELLTOWER|BOBBY|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1
000001678932|STRAIGHT|HENRY|CAT|SHCAT@hello.com|PARTY2|SUZY|SU@goodbye.com|R||COMPANY7|00993|456 GREEN WAY|N|N||SUNNY|FL|45691|9
 
File2 (differences in red)
000001323567|BELLTOWER|ROBERT|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1
000001678932|STRAIGHT|HENRY|CAT|SHCAT@hello.com|PARTY2|SUZY|SU@goodbye.com|R||COMPANY7|00993|456 GREEN WAY|N|N||SUNNYSIDE|FL|45691|9
 
Output (only record 1 since the difference in the second record occurred in the 17th field):
000001323567|BELLTOWER|ROBERT|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1|C
 

Try this:

$ awk -F"|" 'NR==FNR{a[$1]=$2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $15;next}a[$1]==$2 $3 $4 $5 $6 $7 $8 $9 $10 $11 $12 $13 $14 $15{;next}a[$1]{print $0 "|C"}' file1.txt file2.txt
1 Like

mjf, thank you for your help. It is working great for all my test files/scenarios

Or have a long artificial key

awk -F\| 'NR==FNR{a[$1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10 FS $11 FS $12 FS $13 FS $14 FS $15];next} (($1 FS $2 FS $3 FS $4 FS $5 FS $6 FS $7 FS $8 FS $9 FS $10 FS $11 FS $12 FS $13 FS $14 FS $15) in a) '

Anyway there should be delimiters between the fields!

Nor sure if this is more elegant / faster / more efficient:

awk     '               {X=$0; NF=15; $1=""}
         NR==FNR        {Arr[$0];next}
         !($0 in Arr)   {print X, "C"}
        ' FS=\| OFS=\| file1 file2
000001323567|BELLTOWER|ROBERT|BAT|BBBAT@hello.com|JOB2|CHARLES|CH@goodbye.com|A||COMPANY1|00283|123 RED WAY|Y|N||MINNY|MN|34217|1|C

MadeInGermany's proposal is lacking an ! negation in front of the ((... in a)) .

MadeInGermany,
For my education, can you please explain why there should be delimiters in my solution (Don Cragun also pointed this out in another of my awk posts)? Is there a scenario that would not work without delimeters but works with delimiters?

Suppose that you have two records in a file with '|' as the field separator:

ab|c
a|bc

and you want to count the number of occurrences of different settings in the 1st two fields.

awk -F '|' '
{ cnt[$1 $2]++ }
END { for(i in cnt) print i, cut }
' file

would produce:

abc 2

even though field 1 is different in both lines and field 2 is different in both lines. Using cnt[$1 FS $2]++ or cnt[$1,$2]++ instead of cnt[$1 $2]++ avoids the problem.

1 Like

Don, understood. Thank you!