Compare two files using awk

paul.o · May 26, 2009, 2:37pm

Hi. I'm new to awk and have searched for a solution to my problem, but haven't found the right answer yet. I have two files that look like this:

file1
Delete,3105551234
Delete,3105551236
Delete,5625559876
Delete,5625556789
Delete,5625553456
Delete,5625551234
Delete,5625556956
Delete,5625556643
Delete,6265552486
Delete,6265559365
Add,7755559833
Add,9515550087

file2
93,170334,0,-1,-1,,AAA,,5625556643,6465550987,,,-1,,581,93,-1
94,170335,0,-1,-1,,AAA,,7145550167,6465550987,,,-1,,581,93,-1
107,170239,0,-1,-1,,AAA,,6265559999,6465550987,,,-1,,581,93,-1
109,170240,0,-1,-1,,AAA,,5205558723,6465550987,,,-1,,581,93,-1
110,170241,0,-1,-1,,AAA,,3105551236,6465550987,,,-1,,581,93,-1
111,170348,0,-1,-1,,AAA,,6195550178,6465550987,,,-1,,581,93,-1
114,170256,0,-1,-1,,AAA,,5625559876,6465550987,,,-1,,581,93,-1
118,170336,0,-1,-1,,AAA,,3105551234,6465550987,,,-1,,581,93,-1
119,170337,0,-1,-1,,AAA,,5125559812,6465550987,,,-1,,581,93,-1
120,170338,0,-1,-1,,AAA,,5125559083,6465550987,,,-1,,581,93,-1
121,101,1,-1,-1,,AAA,,,2135559126,,,-1,,0,85,-1
122,170339,0,-1,-1,,AAA,,5625559067,6465550987,,,-1,,581,93,-1
125,999996,1,-1,-1,,AAA,,,6265559365,,,-1,,0,2561,-1
127,170340,0,-1,-1,,AAA,,5625551234,6465550987,,,-1,,581,93,-1
128,170341,0,-1,-1,,AAA,,5625559148,6465550987,,,-1,,581,93,-1
129,170342,0,-1,-1,,AAA,,5625556789,6465550987,,,-1,,581,93,-1
130,170343,0,-1,-1,,AAA,,5625559210,6465550987,,,-1,,581,93,-1
133,100,1,-1,-1,,AAA,,,6265552486,,,-1,,0,85,-1
134,170344,0,-1,-1,,AAA,,5625553456,6465550987,,,-1,,581,93,-1
135,170345,0,-1,-1,,AAA,,7605559809,6465550987,,,-1,,581,93,-1
137,170257,0,-1,-1,,AAA,,5625556956,6465550987,,,-1,,581,93,-1

I would like to look at file1 and any entry that has "Delete" in $1, look for $2 (from file1) in file2. Then, create a third file, file3, with "D,"$1 of file2. So, the output with the above examples would look like this:

file3
D,93
D,110
D,114
D,118
D,125
D,127
D,129
D,133
D,134
D,137

I hope I'm making sense. Any help would be appreciated. Thanks.

jim_mcnamara · May 26, 2009, 2:52pm

Not to quibble - but you are not clear. Your example does not match what you said.
take
114,170256,0,-1,-1,,AAA,,5625559876,6465550987,,,-1,,581,93,-1
and
Delete,5625559876

This means 'do not print' the 114,...... line.

Your output
D,114

has the 114 line in it. Several other lines are like this. Did you mean the reverse of what you said?

Franklin52 · May 26, 2009, 2:54pm

He means something like:

awk -F, 'NR==FNR && /^D/ {a[$2]++;next}
$9 in a || $10 in a {print "D," $1}' file1 file2

paul.o · May 26, 2009, 3:04pm

Sorry about that. file1 is a list of numbers that need to be deleted or added. file2 is a list of current numbers and corresponding information. I want file3 to be just the "D," along with the first column of file2 associated with the number marked for deletion in file1.

I tried the script Franklin posted, but I got "syntax error near line 2". I forgot to mention I'm using Solaris 8 if that makes a difference. Thanks.

Franklin52 · May 26, 2009, 3:05pm

Use nawk or /usr/xpg4/bin/awk on Solaris.

Regards

paul.o · May 26, 2009, 3:56pm

Yes, nawk worked, thank you very much. I was wondering, if you didn't mind, if you could breakdown the script so I can understand exactly how it's working? I'd like to learn as much of this as I can. Thanks.

Franklin52 · May 26, 2009, 4:22pm

awk -F, 'NR==FNR && /^D/ {a[$2]++;next}
$9 in a || $10 in a {print "D," $1}' file1 file2

Here we go:

awk -F,

Set field separator

NR==FNR && /^D/

If we read the 1st file and the line starts with a "D"

{a[$2]++;next}

Set array a with the 2nd field as index and read the next line

$9 in a || $10 in a {print "D," $1}'

If the 9th or the 10th field exists as an index of the array a in the 2nd file print "D," and the 1st field.

Regards

paul.o · May 26, 2009, 4:27pm

Thanks for the info and thank you very much for your help. I really appreciate it. How does the script know when to use the first file and when to use the second file? Sorry, if that's a dumb question.

Franklin52 · May 27, 2009, 2:16am

NR to the number of the current input record and FNR is the current record number in the current file.
FNR is reinitialized to 0 each time a new input file is started so if NR==FNR the 1st file is processed.

Regards