awk_Compare two files with a loop

theflamingmoe · August 14, 2012, 9:32pm

Hi,
I have a small query when comparing two files with awk. I have a small piece of code running in a shell. See below:

 
gawk -F"," 'NR == FNR { A[$1","$2","$3","$4]=1; next } \!A[$1","$2","$3","$4]' OFS="," 2011.csv 2012.csv > diff_2012.csv

The code works fine (Note I had to escape the ! with \! to run in shell). What I want to do is add a loop to this code. For the array I want to keep columns $1, $2, $3 for each loop and increment $4 to become $5 then $6 etc up to $33. Each pass of the loop I want to output the difference to a new csv file. An example of what i want is:

 
gawk -F"," 'NR == FNR { A[$1","$2","$3","$5]=1; next } \!A[$1","$2","$3","$5]' OFS="," 2011.csv 2012.csv > diff_2012_5.csv

Then

 
gawk -F"," 'NR == FNR { A[$1","$2","$3","$6]=1; next } \!A[$1","$2","$3","$6]' OFS="," 2011.csv 2012.csv > diff_2012_6.csv

etc etc. Only I want the above in a loop.

Thanks in advance for your help

Corona688 · August 14, 2012, 9:54pm

Show the input you have and the output you want. Code which doesn't do what you want really doesn't tell us what you do want.

theflamingmoe · August 14, 2012, 10:23pm

Ok,
I can't seem to get the tags to work so I've appended two files.

Desired output based on array $1,$2,$3,$5 would be:

 
RD158,SR25509,501,1.3

Desired output based on array $1,$2,$3,$6 would be:

 
RD164,SR24504,441,33

Desired output based on array $1,$2,$3,$7 would be:

 
RD164,SR24505,442,90.1

Rather than run three seperate lines of code, I want to change the array using a loop and output whats different in the 2012.csv to seperate files.

Sorry for the confusion :o

Corona688 · August 15, 2012, 11:05am

awk 'NR==FNR {
        A[$1,$2,$3,$5]=1
        B[$1,$2,$3,$6]=1
        C[$1,$2,$3,$7]=1
        next }

        !A[$1,$2,$3,$4] { print > "file1" }
        !B[$1,$2,$3,$4] { print > "file2" }
        !C[$1,$2,$3,$4] { print > "file3" }' input1 input2

theflamingmoe · August 15, 2012, 7:36pm

Thanks for your time Corona688. I guess my explanation and example was about as clear as mud. Partly because i'm not too sure how the arrays work. Is it possible to index multiple values, in this case field values, to a single array. I want to grab fields from the two master files (2011.txt and 2012.txt), compare them, then find the differences. I have a work around solution (writen for a tcsh shell), as follows. I tested it and it seems to do what I want.

 
foreach n (`seq -s" " -f "%0g" 5 1 7`)
gawk -F"," -v i=$n '{print $1","$2","$3","$i}' OFS="," 2011.csv >  2011_temp.csv
gawk -F"," -v i=$n '{print $1","$2","$3","$i}' OFS="," 2012.csv >  2012_temp.csv
gawk -F"," 'NR == FNR {A[$0]=$0; next } \!A[$0]' OFS="," 2011_temp.csv 2012_temp.csv >> diff_2012_${n}.csv
end

Is there a simple more eligant way than the above code? Preferably awk, grep or perhaps perl. Thanks in advance,
theflamingmoe

Corona688 · August 16, 2012, 11:49am

Show the output you want for the given input. What you want to do will then be clear.

theflamingmoe · August 16, 2012, 8:23pm

One last try,

Lets say I have two comma separated lists of fruit:

fruit1.txt

apples,red,2,32,8
pears,green,4,8,20
grapes,black,150,200,160
bannas, yellow,20,15,12
mangos,yellow,30,40,60

fruit2.txt

apples,red,2,32,10
pears,green,4,8,20
grapes,black,150,300,160
bannas, yellow,20,15,12
mangos,yellow,50,40,60

If I use the code:

 
awk -F"," 'NR == FNR {A[$0]=$0; next } !A[$0]' OFS="," fruit1.txt fruit2.txt >> diff_fruit2.txt

The resultant file diff_fruit2.txt (difference between the two files) should look like below:

diff_fruit.txt

apples,red,2,32,10
grapes,black,150,300,160
mangos,yellow,50,40,60

Where row 1, field 5 has changed from 8 to 10. Row 3, field 4 has changed from 200 to 300. Row 5, field 3 has changed from 30 to 50.

What I want to know is, rather than index the whole row, $0, to an array, can I assign field numbers. Can I index fields $1, $2, and $5 to an array to get the ouput:

diff_fruit.txt

apples,red,10

Or index fields $1, $2, $4 to an array to get ouput:

diff_fruit.txt

grapes,black,300

Or index fields $1, $2, $3 to an array to get output:

diff_fruit.txt

mangos,yellow,50

My last part of the question is, if the above is possible. Can I put this in a loop and output to different files. For example, name the files using the field number.

diff_fruit_5.txt

apples,red,10

diff_fruit_4.txt

grapes,black,300

diff_fruit_3.txt

mangos,yellow,50

The files i'm using are very big with lots of fields. I need a practical way to spot differences between the two files. Printing out the whole row where there is a difference is just not feasible in my case. I would be there for a month of Sundays trying to decipher the output.

Hope my example is clearer. Thanks in advance,
theflamingmoe