Difference between two huge .csv files

Hi all,

I need help on getting difference between 2 .csv files.
I have 2 large . csv files which has equal number of columns. I nned to compare them and get output in new file which will have difference olny.

E.g.
File1.csv

Name, Date, age,number 
Sakshi, 16-12-2011, 22, 56
Akash, 14-12-2011, 23, 76

File2.csv

Name,Date,age,number
Sakshi, 14-12-2011,22,56
Akash,18-12-2011,23,76

then output should be like

16-12-2011                      14-12-2011
14-12-2011                       18-12-2011

It's just an example. What I am trying to say is I should get only the values of columns where we have the difference. Not the whole line.
Assuming File will be in sorted order.
There can be m number of columns but for sure in both the files, we will get same columns. If values are different then those values should be given in output.
It can also work if we can get difference in comma separated file like
wherver values matches between 2 files we get blank
,16-12-2011,,
Hope I am able to explain the issue.

awk -F, 'FNR==NR{a[$1]=$2;next}{if(a[$1]!=$2){print a[$1],$2}}' file1 file2

I think I am not able to explain issue properly.

In the example given, there is a difference at 2nd column only. But there can be difference in some other columns value as well. This command is giving result for difference at 2nd place only.
Can you give me the command so that I can get result in comma separated format only. By this I will get to know wherever values are not matching in our files.
It's not neccessary to get values from both the file. Let say there is difference at 3rd column and 7th column so my result should be like
,,17-12-2011,,,,10,,,,,,,,,

Please help

try with this..

paste file1 file2 | awk -F "[,\t]" '{for(i=1;i<=(NF/2);i++){if($i != $(NF/2+i)){printf $i}else{printf ","}}}{print ""}' 

Thanks for the help.:slight_smile:

But still have one issue.

If i have difference in 2 consecutive columns, it's not showing any separation between them.
E.g
File1
Rahul, 1203,113,11

File2
Malik, 121,113,11

Output coming as Rahul1203,,

Expected Output: Rahul,1203,,

What about FS for all the values. So you can easily distinguish between them...

paste file1 file2 | awk -F "[,\t]" '{for(i=1;i<=(NF/2);i++){if($i != $(NF/2+i)){
if(s){s=s";"$i}else{s=$i}}else{if(s){s=s";,"}else{s=","}}}}{ print s;s=""}'

Thanks for your help :slight_smile:

It's working exactly what I want.

If possible Can you please explain the code.

Thanks

paste file1 file2 # Here we join files line by line

awk -F "[,\t]" # here we have comma separated file and after paste it has tab which separates two files.

for(i=1;i<=(NF/2);i++){if($i != $(NF/2+i)) # we have to compare column 1 of first file to column 1 of second file and so on. So i have divided number fields to compare it. 1 to (NF/2)+1 and so on..

if(s){s=s";"$i}else{s=$i}}else{if(s){s=s";,"}else{s=","}}}}{ print s;s=""} # Here just we assign values to the variable s as per your requirement and at the end we print that s.

Hope this helps you:)

my file has around 1 lakh rows and 300 columns.

Though the command mentioned above is working fine but is taking more than 2 hrs to completely execute and is time consuming.

Is there any command that can specify only the column values and column number where we have the difference like

If there is difference at column number 3 and column number 7, resultant file will have value as
3,7 (Column Number)
Rahul,23 (Values present in file 1)

Rather than displaying "," for all the columns that are matching between two files. It only display columns where we have the mismatch but with a column number to keep the track.

Thanks

I am not sure how we can minimize time for this...

as per above requested input..

is this what you want..?

$paste file1 file2 | awk -F "[,\t]" '{for(i=1;i<=(NF/2);i++){
if($i != $(NF/2+i)){if(s){s=s";"$i","$(NF/2+i)"("i")"}else{s=$i","$(NF/2+i)"("i")"}}}}{ print s;s=""}'

 16-12-2011, 14-12-2011(2)
 14-12-2011,18-12-2011(2)
Maik,Malik(1);113,11(3)

Thanks for the Help :slight_smile:

Its woking exactly the way I want to. Only thing is it's taking much time to execute completely. Don't know how to overcome that