Format & Compare two huge CSV files

Sheel · December 16, 2011, 5:27am

I have two csv files having 90K records each & each row has around 50 columns.Lets say the file names are FILE1 and FILE2. I have to compare both the files and generate a new file that has rows from FILE2 if it differs.

FILE1
-----
2001,"John",25,19901130,21211.41,Unix Forum
2002,"Mike",26,19850101,0.0,"Linux Experts, Co."

FILE2
-----
ID,NAME,AGE,JOINDATE,SALARY,ORGANIZATION
2001,John,25,19901130,000000000021211.41,Unix Forum
2002,Mike,26,19850101,000000000000000.00,"Linux Experts, Co."

As you can see that the text values in one of the files are quoted and the salary field differs in format but not the values. Both the files are same right now and the only difference is the missing header. So the output file must have the header only.

Lets change the data in FILE2

FILE2
-----
ID,NAME,AGE,JOINDATE,SALARY,ORGANIZATION
2001,John,25,19901130,000000000021211.41,Unix Forum
2002,Mike,26,19850101,000000000000000.00,"Linux Experts, Co."

Now, the output file should have the header and row2 from FILE2.

Please suggest an awk command to do this.

balajesuri · December 16, 2011, 5:47am

Understood first example when you said, output file must have the header only. I didn't understand the second example. How would the output file have header and row2 from FILE2?

I'm not able to see any changes made to row2 of FILE2 after you said "Lets change the data in FILE2"

Sheel · December 16, 2011, 6:33am

my bad..here is the modified file


FILE2
-----
ID,NAME,AGE,JOINDATE,SALARY,ORGANIZATION
2001,John,25,19901130,000000000021211.41,Unix Forum
2002,Mike,26,19850101,000000000000011.00,"Linux Experts, Co."

michaelrozar17 · December 16, 2011, 7:08am

sheel:

I have two csv files having 90K records each & each row has around 50 columns.Lets say the file names are FILE1 and FILE2. I have to compare both the files and generate a new file that has rows from FILE2 if it differs.
FILE1
-----
2001,"John",25,19901130,21211.41,Unix Forum
2002,"Mike",26,19850101,0.0,"Linux Experts, Co."

FILE2
-----
ID,NAME,AGE,JOINDATE,SALARY,ORGANIZATION
2001,John,25,19901130,000000000021211.41,Unix Forum
2002,Mike,26,19850101,000000000000000.00,"Linux Experts, Co."
As you can see that the text values in one of the files are quoted and the salary field differs in format but not the values. Both the files are same right now and the only difference is the missing header. So the output file must have the header only.
Now, the output file should have the header and row2 from FILE2.
Please suggest an awk command to do this.

If you really want to compare both the files and print then..

awk 'BEGIN{FS=OFS=",";print "ID,NAME,AGE,JOINDATE,SALARY,ORGANIZATION"} FNR==NR{a[FNR]=$2;next}{$2=a[FNR+1];print}' FILE2 FILE1

Or if you just want to remove the double quotes in column 2 (which it looks like..) in FILE1 then try

awk 'BEGIN{FS=OFS=",";print "ID,NAME,AGE,JOINDATE,SALARY,ORGANIZATION"} {gsub("\"","",$2);print}' FILE1