Compare and merge two big CSV files

Hi all,

i need help.
I have two csv files with a huge amount of data.
I need the first column of the first file, to be compared with the data of the second, to have at the end a file with the data not present in the second file.
Example

File1: (only one column)

profile_id
57036226
57036230
57109826
57241366
57241374

File2: (has 7 columns; the data in File1 are present in the second, third, fourth and fifth columns)

Id                    FederationIdentifier    Profile_ID__c    Contact.Profile_ID__c    Contact.Account.Profile_ID__c    IsActive    LastModifiedDate
005b000000150mxAAA    57036226                57036226        57036226                57036226                        true        2018-05-24
0050X000007K4p7QAC    56564502                56564502        56564502                56564502                        true        2018-04-12
0050X000007JxYOQA0    56994082                56994082        56994082                56994082                        true        2018-04-10
0050X000007K3UrQAK    57241366                57241366        57241366                57241366                        true        2018-04-11
0050X000007K3UcQAK    57109826                57109826        57109826                57109826                        true        2018-04-11
0050X000007K4d1QAC    58699731                58699731        58699731                58699731                        true        2018-04-12

Output file:

profile_id
57036230
57241374

Output file is a file with the data of File1 not present in File2.

I've tried with join but not works...he put in the output file all the data of the second file without any kind of compare:

join -t, -v 2 <(sort file1) <(sort file2) > file 3

Welcome to the forum.

How far would

awk 'NR == FNR && NR > 1 {T[$1]; next} {for (i=2; i<=5; i++) if ($i in T) delete T[$i]} END {for (t in T) print t}' file[12]
57241374
57036230

Hi RudiC,

many thanks.
the code not works.

I launched the command. I took one of the data from the output file and verified that it was not in file 2 but:

grep 63995855 Salesforce_Active_OK.csv 
 005b0000004D91aAAC;63995855;63995855;63995855;63995855;true;2018-06-20T10:42:13.000Z

Seems that it works for the format of the output file but the data of file 1 were not compared with file 2

The code works. It was tested with your samples posted. Did you try with those?

If you chose to change the datas' field separators, you need to adapt awk 's FS variable as well. Add the -F\; option.

Ok, I've tried with:

awk -F\; 'NR == FNR && NR > 1 {T[$1]; next} {for (i=2; i<=5; i++) if ($i in T) delete T[$i]} END {for (t in T) print t}' Magento_2.csv Salesforce_Active_OK.csv > Output.csv

And I have an output file with data that are present in the file2 (Salesforce_Active_OK.csv).

This is the format of the file2 (yes, it has ";" ):

Id;FederationIdentifier;Profile_ID__c;Contact.Profile_ID__c;Contact.Account.Profile_ID__c;IsActive;LastModifiedDate
005b000000150mxAAA;2630;2630;2630;2630;true;2018-05-24T09:58:01.000Z
0050X000007K4p7QAC;56564502;56564502;56564502;56564502;true;2018-04-12T15:23:04.000Z

Have I made some mistake?

What be the file1 data for your above file2 test run? Are fields 2 - 5 always identical?

I attach an example of the two files (only some rows...files are too big:()

in the second file, fields from 2 to 5 are with the same data.

PS: the real file1 is more big of file 2...is that a problem?

The code fails as your data have DOS line terminators (^M = \r = 0x0D = <CR>). Use sub (/\r$/, _); just before the T array assignment.

Oh! I've launched the command:

awk -F\; 'NR == FNR && NR > 1 {sub (/\r$/, _); T[$1]; next} {for (i=2; i<=5; i++) if ($i in T) delete T[$i]} END {for (t in T) print t}' file1.csv file2.csv > Output.csv

In the Output file, I've taken some random data and....nothing in the second file! :smiley: :smiley:

Many thanks RudiC! Now I must study and better understand the use of awk :wink:

Only a finally question. If I use a file that is the same of file 2 but without the field with data...for example have only one of the field (from 2 to 5) with data and the other empty...is the same command?

I don't understand your question.

for exmaple if in the second file there are rows like this:

005b0000004ODyTAAW;872693;;;;false;2017-10-27T18:23:59.000Z
005b0000004ZqRQAA0;425809;;;;false;2016-09-01T14:14:58.000Z

so with empty record

That will work; it will if either of the fields does compare.