Hi all,
i need help.
I have two csv files with a huge amount of data.
I need the first column of the first file, to be compared with the data of the second, to have at the end a file with the data not present in the second file.
Example
File1: (only one column)
profile_id
57036226
57036230
57109826
57241366
57241374
File2: (has 7 columns; the data in File1 are present in the second, third, fourth and fifth columns)
Id FederationIdentifier Profile_ID__c Contact.Profile_ID__c Contact.Account.Profile_ID__c IsActive LastModifiedDate
005b000000150mxAAA 57036226 57036226 57036226 57036226 true 2018-05-24
0050X000007K4p7QAC 56564502 56564502 56564502 56564502 true 2018-04-12
0050X000007JxYOQA0 56994082 56994082 56994082 56994082 true 2018-04-10
0050X000007K3UrQAK 57241366 57241366 57241366 57241366 true 2018-04-11
0050X000007K3UcQAK 57109826 57109826 57109826 57109826 true 2018-04-11
0050X000007K4d1QAC 58699731 58699731 58699731 58699731 true 2018-04-12
Output file:
profile_id
57036230
57241374
Output file is a file with the data of File1 not present in File2.
I've tried with join but not works...he put in the output file all the data of the second file without any kind of compare:
join -t, -v 2 <(sort file1) <(sort file2) > file 3
RudiC
August 19, 2018, 4:37am
2
Welcome to the forum.
How far would
awk 'NR == FNR && NR > 1 {T[$1]; next} {for (i=2; i<=5; i++) if ($i in T) delete T[$i]} END {for (t in T) print t}' file[12]
57241374
57036230
Hi RudiC,
many thanks.
the code not works.
I launched the command. I took one of the data from the output file and verified that it was not in file 2 but:
grep 63995855 Salesforce_Active_OK.csv
005b0000004D91aAAC;63995855;63995855;63995855;63995855;true;2018-06-20T10:42:13.000Z
Seems that it works for the format of the output file but the data of file 1 were not compared with file 2
RudiC
August 19, 2018, 5:06am
4
The code works. It was tested with your samples posted. Did you try with those?
If you chose to change the datas' field separators, you need to adapt awk
's FS
variable as well. Add the -F\;
option.
Ok, I've tried with:
awk -F\; 'NR == FNR && NR > 1 {T[$1]; next} {for (i=2; i<=5; i++) if ($i in T) delete T[$i]} END {for (t in T) print t}' Magento_2.csv Salesforce_Active_OK.csv > Output.csv
And I have an output file with data that are present in the file2 (Salesforce_Active_OK.csv).
This is the format of the file2 (yes, it has ";" ):
Id;FederationIdentifier;Profile_ID__c;Contact.Profile_ID__c;Contact.Account.Profile_ID__c;IsActive;LastModifiedDate
005b000000150mxAAA;2630;2630;2630;2630;true;2018-05-24T09:58:01.000Z
0050X000007K4p7QAC;56564502;56564502;56564502;56564502;true;2018-04-12T15:23:04.000Z
Have I made some mistake?
RudiC
August 19, 2018, 5:30am
6
What be the file1 data for your above file2 test run? Are fields 2 - 5 always identical?
I attach an example of the two files (only some rows...files are too big:()
in the second file, fields from 2 to 5 are with the same data.
PS: the real file1 is more big of file 2...is that a problem?
RudiC
August 19, 2018, 7:13am
8
The code fails as your data have DOS line terminators (^M = \r = 0x0D = <CR>). Use sub (/\r$/, _);
just before the T
array assignment.
Oh! I've launched the command:
awk -F\; 'NR == FNR && NR > 1 {sub (/\r$/, _); T[$1]; next} {for (i=2; i<=5; i++) if ($i in T) delete T[$i]} END {for (t in T) print t}' file1.csv file2.csv > Output.csv
In the Output file, I've taken some random data and....nothing in the second file!
Many thanks RudiC! Now I must study and better understand the use of awk
Only a finally question. If I use a file that is the same of file 2 but without the field with data...for example have only one of the field (from 2 to 5) with data and the other empty...is the same command?
RudiC
August 19, 2018, 7:54am
10
I don't understand your question.
for exmaple if in the second file there are rows like this:
005b0000004ODyTAAW;872693;;;;false;2017-10-27T18:23:59.000Z
005b0000004ZqRQAA0;425809;;;;false;2016-09-01T14:14:58.000Z
so with empty record
RudiC
August 19, 2018, 8:07am
12
That will work; it will if either of the fields does compare.