Compare two csv files with same column structure

Good afternoon team,
I have export1.csv (2m records) and export2.csv (3m records) with the same column structure, the data is disordered, and neither have row data that exactly matches. And, (much to my chagrin) yes, the data comes pre-packaged with commas.

file 1
CN=username3,OU=TestOU,DC=company,DC=com, CN=username3,OU=TestOU,DC=company,DC=com, meta8907435339
CN=username4,OU=TestOU,DC=company,DC=com, CN=username4,OU=TestOU,DC=company,DC=com, meta9084538488
CN=username1,OU=TestOU,DC=company,DC=com, CN=username1,OU=TestOU,DC=company,DC=com, meta0010193834
CN=username2,OU=TestOU,DC=company,DC=com, CN=username2,OU=TestOU,DC=company,DC=com, meta8974583475


file 2
CN=username2,OU=TestOU,DC=company-TEST,DC=com, CN=username2,OU=TestOU,DC=company-TEST,DC=com, meta0934530054
CN=username1,OU=TestOU,DC=company-TEST,DC=com, CN=username1,OU=TestOU,DC=company-TEST,DC=com, meta6546547888
CN=username5,OU=TestOU,DC=company-TEST,DC=com, CN=username5,OU=TestOU,DC=company-TEST,DC=com, meta6542134546
CN=username4,OU=TestOU,DC=company-TEST,DC=com, CN=username4,OU=TestOU,DC=company-TEST,DC=com, meta4654688798
CN=username3,OU=TestOU,DC=company-TEST,DC=com, CN=username3,OU=TestOU,DC=company-TEST,DC=com, meta5454654987

For my purposes, there are four matching rows in the examples above if we exclude the LAST 22 characters of either col1 or col2. I only need the non-matching (i.e., file2,row 3) lines to be output.

I've considered something along these lines:

awk 'NR==FNR { arr[$0]="1";next  } arr[$1]!="1" { print $0 }' file2 file1 > file3

But I lack the ability to modify the filter in the precise way that I need. I have no idea where to start when my need is to exclude the last N chars of row 2 and match on the remainder with output directed to a file with disordered data, to boot. I'm stuck.

Any help would be kindly appreciated!

@ttyp00 , welcome. If i understand your request, something along that below should fit

awk -F, 'NR==FNR { keys[$1]=1; next } !($1 in keys) { print $0 }' file1 file2
CN=username5,OU=TestOU,DC=company-TEST,DC=com, CN=username5,OU=TestOU,DC=company-TEST,DC=com, meta6542134546

if not , show your EXPECTED results from the two file samples and we'll see if we can understand your requirement better.

tks

1 Like

@ttyp00
sounds like either column1 or column5 (as they have the same same value in your data sample) can be used as unique keys for each row in either of the files.

If that's the case, then it's easy and @munkeHoller provided one approach.
If the "unique key" is a combination of several column values for each row, we'll need:

  1. it to be stated clearly
  2. to be provided with the more representative sample(s) representing the item 1 above.
  3. a desired output based on the 2 items above.

Also it might be a good idea to "trim" any leading/trailing spaces for each field value - e.g. column1 and column5 look the same except for the leading space(s) in field 5.

1 Like

I very (VERY) much appreciate the assist. Forgive me for the add-on question, but there are some things I don't understand about the awk string.

awk -F, 'NR==FNR { keys[$1]=1; next } !($1 in keys) { print $0 }' file1 file2

Can you please break down for me what each of these components below is doing? I got my desired output, but I can't validate without understanding what the command is doing. How is awk doing the comparison??

{ keys[$1]=1; next }
!($1 in keys)

This is done for the first input file (where NR==FNR is true).
$1 is the column1 value. The keys array is indexed by it. Accessing a keys[value] is enough to create this array element. The =1 assigns a value 1 to it.
The next jumps to the next input cycle, skipping the following code.
The following code runs for the next input file (where NR==FNR is false).

$1 in keys looks up the column1 value in the keys indexes. It yields a true condition if the index exists, without returning the value as a keys[$1] does.
The ! negates the condition.

something along the following:

pass 1 - (the NR==FNR... part) read first file, take the first field ($1) and store into an array called keys, (the 'keys[$1]=1' part) , move on to the next record in file 1 ( the next ) directive does that

pass 2 ( !($1 in keys) part- reads the 2nd file to see if the first field ($1) matches any of the values in the keys array, if not, print out the whole record from file2 just read .