CSV joining and checking multiple files

SDohmen · August 22, 2016, 8:47am

Hello,

For our work we use several scripts to gather/combine data for use in our webshop. Untill now we did not had any problems but since a couple days we noticed some mismatches between imports.

It happened that several barcodes where matched even though it was a complete other product. Of course the scripts arent checking on this yet so we need to upgrade the scripts to check for this and give us a list to check or update the listing.

The supplier sends us a CSV file with data as shown below:

supplier_clean.csv

ean;pps_reference;stock;price;sku;mpn;manufacturer
4260010852693;1043154;84;743.42;P00000172;70100118555;Fujifilm
4960999575285;273189;9400;141.80;P00009067;2768B016;Canon
0013803092899,4960999575292;27433196;44;44.94;P00022338;2768B017;Canon
8715946388540;2944686;1030;47.76;P00000878;C13S042167;Epson
0088698115763,3141725001174;3654125;20;54.80;P00004251;C1825A;Hewlett Packard

This file is being joined to another file with the following code, more on this here:

joining.sh

#!/bin/sh

awk  '
BEGIN           {FS = OFS = ";"
                 print "ean;sku;pps_reference;mpn;stock;price;manufacturer;supplier_code"
                }
                {gsub (/ /, "", $1)
                }
NR == FNR       {for (n = split($1, T, ","); n > 0; n--) S[T[n]]=$2
                 next
                }
                {for (n = split($1, T, ","); n > 0; n--) if (T[n] in S) {$2 = S[T[n]] OFS $2
                 print
                 next
                 }
                }
' $1 $2 > $3

This script get called as follows.

join_prijslijst.sh website_clean.csv supplier_clean.csv results.csv

The website_clean has the following data (short example)

website_clean.csv

Barcode;Sku;Manufacturer
0696720480781,4000567150589;P00002801;Braun Photo Technik
4000461043031;P00002800;D�rr
4000461034213,4000461037818;P00002799;D�rr
0891257001526,8912570015266;P00002634;Gary Fong
0891257001106;P00002633;Gary Fong
0887111646026;P00002632;HP
0887111515629;P00002631;HP

The problem is that the checking if the manufacturer has to happen during the joining together and to make matters worse some suppliers have different names for some suppliers (For example HP, Hewlet packard etc etc).

My idea is to have another file where first the website_clean checks for all the possible names of that manufacturer (see below for example) and this then compares against the supplier_clean csv. If the correct name is in there it continues as normal and if not it writes this line to a seperate file which we then can manual check for the names. In this seperate file i need both lines though so we can check which would be the correct name/product.

manufacturer_check.csv

manufacturer
HP,Hewlett Packard, HP INC.
Canon
Fujifilm
Epson
WD,Western Digital

I hope this is clear in what needs to happen to make it work. If not let me know and i will try to explain it better.

RudiC · August 22, 2016, 9:04am

If I remember correctly, EAN are unique except for a certain range set aside for any shop's internal coding, so what do you mean by "several barcodes where matched"? Please give us examples of data sets where the identification went wrong. And, your path forward is not too clear to me. Please explain in more detail. Why should the supplier's name help if barcodes are falsely read?

SDohmen · August 22, 2016, 9:16am

They should be indeed but too bad that different manufacturers can use the same code somehow. For example EAN 7636490074196 which is a Seagate 2TB SSD but also a Lacie external HDD.

The problem is not in the joining itself but due too the problem that it is used multiple times.

I am not sure if the explaination will be sufficient but i will try.

The script we use for joining on barcodes needs to be adapted/changed so that it will check the manufacturer of both files against a third file where all the different names are written. This third file is pure to have manufacturers like HP, WD etc caught without them getting ignored each time.

To describe it in steps:

Each line from the supplier file gets matched against the website file. With this is the complete line
Now the manufacturerfrom the website file gets checked against the manufacturer file so it can check how the manufacturer can be written by different suppliers.
The line that matches from those 2 has to be checked against the manufacturer from the supplier.
If this are the same it continues with the normal loop. If there is a difference it needs to be written to a new file which will be picked up for manual checking.
This manual file needs to have the full line from website and supplier in them so we can check if its a spelling error or a ean error.

I hope this clarifies it a bit.

RudiC · August 22, 2016, 2:06pm

No, it doesn't. Please try again, using input data for demonstration.

SDohmen · August 23, 2016, 5:25am

Sorry for that but it is a very confusing bit indeed.

website_clean.csv

Barcode;Sku;Manufacturer
4260010852693,4000567150589;P00002801;Fujifilm
4960999575285;P00002800;Canon
4000461034213,4000461037818;P00002799;D�rr
0891257001526,8912570015266;P00002634;Gary Fong
0891257001106;P00002633;Gary Fong
0887111646026;P00002632;HP
0088698115763;P00002631;HP

supplier_clean.csv

ean;pps_reference;stock;price;mpn;manufacturer
4260010852693;1043154;84;743.42;70100118555;Fujifilm
4960999575285;273189;9400;141.80;2768B016;Canon
4000461034213,4960999575292;27433196;44;44.94;2768B017;Canon
8715946388540;2944686;1030;47.76;C13S042167;Epson
0088698115763,3141725001174;3654125;20;54.80;C1825A;Hewlett Packard

manufacturer_compare.csv

manufacturer
HP,Hewlett Packard, HP INC.
Canon
Fujifilm
Epson
WD,Western Digital

Above are 3 files in which the first 2 are the important ones for the joining part.

When i run the script with ./join_prijslijst.sh website_clean.csv supplier_clean.csv output.csv, i get the following output

ean;sku;pps_reference;mpn;stock;price;manufacturer;supplier_code
4260010852693;P00002801;1043154;84;743.42;70100118555;Fujifilm
4960999575285;P00002800;273189;9400;141.80;2768B016;Canon
4000461034213,4960999575292;P00002799;27433196;44;44.94;2768B017;Canon
0088698115763,3141725001174;P00002631;3654125;20;54.80;C1825A;Hewlett Packard

You would say that this is correct and also fine but the problem is the following line:

4000461034213,4000461037818;P00002799;D�rr

In the website file the manufacturer is Dorr but in the supplier file it is Canon thus making the wrong join and most likely also add a wrong price etc to this product.

What we want is to have this split up in 2 files in which 1 is as follows:
New_output.csv

ean;sku;pps_reference;mpn;stock;price;manufacturer;supplier_code
4260010852693;P00002801;1043154;84;743.42;70100118555;Fujifilm
4960999575285;P00002800;273189;9400;141.80;2768B016;Canon
0088698115763,3141725001174;P00002631;3654125;20;54.80;C1825A;Hewlett Packard

and another file with the following contents:
wrong_match.csv

4000461034213,4000461037818;P00002799;D�rr;4000461034213,4960999575292;27433196;44;44.94;2768B017;Canon

Then we can check that it is the same manufacturer or not. If for example the manufacturer would be HP inc instead of HP in our system we can just add that to the manufacturer_compare.csv file so it will get recognized the next time.

I hope this clears it up a bit.

RudiC · August 23, 2016, 6:01am

Let me paraphrase this: when creating the output file, compare the supplier field from website_clean.csv with the supplier from supplier_clean.csv . If identical, fine, print the record. If can be reconciled via manufacturer_compare.csv , fine, print it, BUT: which supplier?
If it can't be reconciled, print to wrong_match.csv for later evaluation.

SDohmen · August 23, 2016, 6:03am

Instead of the supplier field use the Manufacturer field and you are right on the spot.

The supplier field is something we add after the joining etc has been done. It can be ignored.

RudiC · August 23, 2016, 6:21am

How about

awk  ' 
BEGIN           {FS = OFS = ";"
                 print "ean;sku;pps_reference;mpn;stock;price;manufacturer"
                }
FNR  == 1       {FCNT++
                }
FCNT == 1       {for (n = split($1, T, ","); n > 0; n--)         MFC[T[n]] = $0
                }

                {gsub (/ /, "", $1)
                }
FCNT == 2       {for (n = split($1, T, ","); n > 0; n--)        {SKU[T[n]] = $2
                                                                 SUP[T[n]] = $3
                                                                }
                 next
                }
                {for (n = split($1, T, ","); n > 0; n--) if (T[n] in SKU)       {$2 = SKU[T[n]] OFS $2
                                                                                 if (MFC[$NF] ~ SUP[T[n]]) print
                                                                                 else                      print $0, SUP[T[n]] > ERRFILE
                                                                                 next
                                                                                }
                }
' ERRFILE="wrong_match.csv" manufacturer_compare.csv website_clean.csv supplier_clean.csv
ean;sku;pps_reference;mpn;stock;price;manufacturer
4260010852693;P00002801;1043154;84;743.42;70100118555;Fujifilm
4960999575285;P00002800;273189;9400;141.80;2768B016;Canon
0088698115763,3141725001174;P00002631;3654125;20;54.80;C1825A;Hewlett Packard

cat wrong_match.csv 
4000461034213,4960999575292;P00002799;27433196;44;44.94;2768B017;Canon;D�rr

SDohmen · August 23, 2016, 10:12am

Thank you for the code. I will test it asap and over a couple days to see what happens.

1 thing i did notice though that i am unsure off. From what i can see in the code it will check every manufacturer against the compare file right? This is not really needed since if the name is the same it does not need to check anymore but i assume comparing them all is easier with the coding?

---------- Post updated at 04:12 PM ---------- Previous update was at 04:00 PM ----------

I just tested it and it seems there is something strange going on.

The "correct" file has the following line:

ean;pps_reference;stock;price;sku;mpn;manufacturer
0885370995022;132477;20;2220.45;P00030979;KF7-00030;Microsoft
0885370942705;132484;70;2277.56;P00030395;KF6-00037;Microsoft
0889842018363;146442;4;2724.97;P00031049;KF7-00081;Microsoft
0889842050868;191907;80;2520.21;P00069389;XBOX BUNDLE;Microsoft
0885370808339,0885370813340,0885370903331;813533;10;2205.69;P00006988;5C5-00013;Microsoft
0885370604085,0885370862201;127962;100;322.79;P00030393;S4V-00010;Microsoft
0885370862492;127976;0;328.37;P00032536;9J7-00004;Microsoft
0885370672213;128004;0;221.26;P00007564;6VC-00001;Microsoft

All correct lines only contain Microsoft as manufacturer. The faulty file contains the following items (the rest)

4260010852372;P00016984;849562;70100114776;40;516.34;Fujifilm ;DIF;Fujifilm
4548736014138;P00016992;100214;DSCWX500B.CE3;0;3102.70;Sony ;DIF;Sony
4260041684713;P00016986;101432;10051;20;417.20;Easypix ;DIF;Easypix
5025232832996;P00017004;116790;DMC-FT5EG9-Z;3;2100.08;Panasonic ;DIF;Panasonic
4548736017092;P00008802;118309;DSCRX100M4.CE3;10;9108.11;Sony ;DIF;Sony
4960371004433;P00073960;624461;272ES;7;3020.22;Tamron ;DIF;Tamron
4013904001338,4961607634356;P00017337;625380;T5111611;30;4100.72;Tokina ;DIF;Tokina

Do you have any idea what could be wrong?

Just to add some extra info. The column with DIF in there is just a supplier code which gets added just before the joining process.

RudiC · August 23, 2016, 11:46am

My crystal bowl is somewhat hazy today - you could help by sending some polish. Or - surprise! - by posting all three input files.

And yes - I forwent the test on supplier equality for simplicity of code at the expense of some CPU cycles. Should your manufacturer file become tremdously large, you may add the test into the script.

SDohmen · August 23, 2016, 12:37pm

Sorry, in all haste i forgot to post the files itself. I will see if i can get some full files although i have to edit some columns. Will do this tomorrow when i am back at work.

SDohmen · August 24, 2016, 6:09am

Hello,

I attached the 3 main files: website_clean.csv, supplier_clean.csv and manufacturer_compare.csv.

With the original script only the first 2 are needed and with the new one also the compare one of course.

Since these are full versions i assume this is sufficient for the script testing. The supplier one is also the one that should provide some mismatches with manufacturer when matching for the testing itself.

RudiC · August 24, 2016, 7:18am

How do you expect a script to execute reasonably on a data set if the structure of that data is NOT what you posted and on which the script was built?

The script evaluates $NF for the supplier field in supplier.csv, but in ALL lines finds "JAC" (not "DIF" as you claimed), for which no manufaturer exists, so it writes the record to the wrong_match file. Except for those "Microsoft" (and a few "Nokia") entries, that don't have a manufacturer in $3 in website.csv, so empty matches empty and thus prints to the result file.

PLEASE get your act together, correct the input data, and run the script again. Use $6 in lieu of $NF if the "JAC" field can't be eliminated, hoping $6 will always be the supplier field. Also, look for orthography in the data, like additional spaces or upper/lower case.

SDohmen · August 24, 2016, 8:49am

rudic:

How do you expect a script to execute reasonably on a data set if the structure of that data is NOT what you posted and on which the script was built?

The script evaluates $NF for the supplier field in supplier.csv, but finds "JAC" (not "DIF" as you claimed) in ALL lines, for which no manufaturer exists, so it writes the record to the wrong_match file. Except for those "Microsoft" (and a few "Nokia") entries, that don't have a manufacturer in $3 in website.csv, so empty matches empty and thus prints to the result file.

PLEASE get your act together, correct the input data, and run the script again. Use $6 in lieu of $NF if the "JAC" can't be eliminated, hoping $6 will always be the supplier field. Also, look for orthography in the data, like additional spaces or upper/lower case.

I am sorry for overlooking that error. The JAC/DIF code is just a supplier code which get added. I overlooked this when copying over the code.

I tested the change and it works fine but i am wondering about 1 thing though. You mentioned the part about the upper/lower cases. Does the script not skip checking on this or is that on purpose that it fails them in the list?

SDohmen · September 13, 2016, 9:22am

*bump*.

After last post i decided to add some extra cleanup lines in the form of all lower cases etc. However i seemed to find some lines that dont like to work with the comparing.

In the output file i have the following lines:

4,25E+12   P00043585   242457   8,4E+10   90   19.40   ekl   JAC   alpenfohn
4,25E+12   P00052963   1577936   8,4E+10   18   12.50   ekl   JAC   alpenfohn
4,25E+12   P00052964   2001285   8,4E+10   18   12.00   ekl   JAC   alpenfohn

Dont look at the first and third column as i copied these over from excel so they look messed up. In the last field it says alpenfohn as manufacturer and according to the supplier it is ekl.

In my comparing file i have this line:

ekl,alpenfohn,alpenf�hn

This should be an accepted compare and it should not have been in the compare file.

I have the idea that the script you provided only checks for the first name in the comma seperated list. Can you confirm this and possible have an idea how to get around that since this would mean i need to add double lines.

[EDIT]

For some reason i start to doubt that it also keeps tracks of the spaces in the names since i noticed that the lines below also dont get skipped (as being correct).

          
5,04E+12   P00055056   2085825   KD1250K-QS   0   169.00   black en decker   JAC   black en   decker       
5,04E+12   P00055017   2090770   KR805K   0   83.60   black en decker   JAC   black en decker

Don_Cragun · September 13, 2016, 3:43pm

So we are initially told that we have three CSV input files with semicolons as the field delimiters, but now you are telling us that code RudiC suggested meeting those requirements does not work when one of the input files starts using spaces instead of semicolons as the field delimiters (even though some of the fields in that file contain spaces as data) and one of the input files starts using commas instead of semicolons as the field delimiter. And, instead of seven input fields we now have 10 input fields and we aren't supposed to look at two of those fields because they are messed up.

If your input data formats keep changing, it is not our job to act as your unpaid programming staff and rewrite code to meet your constantly changing input file formats.

If you would like to try modifying RudiC's suggestions to meet your new input file formats, can show us clear specifications for your new input file formats, and can explain where you are stuck trying to get your modifications to work correctly; we will be happy to help you.

If you would like us to guess at what your new input file format is for your third input file and guess at how we might modify RudiC's code to work with your new input file formats, I doubt that many of the volunteers reading your posts here will be interested in tackling that assignment.

SDohmen · September 13, 2016, 3:56pm

don cragun:

So we are initially told that we have three CSV input files with semicolons as the field delimiters, but now you are telling us that code RudiC suggested meeting those requirements does not work when one of the input files starts using spaces instead of semicolons as the field delimiters (even though some of the fields in that file contain spaces as data) and one of the input files starts using commas instead of semicolons as the field delimiter. And, instead of seven input fields we now have 10 input fields and we aren't supposed to look at two of those fields because they are messed up.

If your input data formats keep changing, it is not our job to act as your unpaid programming staff and rewrite code to meet your constantly changing input file formats.

If you would like to try modifying RudiC's suggestions to meet your new input file formats, can show us clear specifications for your new input file formats, and can explain where you are stuck trying to get your modifications to work correctly; we will be happy to help you.

If you would like us to guess at what your new input file format is for your third input file and guess at how we might modify RudiC's code to work with your new input file formats, I doubt that many of the volunteers reading your posts here will be interested in tackling that assignment.

I think we have a slight miscommunication here :). All files use ; as delimiter. The linked parts above are just several lines which went through the script he wrote and that gives that output. The problem(s) i seem to have is that it ignores manufacturers which are not the first name in the list (all files are added to a previous post btw) and for some reason it does not seem to recognize several manufacters which have spaces like the ones above. The first part (the name in front) is pure to know if i need to rearrange the names but the second can be tricky since there are enough names that do have spaces. I could technicly remove all spaces but i am not sure if that would be a good solution.

The input files havent changed ever except for my miss on the column number for which i am sorry.

[Edit]
Again the code RudiC wrote is superb and the files havent changed. The problem is with the output it creates and most possible be a simple solution.

RudiC · September 13, 2016, 4:03pm

If you look at the script, you see that FS = OFS = ";" , and that the respective fields are split($1, T, ",") by commas. ALL comma separated subfields are treated identical and equal, and spaces are NOT treated specially thus can be part of names.
So - I'd recommend to take a look at the input data and validate those. black en decker and black en decker are definitely NOT the same!

SDohmen · September 13, 2016, 4:08pm

Thank you for that answer. Could you do comment on my question if the searched manufacturer name has to be first or does that not matter?

RudiC · September 13, 2016, 4:53pm