I've got two columns, PRODUCT and BRAND, the Brand column currently has the first word of each product, I've acheived this by using SED to copy the first word of the PRODUCT column, however you run into trouble when the brand has more than one word, i.e. 'Weight Watchers'.
Is there a way I can do a search for all the products that have 'Weight Watchers' in the title and then copy the string 'Weight Watchers' to the brand column?
Weight Watchers Apricot Jam Reduced Sugar Weight
All Natural Peanut Butter & Co Crunch Time All
Streamline Reduced Sugar Apricot Jam Streamline
Hi Michael, thanks for you kind help, that's really close but it's only doing the change on every other line, I'll try to explain:
Here's my input file, I've left some of the other columns in, didn't seem to make a difference, sorry if it seems a mess, can't see to get the tabs to show up on this forum.
You should consider adding a non ambiguous separator between the Brand and the product (for example like colon ":" to follow POSIX standard)
Indeed, that would gather the advantages of
1) making the last column adding operation useless
2) Fixing ambiguous brand selection due to the unknown number of words if could be made of.
How has you input file been generated ? maybe there is a way to directly generate it with a separator between the brand and the product name instead of tediously doing it afterward
Do you have , somewhere, a file containing the list of the brand ?
You could also upload your entire file so we can help to format it as expected.
The raw data comes from a website scrape and contains 12000 products name but no brands names, e.g.
Filippo Berio Mild & Light Olive Oil (500ml)
What I've done initially is to the get the scrape program to take the first word of each product to use as the brand name. This is good for about 90% of brands but of course only works for those which are only a single word.
I then use a simple sed script to pick out the incorrect brands e.g.
sed -i "s/^7/7 Up/g" brands.txt
sed -i "s/^Ainsley/Ainsley Harriott/g" brands.txt
sed -i "s/^Air/Air Wick/g" brands.txt
sed -i "s/^Alfa/Alfa One/g" brands.txt
sed -i "s/^All/All About Shine/g" brands.txt
sed -i "s/^Alta/Alta Italia/g" brands.txt
sed -i "s/^Ambi/Ambi Pur/g" brands.txt
sed -i "s/^Angel/Angel Delight/g" brands.txt
However this again only fixes %70 of brands as some brands have the same first word, e.g.
John West
John Frieda
I can pull out a list of all the brands if needed. I've attached just the product list file for now but will produce and attach the brand file later today. What makes this a little bit more complicated is that this list will be updated weekly so new brands are constantly being added.
I can get my brand list to a point where it will only take me 10 - 15 minutes of manual editing so it's not the end of the world
Many thanks for you kind help on this little problem...
The best way of doing this would be to find a way to extract only the brand name from the web site.
Otherwise, you have to build a list that will contains all the "multi word" brand name so that we can then setup a script to parse it with a pseudo code like :
for all brand in multi word band name list
set the : separator at the right place
for all other
set the : separator just after the first word
If you give more clue about the way you initially generate the initial file, maybe it can help to directly separate the brand & product fields at generation step.
The file is initially generated using a piece of web scraping software, I've just had a look at the web pages I do the scrape from, it looks like I can produce a scrape of all the brand names.
so if you can generate the list of the brand name only, and then the other file containing everything, you can then setup the separator at the right place.
Can you upload the file containing only the brand name ?
I'm afraid automatically generating the brand list is harder than I initially thought. I will still continue to look at this. The support forum for the website for the scrape software is currently down and I need to ask a question there. Many thanks for your help