String copy

mrpugster · March 25, 2011, 6:47am

Hi guys

I've got two columns, PRODUCT and BRAND, the Brand column currently has the first word of each product, I've acheived this by using SED to copy the first word of the PRODUCT column, however you run into trouble when the brand has more than one word, i.e. 'Weight Watchers'.

Is there a way I can do a search for all the products that have 'Weight Watchers' in the title and then copy the string 'Weight Watchers' to the brand column?

Weight Watchers Apricot Jam Reduced Sugar Weight
All Natural Peanut Butter & Co Crunch Time All
Streamline Reduced Sugar Apricot Jam Streamline

Many thanks

rdcwayx · March 25, 2011, 7:03am

give the sample, and your expect output.

mrpugster · March 25, 2011, 7:22am

Sorry if the above isn't clear, hopefully this is easier to understand.

Sample

Product Brand
Weight Watchers Apricot Jam Reduced Sugar Weight

Expected Output

Product Brand
Weight Watchers Apricot Jam Reduced Sugar Weight Watchers

Thanks

michaelrozar17 · March 25, 2011, 7:48am

Try..

awk '/Weight Watchers/{print $0 FS $2;getline}1' inputfile > outfile

mrpugster · March 25, 2011, 8:08am

Hi Michael, thanks for you kind help, that's really close but it's only doing the change on every other line, I'll try to explain:

Here's my input file, I've left some of the other columns in, didn't seem to make a difference, sorry if it seems a mess, can't see to get the tabs to show up on this forum.

Castillo d'Canena Extra Virgin Arbequina Olive Oil (250ml)    8            10000        Castillo
Castillo d'Canena Extra Virgin Picual Olive Oil (250ml)    8            10000        Castillo
Farchioni Il Casolare Olive Oil Extra Virgin (1L)    8            10000        Farchioni
Filippo Berio Extra Virgin Olive Oil (250ml)    8            10000        Filippo 
Filippo Berio Extra Virgin Olive Oil (500ml)    8            10000        Filippo 
Filippo Berio Extra Virgin Olive Oil (750ml)    8            10000        Filippo 
Filippo Berio Extra Virgin Olive Oil (1L)    8            10000        Filippo 
Filippo Berio Extra Virgin Olive Oil Special Selection (250ml)    8            10000        Filippo 
Filippo Berio Extra Virgin Olive Oil Tin (1L)    8            10000        Filippo 
Filippo Berio Gusto Fruttato Extra Virgin Olive Oil (500ml)    8            10000        Filippo 
Filippo Berio Mild & Light Olive Oil (250ml)    8            10000        Filippo 
Filippo Berio Mild & Light Olive Oil (500ml)    8            10000        Filippo 
Filippo Berio Mild & Light Olive Oil (1L)    8            10000        Filippo

On this I ran:

awk '/Filippo Berio/{print $0 FS $2;getline}1' test1.txt > test2.txt

And the output:

Castillo d'Canena Extra Virgin Arbequina Olive Oil (250ml)    8            10000        Castillo
Castillo d'Canena Extra Virgin Picual Olive Oil (250ml)    8            10000        Castillo
Farchioni Il Casolare Olive Oil Extra Virgin (1L)    8            10000        Farchioni
Filippo Berio Extra Virgin Olive Oil (250ml)    8            10000        Filippo  Berio
Filippo Berio Extra Virgin Olive Oil (500ml)    8            10000        Filippo 
Filippo Berio Extra Virgin Olive Oil (750ml)    8            10000        Filippo  Berio
Filippo Berio Extra Virgin Olive Oil (1L)    8            10000        Filippo 
Filippo Berio Extra Virgin Olive Oil Special Selection (250ml)    8            10000        Filippo  Berio
Filippo Berio Extra Virgin Olive Oil Tin (1L)    8            10000        Filippo 
Filippo Berio Gusto Fruttato Extra Virgin Olive Oil (500ml)    8            10000        Filippo  Berio
Filippo Berio Mild & Light Olive Oil (250ml)    8            10000        Filippo 
Filippo Berio Mild & Light Olive Oil (500ml)    8            10000        Filippo  Berio
Filippo Berio Mild & Light Olive Oil (1L)    8            10000        Filippo

michaelrozar17 · March 25, 2011, 8:24am

Could you change as below and try..

awk '/Filippo Berio/{print $0 FS $2;next}1' inputfile > outfile

mrpugster · March 25, 2011, 8:38am

Ah ha ----- perfect You just saved me lots of manual find and replace.

---------- Post updated at 04:38 AM ---------- Previous update was at 04:28 AM ----------

Hi again

I've got 40+ of these to do, what's the best way to go about putting all 40 in a script, e.g.

awk '/Filippo Berio/{print $0 FS $2;next}1' test1.txt > test2.txt
awk '/Sun Grown/{print $0 FS $2;next}1' test1.txt > test2.txt

The above won't work obviously, however with AWK can you work on the input file alone so you don't have to output to a second file?

I also tried putting two command on the same line but this causes errors in other products.

awk '/Filippo Berio/{print $0 FS $2;next}1','/Sun Grown/{print $0 FS $2;next}1' test1.txt > test2.txt

thanks

ctsgnb · March 25, 2011, 9:06am

You should consider adding a non ambiguous separator between the Brand and the product (for example like colon ":" to follow POSIX standard)
Indeed, that would gather the advantages of

1) making the last column adding operation useless
2) Fixing ambiguous brand selection due to the unknown number of words if could be made of.

# sed 's/Berio/&:/;s/Castillo/&:/;s/Farchioni/&:/;s/: /:/' infile
Castillo:d'Canena Extra Virgin Arbequina Olive Oil (250ml) 8 10000
Castillo:d'Canena Extra Virgin Picual Olive Oil (250ml) 8 10000
Farchioni:Il Casolare Olive Oil Extra Virgin (1L) 8 10000
Filippo Berio:Extra Virgin Olive Oil (250ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil (500ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil (750ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil (1L) 8 10000
Filippo Berio:Extra Virgin Olive Oil Special Selection (250ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil Tin (1L) 8 10000
Filippo Berio:Gusto Fruttato Extra Virgin Olive Oil (500ml) 8 10000
Filippo Berio:Mild & Light Olive Oil (250ml) 8 10000
Filippo Berio:Mild & Light Olive Oil (500ml) 8 10000
Filippo Berio:Mild & Light Olive Oil (1L) 8 10000

---------- Post updated at 02:06 PM ---------- Previous update was at 01:54 PM ----------

You could then extract the brand easy :

# cut -d: -f1 infile | sort -u
Castillo
Farchioni
Filippo Berio

mrpugster · March 25, 2011, 9:20am

Thank you CTSGNB, interesting solution of yours.

As I've only got 40 or so of these I think I'll carry on the the AWK solution, something like this:

awk '/Filippo Berio/{print $0 FS $2;next}1' test1.txt > test2.txt
awk '/Sun Grown/{print $0 FS $2;next}1' test2.txt > test3.txt

I can't figure out how to work on the same input file like you can with sed -i so I'll just re-direct the output for each line.

ctsgnb · March 25, 2011, 9:24am

Once you have colon delimited your file, adding the last column is just piece of cake

# cat outfile
Castillo:d'Canena Extra Virgin Arbequina Olive Oil (250ml) 8 10000
Castillo:d'Canena Extra Virgin Picual Olive Oil (250ml) 8 10000
Farchioni:Il Casolare Olive Oil Extra Virgin (1L) 8 10000
Filippo Berio:Extra Virgin Olive Oil (250ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil (500ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil (750ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil (1L) 8 10000
Filippo Berio:Extra Virgin Olive Oil Special Selection (250ml) 8 10000
Filippo Berio:Extra Virgin Olive Oil Tin (1L) 8 10000
Filippo Berio:Gusto Fruttato Extra Virgin Olive Oil (500ml) 8 10000
Filippo Berio:Mild & Light Olive Oil (250ml) 8 10000
Filippo Berio:Mild & Light Olive Oil (500ml) 8 10000
Filippo Berio:Mild & Light Olive Oil (1L) 8 10000

# awk -F: '{print $0":"$1}' outfile

or

# awk -F: '{print $0,$1}' OFS=: outfile
Castillo:d'Canena Extra Virgin Arbequina Olive Oil (250ml) 8 10000:Castillo
Castillo:d'Canena Extra Virgin Picual Olive Oil (250ml) 8 10000:Castillo
Farchioni:Il Casolare Olive Oil Extra Virgin (1L) 8 10000:Farchioni
Filippo Berio:Extra Virgin Olive Oil (250ml) 8 10000:Filippo Berio
Filippo Berio:Extra Virgin Olive Oil (500ml) 8 10000:Filippo Berio
Filippo Berio:Extra Virgin Olive Oil (750ml) 8 10000:Filippo Berio
Filippo Berio:Extra Virgin Olive Oil (1L) 8 10000:Filippo Berio
Filippo Berio:Extra Virgin Olive Oil Special Selection (250ml) 8 10000:Filippo Berio
Filippo Berio:Extra Virgin Olive Oil Tin (1L) 8 10000:Filippo Berio
Filippo Berio:Gusto Fruttato Extra Virgin Olive Oil (500ml) 8 10000:Filippo Berio
Filippo Berio:Mild & Light Olive Oil (250ml) 8 10000:Filippo Berio
Filippo Berio:Mild & Light Olive Oil (500ml) 8 10000:Filippo Berio
Filippo Berio:Mild & Light Olive Oil (1L) 8 10000:Filippo Berio

mrpugster · March 25, 2011, 1:16pm

Hi Michael or anyone

I'm having trouble with AWK and with products which have an apostrophe in the name, .i.e.

Fisherman's Friend

I've tried escaping the character with \ and putting it in quotes but no luck. any ideas?

Many thanks

ctsgnb · March 25, 2011, 3:01pm

as suggested previously, you should focus on setting up a separator at the right place

sed "s/Fisherman's Friend/&:/;s/: /:/" outfile

then the awk run using the : as separator would just handle it as a piece of cake

mrpugster · March 25, 2011, 5:43pm

Unfortunately I've got around 1500 of these brands, I guess there's no getting away from cleaning up the data first with a separator.

ctsgnb · March 26, 2011, 4:42am

How has you input file been generated ? maybe there is a way to directly generate it with a separator between the brand and the product name instead of tediously doing it afterward

Do you have , somewhere, a file containing the list of the brand ?

You could also upload your entire file so we can help to format it as expected.

mrpugster · March 26, 2011, 5:30am

Hi

The raw data comes from a website scrape and contains 12000 products name but no brands names, e.g.

Filippo Berio Mild & Light Olive Oil (500ml)

What I've done initially is to the get the scrape program to take the first word of each product to use as the brand name. This is good for about 90% of brands but of course only works for those which are only a single word.

I then use a simple sed script to pick out the incorrect brands e.g.

sed -i "s/^7/7 Up/g" brands.txt
sed -i "s/^Ainsley/Ainsley Harriott/g" brands.txt
sed -i "s/^Air/Air Wick/g" brands.txt
sed -i "s/^Alfa/Alfa One/g" brands.txt
sed -i "s/^All/All About Shine/g" brands.txt
sed -i "s/^Alta/Alta Italia/g" brands.txt
sed -i "s/^Ambi/Ambi Pur/g" brands.txt
sed -i "s/^Angel/Angel Delight/g" brands.txt

However this again only fixes %70 of brands as some brands have the same first word, e.g.

John West
John Frieda

I can pull out a list of all the brands if needed. I've attached just the product list file for now but will produce and attach the brand file later today. What makes this a little bit more complicated is that this list will be updated weekly so new brands are constantly being added.

I can get my brand list to a point where it will only take me 10 - 15 minutes of manual editing so it's not the end of the world

Many thanks for you kind help on this little problem...

ctsgnb · March 26, 2011, 6:08am

The best way of doing this would be to find a way to extract only the brand name from the web site.
Otherwise, you have to build a list that will contains all the "multi word" brand name so that we can then setup a script to parse it with a pseudo code like :

for all brand in multi word band name list
set the : separator at the right place
for all other
set the : separator just after the first word

If you give more clue about the way you initially generate the initial file, maybe it can help to directly separate the brand & product fields at generation step.

mrpugster · March 26, 2011, 7:26am

The file is initially generated using a piece of web scraping software, I've just had a look at the web pages I do the scrape from, it looks like I can produce a scrape of all the brand names.

ctsgnb · March 26, 2011, 7:41am

so if you can generate the list of the brand name only, and then the other file containing everything, you can then setup the separator at the right place.
Can you upload the file containing only the brand name ?

mrpugster · March 26, 2011, 8:32am

I'm afraid automatically generating the brand list is harder than I initially thought. I will still continue to look at this. The support forum for the website for the scrape software is currently down and I need to ask a question there. Many thanks for your help

ctsgnb · March 26, 2011, 10:24am

some line appear more than once in your file :

Here you can see at which line number the duplicate occure :

$ sort tst | uniq -d | while read a
> do
> cat -n tst | grep "$a"
> done
  6776  Bloo Acticlean Cistern Blocks Citrus (2)
  6777  Bloo Acticlean Cistern Blocks Citrus (2)
  7704  Tesco Premium Supermeat Variety Pack (6x400g)
  7981  Tesco Premium Supermeat Variety Pack (6x400g)
  7711  Winalot Classics in Jelly Variety Pack (6x400g)
  7712  Winalot Classics in Jelly Variety Pack (6x400g)

---------- Post updated at 03:24 PM ---------- Previous update was at 03:21 PM ----------

$ sort tst | uniq -d | while read a
> do
> grep -n "$a" tst
> done
6776:Bloo Acticlean Cistern Blocks Citrus (2)
6777:Bloo Acticlean Cistern Blocks Citrus (2)
7704:Tesco Premium Supermeat Variety Pack (6x400g)
7981:Tesco Premium Supermeat Variety Pack (6x400g)
7711:Winalot Classics in Jelly Variety Pack (6x400g)
7712:Winalot Classics in Jelly Variety Pack (6x400g)