Tricky sed required

Hi All

I need to put some sed together for a task and its a bit advanced for me, so I thought I'd ask if anyone here could help.

I have a csv file with content like this -

"['1235','3234']","abcde","[1234]","['1235','3234']"
"'","abcde","[1235]","['1236','2234']"
"[1236]","['1237','1234']","","1234"
"'e'","[1237]","['1238','0234']",""

I need to remove any single quotes that fall between [ ], and also the [ ] themselves.

so -

"['1238','0234']"

Would become -

"1238,0234"

While -

"[1234]"

Becomes -

"1234"

The other fields containing single quotes are unchanged and no double quotes are removed. Quoted nulls are unaffected as well.

I'd like to do this in sed as it I need to squeeze some performance out of it.

Any help greatly appreciated

Brad

Can something like "'e','e'" show up in your input data?

Hi

Thanks for the response :slight_smile:

I'll go with yes on that.

To be honest I'm trying to anticipate some content that a bunch of developers are currently working on. I'm still unclear on exactly what they are going to do but they suggest I need to allow for at least a double quoted string that contains a single quote, so I guess two single quotes in a double quoted string is a possibility.

Cheers

Brad

I guess this is too easy?

sed "s/[][']//g" myFile

I don't think I completely understand your scenario... I tried and came up with below solution... hope this helps.. I am very new to Unix world, so my apologies for silly things., :slight_smile:

sed -e 's/\[//g' -e 's/\]//g' -e "s/\'//g" testFile

Output:

"1235,3234","abcde","1234","1235,3234"
"","abcde","1235","1236,2234"
"1236","1237,1234","","1234"
"e","1237","1238,0234",""

regards,
juzz4fun

Afraid you're right, it stripped the single quotes form around the "'e'"

Thanks anyway :slight_smile:

Try this:

perl -nle '@out=();while (/"([^"]*)"/g) {$x=$1;$x=~s/[\047\[\]]//g if $x=~/^\[/;push @out,"\"$x\""}; print join ",", @out' input

Cheers, That certainly does the job. :slight_smile:

Would love to know how to do it in sed if anyone else fancies a crack at it...

Thanks again, that will get me out of the hole.

The following sed-script will first remove any single quote inside "[".."]"-ranges, then remove "[..]" pairs. This should ensure that "'e','e'" would be untouched while "['e','e']" would reduce to "e,e". I hope this is what you wanted:

sed ':start
      /\[[^]]*\'\]/ {
           s/\(\[[^]]*\)\'/\1/
           b start
      }
      s/\[\([^]]*\)\]/\1/g' /path/to/infile

I hope this helps.

bakunin

Hi Bakunin,

That's certainly the kind of solution I am looking for. Unfortunately it doesn't quite do what I want, probably my fault for not taking a real file home with me.

I am back in work this morning and have just tried it against a real sample file -

"channel.facebook.com","['unknown', 'productivityloss']","[1001005, 1000041]","['standard', 'web2']"
"channel.facebook.com","['unknown', 'productivityloss']","[1001005, 1000041]","['standard', 'web2']"
"channel.facebook.com","['unknown', 'productivityloss']","[1001005, 1000041]","['standard', 'web2']"
"channel.facebook.com","['unknown', 'productivityloss']","[1001005, 1000041]","['standard', 'web2']"
"channel.facebook.com","['unknown', 'productivityloss']","[1001005, 1000041]","['standard', 'web2']"

Output of sed command -

"channel.facebook.com","'unknown', 'productivityloss'","1001005, 1000041","'standard', 'web2'"
"channel.facebook.com","'unknown', 'productivityloss'","1001005, 1000041","'standard', 'web2'"
"channel.facebook.com","'unknown', 'productivityloss'","1001005, 1000041","'standard', 'web2'"
"channel.facebook.com","'unknown', 'productivityloss'","1001005, 1000041","'standard', 'web2'"
"channel.facebook.com","'unknown', 'productivityloss'","1001005, 1000041","'standard', 'web2'"

So where I have single quoted strings inside brackets, I'm not losing the brackets.

This might be me, I had to change the outer quotes on your sed to " in order to get it to run -

sed ":start
      /\[[^]]*\'\]/ {
           s/\(\[[^]]*\)\'/\1/
           b start
      }
      s/\[\([^]]*\)\]/\1/g" $File

Would you be kind enough to explain how the sed works? I am trying to get to grips with sed and would appreciate the insight.

Thanks for the help :slight_smile:

The perl will get me out of trouble for now, but I'm not sure it is installed on all of our servers so would prefer a sed solution if possible for portability.

Brad

sed ':start                         # create a label
      /\[[^]]*\'\]/ {               # execute the following only if the line contains
                                    # this regexp: a "[", followed by optional
                                    # non-"]", followed by "'", followed by "]"
           s/\(\[[^]]*\)\'/\1/      # replace "[ - non-] - '" by itself except for the last '
           b start                  # go back to the label defined above
      }
      s/\[\([^]]*\)\]/\1/g'         # replace "[ - non-] -]" by anything inside the brackets

The first lines set up a loop: one "'" is removed inside "[...]", then "b start" starts over. Once there is no single quote left the regexp in the second line will not match any more and the whole block is skipped. At last the last line removes the brackets by replacing "[-content-]" by "content".

Probably my script failed because the first part has escaped single quotes ("\'") and the escaping didn't work as expected.

I hope this helps.

bakunin

Thanks :smiley:

That's really helpful.

Now I can play around with it and get a good understanding...

Cheers

Brad

while(<DATA>){
	s/(?:[\[\]]|'(?=[^\[]*\]))//xg;
	print;
}
__DATA__
"['1235','3234']","abcde","[1234]","['1235','3234']"
"'","abcde","[1235]","['1236','2234']"
"[1236]","['1237','1234']","","1234"
"'e'","[1237]","['1238','0234']",""
1 Like

Thanks Summer, I really should learn perl :slight_smile:

A bit more precise is

sed -e "s/'\([^']*\)'/\1/g" -e 's/"\[\([^]]*\)\]"/"\1"/g'

Maybe sufficient here.
But I admit that only a perl RE can achieve 100% correctness.

Nice use of the non capturing group plus lookahead.
Works fine also over Python:

>>> text = """
... "['1235','3234']","abcde","[1234]","['1235','3234']"
... "'","abcde","[1235]","['1236','2234']"
... "[1236]","['1237','1234']","","1234"
... "'e'","[1237]","['1238','0234']",""
... """
>>> pat = """(?:[\[\]]|'(?=[^\[]*\]))"""
>>> print re.sub(pat,'',texto)

"1235,3234","abcde","1234","1235,3234"
"'","abcde","1235","1236,2234"
"1236","1237,1234","","1234"
"'e'","1237","1238,0234",""

After some fiddling around I settled for this variation on an earlier post -

#! /bin/bash

sed ':start
      /\['\''/{
    s/\['\''//
    s/'\'']//
    s/'\'','\''/,/
        b start
} 
:finally
      /\[/{
    s/\[//
    s/\]//
    b finally
} ' file

Input -

"['1235','3234']","abcde","[1234]","['1235','3234']"
"'","abcde","[1235]","['1236','2234']"
"[1236]","['1237','1234']","","1234"
"'e'","[1237]","['1238','0234']",""
"123","wer","123","''"

Output -

"1235,3234","abcde","1234","1235,3234"
"'","abcde","1235","1236,2234"
"1236","1237,1234","","1234"
"'e'","1237","1238,0234",""
"123","wer","123","''"

I use tr -d to remove characters. With sed, I never use -e, just single quotes around the entire script and if I need a single quote or expanded variable, I drop out of single to double and then right back, e.g. ' inside two ' is: ' " ' " ' or: ' \' ' (without the spaces) Single quotes let less happen.