Problem with occurence of square brackets

zaxxon · June 11, 2014, 9:01am

Hello all,

I have the following problem:

$ cat infile
this is spam [i need this] and i need this too
this is spam [i need this][this is spam] and i need this too
$ perl -nwe '$_ =~ /[^\]]+ \[([^\]]+)\]\[?[^\]]*\]? ([^\]\[]+)$/; print "$1 - $2\n";' infile
i need this - too
i need this - and i need this too

I am not sure how many occurences of these square brackets will show up, at the moment I assume it is 1 for sure, maybe 2, but I always need the 1st and the complete text behind the last closing square bracket. As you can see for the 1st line, this doesn't work.

Any hints are welcome, thanks a lot.

CarloM · June 11, 2014, 9:36am

I'm not familiar with perl regex, but you could do it it sed by using 2 substitutions:

$ cat ~/tmp/file.txt
this is spam [i need this] and i need this too
this is spam [i need this][but not this] and i need this too
this is spam [i need this][but not this] [or this ] and i need this too
$  sed 's/^[^[]*\[\([^]]*\)\]/\1/g; s/\[[^]]*\]//g' ~/tmp/file.txt
i need this and i need this too
i need this and i need this too
i need this  and i need this too

neutronscott · June 11, 2014, 9:40am

I couldn't do it in regex...

$ awk -F'[][]' '{print $2,$NF}' infile
i need this  and i need this too
i need this  and i need this too

zaxxon · June 11, 2014, 9:58am

Thanks a lot guys.

@neutronscott:
Because it needs to be run on different platforms which includes Windows, I will have to use Perl - so I can't use awk, sorry I missed to point this out.

@CarloM:
I am curious if I can solve it in one expression. Maybe other gods of RegExp shed some light I am stuck at the moment. If there is no other way, I will use the 2 separate statements.

neutronscott · June 11, 2014, 10:09am

$ sed 's/[^[]*[[]\([^]]*\)]\([[][^]]*]\)*\([^]]*\)$/\1 -- \3/g' infile
i need this --  and i need this too
i need this --  and i need this too

?

clx · June 11, 2014, 10:39am

May be split can help?

perl -nwe '@a=split(/\[|\]/,$_); print "$a[1] - $a[$#a]\n";' infile
i need this -  and i need this too

i need this -  and i need this too

i need this -  and i need this too

infile

this is spam [i need this] and i need this too
this is spam [i need this][this is spam] and i need this too
this is spam [i need this][this is spam][this is spam too] and i need this too

zaxxon · June 11, 2014, 12:44pm

Thanks for the answers, guys.

zaxxon · June 12, 2014, 3:55am

I decided to use neutronscott's solution, which I understand except the effect of these two expressions:

sed 's/[^[]*[[]\([^]]*\)]\([[][^]]*]\)*\([^]]*\)$/\1 -- \3/g' infile
            ^^^            ^^^

A group which consists of a single square bracket? I would have written the single square bracket without the enclosure but this does not work obviously.

Don_Cragun · June 12, 2014, 6:02am

zaxxon:

I decided to use neutronscott's solution, which I understand except the effect of these two expressions:
sed 's/[^[]*[[]$[^]]*$]$[[][^]]*]$*$[^]]*$$/\1 -- \3/g' infile
   ^^^            ^^^
A group which consists of a single square bracket? I would have written the single square bracket without the enclosure but this does not work obviously.

If you understood the meaning of a repetition of a non-matching bracket expression (such as [^[]* which matches zero or more occurrences of any character except [ ), I'm surprised that you didn't understand the meaning of the matching bracket expression [[] which matches one occurrence of the [ character. Similarly, [^]] matches any character other than ] and []] matches a ] .

You have to use the bracket expression [[] or escape the opening bracket \[ to distinguish it as a character to be matched (rather than the start of a bracket expression). In some contexts, you do not need to use a bracket expression or an escape to specify a closing bracket, but the meaning is is the same if you use []] and it is symmetric with the [[] if you have an editor that pairs up opening and closing parentheses, braces, and brackets.

zaxxon · June 12, 2014, 7:36am

Ok, I would have written it escaped. So far I never used the grouping to avoid escaping - I was not aware this is an "allowed" usage of the square brackets.
It's now clear to me, thanks.

Scrutinizer · June 12, 2014, 9:10am

Both suggestions could be reduced a bit, when taking into account sed's greedy matchin property, notably \[.*\] matches everything between the first square bracket until the last, from the point of where sed's is looking at that moment. Thus:

CarloM's approach, with the two dashes inserted:

sed 's/[^[]*\[\([^]]*\)\]/\1 --/; s/\[.*\]//' file

And NeutronScott's approach..

sed 's/[^[]*[[]\([^]]*\)]\(\[.*\]\)*/\1 --/' file

(the original will fail with more than two square bracket episodes)

neutronscott · June 12, 2014, 10:08am

Yes, sorry. It's just clever escaping. Like one would ps | grep omething . I began to use this more because in awk, depending on quoting/context, often times you need to escape your escapes since they're really processed twice, and it gets ugly so I tend to avoid \ when possible now.

The * after the grouping allows it to repeat. But I see mine doesn't perform correctly in the last two cases.

mute@thedoctor:~$ cat input
this is spam [i need this][this is spam][another one][last] and i need this too
this is spam [i need this] probably everything [here] too
this is spam [i need this] and probably i need ] everything here too?
mute@thedoctor:~$ sed 's/[^[]*[[]\([^]]*\)]\([[][^]]*]\)*\([^]]*\)$/\1 -- \3/g' input
i need this --  and i need this too
this is spam [here --  too
this is spam [i need this] and probably i need ] everything here too?
mute@thedoctor:~$ sed 's/[^[]*[[]\([^]]*\)]\(\[.*\]\)*/\1 --/' input
i need this -- and i need this too
i need this -- probably everything [here] too
i need this -- and probably i need ] everything here too?

I also didn't think of not needing to match & sub the last part. That's definitely shorter.

CarloM · June 12, 2014, 10:30am

scrutinizer:

Both suggestions could be reduced a bit, when taking into account sed's greedy matchin property, notably \[.*\] matches everything between the first square bracket until the last, from the point of where sed's is looking at that moment. Thus:

CarloM's approach, with the two dashes inserted:
sed 's/[^[]*\[$[^]]*$\]/\1 --/; s/\[.*\]//' file

Note that that isn't just slightly shorter, it also corrects the output to match Zaxxon's requirement - my original produced different output since it left any non-leading text not inside brackets.

Scrutinizer · June 12, 2014, 1:44pm

Yet another approach:

sed 's/]\(.*]\)*/ --/; s/.*\[//' file