Replacing a string with its substring

Hi All,

Below is some sample content of my input file:

There are many types and traditions of anarchism, some of which are [[mutually exclusive]]. Strains of anarchism have been divided into the categories of [[social anarchism|social]] and [[individualist anarchism]] or similar dual classifications. Anarchism is often considered to be a radical [[left-wing]] ideology, and much of [[anarchist economics]] and [[anarchist law|anarchist legal philosophy]] reflect [[anti-statism|anti-statist]] interpretations of [[anarcho-communism|communism]], [[collectivist anarchism|collectivism]], [[anarcho-syndicalism|syndicalism]] or [[participatory economics]].

For the above content, if the square bracket [[mutually exclusive]] doesnt contain the delimiter '|',the substring inside the brackets, mutually exclusive, should replace the whole pattern [[mutually exclusive]].

If the square bracket contain strings separated with '|' delimiter [[social anarchism|social]] , the substring after the final delimiter in that pattern, social, should replace the whole pattern [[social anarchism|social]].

I believe this would be possible with sed and awk commands. I tried it. As am not that much conversant in unix, i could not achieve this.

Any help is appreciated. Also please recommend some useful sites/books to learn sed,awk and other text processing commands.

Thanks
Satheesh

Try:

perl -pe 's/\[\[[^\]]+\|([^\]]+)\]\]/\1/g;s/\[\[([^\]]+)\]\]/\1/g' input

Thats Great Bartus. it is working. Do u suggest perl over shell scripts for text processing.

If so, please recommend me some sites to learn perl.

Regards
Satheesh

To learn Perl you need this book: Learning Perl, Third Edition - O'Reilly Media

Thank you Bartus :slight_smile:

Alternatively, a single regex could be used with the substitution operator like so -

$
$
$ cat f8
There are many types and traditions of anarchism, some of which are [[mutually exclusive]].
Strains of anarchism have been divided into the categories of [[social anarchism|social]]
and [[individualist anarchism]] or similar dual classifications. Anarchism is often
considered to be a radical [[left-wing]] ideology, and much of [[anarchist economics]]
and [[anarchist law|anarchist legal philosophy]] reflect [[anti-statism|anti-statist]]
interpretations of [[anarcho-communism|communism]], [[collectivist anarchism|collectivism]],
[[anarcho-syndicalism|syndicalism]] or [[participatory economics]].
$
$
$
$ perl -plne 's/\[\[[^|]*?\|*([^|]*?)\]\]/$1/g' f8
There are many types and traditions of anarchism, some of which are mutually exclusive.
Strains of anarchism have been divided into the categories of social
and individualist anarchism or similar dual classifications. Anarchism is often
considered to be a radical left-wing ideology, and much of anarchist economics
and anarchist legal philosophy reflect anti-statist
interpretations of communism, collectivism,
syndicalism or participatory economics.
$
$

tyler_durden

Thank you very much tyler :slight_smile:

Sorry, I just noticed that the one-liner posted above will work only if, within the double-brackets:
(a) there's a single string with no embedded "|"s
(b) there are exactly two strings with exactly one embedded "|"

So, cases like the following:

[[abc|def|ghi]]
[[abc|def|ghi|jkl]]

will not be matched, and hence will not be altered by the script.
An example follows (I've modified your data a bit):

$
$
$ cat f8
There are many types and traditions of anarchism, some of which are [[mutually exclusive]].
Strains of anarchism have been divided into the categories of [[social anarchism|social]]
and [[individualist anarchism]] or similar dual classifications. Anarchism is often
considered to be a radical [[left-wing]] ideology, and much of [[anarchist economics]]
and [[anarchist law|anarchist legal philosophy]] reflect [[anti-statism|anti-statist|non-statist]]
interpretations of [[anarcho-communism|communism]], [[collectivist anarchism|collectivism]],
[[anarcho-syndicalism|syndicalism|blah|BLAH]] or [[participatory economics]].
$
$
$ # Old script that does NOT work for more than 2 delimited tokens within double-brackets
$ perl -plne 's/\[\[[^|]*?\|*([^|]*?)\]\]/$1/g' f8
There are many types and traditions of anarchism, some of which are mutually exclusive.
Strains of anarchism have been divided into the categories of social
and individualist anarchism or similar dual classifications. Anarchism is often
considered to be a radical left-wing ideology, and much of anarchist economics
and anarchist legal philosophy reflect [[anti-statism|anti-statist|non-statist]]
interpretations of communism, collectivism,
[[anarcho-syndicalism|syndicalism|blah|BLAH]] or participatory economics.
$
$

The fix for this is to modify the regex so that it:
(a) matches all characters, including "|"s, as much as possible
(b) matches a single "|" character (if it exists at all)
(c) matches the remainder that does not include "|", and set it to position 1

Something like this:

$
$ cat f8
There are many types and traditions of anarchism, some of which are [[mutually exclusive]].
Strains of anarchism have been divided into the categories of [[social anarchism|social]]
and [[individualist anarchism]] or similar dual classifications. Anarchism is often
considered to be a radical [[left-wing]] ideology, and much of [[anarchist economics]]
and [[anarchist law|anarchist legal philosophy]] reflect [[anti-statism|anti-statist|non-statist]]
interpretations of [[anarcho-communism|communism]], [[collectivist anarchism|collectivism]],
[[anarcho-syndicalism|syndicalism|blah|BLAH]] or [[participatory economics]].
$
$ # New script that should work
$ perl -plne 's/\[\[.*?\|*([^|]*?)\]\]/$1/g' f8
There are many types and traditions of anarchism, some of which are mutually exclusive.
Strains of anarchism have been divided into the categories of social
and individualist anarchism or similar dual classifications. Anarchism is often
considered to be a radical left-wing ideology, and much of anarchist economics
and anarchist legal philosophy reflect non-statist
interpretations of communism, collectivism,
BLAH or participatory economics.
$
$

tyler_durden

Thats good Tyler. I had also tried only with a single | delimiter.

Thank you for correcting that. I am searching for a good perl xml parser package to develop my own wikipedia xml parser. Could you help me in this if you have some knowledge in perl xml parsers.

As i am new to perl i do not know about where to search the packages and all. Any help is appreciated.

Regards
Satheesh