delete repeated strings (tags) in a line and concatenate corresponding words

mjomba · November 8, 2010, 3:48am

Hello friends!

Each line of my input file has this format:
word<TAB>tag1<blankspace>lemma<TAB>tag2<blankspace>lemma ... <TAB>tag3<blankspace>lemma

Of this file I need to eliminate all the repeated tags (of the same word) in a line, as in the example here below, but conserving both (all) the lemmata related to that tag, by concatenating them with a �|� separator.

My INPUT (sample):
abecedaria ADJ abecedarius ADJ:abl abecedarius N:abl abecedaria N:abl abecedarium N:acc abecedaria N:acc abecedarium N:nom abecedaria N:nom abecedarium N:voc abecedaria N:voc abecedarium
abecedariabus N:abl abecedaria N:dat abecedaria
abhorruerimus V:IND abhorreo V:IND abhorresco V:SUB abhorreo V:SUB abhorresco
abhorrueritis V:IND abhorreo V:IND abhorresco V:SUB abhorreo V:SUB abhorresco
abhorruero V:IND abhorreo V:IND abhorresco

Very gratefull to anyone who can help me!
mjomba from Tanzania

Scrutinizer · November 8, 2010, 4:01am

Hi mjomba, try this:

sed 's/\( [A-Z]:[[:alnum:]]* \)\([[:alnum:]]*\)\1/\1\2|/g' infile

birei · November 8, 2010, 4:15am

Hi,

Scrutinizer was faster, but another one using 'sed':

sed 's/\( \+[A-Za-z]\+:[A-Za-z]\+ \+\)\(.*\)\(\1\)/\1\2|/g' infile

Regards,
Birei