filtering out duplicate substrings, regex string from a string

My input contains a single word lines.
From each line

data.txt

 
prjtestBlaBlatestBlaBla
prjthisBlaBlathisBlaBla
prjthatBlaBladpthatBlaBla
prjgoodBlaBladpgoodBlaBla
prjgood1BlaBla123dpgood1BlaBla123

Desired output -->
data_out.txt

 
prjtestBlaBla
prjthisBlaBla
prjthatBlaBla
prjgoodBlaBla
prjgood1BlaBla123

I am able to get part a) of my requirement working using following,,

 
> sed 's/dp\(.*\)\..*/\1/' data.txt
prjtestBlaBlatestBlaBla
prjthisBlaBlathisBlaBla
prjthatBlaBladpthatBlaBla
prjgoodBlaBladpgoodBlaBla
prjgood1BlaBla123dpgood1BlaBla123

but not part b).

perl -pe 's/dp.*// || s/(\w+)\1/\1/' data.txt

bart, its working.. Thanks for the solution..

Can you explain what does || and (\w+) do ?

Can we get it working using sed !? can someone help ?

> /usr/xpg4/bin/sed -e 's/dp.*//' -e 's/(\w+)\1/\1/' data.txt
sed: command garbled: s/(\w+)\1/\1/

I don't know how to make it work in sed. "||" works as "exclusive or" in perl so it checks if first command was successful, and if it was then second one is not processed. (\w+)\1 matches first occurance of consecutive duplicate strings (it is extension of "(\w)\1", which would match consecutive duplicate characters, like "aa","bb" and so on).

Thanks bart.

Does anyone know how to do this using sed ? does any other shell in Unix recognize (\w+) as consecutive duplicate strings?

It is not just (w+), but (w+)\1. That "\1" is important, as it matches string matched before by (\w+). In other words "\1" matches the duplicated part.

Does anyone know how to achieve this using sed ?

perl -pe 's/dp.*// || s/(\w+)\1/\1/' data.txt
sed 's/dp.*//;s/\([^ ]*\)\1/\1/g' data.txt
1 Like

Thanks, its working..

sed 's/dp.*//;s/\([^ ]*\)\1/\1/g' data.txt

s/dp.*//; -->
remove string from each line that starts with 'dp' followed by any character.

s/\([^ ]*\)\1/\1/g -->
For any string/line that is not empty, print only first occurance of the string !?