My input contains a single word lines.
From each line
a) I want to remove all text that starts with 'dp' including 'dp'.
Ex: prjgoodBlaBladpgoodBlaBla ---> prjgoodBlaBla
b) Also I want to remove duplicate substrings.
Ex: prjtestBlaBlatestBlaBla ---> prjtestBlaBla
Logic I have in mind but having hard time implementing: Take 4 thru 10 characters [testBla] , if its found in the string, remove all text starting from second occurance of it.
data.txt
prjtestBlaBlatestBlaBla
prjthisBlaBlathisBlaBla
prjthatBlaBladpthatBlaBla
prjgoodBlaBladpgoodBlaBla
prjgood1BlaBla123dpgood1BlaBla123
Desired output -->
data_out.txt
prjtestBlaBla
prjthisBlaBla
prjthatBlaBla
prjgoodBlaBla
prjgood1BlaBla123
I am able to get part a) of my requirement working using following,,
> sed 's/dp\(.*\)\..*/\1/' data.txt
prjtestBlaBlatestBlaBla
prjthisBlaBlathisBlaBla
prjthatBlaBladpthatBlaBla
prjgoodBlaBladpgoodBlaBla
prjgood1BlaBla123dpgood1BlaBla123
but not part b).
perl -pe 's/dp.*// || s/(\w+)\1/\1/' data.txt
bart, its working.. Thanks for the solution..
Can you explain what does || and (\w+) do ?
Can we get it working using sed !? can someone help ?
> /usr/xpg4/bin/sed -e 's/dp.*//' -e 's/(\w+)\1/\1/' data.txt
sed: command garbled: s/(\w+)\1/\1/
I don't know how to make it work in sed. "||" works as "exclusive or" in perl so it checks if first command was successful, and if it was then second one is not processed. (\w+)\1 matches first occurance of consecutive duplicate strings (it is extension of "(\w)\1", which would match consecutive duplicate characters, like "aa","bb" and so on).
Thanks bart.
Does anyone know how to do this using sed ? does any other shell in Unix recognize (\w+) as consecutive duplicate strings?
It is not just (w+), but (w+)\1. That "\1" is important, as it matches string matched before by (\w+). In other words "\1" matches the duplicated part.
Does anyone know how to achieve this using sed ?
perl -pe 's/dp.*// || s/(\w+)\1/\1/' data.txt
sed 's/dp.*//;s/\([^ ]*\)\1/\1/g' data.txt
1 Like
Thanks, its working..
sed 's/dp.*//;s/\([^ ]*\)\1/\1/g' data.txt
s/dp.*//; -->
remove string from each line that starts with 'dp' followed by any character.
s/\([^ ]*\)\1/\1/g -->
For any string/line that is not empty, print only first occurance of the string !?