Unix Remove repetitive alphabets

Hi,

I am trying to write a script that will take 2 or more instances of repetitive alphabets (ZZ) to be removed from a field. This should only happen from beginning and end of a field.

For Example :

Input File

a) ZZZIBM Corporation 
b) ZZZIBM Corporation ZZZZZ
b) IBM ZZZ Corporation

Output Result should be as follow :

a) IBM Corporation 
b) IBM Corporation
b) IBM ZZZ Corporation

Please advise.

Thanks....

Always be ZZ?

sed 's/\(^.. \)Z*/\1/;s/Z*$//' urfile
sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" file

It will be all Z's but it can be 2 or more repetitive Z. So will the sed command will work for 2 or more Z? Also, will it take Z's from beginning and end of a field? Z should not be taken away in between the words. Please advise.

Thanks...

yes, if you try it.

I tried it and it returns the same result without cleaning ZZ.

sed 's/\(^.. \)Z*/\1/;s/Z*$//' zzz_test.dat
ZZZIBM Corporation
ZZZIBM Corporation ZZZZZ
IBM ZZZ Corporation

I also tried

sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" file

and it also not work.

Also, I need this to be done on a field not a file. I am extracting a field from a file already while looping through each line. Please advise. I am doing this in Linux OS.

Thanks...

It is working for me.

$ cat file
ZZZIBM Corporation
ZZZIBM Corporation ZZZZZ
IBM ZZZ Corporation
$ sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" file
IBM Corporation
IBM Corporation
IBM ZZZ Corporation

What is your output of sed command?

$ cat zzz_test.dat
ZZZIBM Corporation
ZZZIBM Corporation ZZZZZ
IBM ZZZ Corporation
$ sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" zzz_test.dat
IBM Corporation
IBM Corporation ZZZZZ
IBM ZZZ Corporation

It is not able to clear up end ZZZZZ. Please advise.

Thanks..

The solutions of anbu23 and rdcwayx work for me. Here's one with awk:

awk '{sub(/^Z+/,"");sub(/ Z+$/,"")}1' file

Maybe it is Linux that is causing issue.

$ awk '{sub(/^Z+/,"");sub(/ Z+$/,"")}1' zzz_test.dat
IBM Corporation
IBM Corporation ZZZZZ
IBM ZZZZ Corporation

Please advise.

Thanks....

May be because of space at the end of the line. Try this

sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//" zzz_test.dat

It is end of line character at end. No Space. Everything else is working fine except the last ZZZ.

$ sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//" zzz_test.dat
IBM Corporation
IBM Corporation ZZZZZ
IBM ZZZZ Corporation

Please advise. Thanks a bunch for the help in this. The reason for this is we are writing a cleansing routine at work to clean data. All other code is done except the last part which is where I am stuck.

Thanks again.....

Can you show the output of this command?

cat -vet zzz_test.dat

Ah...It is having ^M at end of record.

cat -ver zzz_test.dat
ZZZZIBM Corporation          ^M$
ZZZZIBM Corporation ZZZZZ^M$
IBM ZZZZ Corporation         ^M$

I will replace ^M and try the sed again. Thanks a lot for the help.

tr -d \\015 < zzz_test.dat | sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//"

Yeap this worked. Thanks a lot anbu23 and all others for your help.

The below code is changing all alphabets that begin with 2 or more letters in begiining to clear out.

 
tr -d \\015 < zzz_test.dat | sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//"
 
Before : HHS/CDC
After : S/CDC
 
Before : 55304 
After : 304 

I just need 2 or more ZZ to clear out at beginning and end of a field. Please advise.

Thanks....

tr -d \\015 < zzz_test.dat | sed "s/^Z\{2,\}//;s/Z\{2,\} *$//"

This worked. Thanks.....

If your awk/nawk supports multi-char field separators

awk -F'[Z ]+' '{$1=$1}1'