Unix Remove repetitive alphabets

msalam65 · February 2, 2010, 6:09pm

Hi,

I am trying to write a script that will take 2 or more instances of repetitive alphabets (ZZ) to be removed from a field. This should only happen from beginning and end of a field.

For Example :

Input File

a) ZZZIBM Corporation 
b) ZZZIBM Corporation ZZZZZ
b) IBM ZZZ Corporation

Output Result should be as follow :

a) IBM Corporation 
b) IBM Corporation
b) IBM ZZZ Corporation

Please advise.

Thanks....

rdcwayx · February 2, 2010, 7:11pm

Always be ZZ?

sed 's/\(^.. \)Z*/\1/;s/Z*$//' urfile

anbu23 · February 2, 2010, 8:04pm

sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" file

msalam65 · February 2, 2010, 9:25pm

It will be all Z's but it can be 2 or more repetitive Z. So will the sed command will work for 2 or more Z? Also, will it take Z's from beginning and end of a field? Z should not be taken away in between the words. Please advise.

Thanks...

rdcwayx · February 3, 2010, 1:10am

yes, if you try it.

msalam65 · February 3, 2010, 12:02pm

I tried it and it returns the same result without cleaning ZZ.

sed 's/\(^.. \)Z*/\1/;s/Z*$//' zzz_test.dat
ZZZIBM Corporation
ZZZIBM Corporation ZZZZZ
IBM ZZZ Corporation

I also tried

sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" file

and it also not work.

Also, I need this to be done on a field not a file. I am extracting a field from a file already while looping through each line. Please advise. I am doing this in Linux OS.

Thanks...

anbu23 · February 3, 2010, 12:15pm

It is working for me.

$ cat file
ZZZIBM Corporation
ZZZIBM Corporation ZZZZZ
IBM ZZZ Corporation
$ sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" file
IBM Corporation
IBM Corporation
IBM ZZZ Corporation

What is your output of sed command?

msalam65 · February 3, 2010, 12:24pm

$ cat zzz_test.dat
ZZZIBM Corporation
ZZZIBM Corporation ZZZZZ
IBM ZZZ Corporation
$ sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\}$//" zzz_test.dat
IBM Corporation
IBM Corporation ZZZZZ
IBM ZZZ Corporation

It is not able to clear up end ZZZZZ. Please advise.

Thanks..

Franklin52 · February 3, 2010, 12:30pm

The solutions of anbu23 and rdcwayx work for me. Here's one with awk:

awk '{sub(/^Z+/,"");sub(/ Z+$/,"")}1' file

msalam65 · February 3, 2010, 12:32pm

Maybe it is Linux that is causing issue.

$ awk '{sub(/^Z+/,"");sub(/ Z+$/,"")}1' zzz_test.dat
IBM Corporation
IBM Corporation ZZZZZ
IBM ZZZZ Corporation

Please advise.

Thanks....

anbu23 · February 3, 2010, 12:34pm

May be because of space at the end of the line. Try this

sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//" zzz_test.dat

msalam65 · February 3, 2010, 12:49pm

It is end of line character at end. No Space. Everything else is working fine except the last ZZZ.

$ sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//" zzz_test.dat
IBM Corporation
IBM Corporation ZZZZZ
IBM ZZZZ Corporation

Please advise. Thanks a bunch for the help in this. The reason for this is we are writing a cleansing routine at work to clean data. All other code is done except the last part which is where I am stuck.

Thanks again.....

anbu23 · February 3, 2010, 1:05pm

Can you show the output of this command?

cat -vet zzz_test.dat

msalam65 · February 3, 2010, 2:22pm

Ah...It is having ^M at end of record.

cat -ver zzz_test.dat
ZZZZIBM Corporation          ^M$
ZZZZIBM Corporation ZZZZZ^M$
IBM ZZZZ Corporation         ^M$

I will replace ^M and try the sed again. Thanks a lot for the help.

anbu23 · February 3, 2010, 2:36pm

tr -d \\015 < zzz_test.dat | sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//"

msalam65 · February 3, 2010, 2:38pm

Yeap this worked. Thanks a lot anbu23 and all others for your help.

msalam65 · February 8, 2010, 12:55pm

The below code is changing all alphabets that begin with 2 or more letters in begiining to clear out.

 
tr -d \\015 < zzz_test.dat | sed "s/^\(.\)\1\{1,\}//;s/\(.\)\1\{1,\} *$//"
 
Before : HHS/CDC
After : S/CDC
 
Before : 55304 
After : 304

I just need 2 or more ZZ to clear out at beginning and end of a field. Please advise.

Thanks....

anbu23 · February 8, 2010, 1:02pm

tr -d \\015 < zzz_test.dat | sed "s/^Z\{2,\}//;s/Z\{2,\} *$//"

msalam65 · February 8, 2010, 3:01pm

This worked. Thanks.....

reborg · February 8, 2010, 3:16pm

If your awk/nawk supports multi-char field separators

awk -F'[Z ]+' '{$1=$1}1'