Need help in SED

ahjiefreak · December 2, 2007, 9:05pm

Hi,

I am quite new to Bash shell and Unix recently.

I would like to use SED to replace characters in my files.
The input I have is the result of SED which removes all whitespaces which you can see "charc" is from "char c".Assume I have the below input character in my file:-

addstr(c,outset,j,maxset){
charc;
charoutset;
intj;
}

I wanted to tokenise "*", "," and ";" as one separate token respectively let say a character "s" each time it is seen. And for other characters I would like to tokenize to character "t". FOr the () and { } I would also like to tokenize as one character "s".While for char, int or str type, I would like to replace with "d" as one group.

The output I am looking after is:-

ttttttststtttttststtttttss
dts
dstttttts
dsts

And then I plan to count the number of t,s,and d for each line.

I tried to find information on SED tutorials on google but it doesnt detailed to this specific case. Could anyone help.

Appreciate alot.

Thanks.

Rgrds,
Jason

Smiling_Dragon · December 2, 2007, 10:10pm

Start by taking the full words to tokenise and exchange them for something 'safe' that subsequent replacements won't hit.
Then replace the remaining 'word' characters with your t token.
The symbols come next, then lastly, replace the 'safe' symbol for the t token.
Basic syntax you want:

sed 's/thing/token/g'

Untested, but try the following:

sed 's/(char|int|str)/\@/g' < file_to_convert | sed 's/[\w]/t/g' | sed 's/[\*\,\;\(\)\{\}]/s/g' | sed 's/\@/d/g'

Smiling_Dragon · December 2, 2007, 10:12pm

An alternative might be to create a little script to just count the things you want without bothering to monkey with it first - pretty much just a big case with a few smarts to understand how to break the bits up.

ahjiefreak · December 2, 2007, 10:50pm

Hi again,

One more thing,

In the command you suggested there is this :- file_to_convert . May I know what is this for?

Thanks.

-Jason

Hi Smiling Dragon.

Thanks for the reply. I am actually using

cat bla.txt| sed 's/(char|int|str)/\@/g' | sed 's/[\w]/t/g' | sed 's/[\*\,\;\(\)\{\}]/s/g' | sed 's/\@/d/g'>blabla.txt

And i get this:-

addstrscsoutsetsjsmaxsets
charcs
charsoutsets
intsjs

However, the function name "addstr" and other characters could not be converted to other characters.

One thing I do not get you is :-
sed 's/(char|int|str)/\@/g' | sed 's/[\w]/t/g'
The above is to replace char, int and str with "@" right? And what is "w" means in this case?

In your previous reply, you mentioned that by taking the full words to tokenise and exchange them for something 'safe' that subsequent replacements won't hit.What possible way can these int,char,str be recognized? For example; charc could be recognized as "char" and "c"
separately.

One more thing, assume that the previous example; could I just delete the function name off just right before "(" so that the tokenization works much simpler way? Can it be done using SED? If so, how is that possible.

Hope to hear from you soon!Thanks.

-Jason

Smiling_Dragon · December 3, 2007, 4:12pm

The file_to_convert refernce is just away to get your input stream in there - I didn't know if you were bringing it in on STDIN or in a file passed to your script. Substirute file_to_convert for your filename or omit it entirely if you intend to pass the file in on STDIN.

Add 'addstr' and any other things to match to the regex looking for char etc. Put longer names first so that 'addstr' gets matched before 'str' for example.

It's looking like sed doesn't like my (thing|otherthing) regex syntax. This should be something you'll be able to debug though. If not, just use multiple sed calls (ie sed 's/char/\@/g' | sed 's/int/\@/g' etc).

/w is meant to match all alphanumeric characters. If your version of sed doesn't support it, you can use the tr command to look for [a-Z] instead.

The idea behind exchanging those words like char etc for another symbol is to prevent exactly what you are describing. We don't want the substitution of all characters to the 't' token to convert the letters making up char. And we can't change it to 't' yet as that's a letter too. It would be simpler to use tokens that are not normally present in the file, that way you could switch each one in turn without needing 'placeholder' tokens to protect the integrity of the information.

I would think it's fine to remove the function name, but it depends on the main script calling this section. I seldom use ksh functions in small script like this.

ahjiefreak · December 3, 2007, 6:00pm

Hi,

I tried to used

cat b.txt |tr '[a-zA-Z*]' 't'|sed 's/char/\@/g' |sed 's/int/\@/g' |sed 's/str/\@/g'|tr '[a-zA-Z]' 't'|sed 's/==/`/g'|sed 's/>=/`/g'|sed 's/<=/`/g'| sed 's/[\*\,\;\(\)\{\}\+\-\&\=\/\<\>\!\&\||\#]/`/g'>c.txt

The input file (b.txt):-

addstr(c,outset,j,maxset){
charc;
charoutset;
intj;
int maxset;
}

The otuput file (c.txt) :-

tttttt`t` tttttt` t` tttttt`
tttt t`
tttt ttttttt`
ttt tt`
ttt tttttt`
`

I still could not figure out how to set regex to differentiate between charstr and char. I tried to set |tr '[a-zA-Z*]' 't'| to hope that charstr can be recognize as part of token "t" rather than @ as char.

"charc" also caused confusion as to whether it is a character (@) or other words (t). Previously I do a sed on b.txt to remove all the whitespaces in between. Instead of "char c" , it becomes "charc". Should I not remove them at the first place?

Could anyone have any idea to turn around this problem?

Appreciate alot!Thanks.
-Jason

Smiling_Dragon · December 3, 2007, 10:01pm

As I said before, you need to add the additional substitutions (such as charc) before the smaller ones (such as char) to prevent it matching the wrong thing. The tr needs to be after the step to replace the keywords with @ symbols.

ahjiefreak · December 3, 2007, 11:21pm

Hi there,

I understand your explanation where we need to check the bigger chunk possiblities (charc, addstr..or any other string variable which partially consist of some variable representation) followed by the exact data type (char, str, int) and at last, other characters.

I tried to use:-

cat b.txt |sed 's/[\addstr]/-/g'|sed 's/char/\@/g' |sed 's/int/\@/g' |sed 's/str/\@/g'|sed 's/==/`/g'|sed 's/>=/`/g'|sed 's/<=/`/g'| sed 's/[\*\,\;\(\)\{\}\+\-\&\=\/\<\>\!\&\||\#]/`/g'|tr '[a-zA-Z]' '`'|tr '[-]' '`' >c.txt

And below are the output which I still cant get the right substitution:-

`-----`````--`-``````-`-``
```-``
```-```--`-`
``-```
``- ```-`-`
`

Inputs is:-
addstr(c,outset,j,maxset){
charc;
charoutset;
intj;
int maxset;
}

While I am expecting that "addstr" and "charc" is substituted as one single character which replaced by "-" before they are replace by "`" in the end as part of other characters. Others like string, char are expected to have @.

However, now I am getting more confused after observing the output. Any mistakes which I have done?

Please help. Thanks.

-Jason

ahjiefreak · December 3, 2007, 11:57pm

Hi there SmilingDragon,

I managed to get it simplified by having :-

cat b.txt |sed 's/addstr/t/g'|sed 's/^[ ]//'|sed '/^[0]/d'|sed 's/char/\@/g' |sed 's/int/\@/g' |sed 's/str/\@/g'|sed 's/bool/\@/g'|sed 's/if/\@/g'|sed 's/else/\@/g'|sed 's/else if/\@/g'|sed 's/==/~/g'|sed 's/>=/~/g'|sed 's/<=/~/g'|sed 's/!=/~/g'|sed 's/||/~/g'|sed 's/#/~/g'|sed 's/[\\,\;\(\)\{\}\+\-\&\=\/\<\>\!]/`/g'>c.txt

Which I have to replace addstr with some other tokens "t" I name it. Then, as what you have mentioned, replace with char,str,int...with @. And I give a final thought that we dont have to use tr to translate other tokens since it will makes more things confused.

Anyway for the first step on replacing addstr or charc, is there any intelligent way that SED knows to differentiate when to automatically replace with token "t" for "addstr" and token "`" and "@" for "add str".

I tried to use regexp but it doesnt actually works. Please let me know if you have any pointers to this. Thanks.

-Jason

ahjiefreak · December 4, 2007, 1:21am

Hi again,

Another issue I would like to bring up is when I have characters like:-

add[i-1]

Using the previous mentioned sed doesnt do the work of replacing "-" to "`" whenever it is in [ ]. The same goes for other operators in the bracket.

How could we solve this limitation?

Appreciate alot. Hear from you soon.

-Jason