Stemming of words that contained affixes by using shell script

I just learning shell script. Need your shell script expertise to help me. I would like to stemming the words by matching the root words first between both files and replace all words by "I" character but replace "B" character after root words and "E" before root words in affix_words.txt.

root_words.txt:

read
like
.....

affix_words.txt:

reading
unlikely
.....

The expected output is:

r e a d i n g<TAB>I I I I B I I
u n l i k e l y<TAB>I E I I I I B I
.....

Not quite clear. You want to print every input word in the affix file plus, separated by a <TAB>, this word with every char replaced by upper case I except for the last char BEFORE a match with the root file replaced by E , and the first char AFTER a match replaced by B , and a space after every single character?

Yes, that's right Rudic :slight_smile:

---------- Post updated at 05:14 AM ---------- Previous update was at 05:03 AM ----------

Yes, that right RudiC

Try

awk '
NR==FNR         {SP=SP DL $1
                 DL = "|"
                 next
                }
match ($0, SP)  {T = $0
                 gsub (/./, "I", T)
                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                 T = $0 "\t" T
                 $0 = ""
                 for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) " "
                }
1
' root_words.txt affix_words.txt
r e a d i n g    I I I I B I I
u n l i k e l y          I E I I I I B I
1 Like

Thank you so much Rudic :slight_smile: and sorry for disturbing you. Could you explain a little bit about the code above?

No need to apologize. Welcome.

awk '
NR==FNR         {SP=SP DL $1                    # collect root words into a search pattern built from "alternate expressions".
                 DL = "|"                       # make the delimiter the infix alternate operator
                 next
                }
match ($0, SP)  {T = $0                         # if any root word found in affix, create a working variable T from input line
                 gsub (/./, "I", T)             # make temp consist of all "I"s

                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                                                # replace the "I" before SP with "E", and after SP with "B"

                 T = $0 "\t" T                  # combine input line with result pattern
                 $0 = ""                        
                 for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) " "
                                                # intersperse spaces
                }
1                                               # default action: print $0
' file1 file2
2 Likes

How to make output like this:

r<TAB>I
e     I
a     I
d     I
i     B
n     I
g     I
$     $   #put sign "$" at last of word

.     .
.     .
.     .

I try to change a bit your code:

for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) "\n"

but the output become like this:

r
e
a
d
i
n
g

I
I
I
I
B
I
I

.
.
.

Try

awk '
NR==FNR         {SP=SP DL $1
                 DL = "|"
                 next
                }
match ($0, SP)  {T = $0
                 gsub (/./, "I", T)
                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                 for (i=1; i<=length(T); i++) print substr ($0, i, 1) "\t" substr (T, i, 1)
                 print "$\t$"
                }
' file1 file2
r	I
e	I
a	I
d	I
i	B
n	I
g	I
$	$
u	I
n	E
l	I
i	I
k	I
e	I
l	B
y	I
$	$
1 Like

Thanks again Ruduc. You help me a lot :slight_smile:

Is that a homework question? Those should be posted in Homework & Coursework Questions.

1 Like

If I put root words and affix words in a one file but separate it in the different fields. For example:

read     reading
like     unlikely
.....

Which part of code that I need to remove for getting like previous output:

r e a d i n g<TAB>I I I I B I I
u n l i k e l y<TAB>I E I I I I B I

Sorry asking you frequently and Thanks :slight_smile:

Please confirm that the entire thread is NOT homework by giving some background and/or motivation.

Yes, it is not homework. I'm working on doing Text-To-Speech Synthesis in research department and time giving is limited. This is my first time to write the script. Before this I just a java programer. I'm trying hard to this thing if I really can't do then I just asking you. Thanks

Try

awk '
match ($2, $1)  {T = $2
                 gsub (/./, "I", T)
                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                 T = $2 "\t" T
                 $0 = ""
                 for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) " "
                }
1
' file
r e a d i n g      I I I I B I I 
u n l i k e l y      I E I I I I B I 
1 Like

Sorry for disturbing you again. I would like to put the data of file 1 in first field and the data of file 2 in second field and separate between them by <TAB>.

File 1:

read
like

File 2:

reading
unlikely

Expected output:

read<TAB>reading
like<TAB>unlikely
.....

and

r<TAB>r
e     e
a     a
d     d
$     i
      n
      g
      $ #put sign "$" each final of words

The code that I was tried is:

awk 'FNR==NR{a[FNR]=$1"\t"; next}{print a[FNR],$1}' root_test.txt affix_test.txt

finally how to put space between of characters?

As much as I like to help: wouldn't it be time to get your act together? Moving targets ALWAYS are difficult if not impossible to hit. What in your own code presented in post#15 did not satisfy you?
@ 1)

paste file[12]
read    reading
like    unlikely

@ 2) Can't you adapt the proposal in post#8?

1 Like

Sorry, I'm forgot post #8. The problem has solved. Really thanks you :slight_smile:

Sorry disturb you again. This one I really stuck because I need to build a lot of fields. What I can do is only build the first field. Others filed gonna blur. What I try to do is matching them first; I=r, I=e, I=a, I=d, B=i, I=n and I=g. After that, get the value before and after r,e,a,d,i,n,g and put them in the right position. If no value put sign "#". Field 7th is in the complete word for "reading" in this case.

Input file:

r e a d i n g<TAB>I I I I B I I
u n l i k e l y<TAB>I E I I I I B I

The expected output is:

I # # # # # # r e a d i n g   //'r' is the main character, before 'r' is null and after 'r' are e,a,d,i,n,g
I # # # # # r e a d i n g #   //'e' is the main character, before 'e' is r and after 'e' are a,d,i,n,g
I # # # # r e a d i n g # #   //'a' is the main character, before 'a' are r,e and after 'a' are d,i,n,g
I # # # r e a d i n g # # #   //'d' is the main character, before 'd' are r,e,a and after 'd' are i,n,g
B # # r e a d i n g # # # #   //'i' is the main character, before 'i' are r,e,a,d and after 'i' are n,g
I # r e a d i n g # # # # #   //'n' is the main character, before 'n' are r,e,a,d,i and after 'n' is g
I r e a d i n g # # # # # #   //'g' is the main character, before 'g' are r,e,a,d,i,n and after 'g' is null

This is my trying code and how to loop for the next field?

awk '
        {
            for (i=1; i<=length($1); i++) print "#"
                for (j=1; j<=1; j++) print substr ($1, j, 1)
                print ""
        }
' $1

The output that I have got:

#
#
#
#
#
#
#
r

#
#
#
#
#
#
#
#
u

Please guide me, Thanks :slight_smile:

I don't have the slightest clue of what you're after. To print to one single line instead of new lines for every char, try printf "#" or printf "%s" var