Stemming of words that contained affixes by using shell script

paranrat · May 2, 2016, 5:36am

I just learning shell script. Need your shell script expertise to help me. I would like to stemming the words by matching the root words first between both files and replace all words by "I" character but replace "B" character after root words and "E" before root words in affix_words.txt.

root_words.txt:

read
like
.....

affix_words.txt:

reading
unlikely
.....

The expected output is:

r e a d i n g<TAB>I I I I B I I
u n l i k e l y<TAB>I E I I I I B I
.....

RudiC · May 2, 2016, 5:49am

Not quite clear. You want to print every input word in the affix file plus, separated by a <TAB>, this word with every char replaced by upper case I except for the last char BEFORE a match with the root file replaced by E , and the first char AFTER a match replaced by B , and a space after every single character?

paranrat · May 2, 2016, 6:14am

Yes, that's right Rudic

---------- Post updated at 05:14 AM ---------- Previous update was at 05:03 AM ----------

Yes, that right RudiC

RudiC · May 2, 2016, 6:17am

Try

awk '
NR==FNR         {SP=SP DL $1
                 DL = "|"
                 next
                }
match ($0, SP)  {T = $0
                 gsub (/./, "I", T)
                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                 T = $0 "\t" T
                 $0 = ""
                 for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) " "
                }
1
' root_words.txt affix_words.txt
r e a d i n g    I I I I B I I
u n l i k e l y          I E I I I I B I

paranrat · May 2, 2016, 6:20am

Thank you so much Rudic and sorry for disturbing you. Could you explain a little bit about the code above?

RudiC · May 2, 2016, 6:43am

No need to apologize. Welcome.

awk '
NR==FNR         {SP=SP DL $1                    # collect root words into a search pattern built from "alternate expressions".
                 DL = "|"                       # make the delimiter the infix alternate operator
                 next
                }
match ($0, SP)  {T = $0                         # if any root word found in affix, create a working variable T from input line
                 gsub (/./, "I", T)             # make temp consist of all "I"s

                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                                                # replace the "I" before SP with "E", and after SP with "B"

                 T = $0 "\t" T                  # combine input line with result pattern
                 $0 = ""                        
                 for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) " "
                                                # intersperse spaces
                }
1                                               # default action: print $0
' file1 file2

paranrat · May 3, 2016, 4:09am

How to make output like this:

r<TAB>I
e     I
a     I
d     I
i     B
n     I
g     I
$     $   #put sign "$" at last of word

.     .
.     .
.     .

I try to change a bit your code:

for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) "\n"

but the output become like this:

r
e
a
d
i
n
g

I
I
I
I
B
I
I

.
.
.

RudiC · May 3, 2016, 4:53am

Try

awk '
NR==FNR         {SP=SP DL $1
                 DL = "|"
                 next
                }
match ($0, SP)  {T = $0
                 gsub (/./, "I", T)
                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                 for (i=1; i<=length(T); i++) print substr ($0, i, 1) "\t" substr (T, i, 1)
                 print "$\t$"
                }
' file1 file2
r	I
e	I
a	I
d	I
i	B
n	I
g	I
$	$
u	I
n	E
l	I
i	I
k	I
e	I
l	B
y	I
$	$

paranrat · May 3, 2016, 5:31am

Thanks again Ruduc. You help me a lot

RudiC · May 3, 2016, 5:52am

Is that a homework question? Those should be posted in Homework & Coursework Questions.

paranrat · May 3, 2016, 6:10am

If I put root words and affix words in a one file but separate it in the different fields. For example:

read     reading
like     unlikely
.....

Which part of code that I need to remove for getting like previous output:

r e a d i n g<TAB>I I I I B I I
u n l i k e l y<TAB>I E I I I I B I

Sorry asking you frequently and Thanks

RudiC · May 3, 2016, 6:22am

Please confirm that the entire thread is NOT homework by giving some background and/or motivation.

paranrat · May 3, 2016, 6:28am

Yes, it is not homework. I'm working on doing Text-To-Speech Synthesis in research department and time giving is limited. This is my first time to write the script. Before this I just a java programer. I'm trying hard to this thing if I really can't do then I just asking you. Thanks

RudiC · May 3, 2016, 7:04am

Try

awk '
match ($2, $1)  {T = $2
                 gsub (/./, "I", T)
                 T = substr (T, 1, RSTART-2) (RSTART>1?"E":"") substr (T, RSTART, RLENGTH) (RSTART+RLENGTH<length?"B":"") substr (T, RSTART+RLENGTH+1)
                 T = $2 "\t" T
                 $0 = ""
                 for (i=1; i<=length(T); i++) $0 = $0 substr (T, i, 1) " "
                }
1
' file
r e a d i n g      I I I I B I I 
u n l i k e l y      I E I I I I B I

paranrat · May 9, 2016, 11:04am

Sorry for disturbing you again. I would like to put the data of file 1 in first field and the data of file 2 in second field and separate between them by <TAB>.

File 1:

read
like

File 2:

reading
unlikely

Expected output:

read<TAB>reading
like<TAB>unlikely
.....

and

r<TAB>r
e     e
a     a
d     d
$     i
      n
      g
      $ #put sign "$" each final of words

The code that I was tried is:

awk 'FNR==NR{a[FNR]=$1"\t"; next}{print a[FNR],$1}' root_test.txt affix_test.txt

finally how to put space between of characters?

RudiC · May 9, 2016, 12:43pm

As much as I like to help: wouldn't it be time to get your act together? Moving targets ALWAYS are difficult if not impossible to hit. What in your own code presented in post#15 did not satisfy you?
@ 1)

paste file[12]
read    reading
like    unlikely

@ 2) Can't you adapt the proposal in post#8?

paranrat · May 9, 2016, 9:42pm

Sorry, I'm forgot post #8. The problem has solved. Really thanks you

paranrat · May 13, 2016, 1:55pm

Sorry disturb you again. This one I really stuck because I need to build a lot of fields. What I can do is only build the first field. Others filed gonna blur. What I try to do is matching them first; I=r, I=e, I=a, I=d, B=i, I=n and I=g. After that, get the value before and after r,e,a,d,i,n,g and put them in the right position. If no value put sign "#". Field 7th is in the complete word for "reading" in this case.

Input file:

r e a d i n g<TAB>I I I I B I I
u n l i k e l y<TAB>I E I I I I B I

The expected output is:

I # # # # # # r e a d i n g   //'r' is the main character, before 'r' is null and after 'r' are e,a,d,i,n,g
I # # # # # r e a d i n g #   //'e' is the main character, before 'e' is r and after 'e' are a,d,i,n,g
I # # # # r e a d i n g # #   //'a' is the main character, before 'a' are r,e and after 'a' are d,i,n,g
I # # # r e a d i n g # # #   //'d' is the main character, before 'd' are r,e,a and after 'd' are i,n,g
B # # r e a d i n g # # # #   //'i' is the main character, before 'i' are r,e,a,d and after 'i' are n,g
I # r e a d i n g # # # # #   //'n' is the main character, before 'n' are r,e,a,d,i and after 'n' is g
I r e a d i n g # # # # # #   //'g' is the main character, before 'g' are r,e,a,d,i,n and after 'g' is null

This is my trying code and how to loop for the next field?

awk '
        {
            for (i=1; i<=length($1); i++) print "#"
                for (j=1; j<=1; j++) print substr ($1, j, 1)
                print ""
        }
' $1

The output that I have got:

#
#
#
#
#
#
#
r

#
#
#
#
#
#
#
#
u

Please guide me, Thanks

RudiC · May 16, 2016, 5:17am

I don't have the slightest clue of what you're after. To print to one single line instead of new lines for every char, try printf "#" or printf "%s" var