How to combining awk commands?

I can achieve two tasks with 2 different awk commands:
1)

awk -F";;WORD" '{print $2}' file | sed '/^$/d' #to find surface_word

2)

awk -F"bw:|gloss:" '// {print $2}'  file | sed '/\//!d; s:/[^+]*+*: + :g; s:^+::; s: *+ *$::;'  #to find segmentation of surface_word

Number 1) finds surface_word number x, then I expect 2) to find multiple after surface_word x, and before surface_word x+1.

Example

;;; SENTENCE A*AbthA
;;WORD A*AbthA
;;MADA: A*AbthA asp:p cas:na enc0:3fs_dobj gen:f mod:i num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.887822 diac:>a*AbatohA lex:>a*Ab_1 bw:+>a*Ab/PV+at/PVSUFF_SUBJ:3FS+hA/PVSUFF_DO:3FS gloss:dissolve;melt;exhaust;consume 
_0.712209 diac:<i*AbatahA lex:<i*Abap_1 bw:+<i*Ab/NOUN+at/NSUFF_FEM_SG+a/CASE_DEF_ACC+hA/POSS_PRON_3FS gloss:dissolution 
_0.691945 diac:<i*AbatihA lex:<i*Abap_1 bw:+<i*Ab/NOUN+at/NSUFF_FEM_SG+i/CASE_DEF_GEN+hA/POSS_PRON_3FS gloss:dissolution 
_0.691778 diac:<i*AbatuhA lex:<i*Abap_1 bw:+<i*Ab/NOUN+at/NSUFF_FEM_SG+u/CASE_DEF_NOM+hA/POSS_PRON_3FS gloss:dissolution 
--------------
SENTENCE BREAK
--------------
;;; SENTENCE A$Abty
;;WORD A$Abty
;;MADA: A$Abty asp:na cas:u enc0:0 gen:f mod:na num:s per:na pos:noun prc0:0 prc1:0 prc2:0 prc3:0 stt:c vox:na
*0.862011 diac:>u$Abatayo lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_GEN_POSS gloss:alloy 
_0.862001 diac:>u$Abatayo lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_ACC_POSS gloss:alloy 
_0.855251 diac:>u$Abatiy lex:>u$Abap_1 bw:+>u$Ab/NOUN+at/NSUFF_FEM_SG+iy/POSS_PRON_1S gloss:alloy 
_0.776236 diac:>u$Abatay~a lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_GEN_POSS+ya/POSS_PRON_1S gloss:alloy 
_0.776235 diac:>u$Abatay~a lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_ACC_POSS+ya/POSS_PRON_1S gloss:alloy 
--------------

Sample desired output:

A*AbthA
>a*Ab + at + hA
<i*Ab + at + a + hA
<i*Ab + at + i + hA
<i*Ab + at + u + hA
A$Abty
>u$Ab + atayo
>u$Ab + atayo
>u$Ab + at + iy
>u$Ab + atayo + ya
>u$Ab + atayo + ya

It would be helpful to also modify my code to have the output in one line (when relevant), and to have "_+" instead of " + ".
Better output:

A*AbthA >a*Ab_+at_+hA <i*Ab_+at_+a_+hA <i*Ab_+at_+i_+hA <i*Ab_+at_+u_+hA
A$Abty >u$Ab_+atayo >u$Ab_+atayo >u$Ab_+at + iy >u$Ab_+atayo_+ya >u$Ab_+atayo_+ya

For output1:

awk '/;;WORD/ { print $2 }
/ lex:/ {
    sub(/.*:[+]/,"")
    gsub("/[^+]*[+]", " + ")
    sub("/[^+]*$","")
    print }' infile

For output 2:

awk '/;;WORD/ {printf "%s%s", t++?"\n":"", $2 }
/ lex:/ {
    sub(/.*:[+]/,"")
    gsub("/[^+]*[+]", "_+")
    sub("/[^+]*$","")
    printf "%s", OFS $0 }
END { printf "\n" }' infile
1 Like

What if I have an input file that is larger than 2 lines?
About 2 millions of ";;WORD"

Thanks!

I can't see that will be a problem, output is generated as it is read in, so no limits should be exceed.

It will take longer to run and the output file will be bigger.

If the position is always left, add '^' for beginning of line to reduce scanning. You could grep up front and pipe into awk so the work is divided.

The only issue here is when I ran a file with >10 ";;WORD", I got the follow output:

A*AbthA >a*Ab_+at_+hA <i*Ab_+at_+a_+hA <i*Ab_+at_+i_+hA <i*Ab_+at_+u_+hA
A$Abty >u$Ab_+atayo >u$Ab_+atayo >u$Ab_+at_+iy >u$Ab_+atayo_+ya >u$Ab_+atayo_+ya
A*AbwA >a*Ab_+uwA
$A$AbwyAs
A$Abyty
AAd
$A$Ad
A$Ad >a$Ad_+a _0.872887 diac:>u$Adu lex:>a$Ad_1 bw:>u_+$Ad_+u _0.851391 diac:>u$Ad~u lex:$Ad~_1 bw:>u_+$Ad~_+u _0.836867 diac:>u$Ada lex:>a$Ad_1 bw:>u_+$Ad_+a _0.815236 diac:>u$Ad~a lex:$Ad~_1 bw:>u_+$Ad~_+a _0.815182 diac:>u$Ad~a lex:$Ad~_1 bw:>u_+$Ad~_+a
A*Ad
A$AdA >a$Ad_+A
AADAfAt

As you can see, the 4th line before last, I still have line post "bw:", while I only want the token after bw.
The first 3 lines of the output are precisely what I am looking for.
The lines where I have only 1 Word, that means there isn't "bw:|gloss:"

If I got your quite complex requirement correctly, translating your two awk commands in post #1, this might do the job in one go as requested:

awk     '/;;WORD/       {if (LINE) print LINE           # if LINE already filled (i.e. NOT the first occurrence)
                         LINE = $2}                     # on WORD occurrence start a new LINE
          /bw:/          {gsub (/.*bw:| .*$/, "")       # eliminate everything  before "bw:" and e.th. after first space (greedy regex)
                         gsub (/\/[^+]*(\+|$)/, "_+")   # process "/" and "+" terminated strings
                         gsub (/^\+|_\+ *$/, "")        # eliminate leading and trailing "+"s
                         LINE = LINE" "$0               # add to output LINE
                        }
         END            {print LINE}                    # print last line
        ' file
A*AbthA >a*Ab_+at_+hA <i*Ab_+at_+a_+hA <i*Ab_+at_+i_+hA <i*Ab_+at_+u_+hA
A$Abty >u$Ab_+atayo >u$Ab_+atayo >u$Ab_+at_+iy >u$Ab_+atayo_+ya >u$Ab_+atayo_+ya

Corrected alignment for bw: string (your test data didn't include lines that fail to have a + after bw:), this should correct for revised format:

awk '/;;WORD/ {printf "%s%s", t++?"\n":"", $2 }
/ lex:/ {
    sub(/^.*bw:[+]*/,"")
    gsub("/[^+]*[+]", "_+")
    sub("/[^+]*$","")
    printf "%s", OFS $0 }
END { printf "\n" }' infile