How to combining awk commands?

Viernes · February 20, 2013, 5:57pm

I can achieve two tasks with 2 different awk commands:
1)

awk -F";;WORD" '{print $2}' file | sed '/^$/d' #to find surface_word

2)

awk -F"bw:|gloss:" '// {print $2}'  file | sed '/\//!d; s:/[^+]*+*: + :g; s:^+::; s: *+ *$::;'  #to find segmentation of surface_word

Number 1) finds surface_word number x, then I expect 2) to find multiple after surface_word x, and before surface_word x+1.

Example

;;; SENTENCE A*AbthA
;;WORD A*AbthA
;;MADA: A*AbthA asp:p cas:na enc0:3fs_dobj gen:f mod:i num:s per:3 pos:verb prc0:0 prc1:0 prc2:0 prc3:0 stt:na vox:a
*0.887822 diac:>a*AbatohA lex:>a*Ab_1 bw:+>a*Ab/PV+at/PVSUFF_SUBJ:3FS+hA/PVSUFF_DO:3FS gloss:dissolve;melt;exhaust;consume 
_0.712209 diac:<i*AbatahA lex:<i*Abap_1 bw:+<i*Ab/NOUN+at/NSUFF_FEM_SG+a/CASE_DEF_ACC+hA/POSS_PRON_3FS gloss:dissolution 
_0.691945 diac:<i*AbatihA lex:<i*Abap_1 bw:+<i*Ab/NOUN+at/NSUFF_FEM_SG+i/CASE_DEF_GEN+hA/POSS_PRON_3FS gloss:dissolution 
_0.691778 diac:<i*AbatuhA lex:<i*Abap_1 bw:+<i*Ab/NOUN+at/NSUFF_FEM_SG+u/CASE_DEF_NOM+hA/POSS_PRON_3FS gloss:dissolution 
--------------
SENTENCE BREAK
--------------
;;; SENTENCE A$Abty
;;WORD A$Abty
;;MADA: A$Abty asp:na cas:u enc0:0 gen:f mod:na num:s per:na pos:noun prc0:0 prc1:0 prc2:0 prc3:0 stt:c vox:na
*0.862011 diac:>u$Abatayo lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_GEN_POSS gloss:alloy 
_0.862001 diac:>u$Abatayo lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_ACC_POSS gloss:alloy 
_0.855251 diac:>u$Abatiy lex:>u$Abap_1 bw:+>u$Ab/NOUN+at/NSUFF_FEM_SG+iy/POSS_PRON_1S gloss:alloy 
_0.776236 diac:>u$Abatay~a lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_GEN_POSS+ya/POSS_PRON_1S gloss:alloy 
_0.776235 diac:>u$Abatay~a lex:>u$Abap_1 bw:+>u$Ab/NOUN+atayo/NSUFF_FEM_DU_ACC_POSS+ya/POSS_PRON_1S gloss:alloy 
--------------

Sample desired output:

A*AbthA
>a*Ab + at + hA
<i*Ab + at + a + hA
<i*Ab + at + i + hA
<i*Ab + at + u + hA
A$Abty
>u$Ab + atayo
>u$Ab + atayo
>u$Ab + at + iy
>u$Ab + atayo + ya
>u$Ab + atayo + ya

It would be helpful to also modify my code to have the output in one line (when relevant), and to have "_+" instead of " + ".
Better output:

A*AbthA >a*Ab_+at_+hA <i*Ab_+at_+a_+hA <i*Ab_+at_+i_+hA <i*Ab_+at_+u_+hA
A$Abty >u$Ab_+atayo >u$Ab_+atayo >u$Ab_+at + iy >u$Ab_+atayo_+ya >u$Ab_+atayo_+ya

Chubler_XL · February 20, 2013, 7:09pm

For output1:

awk '/;;WORD/ { print $2 }
/ lex:/ {
    sub(/.*:[+]/,"")
    gsub("/[^+]*[+]", " + ")
    sub("/[^+]*$","")
    print }' infile

For output 2:

awk '/;;WORD/ {printf "%s%s", t++?"\n":"", $2 }
/ lex:/ {
    sub(/.*:[+]/,"")
    gsub("/[^+]*[+]", "_+")
    sub("/[^+]*$","")
    printf "%s", OFS $0 }
END { printf "\n" }' infile

Viernes · February 21, 2013, 6:52am

What if I have an input file that is larger than 2 lines?
About 2 millions of ";;WORD"

Thanks!

Chubler_XL · February 21, 2013, 3:42pm

I can't see that will be a problem, output is generated as it is read in, so no limits should be exceed.

It will take longer to run and the output file will be bigger.

DGPickett · February 21, 2013, 4:07pm

If the position is always left, add '^' for beginning of line to reduce scanning. You could grep up front and pipe into awk so the work is divided.

Viernes · February 23, 2013, 9:56am

The only issue here is when I ran a file with >10 ";;WORD", I got the follow output:

A*AbthA >a*Ab_+at_+hA <i*Ab_+at_+a_+hA <i*Ab_+at_+i_+hA <i*Ab_+at_+u_+hA
A$Abty >u$Ab_+atayo >u$Ab_+atayo >u$Ab_+at_+iy >u$Ab_+atayo_+ya >u$Ab_+atayo_+ya
A*AbwA >a*Ab_+uwA
$A$AbwyAs
A$Abyty
AAd
$A$Ad
A$Ad >a$Ad_+a _0.872887 diac:>u$Adu lex:>a$Ad_1 bw:>u_+$Ad_+u _0.851391 diac:>u$Ad~u lex:$Ad~_1 bw:>u_+$Ad~_+u _0.836867 diac:>u$Ada lex:>a$Ad_1 bw:>u_+$Ad_+a _0.815236 diac:>u$Ad~a lex:$Ad~_1 bw:>u_+$Ad~_+a _0.815182 diac:>u$Ad~a lex:$Ad~_1 bw:>u_+$Ad~_+a
A*Ad
A$AdA >a$Ad_+A
AADAfAt

As you can see, the 4th line before last, I still have line post "bw:", while I only want the token after bw.
The first 3 lines of the output are precisely what I am looking for.
The lines where I have only 1 Word, that means there isn't "bw:|gloss:"

RudiC · February 24, 2013, 5:05am

If I got your quite complex requirement correctly, translating your two awk commands in post #1, this might do the job in one go as requested:

awk     '/;;WORD/       {if (LINE) print LINE           # if LINE already filled (i.e. NOT the first occurrence)
                         LINE = $2}                     # on WORD occurrence start a new LINE
          /bw:/          {gsub (/.*bw:| .*$/, "")       # eliminate everything  before "bw:" and e.th. after first space (greedy regex)
                         gsub (/\/[^+]*(\+|$)/, "_+")   # process "/" and "+" terminated strings
                         gsub (/^\+|_\+ *$/, "")        # eliminate leading and trailing "+"s
                         LINE = LINE" "$0               # add to output LINE
                        }
         END            {print LINE}                    # print last line
        ' file
A*AbthA >a*Ab_+at_+hA <i*Ab_+at_+a_+hA <i*Ab_+at_+i_+hA <i*Ab_+at_+u_+hA
A$Abty >u$Ab_+atayo >u$Ab_+atayo >u$Ab_+at_+iy >u$Ab_+atayo_+ya >u$Ab_+atayo_+ya

Chubler_XL · February 24, 2013, 5:45pm

Corrected alignment for bw: string (your test data didn't include lines that fail to have a + after bw:), this should correct for revised format:

awk '/;;WORD/ {printf "%s%s", t++?"\n":"", $2 }
/ lex:/ {
    sub(/^.*bw:[+]*/,"")
    gsub("/[^+]*[+]", "_+")
    sub("/[^+]*$","")
    printf "%s", OFS $0 }
END { printf "\n" }' infile