Adding tags in between sentences with awk

Hi,
I need an

awk

to modify the following file. It is 2-column tab-separated.

Hi PP
my VBD
name DT
is NN
. SENT

Her PP
name VBD
is DT
the NN
same WRT
. SENT
<s>
Hi PP -
my VBD -
name DT -
is NN -
. SENT .
</s>
<s>
Her PP -
name VBD -
is DT -
the NN -
same WRT -
. SENT -
</s>

I tried to use the following awk

awk '{print $1 "\t" $2 "\t" "-"}'

but I can not figure out how to include the

<s>

and

</s>

in between each sentence.

Any suggestions?

Your code snippet doesn't fit what we see in the desired output:

  • where's the <TAB>s?
  • why no hyphens in the second paragraph?
  • IF there's a dot after the first SENT, where's it after the second?

So we don't have a chance to infer the task from your info given.

Please give us a precise specification of what you want to get done.

Hello owwow14,

Could you please try following and let me know if this helps you.

awk -vs="<s>" -vs1="</s>" 'function add_tags(A){if(A==1){$0=s ORS $0};if(A==2){$0=$0 ORS s1}}($0 ~ /^$/){next} (NR==1 || j==1){add_tags(1);j=0} ($0==". SENT"){add_tags(2);j=1} 1'  Input_file

Output will be as follows.

<s>
Hi PP
my VBD
name DT
is NN
. SENT
</s>
<s>
Her PP
name VBD
is DT
the NN
same WRT
. SENT
</s>

Also a non oneliner form of the solution is as follows.

awk -vs="<s>" -vs1="</s>" 'function add_tags(A){
                                                if(A==1)        {
                                                                        $0=s ORS $0};
                                                                                        if(A==2){
                                                                                                        $0=$0 ORS s1
                                                                                                }
                                                                }
                                                ($0 ~ /^$/)     {
                                                                        next
                                                                }
                                                (NR==1 || j==1) {
                                                                        add_tags(1);
                                                                        j=0
                                                                }
                                                ($0==". SENT")  {
                                                                        add_tags(2);
                                                                        j=1
                                                                }
                           1
                          ' Input_file

Thanks,
R. Singh

Like RudiC says, there are inconsistencies in your specification.

To produce output like in the first half of your sample input/output, try:

awk '{$1=$1; print "<s>\n" $0 "\t.\n</s>"}' RS=  FS='\n' OFS='\t-\n' file

If it is like the lower half, try:

awk '{$1=$1; print "<s>\n" $0 "\n</s>"}' RS=  FS='\n' OFS='\n' file

Hi,
I updated the code snippet so that I hope the desired output is clearer.
@RavinderSingh13 your code gives me the following error:

awk: invalid -v option

@RudiC and @Scrutinizer I hope that the updated desired output answers some of your questions.
@Scrutinizer I tried your code too but the output that it gives me is not correct. Here is the example. As you see - there are "-" in the blank spaces and the

<s>

and

</s>

envelope the entire text rather than each individual sentence.

<s>
Hi PP	-
my VBD	-
name DT	-
is NN	-
. SENT	-
 	-
Her PP	-
name VBD	-
is DT	-
the NN	-
same WRT	-
. SENT	.
</s>

Again, here would be the example of the desired output:

<s>
Hi PP	-
my VBD	-
name DT	-
is NN	-
. SENT	-
</s>
<s>
Her PP	-
name VBD	-
is DT	-
the NN	-
same WRT	-
. SENT	.
</s>

Hi, that shows that the empty lines in the input files contain some characters. Try this instead:

awk '!NF{$0=x}1' file |  awk '{$1=$1; print "<s>\n" $0 "\t.\n</s>"}' RS=  FS='\n' OFS='\t-\n'

There is still some ambiguity. In the first half there is a trailing dot, in the second half there is a trailing dash.
Also, your samples appears to not be TAB-delimited, contrary to what you say in the description..

Looks like the input file sample has DOS <CR> char line terminators...