Contextual search and replace in a tagged file

gimley · November 3, 2014, 11:39pm

Dear all,
I have a large tagged training file in Hindi for Parts of Speech. When I tagged the file, I inadvertently classified Pronouns and Adjectives as one single category. This has resulted in ambiguity.
An example from English will make this clear.

This is his.
This is his book.

The tagged file gives the following output

This_DT is_VBZ his_PRP
This_DT is_VBZ his_PRP book_NN

his

is tagged in both cases as

 _PRP

Since the data is voluminous (800,000 tags), I would like to make a conditional contextual search. Luckily in Hindi a

 _PRP

followed by a tag

_NN

will always be an adjective, if not followed it will be a pronoun.
What I am looking for is an awk or perl script which can do the job. This would mean the following steps.

1. Finding the tag _PRP
2. Looking for the next Tag
3. If the next Tag is _NN, replace _NN by _ADJP, otherwise do nothing.

The structure of the tagged data is as under:

WORD followed by _ followed by TAG<SPACE>followed byWORD followed by _ followed by TAG

I am giving below a sample set in Hindi for testing

1a. _PRP _NN 
1b. _DMD _PRP _VM 
2a._NN _PRP _NN _VM 
2b._NN _PRP _VM

Since in 1a. and 2a.

_PRP

is followed by a

_NN

, it should be replaced by a

_ADJP

tag.
Since in 1b. and 2b. the condition does not exist, it should be mantained as such.
I have never attempted a contextual search of this type in AWK or PERL and all my attempts have resulted in disaster. Please help out, since the data is voluminous and cannot be retagged manually.
I work in a Windows environment.

RudiC · November 4, 2014, 5:05am

Would this

awk '{for (i=1; i<NF; i++) if ($i ~ /_PRP$/ && $(i+1) ~ /_NN$/) sub (/PRP$/,"ADJP", $i)} 1' file
1a. _ADJP _NN
1b. _DMD _PRP _VM 
2a._NN _ADJP _NN _VM
2b._NN _PRP _VM

do what you need?

gimley · November 4, 2014, 9:07am

Many thanks. It really meets my needs. I have studied the script and can see how I can do a contextual Search and Replace using AWK.

i