Dear all,
I have a large tagged training file in Hindi for Parts of Speech. When I tagged the file, I inadvertently classified Pronouns and Adjectives as one single category. This has resulted in ambiguity.
An example from English will make this clear.
This is his.
This is his book.
The tagged file gives the following output
This_DT is_VBZ his_PRP
This_DT is_VBZ his_PRP book_NN
his
is tagged in both cases as
_PRP
Since the data is voluminous (800,000 tags), I would like to make a conditional contextual search. Luckily in Hindi a
_PRP
followed by a tag
_NN
will always be an adjective, if not followed it will be a pronoun.
What I am looking for is an awk or perl script which can do the job. This would mean the following steps.
1. Finding the tag _PRP
2. Looking for the next Tag
3. If the next Tag is _NN, replace _NN by _ADJP, otherwise do nothing.
The structure of the tagged data is as under:
WORD followed by _ followed by TAG<SPACE>followed byWORD followed by _ followed by TAG
I am giving below a sample set in Hindi for testing
1a. _PRP _NN
1b. _DMD _PRP _VM
2a._NN _PRP _NN _VM
2b._NN _PRP _VM
Since in 1a. and 2a.
_PRP
is followed by a
_NN
, it should be replaced by a
_ADJP
tag.
Since in 1b. and 2b. the condition does not exist, it should be mantained as such.
I have never attempted a contextual search of this type in AWK or PERL and all my attempts have resulted in disaster. Please help out, since the data is voluminous and cannot be retagged manually.
I work in a Windows environment.