Reducing multiple entries in a tri-lingual dictionary to single entries

Dear all,
I am editing a tri-lingual dictionary for open source which has the following data structure

English headwords <Tab>Devanagari Headwords<Tab>PersoArabic headwords

as in the example below

to mark, to number		()

The English headword entry has at times more than one word, each word separated by a comma or a semi-colon as shown below and also in the example above

number, numeral, digit; limb; component; tear in cloth		()

For purposes of editing the dictionary, I need to reduce the multiple headwords in English to single entries. Thus the data in the example above would be reduced to the following entries:

number		()
numeral		()
digit		()
limb		()
component		()
tear in cloth		()

I could handle two mappings but I do not know how to handle such a complicated data structure in Perl or Awk. Any help provided would be gratefully acknowledged.
I would like to add that I work under Windows Vista and Windows 7. Unfortunately solutions in Linux do not help.

I am providing a sample for testing below

suddenly,unexpectedly		()
surprise,wonder		()
to wonder	 �  	( )
to get surprised	 �  	( )
to wonder	 �  	( )
to be surprised	 �   	(  )
unfailing,unerring;sure		()
inanimate		()
unconsciousness,senselessness		()
unconscious,senseless		()
a flood of water;a body of clouds;a large desolated plain		()
to be flooded due to heavy rain	 �  	( )
the hair to turn grey,to become old		()
whiteness,clearness		()
untouchable		()
untouched;unpolluted		()
whitish		()
white;clean		()
neat and clean	 �  	( )
to age,to become old,to turn into grey hair 	 �   	(  )
to disgrace,to make ashamed	 �   	(  )
to be disgraced,to do a shameful act	 �   	(  )
to respect the elderliness,to have regard for an old person	 �   	(  )
to be spoilt in the old age	 �   	(  )
to be exposed,truth to be known	 �    	(   )
to turn into grey hair ,to become old	 �  	( )
to gain without much effort	 �    	(   )
to do shameful act in the old age	 �  	( )
to enter a false amount in the account	 �    	(   )
python,dragon		()
stranger,unknown person		()
wonderful,surprising		()
wonder,astonishment		()
to be surprised	 �  	( )
to wonder	 �  	( )
not liable to decay or old age		()
to live forever,to be immortal	 �    	(  )
death,the appointed hour of death		()
disgrace,infamy,dishonor		()
museum		( )
unnecessary,useless		()
a kind of fancy coloured sheet or shawl worn over shoulder		()
strange,wonderful,surprising		()
very strange,awkward	 	(  )
wonder		()
unsuitability		()
improper,unsuitable		()
unknown,unacquainted,ignorant		()
today		()
to complete a work in time	 �       	(      )

Hi, try:

awk '{n=split($1,F,/[,;]/); for(i=1; i<=n; i++) print F,$2,$3}' FS='\t' OFS='\t' file

--edit--
This will work on Linux / Unix. Just noticed that it needs to work under Windows.

Can't help you there.. I know there can be quoting issues, maybe CR/LF related issues...

Perhaps you could put the script in a file and execute that:

keyword_split.awk:

BEGIN {
  FS=OFS="\t"
}
{
  n=split($1,F,/[,;]/)
  for(i=1; i<=n; i++) print F,$2,$3
}

And execute with

awk -f keyword_split.awk file

Or use Cygwin or some other simulation...

1 Like

Many thanks. It worked perfectly. The tri-lingual dictionary generated out very well.Also thanks for noting that the delimiter could also be a semi-colon. I had missed that out.