SCRIPT TO TRAP ILLEGAL COMBOS

gimley · March 8, 2011, 5:15am

Hello,
I am trying to identify names which are "illegal" in the sense that they do not comply with the spelling norms of a culture. I have written NGrams for initial and final combos which are illegal. These are lists stored in 2 files named Initial and Final. Here are few examples
Initial:
bb
bc
bd
bbb
bbc

Final:
bx
bbx

I want to run these on a file containing a large amount of data and identify and store those words which are "illegal"
e.g.of illegal names
Initial
bbarry
bbclaude

Final
robx
hirambbx

Of course an add-on would be that if the correct name was found in the input file, the "illegal" output would be shown as:
Initial
b+barry
bb+claude

Final
rob+x
hiram+bbx

This assuming that claude, barry, rob and hiram are part of the input file.

The input file of names would be very large. So a large array would be needed.

Could anyone help me with a Perl or an Awk script to do the job. The ones I wrote are so bad they are just not worth displaying.

Many thanks in advance for any help,

Gimley

jim_mcnamara · March 8, 2011, 8:28am

first off:

Initial:
bb
bc
bd
bbb
bbc

Final:
bx
bbx

if you find bx, you have by default also found bbx, since bx is a substring of bbx.
Revised list

Initial:
bb
bc
bd

Final:
bx


awk ' /bb/ || /bc/ || /bd/ {for(i=1;i<=NF;i++) 
            {if($i~/bc/ || $i~/bb/ || $i~/bd/ ) {print $i} } ' initial

use the same logic on final.

gimley · March 8, 2011, 10:33am

Hi,
Many thanks for the answer. That would work if the number of NGrams were few. How do I load an NGram from a file.
Sorry for the hassle and many thanks in advance