Eliminating words from a file through ngrams stored in another file

gimley · January 22, 2013, 11:39pm

Hello,
I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear:

kpaware
nlupset
rrrbring

In other words these words are invalid in English and constitute garbage in the data.
I have identified such combinations (at least in the initial position) and have prepared a file of such combos which for lack of better I call bigrams, trigrams
An example of such combos is given below:

nl
kp
rrr

Is there a script which could load the ngram file and check in the database which words do not meet the requirement and create two files a clean file and an invalid file
I am fully aware that this approach is fraught with a certain amount of danger since two letter combinations are involved and it could be that a bigram such as

nl

could eliminate out a word such as

nlong

Hence the request for storing the data in an invalid file for manual examination.
Mnay thanks in advance.

PikK45 · January 23, 2013, 12:15am

may be a line would help!

egrep -v "^nl|^kp|^rrr" file > valid_file
egrep "^nl|^kp|^rrr" file > invalid_file

gimley · January 23, 2013, 12:58am

Many thanks, but the list is large and it would involve grepping from a file. I work under windows and egrep does not always give expected results.