Hello,
I have a large data file which contains a huge amount of garbage i.e. words which do not exist in the language. An example will make this clear:
kpaware
nlupset
rrrbring
In other words these words are invalid in English and constitute garbage in the data.
I have identified such combinations (at least in the initial position) and have prepared a file of such combos which for lack of better I call bigrams, trigrams
An example of such combos is given below:
nl
kp
rrr
Is there a script which could load the ngram file and check in the database which words do not meet the requirement and create two files a clean file and an invalid file
I am fully aware that this approach is fraught with a certain amount of danger since two letter combinations are involved and it could be that a bigram such as
nl
could eliminate out a word such as
nlong
Hence the request for storing the data in an invalid file for manual examination.
Mnay thanks in advance.