I am trying to count the occurrences of ALL words in a file. However, I want to exclude certain words: short words (i.e. <3 chars), and words contained in an blacklist file. There is also a desire to count words that are capitalized (e.g. proper names). I am not 100% sure where the line on capitalization is; i.e. do we count the first word of a sentence differently? What if it is a word that would be capitalized in the middle of a sentence, e.g. a name? So working on the other parts is more important, but any other input would be appreciated.
I have put together a command to do the word counting in the file (I borrowed code that I found here in other postings). It is in a script here, and uses command line arguments for the filename, too:
tr -cs "[:alpha:]'" "\n" < $1 | sort | uniq -c | sort -rn >w_counts.txt
In the TR command, I have put in an apostrophe in the match set so that it doesn't break up contractions (e.g. "doesn't"). The output of TR is a CR/LF separated list of words that is then fed into the others, where it gets sorted so that 'uniq' will count correctly. Then that is reverse sorted (we want to know about the highest occurring words) and output to the text file. (This will eventually be imported back into a database.)
This works in about .5 seconds on a 4000+ word file. I am pretty happy with that.
Any comments or suggestions about excluding short words or words from a blacklist file, or even the counting capitalized words, would be appreciated.
I am working on Mac OS X 10.6.8, but would hope to get a solution that will work under a Windows Unix-like shell (e.g. Cygwin).
Thanks,
J