Script to count word occurrences, but exclude some?

Cronk · June 15, 2012, 8:38pm

I am trying to count the occurrences of ALL words in a file. However, I want to exclude certain words: short words (i.e. <3 chars), and words contained in an blacklist file. There is also a desire to count words that are capitalized (e.g. proper names). I am not 100% sure where the line on capitalization is; i.e. do we count the first word of a sentence differently? What if it is a word that would be capitalized in the middle of a sentence, e.g. a name? So working on the other parts is more important, but any other input would be appreciated.

I have put together a command to do the word counting in the file (I borrowed code that I found here in other postings). It is in a script here, and uses command line arguments for the filename, too:

tr -cs "[:alpha:]'" "\n" < $1 | sort | uniq -c | sort -rn >w_counts.txt

In the TR command, I have put in an apostrophe in the match set so that it doesn't break up contractions (e.g. "doesn't"). The output of TR is a CR/LF separated list of words that is then fed into the others, where it gets sorted so that 'uniq' will count correctly. Then that is reverse sorted (we want to know about the highest occurring words) and output to the text file. (This will eventually be imported back into a database.)

This works in about .5 seconds on a 4000+ word file. I am pretty happy with that.

Any comments or suggestions about excluding short words or words from a blacklist file, or even the counting capitalized words, would be appreciated.

I am working on Mac OS X 10.6.8, but would hope to get a solution that will work under a Windows Unix-like shell (e.g. Cygwin).

Thanks,
J

agama · June 15, 2012, 10:16pm

Given your requirements for black listing and finding capitalised leading letters, I'd probably have approached it this way:

awk '
    NR == FNR { $1; blist[$1]; next; }      # read black list

    {
        for( i=1; i <= NF; i++ )
        {
            ignore = ignore_nxt;
            ignore_nxt = 0;
            ignore_nxt = ( match(  $i, "[?.!]" ) && RSTART == length( $i ) );

            gsub( "[:,%?<>&@!=+.()]", "", $(i) );       # trash punctuation not considered part of a word
            if( length( $(i) ) > 3 )
            {
                count[$(i)]++;
                fc = substr( $(i), 1, 1 );
                if( !ignore && fc >= "A" && fc <= "Z" )
                    cap++;
            }
        }
    }
    END {
        printf( "words starting with a capital: %d\n", cap ) >"/dev/fd/2";  # out to stderr so it doesnt sort
        for( x in count )
        {
            if( !( x in blist ) )
                print x, count[x];
        }
    }
' blacklist.file text-file | sort -k 2nr,2

The captialisation is tricky. You can count all words with capitalised letters, or ignore those that immediatly follow a full stop (.), question (?) or explaination (!). The code above does the latter -- effectively counting proper names that appear in the middle of the sentence. You can comment out the statements that check for and set the ignore variables, and it will count all words that start with a capitalised letter and are larger than 3 characters in length.

Might not be exactly what you want, but it should give you an idea of one method.

Cronk · June 16, 2012, 12:17am

Uff da! Quite a bit of code there; thanks for taking the time to put that together.

If the task of figuring out capitalization is ignored, is there something else you might suggest to handle excluding short words, or incorporating a black list? I like the one liner shell commands.

Thanks,
J

guruprasadpr · June 16, 2012, 1:36am

Hi

$ cat file
hi hello world
welcome to india
welcome to unix.com

$ cat black
world

$ sed -r 's/ +/\n/g' file | grep -v -f black | awk 'length>3{a[$0]++}END{for(i in a)print i, a;}'
hello 1
india 1
welcome 2
unix.com 1

Guru.

agama · June 16, 2012, 12:05pm

I would take a slightly different approach. No need for the leading sed, and I would exclude the black list on the output of the awk assuming that will be a shorter list than the output of an initial sed. I'd also strip punctuation/special characters so that something like (word is counted as word without the paren. I'd also check the length after removing specials/punct so that (and is dropped if you want only words that have a length greater than 3.

This can be smashed onto one line, but it's easier to read and commented when written with some structure:

awk '
    BEGIN { RS = "[" FS "\n]" }         # break into records based on whitespace and newline (this may require gnu awk and not work in older versions)
    { 
        gsub( "[:,%?<>&@!=+.()]", "", $(i) );   # ditch unwanted punctuation before looking at len
        if( length( $0 ) > 3 )                  # keep only words long enough
            count[$0]++; 
    } 

    END {
        for( x in count )
            print x, count[x];
    }'  data-file | grep -v -f black-list |sort -k 2rn,2

Cronk · June 18, 2012, 7:22pm

I was trying to implement part of your suggestions but ended up with a blank Results file. Here is what I am using:

time tr -cs "[:alpha:]'" "\n" < $1 | grep -v -f blacklist.txt | sort | uniq -c | sort -rn >counts.txt

The only added part is the 'grep -v -f ...' that you suggested. I created the blacklist text file, one word per line. Blacklist file is in the same directory as the shell script. (Seems like it would complain if it couldn't find it.)

Thanks,
J

---------- Post updated at 03:16 PM ---------- Previous update was at 02:31 PM ----------

Ah, I think I found an answer. Not exactly sure what the difference is, but it appears to work. (Remember, this is in a script and the $1 is the script parameter.)

time tr -cs "[:alpha:]'" "\n" < $1 | grep -viFf  blacklist.txt | sort | uniq -c | sort -rn >counts.txt

This also works (is apparently the same as the above?):

time tr -cs "[:alpha:]'" "\n" < $1 | fgrep -vif  blacklist.txt | sort | uniq -c | sort -rn >counts.txt

Now, why is it that :
"-F, --fixed-strings
Interpret PATTERN as a list of fixed strings, separated by newlines, any of which is to be matched.
"

appears to work the way that the "-f" option SOUNDS like it would work?

---------- Post updated at 04:22 PM ---------- Previous update was at 03:16 PM ----------

OK, I have figured out a bit further: matching only words of 3 or more characters:

time tr -cs "[:alpha:]'" "\n" < $1 | fgrep -vif  blacklist.txt | egrep '\w{3,}' | sort | uniq -c | sort -rn >counts.txt

I think that the only missing part is the whole capitalization issue. But that isn't a pressing issue, I don't think. And it still appears to be running in less than .03 seconds! Gotta love the shell sometimes.

Thanks all for your suggestions.

-- J