Word Occurrences script using awk

I'm putting together a script that will the count the occurrences of words in text documents. It works fine so far, but I'd like to make a couple tweaks/additions:

1) I'm having a hard time displaying the array index number, tried freq[$i] which just spit 0's back at me
2) Is there any way to eliminate the whitespace (spaces) from the word count?

I'm relatively new to Unix, so any help would be greatly appreciated. Thank you!

{
        $0 = tolower($0)
        for ( i = 1; i <= NF; i++ )
        freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
sort = "sort -k 2nr"
for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
close(sort)
}

maybe try this?

freq++

awk automatically use delimiter of spaces by default.

1 Like

Thank you, ghostdog. I'll try your suggestion about freq[i]++ instead of freq[$i]++.

The reason I mentioned the spaces - when viewing the output it lists blank space as having a count of 243. I can't figure out exactly what it's picking up.

Perhaps you have some non-printing characters in the file.
Maybe it's from MSDOS and has LF characters, you could try dos2unix filename first

or try

{
    $0 = tolower($0)
    gsub(/\r/, x, $0)
    for ( i = 1; i <= NF; i++ )
    freq[$i]++
}
BEGIN { printf "%-20s %-6s\n", "Word", "Count"}
END {
    sort = "sort -k 2nr"
    for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
    close(sort)
}
1 Like

Thank you, Chubler! Any idea how I can print off the index value as well? Should I be using asorti instead of sort? I'd like my output to appear like the following example:

Index Word Count
1 the 247
2 a 215
3 to 201

How about :

{
    $0 = tolower($0)
    gsub(/\r/, x, $0)
    for ( i = 1; i <= NF; i++ )
    freq[$i]++
}
BEGIN { printf "Index\t%-20s %-6s\n", "Word", "Count"}
END {
    sort = "sort -k 2nr | cat -n"
    for (word in freq)
        printf "%-20s %-6s\n", word, freq[word] | sort
    close(sort)
}
1 Like

Huge improvement, thank you Chubler! The only issue's remaining are the alignment.
-The index heading is left aligned, but the index numbers are right aligned (I'd like to get both left aligned)
-The word heading and results are left aligned (need right aligned)
-The work count and results are left aligned (need right aligned).

Also, is there any way to do the sort using the asorti function? It was recommended I use that.

Again, thank you so much for your help!

---------- Post updated at 09:58 PM ---------- Previous update was at 02:43 PM ----------

I've completely redone the script because I wasn't using the actual index values (which this needs to be sorted by). I've come up with the following, which seems close to working, but isn't quite there. I've spent the past 4 hours on this, and am completely at my wits end. Any help would be appreciated. Thanks.

{
j = 1
for (i in freq)
ind[j] = i
j++
}
{
$0 = tolower($0)
for (i = 1; i <= NF; i++ )
freq [$i]++
}
BEGIN { printf "%-5s %20s %6s\n", "Index", "Word", "Count"}
END {
        asorti(freq)
        for (word in freq)
        printf "%-5s %20s %6s\n", ind[j], word, freq[word]
}

Unfortunately the gnu awk array sort functions destroy 1 of either the array index or the data.

So we have to copy the count and work into another array and then sort that. I then split the count and word out into v[1] and v[2]

Note the leading zeros in the printf format, this is to assure that alpha sorting will get the count in the correct order and not 1,10,100,11,...,18,19,2,20 etc.
(compare printf "%d\n" {1..100} | sort with printf "%03d\n" {1..100} | sort )

{
$0 = tolower($0)
for(i = 1; i <= NF; i++ )
freq[$i]++
}
BEGIN { printf "%-5s %20s %6s\n", "Index", "Word", "Count" }
END {
        for(word in freq)
           res[sprintf("%08i:%s",freq[word],word)];

        n = asorti(res)
        for (i=n; i; i--) {
            split(res,v,":")
            printf "%-5s %20s %6s\n", n-i+1, v[1]+0, v[2]
        }
}
1 Like

Thank you so much, Chubler. Last question - now say I wanted to sort the array by the word itself (but still display the index & count), would I just switch the asorti function? Sorry to be a pain, but I want to understand how to manipulate these arrays so I can design them in the future for multiple purposes.

Output in this example would look something like:
Index Word Count
1043 the 247
1044 their 84
1045 them 15

Yes as you proposed the res[] index would need to change to the sort order you would like so the:00001043 . This would mean the results from the split would be v[1] = word and v[2]+0 = count

1 Like

I must be doing something wrong. Instead of outputting the actual word, it's throwing out 00001787 for instance. The index and counts appear to be correct. Would you mind taking a look at this? Thanks!

{
        $0 = tolower($0)
        for ( i = 1; i <= NF; i++ )
        freq[$i]++
}
BEGIN { printf "%-5s %20s %6s\n", "Index", "Word", "Count"}
END {
        for (word in freq)
          res[sprintf("%08i:%s",word,freq[word])];

        n = asorti(res)
        for ( i = n; i; i--) {
        split(res, v, ":")
        printf "%-5s %20s %6s\n", n-i+1, v[1], v[2]+0
        }
}

Few things:

1: Change sprintf formats around you want %s for string followed by %08i for integer:
res[sprintf("%s:%08i",word,freq[word])];

2: Change order you want ascending order:
for ( i = 1; i<=n; i++) {

3: Change index value as we are going from 1 just display i:
printf "%-5s %20s %6s\n", i, v[1], v[2]+0

1 Like

Chubler, you've been a tremendous help! I can't thank you enough. And thanks for making a UNIX newbie like myself feel welcome.