word frequency counter - awk solution?

irrevocabile · March 4, 2011, 7:52am

Dear all,

i need your help on this. There is a text file, i need to count word frequency for each word with frequency >40 in each line of file and output it into another file with columns like this:

word1,word2,word3, ...wordn
0,0,1
1,2,0
3,2,0 etc -- each raw represents word counts for a line of the original text file

numbers are wordn frequencies in each line of the original file.

This AWK of course does the first part (collects a list of words to count)

{    
     for (i=1; i<=NF; i++)
          words[$i]++
}
     
END {
for (i in words)
         if (words > 40)
         print i
 }

This does searches and counts

{
res=gsub(i, " ", all)

print res
}

How do i put them together??? In awk? Sorry, i am a complete newbie.

Corona688 · March 4, 2011, 1:23pm

Your description is very vague. Should it do this:

# input data
aa bb aa bb cc dd ee ee ee
# resulting count line
2,2,2,2,1,1,3

...because I think that's what your gsub would end up doing.

irrevocabile · March 4, 2011, 8:28pm

Thank you for your reply!!!
It needs to be:

# input data
aa bb aa bb cc dd ee ee ee
aa aa bb cc ee ee dd ee cc
# resulting count line
aa,bb,cc,dd,ee
2,2,1,1,3
2,1,2,1,3

I made a shell script for that, but I would really prefer to have it all done inside awk.
Thank you again.

yinyuemi · March 4, 2011, 9:17pm

how about this?

 awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{for(m=0;m<j;m++) printf gsub(b[m],b[m])" ";print""}' file file

irrevocabile · March 6, 2011, 5:05pm

Thank you yinyuemi. I was trying to make this work, no success so far. Output is an empty file. Thank you nevertheless.

yinyuemi · March 6, 2011, 7:18pm

it worked on my computer:

>cat file
aa bb aa bb cc dd ee ee ee
aa aa bb cc ee ee dd ee cc
>awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{for(m=0;m<j;m++) printf gsub(b[m],b[m])" ";print""}' file file
bb cc dd ee aa
2 1 1 3 2
1 2 1 3 2

Best,
Y

Chubler_XL · March 6, 2011, 8:02pm

Worked on my PC too, perhaps OP should use nawk instead of awk.

Couple of things to note, yinyuemi's code does search and replace so if words are substrings of other words eg "the" and "thesis" it's starts going all wrong.

This update fixes this issue for me (Change >=1 to >=40 when your ready to limit to only 40 or greater total occurances):

$ cat file
thesis the thesis the cc dd ee ee ee
thesis thesis the cc ee ee dd ee cc
$ awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++};next}
FNR==1{for(i in a)if(a>=1)b=0;for(i in b)printf (k++?",":"")i;print ""}
{for(i in b) k=b=0;for(w=1;w<=NF;w++)if($w in b)b[$w]++;for(i in b) printf (k++?",":"")b;print ""}' file file
thesis,cc,the,dd,ee
2,1,2,1,3
2,2,1,1,3

yinyuemi · March 6, 2011, 8:22pm

Thanks Chubler_XL, based on your note, I have my code a little change to be more robust,

awk 'NR==FNR{for(i=1;i<=NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{split($0,c,FS);for(m=0;m<j;m++) {for(n=1;n<=NF;n++){if(b[m]== c[n]) {d++}};printf d" ";d=0};print""}' file1 file1
bb cc dd ee aa
2 1 1 3 2
1 2 1 3 2

Hi Chubler_XL, Thanks for improving my code:b:

Chubler_XL · March 6, 2011, 8:56pm

Sorry to be pedantic yinyuemi, but it now misses the last word on each line change n<NF to n<=NF

---------- Post updated at 11:56 AM ---------- Previous update was at 11:33 AM ----------

And, just for fun, here is a version that does it in 1 pass (change <2 to <40 for limit of 40 total count):

awk '{t++;for(w=1;w<=NF;w++){l[t,$w]++;g[$w]++}}
END {for(w in g) if(g[w]<2) delete g[w];
for(w in g) printf w " "; print "";
for(i=1;i<=t;i++) { for(w in g) printf l[i,w]" "; print ""}}' file

alister · March 6, 2011, 9:50pm

I did not run the code, only skimmed it, but it seems to me that if a word that meets the frequency threshold does not occur in a line, the printf statement will not print a 0. I'm assuming a 0 would be desirable as opposed to an empty string. Perhaps a "+0" or a format string with a numeric conversion specifier would be in order?

For example:

printf l[i,w]+0 " "
printf "%d ", l[i,w]

Regards,
Alister

Chubler_XL · March 6, 2011, 11:05pm

Well spotted - my test data didn't have a line with zero count, fixed below.
Also matches words regardless of their case and removes common punctuantion (eg comma, full stop, semi-colon, colon, brackets, etc.):

awk '{$0=tolower($0);gsub("[:;.,()!]"," ");t++;
  for(w=1;w<=NF;w++){l[t,$w]++;g[$w]++}}
END {for(w in g) if(g[w]<2) delete g[w]; else printf w " "; print "";
  for(i=1;i<=t;i++){ for(w in g) printf +l[i,w]" "; print ""}}' infile

irrevocabile · March 9, 2011, 8:06am

Chubler_XL's works perfectly... in gawk, nawk, awk. Trying to see how...

Chubler_XL · March 9, 2011, 3:36pm

It probably does need some explination.

Consider the following input

The quick brown fox jumped over the lazy
brown fox.

It produces to 2 arrays from this g is a global word count:

w[the]=2
w[fox]=2
w[quick]=1
w[brown]=2
w[jumped]=1
...

l is a word count for each line

l[1,the]=2
l[1,quick]=1
l[1,brown]=1
...
l[2,brown]=1
l[2,fox]=1

The whole file is processed like this, note t is also counting the number of lines in the file. At the end we go thru the g array and delete any entries with less than our 40 limit, this changes g to the popular word list. Now for each line (i = 1 thru t) we print the count in l[i,w] were w is each word remaining in g.

If no entry exists for the line (ie this popular word is not on line i) l[i,w] will be null, but the + in front of +l[i,w] causes awk to treat it as numeric and print a zero for us instead of a blank.

irrevocabile · March 11, 2011, 10:14am

This is as clear as a child's tear drop... thank you.