i need your help on this. There is a text file, i need to count word frequency for each word with frequency >40 in each line of file and output it into another file with columns like this:
word1,word2,word3, ...wordn
0,0,1
1,2,0
3,2,0 etc -- each raw represents word counts for a line of the original text file
numbers are wordn frequencies in each line of the original file.
This AWK of course does the first part (collects a list of words to count)
{
for (i=1; i<=NF; i++)
words[$i]++
}
END {
for (i in words)
if (words > 40)
print i
}
This does searches and counts
{
res=gsub(i, " ", all)
print res
}
How do i put them together??? In awk? Sorry, i am a complete newbie.
>cat file
aa bb aa bb cc dd ee ee ee
aa aa bb cc ee ee dd ee cc
>awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++}}
NR>FNR&&FNR==1{for(i in a) {if( a>=1) b[j++]=i;printf i " "}print ""}
NR>FNR{for(m=0;m<j;m++) printf gsub(b[m],b[m])" ";print""}' file file
bb cc dd ee aa
2 1 1 3 2
1 2 1 3 2
Worked on my PC too, perhaps OP should use nawk instead of awk.
Couple of things to note, yinyuemi's code does search and replace so if words are substrings of other words eg "the" and "thesis" it's starts going all wrong.
This update fixes this issue for me (Change >=1 to >=40 when your ready to limit to only 40 or greater total occurances):
$ cat file
thesis the thesis the cc dd ee ee ee
thesis thesis the cc ee ee dd ee cc
$ awk 'NR==FNR{for(i=1;i<NF;i++) {a[$i]++};next}
FNR==1{for(i in a)if(a>=1)b=0;for(i in b)printf (k++?",":"")i;print ""}
{for(i in b) k=b=0;for(w=1;w<=NF;w++)if($w in b)b[$w]++;for(i in b) printf (k++?",":"")b;print ""}' file file
thesis,cc,the,dd,ee
2,1,2,1,3
2,2,1,1,3
I did not run the code, only skimmed it, but it seems to me that if a word that meets the frequency threshold does not occur in a line, the printf statement will not print a 0. I'm assuming a 0 would be desirable as opposed to an empty string. Perhaps a "+0" or a format string with a numeric conversion specifier would be in order?
Well spotted - my test data didn't have a line with zero count, fixed below.
Also matches words regardless of their case and removes common punctuantion (eg comma, full stop, semi-colon, colon, brackets, etc.):
awk '{$0=tolower($0);gsub("[:;.,()!]"," ");t++;
for(w=1;w<=NF;w++){l[t,$w]++;g[$w]++}}
END {for(w in g) if(g[w]<2) delete g[w]; else printf w " "; print "";
for(i=1;i<=t;i++){ for(w in g) printf +l[i,w]" "; print ""}}' infile
The whole file is processed like this, note t is also counting the number of lines in the file. At the end we go thru the g array and delete any entries with less than our 40 limit, this changes g to the popular word list. Now for each line (i = 1 thru t) we print the count in l[i,w] were w is each word remaining in g.
If no entry exists for the line (ie this popular word is not on line i) l[i,w] will be null, but the + in front of +l[i,w] causes awk to treat it as numeric and print a zero for us instead of a blank.