Frequency Count of chunked data

gimley · December 11, 2014, 4:37am

Dear all,
I have an AWK script which provides frequency of words. However I am interested in getting the frequency of chunked data. This means that I have already produced valid chunks of running text, with each chunk on a line. What I need is a script to count the frequencies of each string. A pseudo sample is provided below

this interesting event
has been going on
since years
in this country
the two actors
met
one another
in this country
Mary
met
her husband
in this country

The output would be

Mary	1
has been going on	1
her husband	1
in this country	3
met	2
one another	1
since years	1
the two actors	1
this interesting event	1

I have been able to sort the data so that all similar strings are clubbed together

Mary	
has been going on
her husband
in this country
in this country
in this country
met
met
one another
since years
the two actors
this interesting event

My question is how do I manipulate a script so that a whole line is treated as an entity and lines that match (I have come till there) can be treated as one unit and a frequency counter set up.
My awk script handles space as delimiter but I do not know how to make it recognise start of line and end of line CRLF as delimiters.
I am sure this tool will be useful to people who work with chunked big data.
Many thanks

anbu23 · December 11, 2014, 4:50am

$ cat file
this interesting event
has been going on
since years
in this country
the two actors
met
one another
in this country
Mary
met
her husband
in this country
$ sort file | uniq -c
   1 Mary
   1 has been going on
   1 her husband
   3 in this country
   2 met
   1 one another
   1 since years
   1 the two actors
   1 this interesting event

RavinderSingh13 · December 11, 2014, 4:51am

Hello gimley,

Following may help you in same, but still I am not sure completly about your requirement if this following doesn't fulfil your requirement,
please do let us know the expected output of yours, what you have tried with OS name you are using.

awk '{X[$0]++;Y[$0]=$0;} END{for(i in X){print Y OFS X}}' Input_file | sort

Output will be as follows.

Mary 1
has been going on 1
her husband 1
in this country 3
met 2
one another 1
since years 1
the two actors 1
this interesting event 1

Thanks,
R. Singh

Don_Cragun · December 11, 2014, 5:20am

gimley:

Dear all,
I have an AWK script which provides frequency of words. However I am interested in getting the frequency of chunked data. This means that I have already produced valid chunks of running text, with each chunk on a line. What I need is a script to count the frequencies of each string. A pseudo sample is provided below
this interesting event
has been going on
since years
in this country
the two actors
met
one another
in this country
Mary
met
her husband
in this country
The output would be
Mary	1
has been going on	1
her husband	1
in this country	3
met	2
one another	1
since years	1
the two actors	1
this interesting event	1
I have been able to sort the data so that all similar strings are clubbed together
Mary	
has been going on
her husband
in this country
in this country
in this country
met
met
one another
since years
the two actors
this interesting event
My question is how do I manipulate a script so that a whole line is treated as an entity and lines that match (I have come till there) can be treated as one unit and a frequency counter set up.
My awk script handles space as delimiter but I do not know how to make it recognise start of line and end of line CRLF as delimiters.
I am sure this tool will be useful to people who work with chunked big data.
Many thanks

anbu23's suggestion using uniq -c is an excellent choice for this task, but if you want to know how to do it with awk , read on...

When you were dealing with words (instead of lines), your awk script probably had a loop going from 1 to NF on each input line to treat each field as a "word". To do that you probably had a loop to count occurrences of words something like:

{for(i = 1; i <= NF; i++) freq[$i]++}

To count occurrences of lines, it is simpler:

{freq[$0]++}

Note that awk assumes input files have LF line terminators; not CR/LF. But if every line is CR/LF terminated, it won't matter when you're working on whole lines. It would, however, screw up individual word counts because the last word on each line would be stored in a different bucket (one with "wordCR") from the other words on the line that would be counted in the "word" bucket.

Note that your awk script can't have CR/LF line terminators; the CR will be treated as part of whatever awk command is on that line, frequently generating syntax errors.

RavinderSingh13 provided an example of how to use awk to do this. It can be simplified a little bit to just:

awk '{X[$0]++} END{for(i in X){print i, X}}' Input_file | sort

gimley · December 11, 2014, 8:17am

Many thanks to all for their help. Especially to Don for his kind and helpful explanation. One is never too old to learn (just turned 65) and this forum is a wonderful place to learn with helpful people.