Extracting high frequency data-lines

sajal.bhatia · January 4, 2011, 1:56am

Hi,

I have a very large log file in the following format:

198.28.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 348
244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 211
198.28.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 191
4.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131
244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131
244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131
4.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131
244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 211
4.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131

The first column is the source IP address. The entire file contains entries a finite set of source IP addresses, each having some frequency. In the example 198.28.0.0 (2), 244.48.0.0 (4), 4.48.0.0 (3).

I require a sed/awk script to take "Frequency" as the user input (or it can be hard-coded as well) and extract all the lines higher than or equal to that frequency. For example if user gives 3 as the input that entries from source IP's appearing more than or equal to 3 times should be in the output file. Hence the output file should contain entries from source IP 244.48.0.0 and 4.48.0.0 i.e.

244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 211
244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 211
244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 211
244.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 211
4.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131
4.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131
4.48.0.0 - - [08/Jul/1998:19:00:01 +0000]  200 1131

Thanks and Regards

Franklin52 · January 4, 2011, 2:38am

One way:

awk 'NR==FNR{c[$1]++;next}c[$1]>=3' file file | sort

Scrutinizer · January 4, 2011, 2:41am

Try this:

awk 'NR==FNR{A[$1]++;next}A[$1]>=n' n=3 infile infile