Counting entries in a file

sajal.bhatia · August 6, 2011, 12:07am

Hi,

I have a very large two column log file in the following format:

# Epoch Time IP Address

899726401      112.254.1.0
899726401      112.254.1.0
899726402      154.162.38.0
899726402      160.114.12.0
899726402      165.161.7.0
899726403      101.226.38.0
899726403      101.226.38.0
899726403      101.226.38.0
899726403      73.214.29.0
899726403      144.12.40.0
899726404      144.12.40.0
899726404      1.14.4.0

Each row represents a packet with a time stamp (epoch time) and the source IP address. As the granularity level is in "seconds", hence there are multiple entries for the same time stamp. So in 1st second there are two packets (from 1 IP), 2nd second three (from 3 IPs), 3rd second five (from 3 IPs) and so on.

I want to have a script using sed/awk (as the log files are quite big) which takes "time (seconds)" as the user input and gives the number of packets (instances of epoch time) and number of unique IP address within that specified time as the output.

So for e.g., if user gives 1 second as the input, the output file (3 columns) should be like:

#Time No of Packets No of Unique IPs

1 2 1
2 3 3
3 5 3
4 2 2

Similarly for user input as 2 second the output file should be like:

#Time No of Packets No of Unique IPs

1 5 4
2 7 4

PS: here 1 and 2 in the first column means first two seconds and next two seconds respectively.

Looking forward to the reply.

Thanks,

agama · August 6, 2011, 12:25am

Have a go with this:

#!/usr/bin/env ksh

awk -v bin_size=${1:-5} '
    function dump( )
    {
        if( NR == 1 )
            return;
        printf( "%3d %3d %3d\n", bin+1, total, length( unique ) );
        bin++;
    }

    {
        if( $1+0 >= next_bin )
        {
            dump( );
            next_bin = $1 + bin_size;

            delete unique;
            total = 0;
        }

        unique[$2]++
        total++;
    }
    END {
        if( total )
            dump( );
    }
'

exit

The size of the bin in seconds is passed on the command line as the only parameter to the script.

sajal.bhatia · August 6, 2011, 1:44am

It should also have an "input file" as the one of the user arguments alongwith the "bin size". Its doesn't seem to work.

Can you check it again?

Thanks,

yazu · August 6, 2011, 1:52am

It works perfectly and it's a very good script (I tried to write one but I couldn't). Get your INPUTFILE from stdin like this:

./this_script 1 <INPUTFILE

sajal.bhatia · August 6, 2011, 1:56am

Cheers

agama · August 6, 2011, 9:22am

Sorry, I forgot to point out that the script assumes input from stdin -- I'm an old school programmer and generally write code to take input from stdin as this allows preprocessing of the input (through sed or grep) if needed without having to make a change to the script, or code the script to do extra work.

The script could also be modified slightly (the very last line) to allow the name of the input file to be supplied on the command line to read from stdin:

awk  -v bin_size=${1:-5} '
# body of awk from previous example 
' $2

Assuming the script is saved in foo.ksh, this allows it to be invoked both ways:

foo.ksh 1 <input-file
foo.ksh 1 input-file

sajal.bhatia · August 7, 2011, 9:16pm

Hi,

In place of "number of unique IPs" for that timeinterval, if I would like to calculate "number of NEW IPs (comparing the IPs in the current time interval to the IPs in the previous time interval), could someone suggest the modifications to the script?

Cheers,

agama · August 7, 2011, 11:32pm

Small tweeks to the original awk to show number of new IPs in the current bin compared to the previous.

#!/usr/bin/env ksh

awk -v bin_size=${1:-5} '
    function dump( )
    {
        if( NR == 1 )
            return;

        new_count = 0;
        for( u in unique )              # compute total in this bin that were not in last bin
            if( last_bin == 0 )
                new_count++;

        printf( "%3d %3d %3d\n", bin+1, total, new_count );
        bin++;
    }

    {
        if( $1+0 >= next_bin )
        {
            dump( );
            next_bin = $1 + bin_size;

            delete last_bin;
            for( u in unique )              # copy hits from this bin
                last_bin = 1;
            delete unique;
            total = 0;
        }

        unique[$2]++
        total++;
    }
    END {
        if( total )
            dump( );
    }
'

exit

Have fun!

sajal.bhatia · August 8, 2011, 12:09am

It works:b:

Can you give a brief explanation ? I am new to awk.

agama · August 9, 2011, 8:09pm

Glad it worked for you. I've added some comments. I'll watch the thread if you have specific questions.

awk -v bin_size=${1:-5} '
    # use a function so we can call as we process input and at the end without duplicating the code
    function dump( )                # dump out the information that we collected about the last bin
    {
        if( NR == 1 )               # we will call this for the first record; 
            return;                 # if this is the first record (NR equals 1) then we skip the print

        new_count = 0;
        for( u in unique )              # look at each unique IP we saved
            if( last_bin == 0 )      # if it wasnt noticed last time, count it
                new_count++;

        bin++;
        printf( "%3d %3d %3d\n", bin, total, new_count );       # print all of the counts
    }

    {                                   # process for each record in the file (impled true condition)
        if( $1+0 >= next_bin )          # if timestamp (col 1) is in the next bin
        {
            dump( );                    # print data from the previous bin
            next_bin = $1 + bin_size;   # mark the start of the next bin

            delete last_bin;            # must delete contents of last bin
            for( u in unique )          # copy hits from this bin 
                last_bin = 1;        # for comparison when we see the start of next bin
            delete unique;              # must delete the list of unique IPs from the current bin before we start
            total = 0;                  # zero number of hits in the bin
        }

        unique[$2]++                    # count the number of times this IP address was seen in the bin
        total++;                        # total number of entries in the bin
    }

    END {               # at the end of the file, one last print if we saw something in the previous bin
        if( total )
            dump( );
    }
'

sajal.bhatia · September 30, 2011, 12:43am

Can you modify it to compute the 'new IPs in the current bin compared to the ENTIRE HISTORY upto that interval instead of just the previous interval' ?

Cheers,

sajal.bhatia · October 5, 2011, 12:26am

I have another column to the input file and would like to sum-up and print the entries of that column for the user specified time interval. For e.g. if the user specifies 5 second as the input, the script should add all the entries of the third column for this 5 sec interval and print it alongside the other information being currently printed by the above script i.e. Time-stamp, number of packets, number of uniq IPs in that interval and number of new IP in that interval as compared to the previous interval.

Looking for a solution.

Cheers,

rdcwayx · October 5, 2011, 11:42pm

only raw data, no title in source file. Otherwise, you need adjust the red part.

$ interval=2
$ awk -v s=$interval 'NR==1{min=$1}
                    {NoP[$1]++;UnIP[$1 FS $2]++;IP[$2];min=min>$1?$1:min;max=max>$1?max:$1}
                   END{for (i=min;i<=max;i=i+s)  
                         { b=i
                           {while (b<i+s) 
                                 {t+=NoP
                                  for (j in IP) if (UnIP[b FS j]) u++
                                  b++
                                 }
                            }
                            print ++e,t,u
                           t=0;u=0}
                       }' infile

1 5 4
2 7 5

sajal.bhatia · October 6, 2011, 12:03am

The third column it prints is incorrect.

I am looking for a solution which integrates the above script given by agama and your previous solution

rdcwayx · October 6, 2011, 1:23am

I see. Here is the fix:

interval=2

awk -v s=$interval 'NR==1{min=$1}
                    {NoP[$1]++;UnIP[$1 FS $2]++;IP[$2];min=min>$1?$1:min;max=max>$1?max:$1}
                   END{for (i=min;i<=max;i=i+s)  
                         { b=i
                           {while (b<i+s) 
                                 {t+=NoP
                                  for (j in IP) if (UnIP[b FS j]) x[j]
                                  b++
                                 }
                            }
                            print ++e,t,length(x)
                           t=0;u=0;delete x}
                       }' infile

sajal.bhatia · October 6, 2011, 1:35am

Its still not solving my purpose, but thanks anyways!

rdcwayx · October 6, 2011, 2:54am

Hi Guy,

You need point out what's problem you faced. Please think it with positive way, people here are helping you by free.

From your previous description, it does the job.

1 5 4
2 7 4

Where is the problem, you need give the detail.

sajal.bhatia · October 6, 2011, 3:04am

Okay. I have a file with three columns. Time-stamp, IP Address and Bytes. Time-stamp is in epoch and there are multiple entries for same time-stamp, each representing a packet.

I need a awk/sed script which takes time (in secs) as the user input and give an output file with five columns (tab separated). Time-stamp, Number of packets in that interval (essentially a count of number of entries within that time interval), number of unique IPs in that interval, number of NEW IPs in that interval (as compared to the previous interval) and sum of bytes in that interval.

Hope I am clear this time.

Cheers,

rdcwayx · October 6, 2011, 8:31pm

I go through the whole post and don't see any source files which have three columns, and if you need the result as 5 columns, the best way is to show the sample source file, and your expect output directly.

When you reply that others' codes don't resolve your problem with your non-write-down purpose, how can we think about you?

sajal.bhatia · October 6, 2011, 8:53pm

Okay..the input file is in the following format
[DATA]# Epoch Time IP Address

899726401      112.254.1.0 20 
899726401      112.254.1.0 10
899726402      154.162.38.0 30 
899726402 112.254.1.0 40 
899726402      165.161.7.0 60

899726403 101.226.38.0 20
899726403 101.226.38.0 10
899726403 101.226.38.0 30
899726403 112.254.1.0 40
899726403 144.12.40.0 20
899726404 144.12.40.0 30
899726404 1.14.4.0 10[/DATA]

So for the user input of 1sec the output should be like
[DATA]#Time-stamp No.of Packets No.of Unique IPs No.of New IPs Bytes

1 2 1 1 30
2 3 3 2 130
3 5 3 2 120
4 2 2 1 40[/DATA]

Hope I am clear.

Cheers,