awk slowing down -- why?

treesloth · January 7, 2014, 7:49pm

I have an awk script that extracts data from log files and generates a report. The script is incrementing elements in arrays, summing fields depending on contents of other fields, and then in END performing some operations on those arrays. It seemed to be taking longer than it should on my system, so I started trying to figure out where that's happening. I added a simple test to the script:

NR % 10000 == 0 {
        print NR "\t" length(time_count) "\t" systime() - prevtime
        prevtime = systime()
        }

time_count is an array that becomes rather large-- a per-second count across an hour's time for each of several servers. So, it stacks up considerably, and seems like the best candidate for causing the slowdown. Here's a snippet of that particular bit's output around when the slowdown starts:

350000  8559    1
360000  8804    2
370000  9012    1
380000  16773   3
390000  16811   4
400000  16857   4

You can see that there's a big increase in the array length around 380,000 lines, as well as in the time required to process that particular set of 10,000 lines. The time grows, slightly but measurably, as the line count increases. This will become more of a problem as the script is used to process larger files.

So, my questions:

1) Are there any general suggestions for increasing performance? This may require me to post the whole script. I don't mind, but I don't want to clutter things up here, so let me know if that would help.
2) I've noticed that my VIRT/RES/SHR in top for my script max out at 108m/6204/864. So, it's actually using about 6MB of RAM, if I read that right. I'd like it to feel free to gobble up as much as it can get. This is a system with 96GB of RAM, so not a problem. How can I encourage the process to do that?

I'd love it if I could tweak things so that disk I/O became the limiting factor. Many thanks in advance for suggestions.

kurumi · January 8, 2014, 1:13am

show your code then

treesloth · January 8, 2014, 12:51pm

As requested! Attached as .txt.

treesloth · January 8, 2014, 4:16pm

As of this posting, my attachment is still pending approval. So, here's the script:

#! /bin/awk -f

BEGIN   {
        OFS = ","
        count = 1
        prevtime = 0
        while ( "cat /root/scripts/billing/subant_list" | getline )
                {
                split($0, sublist, ",")
                subants[count] = sublist[1]
                count ++
                }
        }

## Let's chew on something now...
NR % 10000 == 0 {
        print NR "\t" length(time_count) "\t" systime() - prevtime
        prevtime = systime()
        }

$1 >= start_time && $1 < end_time       {
linecount++

## Hourly operations count

hourlyOperationsCount[substr($2,2,14)]++

## Generate a per-subant count, including:
##      Count of HTTP status codes per subant
##      Count of HTTP tx types per subant

for ( i = 1 ; i <= length(subants) ; i++)
        {
        if ( $14 == subants )
                {
                ##  Subtenant count:
                found["TotalTXCount," subants]++

                ##  Count of HTTP status codes per subant:
                found["StatusCount," subants "," $5] ++

                ##  Count of HTTP tx types per subant
                httptype = substr($9, 2, length($9) - 1)
                found["TXTypeCount," subants "," httptype] ++

                ##  Cumulative size and time of tx by subant_list and HTTP tx type
                indexInSizeByType = "InSizeByType," subants "," httptype
                found[indexInSizeByType] = found[indexInSizeByType] + $17

                indexOutSizeByType = "OutSizeByType," subants "," httptype
                found[indexOutSizeByType] = found[indexOutSizeByType] + $18

                indexTimeByType = "TimeByType," subants "," httptype
                found[indexTimeByType] = found[indexTimeByType] + $19
                }
        }
}


## Ok, these next two sections warrant a little 'splainin.  We track:
##      1)  Concurrent connections -- connections that are ongoing during
##              a given second, whether or not they were initiated in that
##              particular second.
##      2)  initiated connections -- connections that were started in a
##              given second.

{

## First, track concurrent connections.  This one doesn't have the time
##      filter that everything else has so that connections already in
##      progress when the time window starts are counted.

for ( i = 1 ; i <= length(subants) ; i++)
        {
        if ( $14 == subants )
                {
                stime = $1
                if (int(($19 + 500000) / 1000000) >= 1 )
                        {
                        for ( j = stime ; j <= (stime + int(($19 + 500000) / 1000000)) ; j ++ )
                                {
                                time_count[subants "," j] ++
                                }
                        }
#               for (i in time_count) {print i "\t" time_count}}
                }
        }
}

$1 >= start_time && $1 < end_time       {
## Finally, we track initiated connections.

for ( i = 1 ; i <= length(subants) ; i++)
        {
        if ( $14 == subants )
                {
                txInitiated[subants "," $1] ++
                }
        }
}


END     {
        print linecount
        for ( i in found )
                {
                print i "," found
                }

        for ( i in time_count )
                {
                split(i,st,",")
                subant = st[1]
                subant_time = st[2]
                if ( time_count > max_concurrency[subant] && subant_time >= start_time && subant_time < end_time )
                        {
                        max_concurrency[subant] = time_count
                        max_concurr_time[subant] = subant_time
                        }
                }

        for ( i in txInitiated )
                {
                split(i,st,",")
                subant = st[1]
                subant_time = st[2]
                if (txInitiated > max_initiated[subant])
                        {
                        max_initiated[subant] = txInitiated
                        max_init_time[subant] = subant_time
                        }
                }

        for ( i in max_concurrency )
                {
                print "PeakCncrntConns," i "," max_concurrency ",@" max_concurr_time
                }

        for ( i in max_initiated )
                {
                print "PeakInitConns," i "," max_initiated ",@" max_init_time
                }

        for ( i in hourlyOperationsCount )
                {
                print "hourlyOperationsCount", i, hourlyOperationsCount
                }

        }

radoulov · January 8, 2014, 5:18pm

Attachment approved.
You may try to run your script like this:

export LC_ALL=C
./your_awk_script args...

and see if the elapsed time decreases.

If you post sample datafiles, the analysis would be easier.

treesloth · January 8, 2014, 6:09pm

radoulov:

Attachment approved.
You may try to run your script like this:
export LC_ALL=C
./your_awk_script args...
and see if the elapsed time decreases.

If you post sample datafiles, the analysis would be easier.

Thanks for the reply. I'm trying the variable export to see how the completion time compares.

radoulov · January 9, 2014, 3:34am

Answering your previous question (now deleted): yes, you can limit the scope
of LC_ALL and execute it like this:

LC_ALL=C <your_script> <args> ...