I have an awk script that extracts data from log files and generates a report. The script is incrementing elements in arrays, summing fields depending on contents of other fields, and then in END performing some operations on those arrays. It seemed to be taking longer than it should on my system, so I started trying to figure out where that's happening. I added a simple test to the script:
NR % 10000 == 0 {
print NR "\t" length(time_count) "\t" systime() - prevtime
prevtime = systime()
}
time_count is an array that becomes rather large-- a per-second count across an hour's time for each of several servers. So, it stacks up considerably, and seems like the best candidate for causing the slowdown. Here's a snippet of that particular bit's output around when the slowdown starts:
350000 8559 1
360000 8804 2
370000 9012 1
380000 16773 3
390000 16811 4
400000 16857 4
You can see that there's a big increase in the array length around 380,000 lines, as well as in the time required to process that particular set of 10,000 lines. The time grows, slightly but measurably, as the line count increases. This will become more of a problem as the script is used to process larger files.
So, my questions:
1) Are there any general suggestions for increasing performance? This may require me to post the whole script. I don't mind, but I don't want to clutter things up here, so let me know if that would help.
2) I've noticed that my VIRT/RES/SHR in top for my script max out at 108m/6204/864. So, it's actually using about 6MB of RAM, if I read that right. I'd like it to feel free to gobble up as much as it can get. This is a system with 96GB of RAM, so not a problem. How can I encourage the process to do that?
I'd love it if I could tweak things so that disk I/O became the limiting factor. Many thanks in advance for suggestions.