Making a faster alternative to a slow awk command

s052866 · July 4, 2012, 8:28am

Hi,

I have a large number of input files with two columns of numbers.

For example:

[X]    [Y]
83     1453
99     3255
99     8482
99     7372
83     175

I only wish to retain lines where the numbers fullfil two requirements. E.g:
[X]=83
1000<=[Y]<=2000

To do this I use the following command:

awk '($1==83) &&  $2>=1000 && $2<=2000' [inputfile]

PROBLEM: My inputfiles contain >50 million lines, so the awk command is too slow (it takes >2 minutes and I have thousands of inputfiles). Is there a way to make it faster? I have been told that it would be faster if I use Perl.

Klashxx · July 4, 2012, 8:59am

If the first value is fixed try:

awk '/^83 *[12][0-9][0-9][0-9]/{if($2>=1000 && $2<=2000){print}}' infile

jayan_jay · July 4, 2012, 9:01am

Performance not tested .. To avoid arithmetic calculations, try with egrep

$ egrep "83 (1...$|2000)" infile
83 1453
$

Corona688 · July 4, 2012, 11:23am

The problem, really, is that you have a huge amount of data, not a slow program. How big are your records, really?

I haven't heard perl suggested to increase speed before. I don't think using an awk regex instead of a simple = is going to make it faster, either...

The egrep solution might be worth a shot.

alister · July 4, 2012, 1:23pm

That's a reasonable assumption, but it turns out to be incorrect (at least with the implementation I tested).

In my testing, the following code is over three times faster than the original solution:

awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data

I'm curious to know if this solution is also faster on other implementions (nawk and gawk, specifically), but I won't be able to test on them today.

I used an obsolete linux system for all of my testing.

Hardware: Pentium 2 @ 350 MHz (can you feel the power?)
Software: awk is mawk 1.3.3, perl 5.8.8, GNU (e)grep 2.5.1, GNU sed 4.1.5, GNU coreutils 5.97 (cat, wc)
Data: 14 megabytes. 6 line repeating pattern. 1,783,782 lines. 297,297 matches.

Slowest to fastest:

$ time egrep '^83 (1...|2000)$' data > /dev/null

real    0m15.170s
user    0m15.089s
sys     0m0.080s

$ time awk '$1==83 && $2>=1000 && $2<=2000' data > /dev/null

real    0m11.325s
user    0m11.213s
sys     0m0.112s

$ time perl -ne 'print if /^83 (1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m9.728s
user    0m9.629s
sys     0m0.100s

$ time sed d data

real    0m8.357s
user    0m8.277s
sys     0m0.080s

$ time awk '/^83  *[12][0-9][0-9][0-9]$/ {if ($2>=1000 && $2<=2000) print}' data > /dev/null

real    0m6.809s
user    0m6.692s
sys     0m0.116s

$ time awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m3.555s
user    0m3.404s
sys     0m0.152s

$ time awk 0 data

real    0m1.898s
user    0m1.832s
sys     0m0.068s

$ time wc -l data > /dev/null

real    0m0.721s
user    0m0.316s
sys     0m0.128s

$ time cat data > /dev/null

real    0m0.084s
user    0m0.012s
sys     0m0.072s

Most surprising to me is how long it takes GNU sed to do nothing.

For everyone's amusement (GNU bash 3.1.17):

$ cat match.sh
while read -r line; do
    case $line in
        83\ 1???|83\ 2000) echo $line;;
    esac
done

$ time sh match.sh < data > /dev/null

real    6m53.128s
user    6m28.776s
sys     0m24.150s

Regards,
Alister

---------- Post updated at 01:23 PM ---------- Previous update was at 01:17 PM ----------

That regular expression says that the space is optional. That's probably not a good idea. The way it's written, 832999 2000 would match.

That may require an anchor at the beginning, ^ , if numbers with more than 3 digits are possible in the first column. Also, the $ anchor should probably be moved so that it's just after the parenthesized group (for a similar reason).

Regards,
Alister

alister · July 4, 2012, 7:16pm

Absolutely. GNU tools in particular tend to be slower than their counterparts.

Woops. My test data was delimited by a single space, so the output of the commands would be correct, but the time was slightly underestimated due to the simpler regular expression.

Using ed, I replaced the single space in each line with a <space><tab><space> sequence. I re-ran the tests, replacing the <space> in the regular expression with [<space><tab>]+ , and the time for each test increased by 1 to 3 seconds with the rankings unchanged.

Interesting observation: character classes really slowed down GNU grep.

egrep '^83[[:blank:]]+... takes twice as long as egrep '^83[ <tab>]+... , 30s versus 15s. With perl, the difference was approximately 0.6s.

As for the bash trinket, I won't bother fixing that. I'm not _that_ bored.

Thanks for living up to your nick.

Regards,
Alister

s052866 · July 5, 2012, 9:23am

Thanks everybody for the interesting comments. It has helped me to try different options and get around the problem.

The values that I use for filtering is not fixed so I had to be cautious to use the regular expressions. Also, I have a number of other informative columns that I did not include in the original post for simplicity.

Instead, as Corona688 suggested, I have realized that my main problem is the huge amount of data.

Therefore, I have made two changes to speed things up (not perfect but to an acceptable level):
(1) I sorted the files according to the relevant columns, and then used a modified awk-line that would not need to parse through the full files but exit when relevant:

awk '{print;if($4>UPPERLIMIT) exit}'

(2) I split the original, but sorted, file (>50,000,000 lines) into smaller fragments with the range of numbers in the columns that are on priori known to be relevant for the current filtering requirements.

Chubler_XL · July 5, 2012, 8:36pm

Use a relational database - they are specifically designed to do the type of queries you are talking about, and people spend their whole careers optimising them.

Other thoughts - If you have the files on a nice quick SAN or somthing, you might benifit for doing a multi-threaded lookup. Get all your 16 cores working on the problem. You may need to split the file(s) up so you can run multiple processes against their own file.

daytripper1021 · July 12, 2012, 5:35am

Hi there,

Newbie here. Can you help to explain the code above, particularly this part:

awk '/^83 *[12][0-9][0-9][0-9]/

Thanks in advance!

Scrutinizer · July 12, 2012, 5:54am

Hi, it should be

awk '/^83  *[12][0-9][0-9][0-9]/

(two spaces)

It means: match any line that starts (^) with 83 followed by 1 or more spaces and then a number that starts with a 1 or a 2 ([12]) followed by 3 digits ([0-9]).