Making a faster alternative to a slow awk command

Hi,

I have a large number of input files with two columns of numbers.

For example:

[X]    [Y]
83     1453
99     3255
99     8482
99     7372
83     175

I only wish to retain lines where the numbers fullfil two requirements. E.g:
[X]=83
1000<=[Y]<=2000

To do this I use the following command:

awk '($1==83) &&  $2>=1000 && $2<=2000' [inputfile]

PROBLEM: My inputfiles contain >50 million lines, so the awk command is too slow (it takes >2 minutes and I have thousands of inputfiles). Is there a way to make it faster? I have been told that it would be faster if I use Perl.

If the first value is fixed try:

awk '/^83 *[12][0-9][0-9][0-9]/{if($2>=1000 && $2<=2000){print}}' infile
1 Like

Performance not tested .. To avoid arithmetic calculations, try with egrep

$ egrep "83 (1...$|2000)" infile
83 1453
$
1 Like

The problem, really, is that you have a huge amount of data, not a slow program. How big are your records, really?

I haven't heard perl suggested to increase speed before. I don't think using an awk regex instead of a simple = is going to make it faster, either...

The egrep solution might be worth a shot.

1 Like

That's a reasonable assumption, but it turns out to be incorrect (at least with the implementation I tested).

In my testing, the following code is over three times faster than the original solution:

awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data

I'm curious to know if this solution is also faster on other implementions (nawk and gawk, specifically), but I won't be able to test on them today.

I used an obsolete linux system for all of my testing.

Hardware: Pentium 2 @ 350 MHz (can you feel the power?)
Software: awk is mawk 1.3.3, perl 5.8.8, GNU (e)grep 2.5.1, GNU sed 4.1.5, GNU coreutils 5.97 (cat, wc)
Data: 14 megabytes. 6 line repeating pattern. 1,783,782 lines. 297,297 matches.

Slowest to fastest:

$ time egrep '^83 (1...|2000)$' data > /dev/null

real    0m15.170s
user    0m15.089s
sys     0m0.080s

$ time awk '$1==83 && $2>=1000 && $2<=2000' data > /dev/null

real    0m11.325s
user    0m11.213s
sys     0m0.112s

$ time perl -ne 'print if /^83 (1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m9.728s
user    0m9.629s
sys     0m0.100s

$ time sed d data

real    0m8.357s
user    0m8.277s
sys     0m0.080s

$ time awk '/^83  *[12][0-9][0-9][0-9]$/ {if ($2>=1000 && $2<=2000) print}' data > /dev/null

real    0m6.809s
user    0m6.692s
sys     0m0.116s

$ time awk '/^83  *(1[0-9][0-9][0-9]|2000)$/' data > /dev/null

real    0m3.555s
user    0m3.404s
sys     0m0.152s

$ time awk 0 data

real    0m1.898s
user    0m1.832s
sys     0m0.068s

$ time wc -l data > /dev/null

real    0m0.721s
user    0m0.316s
sys     0m0.128s

$ time cat data > /dev/null

real    0m0.084s
user    0m0.012s
sys     0m0.072s

Most surprising to me is how long it takes GNU sed to do nothing.

For everyone's amusement (GNU bash 3.1.17):

$ cat match.sh
while read -r line; do
    case $line in
        83\ 1???|83\ 2000) echo $line;;
    esac
done

$ time sh match.sh < data > /dev/null

real    6m53.128s
user    6m28.776s
sys     0m24.150s

Regards,
Alister

---------- Post updated at 01:23 PM ---------- Previous update was at 01:17 PM ----------

That regular expression says that the space is optional. That's probably not a good idea. The way it's written, 832999 2000 would match.

That may require an anchor at the beginning, ^ , if numbers with more than 3 digits are possible in the first column. Also, the $ anchor should probably be moved so that it's just after the parenthesized group (for a similar reason).

Regards,
Alister

2 Likes

Absolutely. GNU tools in particular tend to be slower than their counterparts.

Woops. My test data was delimited by a single space, so the output of the commands would be correct, but the time was slightly underestimated due to the simpler regular expression.

Using ed, I replaced the single space in each line with a <space><tab><space> sequence. I re-ran the tests, replacing the <space> in the regular expression with [<space><tab>]+ , and the time for each test increased by 1 to 3 seconds with the rankings unchanged.

Interesting observation: character classes really slowed down GNU grep.

egrep '^83[[:blank:]]+... takes twice as long as egrep '^83[ <tab>]+... , 30s versus 15s. With perl, the difference was approximately 0.6s.

As for the bash trinket, I won't bother fixing that. I'm not _that_ bored.

Thanks for living up to your nick.

Regards,
Alister

Thanks everybody for the interesting comments. It has helped me to try different options and get around the problem.

The values that I use for filtering is not fixed so I had to be cautious to use the regular expressions. Also, I have a number of other informative columns that I did not include in the original post for simplicity.

Instead, as Corona688 suggested, I have realized that my main problem is the huge amount of data.

Therefore, I have made two changes to speed things up (not perfect but to an acceptable level):
(1) I sorted the files according to the relevant columns, and then used a modified awk-line that would not need to parse through the full files but exit when relevant:

awk '{print;if($4>UPPERLIMIT) exit}'

(2) I split the original, but sorted, file (>50,000,000 lines) into smaller fragments with the range of numbers in the columns that are on priori known to be relevant for the current filtering requirements.

Use a relational database - they are specifically designed to do the type of queries you are talking about, and people spend their whole careers optimising them.

Other thoughts - If you have the files on a nice quick SAN or somthing, you might benifit for doing a multi-threaded lookup. Get all your 16 cores working on the problem. You may need to split the file(s) up so you can run multiple processes against their own file.

Hi there,

Newbie here. Can you help to explain the code above, particularly this part:

awk '/^83 *[12][0-9][0-9][0-9]/

Thanks in advance!

Hi, it should be

awk '/^83  *[12][0-9][0-9][0-9]/

(two spaces)

It means: match any line that starts (^) with 83 followed by 1 or more spaces and then a number that starts with a 1 or a 2 ([12]) followed by 3 digits ([0-9]).