That's a reasonable assumption, but it turns out to be incorrect (at least with the implementation I tested).
In my testing, the following code is over three times faster than the original solution:
awk '/^83 *(1[0-9][0-9][0-9]|2000)$/' data
I'm curious to know if this solution is also faster on other implementions (nawk and gawk, specifically), but I won't be able to test on them today.
I used an obsolete linux system for all of my testing.
Hardware: Pentium 2 @ 350 MHz (can you feel the power?)
Software: awk is mawk 1.3.3, perl 5.8.8, GNU (e)grep 2.5.1, GNU sed 4.1.5, GNU coreutils 5.97 (cat, wc)
Data: 14 megabytes. 6 line repeating pattern. 1,783,782 lines. 297,297 matches.
Slowest to fastest:
$ time egrep '^83 (1...|2000)$' data > /dev/null
real 0m15.170s
user 0m15.089s
sys 0m0.080s
$ time awk '$1==83 && $2>=1000 && $2<=2000' data > /dev/null
real 0m11.325s
user 0m11.213s
sys 0m0.112s
$ time perl -ne 'print if /^83 (1[0-9][0-9][0-9]|2000)$/' data > /dev/null
real 0m9.728s
user 0m9.629s
sys 0m0.100s
$ time sed d data
real 0m8.357s
user 0m8.277s
sys 0m0.080s
$ time awk '/^83 *[12][0-9][0-9][0-9]$/ {if ($2>=1000 && $2<=2000) print}' data > /dev/null
real 0m6.809s
user 0m6.692s
sys 0m0.116s
$ time awk '/^83 *(1[0-9][0-9][0-9]|2000)$/' data > /dev/null
real 0m3.555s
user 0m3.404s
sys 0m0.152s
$ time awk 0 data
real 0m1.898s
user 0m1.832s
sys 0m0.068s
$ time wc -l data > /dev/null
real 0m0.721s
user 0m0.316s
sys 0m0.128s
$ time cat data > /dev/null
real 0m0.084s
user 0m0.012s
sys 0m0.072s
Most surprising to me is how long it takes GNU sed to do nothing.
For everyone's amusement (GNU bash 3.1.17):
$ cat match.sh
while read -r line; do
case $line in
83\ 1???|83\ 2000) echo $line;;
esac
done
$ time sh match.sh < data > /dev/null
real 6m53.128s
user 6m28.776s
sys 0m24.150s
Regards,
Alister
---------- Post updated at 01:23 PM ---------- Previous update was at 01:17 PM ----------
That regular expression says that the space is optional. That's probably not a good idea. The way it's written, 832999 2000
would match.
That may require an anchor at the beginning, ^
, if numbers with more than 3 digits are possible in the first column. Also, the $
anchor should probably be moved so that it's just after the parenthesized group (for a similar reason).
Regards,
Alister