Mawk printf %d maxes out at 2147483647

So, I do some file processing that generates very large numbers, such as total amount GETted from a busy web cluster in a month, etc. Mawk is awesome-- fast and easy. It's awk! But, there's a fatal flaw that I'd like to overcome. Apparently, %d maxes out at 2147483647. Here's sample output, with the first line sprintf'd and the second just print'd:

total_size, total_count, average:       2147483647 50586 2493242
total_size, total_count, average:       1.26123e+11 50586 2.49324e+06

Is there any way I'm not thinking of to achieve the same result as %d? I'm using the very latest, bright and shiny mawk:

# mawk -W version
mawk 1.3.4 20131226

There are bug reports out there about this (google "mawk 2147483647") but no solutions so far. May thanks in advance, and please pardon me if I've overlooked some solution. I was a bit fatigued when I went looking.

Try using "gawk".

That is a solution, of course, but I'm using mawk because of the size of the datasets we're working with. Depending on the task, we have anywhere from 3 to 7 times better performance. So, it's been an ongoing project to convert those tasks that we can to mawk. I'd like to do so for this one as well, as it is a very time-consuming job.

Try this then:

printf "%.0f\n", 2147483648
1 Like

Ah, of course. That works. :slight_smile: Don't know why I assumed that would also be broken. Many thanks.

I've dug into mawk's code a bit and switching it to a 64-bit integer isn't quite as easy as it seems. It's a sticky problem, because of the mutability of numbers in awk. They are quite careful to get a 32-bit int and a 64-bit double, since all 32-bit integers can be faithfully represented by a 64-bit float, but what happens when your int is 64-bit? Not all 64-bit integers can be perfectly represented by the 53-bits precision of a 64-bit float.

It also passes on its printf options into the system printf's, almost completely faithfully, except for a weird case they added in 1995 for a system that only had 16-bit ints. I suspect another such weird case would be needed for 64-bits.

In fact I'd go far enough to say... The mawk developers might do better, in readability and performance, to write their own printf. Cooperating with every awkward printf of the last 20 years has made it very strange internally.

The old SysV Unix /bin/oawk and /bin/nawk have the same limit:

echo 2147483649 | /bin/oawk '{printf "%d\n", $1}'
2147483647
echo 2147483649 | /bin/nawk '{printf "%d\n", $1}'
2147483647

While print has a higher limit

echo 2147483649 | /bin/oawk '{print $1}'
2147483649
echo 2147483649 | /bin/nawk '{print $1}'
2147483649

@MadeinGermany, the latter is because the numbers are interpreted as a string:

$ echo 214744445564646464646454646646464646466666646456456546456456466464646483649 | awk '{print $1}'
214744445564646464646454646646464646466666646456456546456456466464646483649

Oh I overlooked that, have to force a number:

echo 2147483649 | /bin/oawk '{print $1+0}'
2147483649
echo 2147483649 | /bin/nawk '{print $1+0}'
2147483649

Interesting: GNU awk immediately escapes to floating point here.

What I think the standard says (whether or not it matters, is another matter).

POSIX mandates doubles for all AWK numerics [1]:

So, at least at the outset, the valid integral range is approximately -2^53 to 2^53.

What should happen when you try to print the integer using printf "%d" bigint ? POSIX AWK says [1]:

The File Format Notation [2] section then says:

Note the use of "integer", not "int" as is the case in the C library implementation. The latter is a specific data type, the former is not. In this context, what exactly is an "integer"?

In Concepts Derived from the ISO C Standard [3], POSIX says:

So, it seems to me that POSIX requires compliant AWK implementations to store all values as doubles and to use at least signed long when converting an integral double to an integral type.

If this is correct, a compliant AWK implementation on an LP64 platform (most 64-bit UNIX) should never lose precision when converting a double to an integer.

I know that nawk does indeed store all numerics as doubles. However, it casts to signed int during printf "%d" ... .

References:
[1] POSIX: AWK
[2] POSIX: File Format Notation
[3] POSIX: Concepts Derived from the ISO C Standard

Regards,
Alister

2 Likes

Interesting, and neither mawk (typical versions) nor nawk are POSIX compliant. Case in point on Solaris:

$ echo 2147483649 | /bin/nawk '{printf "%d\n", $1}'
2147483647
$ echo 2147483649 | /usr/xpg4/bin/awk '{printf "%d\n", $1}'
2147483649