Strange value of the double type variable: -nan(0x8000000000000)

915086731 · May 30, 2014, 3:25am

I am confused by the value of "currdisk->currangle" after adding operation. Initially the value of "currdisk->currangle" is 0.77500000000000013, but after adding operation, it's changed to "-nan(0x8000000000000)", Can anyone explain ? Thanks! The following is the occasion of gdb debugging.

3338          currdisk->currangle += (simtime - seg->time) / currdisk->rotatetime;
(gdb) p currdisk->currangle
$28 = 0.77500000000000013
(gdb) p (simtime - seg->time) / currdisk->rotatetime
$29 = 0.00833333333333325
(gdb) p (simtime - seg->time) 
$30 = 0.092592592592591672
(gdb) p currdisk->rotatetime
$31 = 11.111111111111111
(gdb) n

(gdb) p currdisk->currangle 
$32 = -nan(0x8000000000000)
(gdb) p/x (char[8])currdisk->currangle 
$52 = {0x0, 0x0, 0x0, 0x0, 0x0, 0x0, 0xf8, 0xff}
(gdb)

Then I change
currdisk->currangle += (simtime - seg->time) / currdisk->rotatetime ;
to

double tmp1 = (simtime - seg->time) / currdisk->rotatetime; 
currdisk->currangle += tmp1;

The value of currdisk->currangle is normal. Can anyone explain the confusing phenomenon ?

---------- Post updated 05-30-14 at 02:25 AM ---------- Previous update was 05-29-14 at 10:56 PM ----------

All are double type.

Perderabo · May 30, 2014, 8:07am

First, take a long hard long at the currdisk structure. If there is any chance that (double)currdisk->currangle and (double)currdisk->rotatetime overlap that is your problem. Having these overlap is the only reasonable explanation that I see. When you introduced tmpl you hide the fact that both are in operation at the same time.

915086731 · May 30, 2014, 8:29am

Both (double)currdisk->currangle and (double)currdisk->rotatetime are defined in one struct obviously, the overlap between the two elements is impossible. After I introduced tmp1, sometimes I get NaN, sometimes I get normal value.
Maybe the bug of Linux, is it possible?

Perderabo · May 30, 2014, 8:46am

It is possible that it is a compiler bug. It wouldn't be the first. But it's more likely that it's a bug in your code somehow. Everybody who ever has a mysterious bug that they can't find will almost always point a finger at the compiler. And they are right about 1% of the time.

915086731 · May 30, 2014, 8:54am

assembly code

       tmp1 = (simtime - seg->time) / currdisk->rotatetime;
0x0808fdf4  <disk_buffer_sector_done+559>:  fldl   0x80b2208
0x0808fdfa  <disk_buffer_sector_done+565>:  mov    -0x38(%ebp),%eax
0x0808fdfd  <disk_buffer_sector_done+568>:  fldl   (%eax)
0x0808fdff  <disk_buffer_sector_done+570>:  fsubrp %st,%st(1)
0x0808fe01  <disk_buffer_sector_done+572>:  mov    0x8(%ebp),%eax
0x0808fe04  <disk_buffer_sector_done+575>:  fldl   0xc4(%eax)
0x0808fe0a  <disk_buffer_sector_done+581>:  fdivrp %st,%st(1)
0x0808fe0c  <disk_buffer_sector_done+583>:  fstpl  -0x28(%ebp)
      currdisk->currangle += tmp1;
0x0808fe0f  <disk_buffer_sector_done+586>:  mov    0x8(%ebp),%eax
0x0808fe12  <disk_buffer_sector_done+589>:  fldl   0x284(%eax)
0x0808fe18  <disk_buffer_sector_done+595>:  faddl  -0x28(%ebp)
0x0808fe1b  <disk_buffer_sector_done+598>:  mov    0x8(%ebp),%eax
0x0808fe1e  <disk_buffer_sector_done+601>:  fstpl  0x284(%eax)

Perderabo · May 30, 2014, 10:26am

The only thing I have to suggest will be a lot of work. Write a tiny workable fragment of code that will set up your initial values. Then execute:

currdisk->currangle += (simtime - seg->time) / currdisk->rotatetime ;

And then have a printf to display the result. This will probably work reliably.

Now, bit by bit, add a few lines from your original program to the fragment. After each addition, retest. Eventually you will add a few lines that break the program. That will be enough of a clue to find your bug. (The only other possible outcome is that you will have a complete second copy of original program that somehow works correctly. Use diff to find out why.)

jim_mcnamara · May 30, 2014, 10:33am

FWIW:

I have had code run correctly when compiled for debug, and fail when not compiled that way. Rare.

Similar issues with levels of optimization. Also rare. Same with a 32bit compile versus a 64bit compile. Only once - it was a code issue.

To eliminate compiler complications:

gcc -O0 yourprog.c -o  yourprog -lm

I would guess you have a make file but turn off all optimizations. Run the code.
If it runs and you are positive there are no coding errors, then it may be an optimization doing something untoward.

915086731 · May 31, 2014, 12:32am

tmp1 = (simtime - seg->time) / currdisk->rotatetime;
currdisk->currangle += tmp1;

The above code can also cause NaN in the later calling.
So I step into to the assembly.
tmp1 = (simtime - seg->time) / currdisk->rotatetime;

0x0808fdf4  <disk_buffer_sector_done+559>:  fldl   0x80b2208
0x0808fdfa  <disk_buffer_sector_done+565>:  mov    -0x38(%ebp),%eax
0x0808fdfd  <disk_buffer_sector_done+568>:  fldl   (%eax)

...Here, the content of %eax is 0x80bdfeb, which is the address of seg->time.
After the above "fldl (%eax) " executed, the content of register st0 is 0x8000000000000000, which represents NaN.
So the NaN is propagated to the following instructions. This is the key issue.

0x0808fdff  <disk_buffer_sector_done+570>:  fsubrp %st,%st(1)
0x0808fe01  <disk_buffer_sector_done+572>:  mov    0x8(%ebp),%eax
0x0808fe04  <disk_buffer_sector_done+575>:  fldl   0xc4(%eax)
0x0808fe0a  <disk_buffer_sector_done+581>:  fdivrp %st,%st(1)
0x0808fe0c  <disk_buffer_sector_done+583>:  fstpl  -0x28(%ebp)

Perderabo · June 2, 2014, 1:34pm

I don't really know this assembly language, but I looked up fldl and my source said it is a floating double load. The only reasons that I can think of for that to result in a NAN are:

There really was a NAN there to load
The address in question is not correctly aligned in core to be treated as a double.