the problem of gettimeofday

dophine · July 31, 2008, 2:43am

Hi all,

I just wrote a small problem to check gettimeofday in a multi-cores environment.


int timediff(double fTimeQvs, double fTimeTcp)
{
        int iTimeDiff;
        iTimeDiff = (((((int)fTimeQvs)/10000) - (((int)fTimeTcp)/10000)) * 3600) * 1000000;
        iTimeDiff += ((((((int)fTimeQvs)/100)%100) - ((((int)fTimeTcp)/100)%100)) * 60) * 1000000;
        iTimeDiff += ((((int)fTimeQvs)%100) - (((int)fTimeTcp)%100)) * 1000000;
        iTimeDiff += (((long long)(fTimeQvs*1000000))%1000000) - (((long long)(fTimeTcp*1000000))%1000000);

    return iTimeDiff;
}

int main()
{
        struct timeval cur;
        gettimeofday(&cur, 0);

        for(;;)
        {
                struct timeval new_time;
                gettimeofday(&new_time, 0);

                double diff = difftime(new_time.tv_sec,cur.tv_sec);
                if(diff > 1.0)
                        printf("Big:%ld.%ld,%ld.%ld,%lf\n",new_time.tv_sec,new_time.tv_sec, cur.tv_sec,cur.tv_usec,diff);
                else if(diff < 0.0)
                        printf("Small::%ld.%ld,%ld.%ld,%lf\n",new_time.tv_sec,new_time.tv_usec, cur.tv_sec,cur.tv_usec,diff);
                cur.tv_sec = new_time.tv_sec;
        }
        return 0;
}

Here is the result:
Small::1217485404.98014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.99014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.100014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.101013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.102014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.103014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.104014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.105014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.106014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.107013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.108013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.109013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.110014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.111014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.112014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.113014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.114014,1217489802.813903,-4398.000000

so is there any way to overcome it?

jim_mcnamara · July 31, 2008, 3:09pm

First off, try man timercmp. If you are on older systems it may not exist.
Otherwise -- Look for something like this in your sys/time.h file

#  define timerisset(tvp)               ((tvp)->tv_sec || (tvp)->tv_usec)
#  define timercmp(tvp, uvp, cmp) \
          ((tvp)->tv_sec cmp (uvp)->tv_sec || \
           (tvp)->tv_sec == (uvp)->tv_sec && (tvp)->tv_usec cmp (uvp)->tv_usec)
#  define timerclear(tvp)               ((tvp)->tv_sec = (tvp)->tv_usec = 0L)
#  define is_valid_timeval(tvp) \
          ((tvp)->tv_usec >= 0L && (tvp)->tv_usec < 1000000L)

Note - these are from an HPUX 11.23 implementation - use the ones written for your system....

These macros are defined to work with timeval structs and perform arithmetic operations on them.

dophine · August 1, 2008, 3:05am

Thank you first

# define timerisset(tvp)        ((tvp)->tv_sec || (tvp)->tv_usec)
# define timerclear(tvp)        ((tvp)->tv_sec = (tvp)->tv_usec = 0)
# define timercmp(a, b, CMP)                                                  \
  (((a)->tv_sec == (b)->tv_sec) ?                                             \
   ((a)->tv_usec CMP (b)->tv_usec) :                                          \
   ((a)->tv_sec CMP (b)->tv_sec))

these codes are in the kernel. Does it mean I need to add "is_valid_timeval"?
or I need modify it to ignore those values moving forward and backward?

jim_mcnamara · August 2, 2008, 12:25pm

You probably have others like timersub or timeradd that let you do actual arithmetic operations.

I think you may have some problems using (int) values on the time_t tv_sec - which is an unsigned int on most systems. Current epoch seconds can overflow overflow a signed integer, for example on subtraction. In other words, your timediff function is using wrong datatypes, use the standard difftime() it is declared in time.h and works correctly.

dophine · August 3, 2008, 12:08pm

let us take this "Big:1217489802.1217489802,1217485404.813903,4398.000000" as an example.
1217489802-1217485404=4398

so, there are some problems for the difftime function?

jim_mcnamara · August 3, 2008, 7:27pm

No, your timediff function. Not the std C library function difftime.

dophine · August 3, 2008, 9:16pm

Sorry, I think it is misleading. the function I used in Main is stardard function difftime, not my own function timediff.

shamrock · August 4, 2008, 12:21pm

What do you mean by "problems with the difftime function"
The difference between the programmatically and manually computed times is correct and your printf() doesn't have the microsecond counter so I added it in.

printf("Big:%ld.%ld,%ld.%ld,%lf\n",new_time.tv_sec,new_time.tv_usec, cur.tv_sec,cur.tv_usec,diff);

What do you get if you just subtract cur.tv_usec from new_time.tv_usec without calling difftime().

diff = new_time.tv_usec - cur.tv_usec;

dophine · August 4, 2008, 9:52pm

just forget this.

shamrock:

printf("Big:%ld.%ld,%ld.%ld,%lf\n",new_time.tv_sec,new_time.tv_usec, cur.tv_sec,cur.tv_usec,diff);
What do you get if you just subtract cur.tv_usec from new_time.tv_usec without calling difftime().
diff = new_time.tv_usec - cur.tv_usec;

As I expect the time does not go back or forward by more 1 second, the micro second part is not my concern. the same result occurs suing subtraction directly. So I suspect there is a problem for gettimeofday.

shamrock · August 4, 2008, 11:55pm

What platform are you on...cast the null pointer correctly in your call to gettimeofday() and see if that affects the result.

gettimeofday(&cur, (void *)0);

dophine · August 5, 2008, 1:21am

red hat in a multi-cores platform.

it said invalid conversion from `void*' to `timezone*'.
if I change to gettimeofday(&cur, (timezone *)0); it said "expected primary-expression before ')' token"

Perderabo · August 5, 2008, 7:25am

dophine:

Here is the result:
Small::1217485404.98014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.99014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.100014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.101013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.102014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.103014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.104014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.105014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.106014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.107013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.108013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.109013,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.110014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.111014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.112014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.113014,1217489802.813903,-4398.000000
Big:1217489802.1217489802,1217485404.813903,4398.000000
Small::1217485404.114014,1217489802.813903,-4398.000000

You are alternating between two values:
1217485404
1217489802

It looks like the first two calls to gettimeofday() worked and then it started failing. I can't say why, but I certainly don't understand why you are blaming a multi-core environment. System calls return -1 and set errno when they fail. Checking that is the documented way to discover the reason for the failure.

shamrock · August 6, 2008, 3:48pm

I agree with Perderabo and great catch by the way. I didn't pay attention to the data but it is not incrementing the gettimeofday() output as it should so looks like that call is failing. You need to check the return value from gettimeofday() to make sure it is working correctly.

dophine · August 6, 2008, 10:03pm

I have checked it. It always returns 0.
the reason why I said it may be the problem of multi-core is I found this website
linux-kernel - gettimeofday() jumping into the future

Perderabo · August 7, 2008, 10:37pm

This is all a bit new to me, but I'm coming up to speed. You are almost certainly correct that you have a gettimeofday() problem and the multi-core environment is to blame.

I was congratulating myself for noticing that you are alternating between 2 different values. Too bad I didn't also notice that they were over 1 hour apart. Two cpu's are fighting over which second it is. Neither tried to increment the second because all of the calls you show occurred during the same second.

What OS are you using? If linux, please tell us the distro and the kernel rev level (use uname -a to find out). Also what cpu do you have? Workstation or laptop?

Some background: Intel introduced a feature called TSM (time stamp counter). It gets incremented with each CPU cycle. AMD copied the idea. Linux (and maybe other unix versions too?) is using this to improve the accuracy of the clock. It worked great for a while but stuff like hibernation and speed-step screwed it up for laptops. Speed-step is now used for workstations (reduce power consumption). And multi-cores really hurt it.

dophine · August 8, 2008, 3:05am

OS: red hat 4
kernel: 2.6.18-92.el5PAE
CPU: Intel(R) Xeon(R) CPU X7350 @ 2.93GHz
16 cores.

The magic number 4389 is my concern. It occurs every time when this problem arises.
Therefore, I try to do some search in the internet and found that website.

Since it may be the kernel issue, I have no idea how to fix. As the website mentioned the cause may be the clock source if it is TSC. However, the clock source I use is HPET.

Perderabo · August 9, 2008, 9:56am

How do you know that you're using HPET? Regardless, you have a kernel bug that has been discussed quite a bit recently and yet you are using an older release of RedHat. Upgrading would be the thing I would try. Once you're on the current (or at least recent) release, if you still have a problem, try opening a call to RedHat support.

dophine · August 11, 2008, 11:45pm

my colleague told me about it. I found redhat provides a patch. I hope it can solve it.