select() system call takes longer than the timeout specified

Below is my code. Every once in a while the select call takes as long as 150 seconds (discovered by printing time before and after this statement) while the timeout specified into it is only 1 second. Any clue why? I can't believe that select call which has been around for centuries can have a bug, but I don't see what I'm doing wrong.

 
    fd_set fdread;
    fd_set fdwrite;
    fd_set fdexcep;
    int maxfd;
    FD_ZERO(&fdread);
    FD_ZERO(&fdwrite);
    FD_ZERO(&fdexcep);
    /* set a suitable timeout to play around with */
    timeout.tv_sec = 1;
    timeout.tv_usec = 0;
    /* get file descriptors from the transfers */
    curl_multi_fdset(cmh, &fdread, &fdwrite, &fdexcep, &maxfd);   
    int ret = select(maxfd+1, &fdread, &fdwrite, &fdexcep, &timeout);

Is your timeout value a struct timeval or a struct timespec? select uses struct timeout, pselect uses struct timespec.

Neither. It is struct timeval.

Hmm. That's the correct structure.. but select() has odd side-effects. It modifies the timeval you give it, to reflect how much time was left. You have to update it every time to get the same timeout!

Yes, I'm updating the timeout value everytime.

How do you know the select() is actually taking that long? Maybe your process is getting swapped out? As you said, select() isn't exactly untested code.

Also, if you're running on Solaris, and your file descriptor set is large and not constant, read the man page for "poll.7d". It scales much better, only returns active fds so you don't have to iterate through every fd every time to see if it's active, and you don't have to rebuild the entire list of fds every time - you just add and delete fds to a set that the kernel tracks.

There is clearly something else in your code/system that is not shown. select() was and is actively used as a very reliable way to implement sleep(), usleep() and other "sleep" calls on old systems that do not support them. Almost categorically select() is not directly causing your problem - if what you posted is correct.

If your process has low priority or is running as a normal process on a realtime system you may get extended waits. Not 120 seconds.

If you are doing writes on the fd select is waiting on even with huge (multi-GB) data elements I would still find 120 seconds hard to believe. It looks like you are calling curl - so it is probably a linux box right?

We need: OS version, kernel version, and what type of system you have - realtime, desktop, etc.

Have you traced your code and watched it wait in select() 120 seconds, for example?

---------- Post updated 03-31-10 at 06:17 ---------- Previous update was 03-30-10 at 20:27 ----------

Q:
int maxfd=some_fixed_value (?)

where is int maxfd initialized?

Linux 2.6.9-67.EL. It is VM on a desktop. There are no other processes running, it is not a shared system.

I just do gettimeofday(&current_time, NULL); before and after.

maxfd need not be initialized, its value is set by the call to curl_multi_fdset as an out-parameter. I printed the value thus set and it was 8.

You are not blocking the socket are you - that will break select?

for example see:
Linux-Kernel Archive: Re: UDP recvmsg blocks after select(), 2.6 bug?

Sorry I was busy with other tasks for the past few days.

No, I'm not operating on sockets directly. I'm just using the libcurl library. Which is also a widely used library and I dont think they would do something silly like that.

In what way?

fcntl(sockfd, F_SETFL, O_NONBLOCK);

where sockfd is one of the fd's select is calling. You HAVE to do this or you break select(). Period.

I dunno what the libcurl folks did, but I think this is your problem - blocking.

Just from looking at that single post and its title, I believe you misunderstood the discussion/conclusion in that thread.

When the kernel people speak of "breaking" select(), they mean altering the behavior of their select() implementation. The topic of that debate seems to be whether a select() hit on a blocking socket GUARANTEES that a subsequent recvmsg() call will not block. It doesn't guarantee that, but most programmers expect it to, hence the question of whether it is right for select() to "lie" about state.

However, select() is better thought of as a temporary snapshot of I/O state. If select() indicates readability on a file descriptor, that does not mean that a subsequent read operation is guaranteed not to block.

Likewise, if stat() indicates that a file is 1363 bytes big, that does not mean that a subsquent open() + read() will be able to retrieve 1363 bytes from it because the file may have been deleted or resized in the meantime.


As for the original problem:

As others have stated, I also believe that you should make sure you are really measuring the right thing. It is highly unlikely that select() doesn't behave "as advertised".

Make sure that your gettimeofday() calls only measure select(). Perhaps you didn't just measure select() but also time for reading and processing data?

When you used gettimeofday() to record time, did you also make sure to use those same data structures when printing the time difference? How did you print it - did you maybe mix up seconds and microseconds? Are you sure your print statements aren't adding any extra delay because you didn't flush or unbuffer your output file?

Please post your real code. What you posted originally does not contain the gettimeofday() calls you said you used to measure it.

jim mcnamara's suggestion to trace the process is also good - If possible, run your program through "strace ./prog" and see whether it really spends a long time in select().

What does this mean for a socket however, which cannot be deleted or resized? According to the linux man page, it happens when received data is discarded due to checksum errors etc.

I think some versions of their implementation also drop packets during system memory shortage even after having reported availability with select(). For TCP sockets, I believe there may be a race condition between select() and accept() on some implementations if a half-opened connection is dropped before it can be accepted. But that one may just as well always yield ECONNABORTED, at least on current systems (as indicated by accept )

It is unlikely to encounter these problems so select() + blocking sockets mostly works :slight_smile:

Sorry I was fossilled for sometime and couldn't provide more info.
Here is the complete code http://www.unix.com/attachment.php?attachmentid=1451&stc=1&d=1271961027

It makes request to a service that takes 2 seconds delay to return response.
While executing in a loop, most of the time the select call takes 0 or 1 second but occasionally takes as much as 98 or 105 seconds.