problem with accept() on Fedora 8

Akimaki · January 11, 2008, 8:52am

hi,

accept() seems to be still blocking after socket is being closed on our Fedora 8 build. not sure if this is a common problem because i have never experienced this on any other platforms, however i have seen someone else having this issue on Redhat 7 and 9. so is there a socket option fedora is setting differently or anyone know an appropriate method to get that thread out of the accept?

thanks in advance

ramen_noodle · January 11, 2008, 12:26pm

Post the code section that is problematic or an strace of the process state during the blocking interval after the socket is closed please.

Akimaki · January 15, 2008, 12:06pm

well the code is quite robust so i skip pasting here even pieces of it but i put together a little test where i just spawned a thread which created a socket and called accept. then in the main i waited a few sec and closed the socket and joined the thread. i debugged it and of course it stucked on accept().

uname({sys="Linux", node="hal9000", ...}) = 0
futex(0x3012adc, FUTEX_WAKE_PRIVATE, 2147483647) = 0
brk(0)                                  = 0x9065000
brk(0x9086000)                          = 0x9086000
clock_gettime(CLOCK_REALTIME, {1200419840, 211248690}) = 0
clock_gettime(CLOCK_REALTIME, {1200419840, 211319847}) = 0
open("/dev/urandom", O_RDONLY)          = 3
fstat64(3, {st_mode=S_IFCHR|0666, st_rdev=makedev(1, 9), ...}) = 0
ioctl(3, SNDCTL_TMR_TIMEBASE or TCGETS, 0xbffe8bbc) = -1 EINVAL (Invalid argument)
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f3a000
read(3, "\310\foW^R\270\212\315\227\323\250\215\275\235z\267\336\263}\261\0\227\261\275\344\302\201\351\226,H"..., 4096) = 4096
close(3)                                = 0
munmap(0xb7f3a000, 4096)                = 0
fstat64(1, {st_mode=S_IFCHR|0620, st_rdev=makedev(136, 0), ...}) = 0
mmap2(NULL, 4096, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7f3a000
clock_gettime(CLOCK_REALTIME, {1200419840, 213952202}) = 0
clock_gettime(CLOCK_REALTIME, {1200419840, 214048675}) = 0
clock_gettime(CLOCK_REALTIME, {1200419840, 214158273}) = 0
clock_gettime(CLOCK_REALTIME, {1200419840, 214242900}) = 0
socket(PF_INET, SOCK_STREAM, IPPROTO_TCP) = 3
bind(3, {sa_family=AF_INET, sin_port=htons(6111), sin_addr=inet_addr("0.0.0.0")}, 16) = 0
setsockopt(3, SOL_SOCKET, SO_LINGER, {onoff=1, linger=0}, 8) = 0
ioctl(3, FIONBIO, [0])                  = 0
mmap2(NULL, 10489856, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0xb7524000
mprotect(0xb7524000, 4096, PROT_NONE)   = 0
clone(child_stack=0xb7f244a4, flags=CLONE_VM|CLONE_FS|CLONE_FILES|CLONE_SIGHAND|CLONE_THREAD|CLONE_SYSVSEM|CLONE_SETTLS|CLONE_PARENT_SETTID|CLONE_CHILD_CLEARTID, parent_tidptr=0xb7f24bd8, {entry_number:6, base_addr:0xb7f24b90, limit:1048575, seg_32bit:1, contents:0, read_exec_only:0, limit_in_pages:1, seg_not_present:0, useable:1}, child_tidptr=0xb7f24bd8) = 19928
clock_gettime(CLOCK_REALTIME, {1200419840, 215085087}) = 0
write(1, "starting..(0 s) [test] started o"..., 44) = 44
nanosleep({3, 0}, {0, 1132736})         = 0
close(3)                                = 0
futex(0xb7f24bd8, FUTEX_WAIT, 19928, NULL) = ? ERESTARTSYS (To be restarted)

i'am pretty sure it's something with fedora cause this software have successfully ran on a whole bunch of various platforms.

ramen_noodle · January 16, 2008, 1:06am

After a view of the strace it looks like the block is related to a mutex lock dude.

frank_rizzo · January 16, 2008, 1:14am

did you compile it with all warnings turned on? If not, can you. If so did you receive any warnings?

Akimaki · January 16, 2008, 5:29am

if you think that cause of the last line, it's cause of joining the thread and the thread stucks cause of its still in the accept() even after the close is called on that socket. i tried this test now on freebsd and redhat 8 too and its running fine there.

Akimaki · January 16, 2008, 5:38am

using -Wall -W and besides unused parameters there are no warnings

ramen_noodle · January 16, 2008, 4:26pm

Try without setting SO_LINGER on the socdket and see if it behaves better.

SO_LINGER
    Waits to complete the close function if data is present. When this option is enabled and there is unsent data present when the close function is called, the calling application is blocked during the close function until the data is transmitted or the connection has timed out. The close function returns without blocking the caller. This option has meaning only for stream sockets.

ramen_noodle · January 16, 2008, 5:27pm

Never mind.
linger->l_linger is 0. The correct behavior is that close does not block in
this case and the connected client will get a RST. Data is discarded.

Can you try an experiment with this thread detached and don't call pthread_join()?
What's a backtrace on the blocked thread look like in gdb? What's your signal handling
look like?

<Edit> If this looks familiar you may be getting bit by a kernel bug: Re: [PATCH -v2] fix for futex_wait signal stack corruption

Akimaki · January 17, 2008, 5:27am

no the problem is not with the threads, again it stucks on the accept.

ramen_noodle · January 17, 2008, 11:39am

If you are sure that accept() is this badly broken on FC8 then you should probably open a TT with the fedora folks. I don't think that's the core issue though...

Any multithreaded code can cause hard to troubleshoot issues when moved to
another platform or kernel version. Sounds like this is the case here.