can-not detect TCP disconnects well

Hello everyone. Thanks for reading. I am using Ubuntu 7.04 to experience this problem:

I have written my own programs that communicate to eachother and I am having a hard time detecting a TCP socket disconnect when the remote side's computer has a power-failure (for example).

On the computer that stays up my program continually polls the socket and tries to send a status message. These will never end up failing.

The poll just returns 0 implying a timeout, and the poll before a send returns with a POLLOUT and a send returns the number of bytes I tried to send, implying that it sent properly.

This goes on for ever. I am trying to figure out how to detect that the socket is down so I can clean up my end and listen for the computer to connect again.

Thanks!
PJW

It would be easier if you post the code you're using.

Generally servers simply send a packet every X minutes and wait for reply for Y minutes, if there is no reply they assume the connection is timed out. You can use poll(), alarm(), setitimer()...

see manpage setsockopt() and pay special attention to so_keepalive and tcp_keepalive or tcp_keepidle options. BTW, options are system specified.

mika is correct. You really have to understand the TCP protocol to understand "exceptional" behavior, such as what you're dealing with. The TCP protocol is designed for fast AND slow networks. It's also built around reliable transmission when things in the middle temporarily break or get clogged. Thus, TCP is very tolerant of errors and transmission interruptions, and what you're trying to do is make it intolerant of errors.

Let's say you take mika's suggestion and set the socket options so that the timeout is shortened. Now what if the client application is operating over a modem or over a VPN (virtual private network)? In both of these cases, your connection might temporarily become very slow -- there's noise on the telephone line and the modems need several seconds to renegotiate, resulting in lots of lost packets, or in the case of the VPN, there is a re-exchange of encryption keys which suspends communications for several seconds. You might find you have made your TCP connection incapable of surviving these "normal" exceptions.

Another possibility is to leave these TCP parameters alone and implement an out-of-band "heartbeat". You could do this with ICMP or UDP. As soon as you lose your heartbeat (several times in a row to be sure), your client application shuts down the socket. The upside is portability (sort of). But there are several downsides as well, like learning a new set of programming protocols, multithreading (or at least interprocess communication between the heartbeat and your client application), and getting around firewalls, which might block your heartbeats.

Is the can of worms making you sick yet?

poll() and select() aren't the operative indicators.

send() to a closed port should error. You need to check the return from this function.
If your client really depends on server response then make sure that the client knows
that the message was not received. It seems to me that in any tcp based scenario the
client would have ample notification based on l4 feedback.

"""
tcp_disconnect.py
Echo network data test program in python. This program easily translates to C & Java.

By TCP rules, the only way for a server program to know if a client has disconnected,
is to try to read from the socket. Specifically, if select() says there is data, but
recv() returns 0 bytes of data, then this implies the client has disconnected.

But a server program might want to confirm that a tcp client is still connected without
reading data. For example, before it performs some task or sends data to the client.
This program will demonstrate how to detect a TCP client disconnect without reading data.

The method to do this:
1) select on socket as poll (no wait)
2) if no recv data waiting, then client still connected
3) if recv data waiting, the read one char using PEEK flag 
4) if PEEK data len=0, then client has disconnected, otherwise its connected.
Note, the peek flag will read data without removing it from tcp queue.

To see it in action: 0) run this program on one computer 1) from another computer, 
connect via telnet port 12345, 2) type a line of data 3) wait to see it echo, 
4) type another line, 5) disconnect quickly, 6) watch the program will detect the 
disconnect and exit.

I hope this is helpful to someone. John Masinter, 17-Dec-2008.
"""

import socket
import time
import select

HOST = ''       # all local interfaces
PORT = 12345    # port to listen

# listen for new TCP connections
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
s.setsockopt(socket.SOL_SOCKET, socket.SO_REUSEADDR, 1)
s.bind((HOST, PORT))
s.listen(1)
# accept new conneciton
conn, addr = s.accept()
print 'Connected by', addr
# loop reading/echoing, until client disconnects
try:
    conn.send("Send me data, and I will echo it back after a short delay.\n")
    while 1:
        data = conn.recv(1024)                          # recv all data queued
        if not data: break                              # client disconnected
        time.sleep(3)                                   # simulate time consuming work
        # below will detect if client disconnects during sleep
        r, w, e = select.select([conn], [], [], 0)      # more data waiting?
        print "select: r=%s w=%s e=%s" % (r,w,e)        # debug output to command line
        if r:                                           # yes, data avail to read.
            t = conn.recv(1024, socket.MSG_PEEK)        # read without remove from queue
            print "peek: len=%d, data=%s" % (len(t),t)  # debug output
            if len(t)==0:                               # length of data peeked 0?
                print "Client disconnected."            # client disconnected
                break                                   # quit program
        conn.send("-->"+data)                           # echo only if still connected
finally:
    conn.close()

Hello pjwhite,

It the remote computer simply crash or network is cut, then how to to catch that situation? As I understand that's the question?

Let me say few words about TCP/IP

  1. It is reliable and stream based protocol ...

Each sending side has a buffer for example 32 kb buffer and if there is room at the buffer all "send" or "write" operations succeeds immediately without any error and without caring if the data reaches the remote end or not. Next the tcp/ip implementation starts to send the current window of the stream as a sequence of IP packets re-transmitting them until any reply from the remote side. Without any reply this side will not know what happen.

I would like to propose to you to change your protocol ( if yours ), and to introduce ability to ping the other side with a special kind of message. Most already designed protocol has ability to do such activity or they don't need it.

I've seen from your post that you have already created that.
I mean you can create a special message that requires a special reply :

  1. If there is some in-activity time you can simply pass the special message, just 1, and if the reply didn't come to start re-establishing the connection.

  2. 2 is like 1.but uses OOB ( Out-Of-Band Data for such "ping" message )

Actually after sending the special message if there is no reply you can try to re-connect and if that operation fails. Actually reconnecting will also "hang" if the network is down. So timeout and asynchronoust connect is also recommended.

Best Regards
O.

or try the mika's recommendation to use keep alive but please check it if it is working. And you should configure something

"A Transmission Control Protocol (TCP) keep-alive packet is an acknowledgment (ACK) with the sequence number set to one less than the current sequence number for the connection. The Transmission Control Protocol/Internet Protocol (TCP/IP) stack can automatically generate these keep-alive messages to verify that the computer at the remote end of a connection is still available."

Best Regards
O.