Truss output interpretation

hi, anyone can help on this piece of truss output?

8094:    0.7028 write(4, 0x0043BE90, 236)                       = 236
8094:      T S H  \0\0\0EC020101\0\0\0\0\0\0\0\0\0 "02\0\0 303\0\0 I D
8094:     \f %\0\0\0\0 2\0F67F\0\0\0\0 @06FFC99A ; 
8094:      L              D6\0 303 
8094:                                                             01\0\0\0
8094:      j\0\0\0\0FF\0FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\0\0\0\0\0\0\n\0\0\0
8094:     02\0\0\09E1F\0\001\0\0\001\0\0\0 M Q C C 0 7 0 5 0 0 0 3\0\0\0\0
8094:     \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
8094:     \0\0\0\0\0\0\0\0\0\0\0\0
8094:    0.7029 read(4, 0x0043BE90, 32766)                      Err#11 EAGAIN
8094:   pollsys(0xFFFFFD7FFFDEE080, 1, 0xFFFFFD7FFFDEDF10, 0x00000000) (sleeping...)
8094:   20.7132 pollsys(0xFFFFFD7FFFDEE080, 1, 0xFFFFFD7FFFDEDF10, 0x00000000) = 0
8094:   pollsys(0xFFFFFD7FFFDEE080, 1, 0xFFFFFD7FFFDEDF10, 0x00000000) (sleeping...)
8094:   25.0182 pollsys(0xFFFFFD7FFFDEE080, 1, 0xFFFFFD7FFFDEDF10, 0x00000000) = 1
8094:   25.0188 read(4, 0x0043BE90, 32766)                      = 236
8094:      T S H  \0\0\0EC020102\0\0\0\0\0\0\0\0\0 "02\0\0 303\0\0 I D
8094:     \f %\090\0\0 2\0F67F\0\0\0\0 @\0FFC99A ; 
8094:      L              D6\0 303
8094:                                                             \n\0\0\0
8094:     \0\0\0\0\0FF\0FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\0\0\0\0\0\0\n\0\0\0
8094:     02\0\0\0 k *\0\08902\0\0D301\0\0 M Q M M 0 7 0 5 0 0 0 3 X P F X
8094:      X X X X _ 2 0 1 4 - 0 9- 3 1 _ 2 0 . 0 9 . 4 5
8094:

First, there is an EAGAIN error, goes into sleeping (poll) and then after about 20 odd seconds, it begins to read. What are the possible causes ?
thanks

From the read man page:

     When attempting to read from a regular file  with  mandatory
     file/record locking set (see chmod(2)), and there is a write
     lock owned by another process on the segment of the file  to
     be read:

         o    If O_NDELAY or O_NONBLOCK is set, read() returns -1
              and sets errno to EAGAIN.

pollsys then waits until the file can be read. The only possible cause is described above. This is POSIX compliant semantics for the read call. pollsys is a Solaris-ism (a system call entry point) for poll() and select(), I believe. Don Cragun would know definitely.

hi, thanks for the help.
This pertains to a networking process rather than file i/o process. One of my networking application is trying to connect to a server, and this happens. Below is truss output of what happens prior to the wait..

8094:    0.6895 open("/etc/netconfig", O_RDONLY|O_LARGEFILE)    = 4
8094:    0.6896 time()                                          = 1413538243
8094:    0.6901 rctlsys(0, "process.max-file-descriptor", 0x00000000, 0x00428020, 0, 0) = 0
8094:    0.6902 read(4, 0x00421524, 1024)                       = 1024
8094:      # p r a g m a   i d e n t\t " @ ( # ) n e t c o n f i g\t 1 . 1
8094:      6\t 9 9 / 1 0 / 2 5   S M I "\n #\n #   T h e   " N e t w o r k
.....

8094:      / t i c o t s           s t r a d d r . s o\n
8094:    0.6904 read(4, 0x00421524, 1024)                       = 0
8094:    0.6905 lseek(4, 0, SEEK_SET)                           = 0
8094:    0.6905 read(4, 0x00421524, 1024)                       = 1024
8094:      # p r a g m a   i d e n t\t " @ ( # ) n e t c o n f i g\t 1 . 1
8094:      6\t 9 9 / 1 0 / 2 5   S M I "\n #\n #   T h e   " N e t w o r k
8094:        C o n f i g u r a t i o n "   F i l e .\n #\n #   E a c h   e
8094:      n t r y   i s   o f   t h e   f o r m :\n #\n #               <
8094:      n e t w o r k _ i d >   < s e m a n t i c s >   < f l a g s ......

8094:      / t i c o t s           s t r a d d r . s o\n
8094:    0.6907 read(4, 0x00421524, 1024)                       = 0
8094:    0.6908 close(4)                                        = 0
8094:    0.6909 open("/dev/udp", O_RDONLY)                      = 4
8094:    0.6910 ioctl(4, SIOCGLIFNUM, 0xFFFFFD7FFFDEDC90)       = 0
8094:    0.6911 close(4)                                        = 0
8094:    0.6911 brk(0x0043AAB0)                                 = 0
8094:    0.6912 brk(0x0043EAB0)                                 = 0
8094:    0.6915 so_socket(PF_INET, SOCK_STREAM, IPPROTO_IP, "", SOV_DEFAULT) = 4
8094:    0.6916 ioctl(4, FIONBIO, 0xFFFFFD7FFFDEDCB0)           = 0
8094:    0.6924 connect(4, 0xFFFFFD7FFFDEE310, 16, SOV_DEFAULT) = 0
8094:    0.6925 getsockname(4, 0xFFFFFD7FFFDEE340, 0xFFFFFD7FFFDEE114, SOV_DEFAULT) = 0
8094:    0.6927 ioctl(4, FIONBIO, 0xFFFFFD7FFFDEE10C)           = 0
8094:    0.6928 setsockopt(4, tcp, TCP_NODELAY, 0xFFFFFD7FFFDEE10C, 4, SOV_DEFAULT) = 0
8094:    0.7024 brk(0x0043EAB0)                                 = 0
8094:    0.7025 brk(0x00446AB0)                                 = 0
8094:    0.7027 time()                                          = 1413538243
8094:    0.7028 write(4, 0x0043BE90, 236)                       = 236
8094:      T S H  \0\0\0EC020101\0\0\0\0\0\0\0\0\0 "02\0\0 303\0\0 I D
8094:     \f %\0\0\0\0 2\0F67F\0\0\0\0 @06FFC99A ; 
8094:      L              D6\0 303 X X X X X X
8094:                                                             01\0\0\0
8094:      j\0\0\0\0FF\0FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF\0\0\0\0\0\0\n\0\0\0
8094:     02\0\0\09E1F\0\001\0\0\001\0\0\0 M Q C C 0 7 0 5 0 0 0 3\0\0\0\0
8094:     \0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0\0
8094:     \0\0\0\0\0\0\0\0\0\0\0\0
8094:    0.7029 read(4, 0x0043BE90, 32766)                      Err#11 EAGAIN

Is there any network tcp parameters that I can fine tune with regards to EAGAIN errors? My guess is network is slow (may be due to intermediate firewalls/routers etc), hence not able to read in time? Not really sure how to go on from here. Wondering also if performing a traceroute would do any good. any tips on troubleshooting this kind of errors greatly appreciated.
thanks

OK. So you're sending a request to a server on a socket and immediately after sending the request, you try to read a response from the server with NDELAY set in the socket options. There is no server and no network that can respond that fast to a request.

There are a couple of obvious things you could try:

  1. Drop the TCP_NDELAY socket option so the read() will wait for data instead of returning immediately if no data is present.
  2. Drop the 1st read() and start with the poll() or pollsys() to wait for data to be present before attempting the read() .

I haven't tried to evaluate the arguments to pollsys() to see if your program is waiting for data on a group of file descriptors or just waiting for data on fd #4. If it is just waiting for data on fd #4, I would start by trying #1; but if your program can continue processing if data is available on another file descriptor as well, choose option 2.

With what you have shown us there is no way for us to guess why it is taking more than 24 seconds for the server to respond to your request.

hi, thanks for reply
I have setup my own client and server on two test Solaris VMs, and my client app is able to connect to server and process data just fine. The difference is that, the test network has only one routing path and there are no intermediate routers/firewalls etc in between my client and server.

However, at my workplace, the environment is different. my production client and server VMs have multiple interfaces and so have many different routes. I am beginning to suspect it could be routing mis configuration that result in non optimal routes, or intermediate firewalls/switches causing the delay. So far, all i could think of is comparing the tcp and ip settings ( using ndd ) to that of production and see what can be fine tuned.

The client and server apps are proprietary, so I have no way to change any code.
thanks

This may sound obvious, but consider traceroute. A 20+ second delay almost surely indicates a grossly overloaded system, somewhere in the path from A to B. You cannot do much about it until you locate it. Ignore my suggestion if you are routing over lots of DSL lines, in which case you are stuck.

We had remote printers on the other side of DSL lines from their print servers. Printing was painfully slow. Created local remote print servers, the problem went away.

hi jim, thanks. yes i have requested traceroute restriction to be lifted so we can troubleshoot this problem. I have no say over network security. the productions VMs are running in a cloud platform hence I don't think there are DSL lines involved.