Wait process holding CPU

gopeezere · February 26, 2013, 10:58am

Hi all,

Have this performance Issue,

[

srvbd1]root]/]>ps vg | head -1 ; ps vg | grep -w wait
    PID    TTY STAT  TIME PGIN  SIZE   RSS   LIM  TSIZ   TRS %CPU %MEM COMMAND
   8196      - A    4448:23    0   384   384    xx     0     0 12.8  0.0 wait
  53274      - A    4179:28    0   384   384    xx     0     0 12.1  0.0 wait
  57372      - A    4436:05    0   384   384    xx     0     0 12.8  0.0 wait
  61470      - A    4173:05    0   384   384    xx     0     0 12.0  0.0 wait
[srvbd1]root]/]>ps -ef | grep 8196| grep -v grep
[srvbd1]root]/]>

There are 4 "wait" commands and it occupies like 50 % of CPU, as showed by ps aux

[srvbd1]root]/]>ps aux | head -1; ps aux | sort -rn +2 | head -5
USER        PID %CPU %MEM   SZ  RSS    TTY STAT    STIME  TIME COMMAND
root      57372 12.8  0.0  384  384      - A      Feb 20 4437:22 wait
root       8196 12.8  0.0  384  384      - A      Feb 20 4449:41 wait
root      53274 12.1  0.0  384  384      - A      Feb 20 4180:41 wait
root      61470 12.0  0.0  384  384      - A      Feb 20 4174:17 wait
fin102   299090  0.2  0.0 1992 1976      - A    09:19:01  0:42 /u02/F10204/UBS/
[srvbd1]root]/]>

Please help me killing these wait process, as they are not real processes. Help would be greatly appreciated. Server performance is very poor, even login takes hell lotta time.

bakunin · February 26, 2013, 11:46am

I see no "performance issue", just a "ps"-output. To assess the performance situation of your system it would be necessary to the output of:

vmstat -v
vmstat -tw 1
svmon -G
iostat 5
no -a

and, depending on the configuration of your system ("lscfg") probably some other.

Anyways, to kill the processes is easy. You see the columns labeled PID in your output:

kill -15 <pid>

then wait a few seconds, issue another "ps". If <pid> isn't gone:

kill -9 <pid>

I still have serious doubts that this will help your situation any and i fear it might make you situation even worse, but there you go. My recommendation is not to do it, but you are free to do as you please.

I hope this helps.

bakunin

gopeezere · February 26, 2013, 12:03pm

Hi Bakumin,

Thanks for your reply. Let me explain the issue with me right now. The server is completely empty, but still any application i start like WAS 'or' enterprise application is very slow like takes hours together. Even putty login takes like few minutes to login. So we analyzed and found only this wait process looked like bottlenect. But i m not sure, this being kernel process, i m not able to kill them.

Here i post the required details, please do review and let me know if you can find any reason for the server behaviour.

[srvbd1]root]/]>proctree 8196
[srvbd1]root]/]>        kill -15 8196
kill: 8196: 0403-003 The specified process does not exist.
[srvbd1]root]/]>ps -fk | grep wait
    root   8196      0   0   Feb 20      - 4479:28 wait
    root  53274      0   0   Feb 20      - 4208:33 wait
    root  57372      0   0   Feb 20      - 4466:54 wait
    root  61470      0   0   Feb 20      - 4201:55 wait
[srvbd1]root]/]>vmstat -v
              2035712 memory pages
              1957145 lruable pages
              1052819 free pages
                    1 memory pools
               384893 pinned pages
                 80.0 maxpin percentage
                 20.0 minperm percentage
                 80.0 maxperm percentage
                 13.3 numperm percentage
               260427 file pages
                  0.0 compressed percentage
                    0 compressed pages
                 13.2 numclient percentage
                 80.0 maxclient percentage
               260187 client pages
                    0 remote pageouts scheduled
                    0 pending disk I/Os blocked with no pbuf
                    0 paging space I/Os blocked with no psbuf
                 2228 filesystem I/Os blocked with no fsbuf
                 1019 client filesystem I/Os blocked with no fsbuf
                    0 external pager filesystem I/Os blocked with no fsbuf
                    0 Virtualized Partition Memory Page Faults
                 0.00 Time resolving virtualized partition memory page faults
[srvbd1]root]/]>vmstat -tw 1

System configuration: lcpu=4 mem=7952MB

 kthr          memory                         page                       faults           cpu       time
------- --------------------- ------------------------------------ ------------------ ----------- --------
  r   b        avm        fre    re    pi    po    fr     sr    cy    in     sy    cs us sy id wa hr mi se
  0   0     702600    1052811     0     0     0     0      0     0     2   6268  7339  0  1 99  0 11:52:31
  0   0     702602    1052809     0     0     0     0      0     0     4   5902  7045  0  1 99  0 11:52:32
  0   0     702602    1052809     0     0     0     0      0     0     5   5991  6883  0  1 99  0 11:52:33
  0   0     702602    1052809     0     0     0     0      0     0     4   5913  6100  0  1 99  0 11:52:34
[srvbd1]root]/]>
[srvbd1]root]/]>
[srvbd1]root]/]>svmon -G
               size      inuse       free        pin    virtual
memory      2035712     982932    1052780     384894     702631
pg space    2097152       2404

               work       pers       clnt      other
pin          314839          0          0      70055
in use       702631        240     280061

PageSize   PoolSize      inuse       pgsp        pin    virtual
s   4 KB          -     935236       2404     361214     654935
m  64 KB          -       2981          0       1480       2981
[srvbd1]root]/]>iostat 5

System configuration: lcpu=4 drives=3 paths=2 vdisks=0

tty:      tin         tout    avg-cpu: % user % sys % idle % iowait
          0.0         11.6                0.3   0.7   98.9      0.2

Disks:        % tm_act     Kbps      tps    Kb_read   Kb_wrtn
hdisk0           2.0       3.2       0.4          0        16
hdisk1           2.0       6.4       0.8          0        32
cd0              0.0       0.0       0.0          0         0

tty:      tin         tout    avg-cpu: % user % sys % idle % iowait
          0.0         77.6                0.3   1.5   97.9      0.3

Disks:        % tm_act     Kbps      tps    Kb_read   Kb_wrtn
hdisk0           0.2      11.0       2.4          0        56
hdisk1           0.2       7.9       1.2          0        40
cd0              0.0       0.0       0.0          0         0
[srvbd1]root]/]>
[srvbd1]root]/]>no -a
                 arpqsize = 12
               arpt_killc = 20
              arptab_bsiz = 7
                arptab_nb = 149
                bcastping = 0
      clean_partial_conns = 1
                 delayack = 0
            delayackports = {}
         dgd_packets_lost = 3
            dgd_ping_time = 5
           dgd_retry_time = 5
       directed_broadcast = 0
         extendednetstats = 0
                 fasttimo = 200
        icmp6_errmsg_rate = 10
          icmpaddressmask = 0
ie5_old_multicast_mapping = 0
                   ifsize = 256
          inet_stack_size = 16
               ip6_defttl = 64
                ip6_prune = 1
            ip6forwarding = 0
       ip6srcrouteforward = 1
       ip_ifdelete_notify = 0
                 ip_nfrag = 200
             ipforwarding = 0
                ipfragttl = 2
        ipignoreredirects = 0
                ipqmaxlen = 100
          ipsendredirects = 1
        ipsrcrouteforward = 1
           ipsrcrouterecv = 0
           ipsrcroutesend = 1
          llsleep_timeout = 3
                  lo_perf = 1
                lowthresh = 90
                 main_if6 = 0
               main_site6 = 0
                 maxnip6q = 20
                   maxttl = 255
                medthresh = 95
               mpr_policy = 1
              multi_homed = 1
                nbc_limit = 1017856
            nbc_max_cache = 131072
            nbc_min_cache = 1
         nbc_ofile_hashsz = 12841
                 nbc_pseg = 0
           nbc_pseg_limit = 2035712
           ndd_event_name = {all}
        ndd_event_tracing = 0
            ndp_mmaxtries = 3
            ndp_umaxtries = 3
                 ndpqsize = 50
                ndpt_down = 3
                ndpt_keep = 120
               ndpt_probe = 5
           ndpt_reachable = 30
             ndpt_retrans = 1
             net_buf_size = {all}
             net_buf_type = {all}
        net_malloc_police = 0
           nonlocsrcroute = 0
                 nstrpush = 8
              passive_dgd = 0
         pmtu_default_age = 10
              pmtu_expire = 10
 pmtu_rediscover_interval = 30
              psebufcalls = 20
                 psecache = 1
             pseintrstack = 24576
                psetimers = 20
           rfc1122addrchk = 0
                  rfc1323 = 1
                  rfc2414 = 1
             route_expire = 1
          routerevalidate = 0
                 rto_high = 64
               rto_length = 13
                rto_limit = 7
                  rto_low = 1
                     sack = 0
                   sb_max = 1048576
       send_file_duration = 300
              site6_index = 0
               sockthresh = 85
                  sodebug = 0
              sodebug_env = 0
                somaxconn = 1024
                 strctlsz = 1024
                 strmsgsz = 0
                strthresh = 85
               strturncnt = 15
          subnetsarelocal = 1
       tcp_bad_port_limit = 0
                  tcp_ecn = 0
       tcp_ephemeral_high = 65535
        tcp_ephemeral_low = 32768
             tcp_finwait2 = 1200
           tcp_icmpsecure = 0
          tcp_init_window = 0
    tcp_inpcb_hashtab_siz = 24499
              tcp_keepcnt = 8
             tcp_keepidle = 14400
             tcp_keepinit = 150
            tcp_keepintvl = 150
     tcp_limited_transmit = 1
              tcp_low_rto = 0
             tcp_maxburst = 0
              tcp_mssdflt = 1460
          tcp_nagle_limit = 65535
        tcp_nagleoverride = 0
               tcp_ndebug = 100
              tcp_newreno = 1
           tcp_nodelayack = 0
        tcp_pmtu_discover = 1
            tcp_recvspace = 16384
            tcp_sendspace = 262144
            tcp_tcpsecure = 0
             tcp_timewait = 1
                  tcp_ttl = 60
           tcprexmtthresh = 3
                  thewall = 4071424
         timer_wheel_tick = 0
       udp_bad_port_limit = 0
       udp_ephemeral_high = 65535
        udp_ephemeral_low = 32768
    udp_inpcb_hashtab_siz = 24499
        udp_pmtu_discover = 1
            udp_recvspace = 42080
            udp_sendspace = 9216
                  udp_ttl = 30
                 udpcksum = 1
                 use_isno = 1
           use_sndbufpool = 1
[srvbd1]root]/]>lscfg
INSTALLED RESOURCE LIST

The following resources are installed on the machine.
+/- = Added or deleted from Resource List.
*   = Diagnostic support not available.

  Model Architecture: chrp
  Model Implementation: Multiple Processor, PCI bus

+ sys0                                             System Object
+ sysplanar0                                       System Planar
* vio0                                             Virtual I/O Bus
* vsa0             U789F.001.AAA8080-P1-T3         LPAR Virtual Serial Adapter
* vty0             U789F.001.AAA8080-P1-T3-L0      Asynchronous Terminal
* pci2             U789F.001.AAA8080-P1            PCI Bus
* pci1             U789F.001.AAA8080-P1            PCI Bus
+ fcs0             U789F.001.AAA8080-P1-C13-C1-T1  FC Adapter
* fscsi0           U789F.001.AAA8080-P1-C13-C1-T1  FC SCSI I/O Controller Protocol Device
* fcnet0           U789F.001.AAA8080-P1-C13-C1-T1  Fibre Channel Network Protocol Device
+ fcs1             U789F.001.AAA8080-P1-C13-C1-T2  FC Adapter
* fscsi1           U789F.001.AAA8080-P1-C13-C1-T2  FC SCSI I/O Controller Protocol Device
* fcnet1           U789F.001.AAA8080-P1-C13-C1-T2  Fibre Channel Network Protocol Device
* pci0             U789F.001.AAA8080-P1            PCI Bus
* pci3             U789F.001.AAA8080-P1            PCI Bus
+ ent0             U789F.001.AAA8080-P1-T1         2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
+ ent1             U789F.001.AAA8080-P1-T2         2-Port 10/100/1000 Base-TX PCI-X Adapter (14108902)
* pci4             U789F.001.AAA8080-P1            PCI Bus
+ usbhc0           U789F.001.AAA8080-P1            USB Host Controller (33103500)
+ usbhc1           U789F.001.AAA8080-P1            USB Host Controller (33103500)
* pci5             U789F.001.AAA8080-P1            PCI Bus
* ide0             U789F.001.AAA8080-P1-T10        ATA/IDE Controller Device
+ cd0              U789F.001.AAA8080-P1-D3         IDE DVD-RAM Drive
* pci6             U789F.001.AAA8080-P1            PCI Bus
+ sisscsia0        U789F.001.AAA8080-P1            PCI-X Dual Channel Ultra320 SCSI Adapter
+ scsi0            U789F.001.AAA8080-P1-T5         PCI-X Dual Channel Ultra320 SCSI Adapter bus
+ scsi1            U789F.001.AAA8080-P1-T9         PCI-X Dual Channel Ultra320 SCSI Adapter bus
+ hdisk0           U789F.001.AAA8080-P1-T9-L5-L0   16 Bit LVD SCSI Disk Drive (73400 MB)
+ hdisk1           U789F.001.AAA8080-P1-T9-L8-L0   16 Bit LVD SCSI Disk Drive (73400 MB)
+ ses0             U789F.001.AAA8080-P1-T9-L15-L0  SCSI Enclosure Services Device
+ L2cache0                                         L2 Cache
+ mem0                                             Memory
+ proc0                                            Processor
+ proc2                                            Processor
[srvbd1]root]/]>kill -9 8196
kill: 8196: 0403-003 The specified process does not exist.
[srvbd1]root]/]>

zaxxon · February 26, 2013, 1:42pm

These are kernel wait processes. They are absolute normal and come with the OS, 1 per Logical CPU. As one can see you have 2 procs and I assume you have SMT activated with 2 Logical CPUs per virtual or physical CPU.

As Bakunin said, you should really not kill them. They are definetly not your problem. They are just waiting for work and help calculating your idle percentage. Leave them alone!
IBM CPU Utilization for the wait KPROC - United States

Either this box is very weak ressource wise, the application is programmed badly, or there is some other kind of performance problem. There can be problems with name resolution etc. whatever.

Start up the application and have something like vmstat -w 2 20 while it performs slow, to get a 1st impression of your system.
Also check the logs of your application, if it writes any.

bakunin · February 26, 2013, 4:04pm

OK, this might as well be a problem with the server as it might be a problem with some third-party system. A possible cause could be the name server (have a look at /etc/resolv.conf ), maybe the server runs into a timeout every time it tries to query an IP address. Try the following: select a server in your network. Make sure its IP address is not in the local /etc/hosts . Do a "ping <IP-address>" and note the time it takes to respond. Now try a "ping <hostname>" for the same server. If there is a noticeable difference in how long it takes "ping" to start the name server is the culprit.

Actually you are, but they are immediately restarted.

OK, i had a quick look at your output and IMHO the system was doing absolutely nothing when you took the snapshots, it probably rebooted just before. If you look at "vmstat"s output and notice the lots of "free" memory pages there are only two possible reasons: either the system does absolutely nothing so that the kernel doesn't even know what to put into file cache - this is unlikely given your modest memory size of ~8GB. The other option is that the system just restarted and there was not enough I/O to this moment to fill the filecache with anything that makes sense. (The last possible explanation - a rather hilarious "maxperm"- "minperm"-, etc. setting - is ruled out by the output of "vmstat -v".)

You might want to tune your maxperm- and minperm-settings to more sensible values. What these values might be depends on the application, but 95% and 3% are good starting points. Right now you have:

[srvbd1]root]/]>vmstat -v
[...]
                 20.0 minperm percentage
                 80.0 maxperm percentage
                 80.0 maxclient percentage
[...]

[srvbd1]root]/]>svmon -G
               size      inuse       free        pin    virtual
memory      2035712     982932    1052780     384894     702631
pg space    2097152       2404

This display is in memory pages (=4k). 2 Mio pages ~ 8GB. From these 2 mio pages 700k have been used, the rest is simply doing nothing. If this is everything your system ever does you could reduce its memory to ~4GB and everything would be fine.

[srvbd1]root]/]>iostat 5

System configuration: lcpu=4 drives=3 paths=2 vdisks=0

tty:      tin         tout    avg-cpu: % user % sys % idle % iowait
          0.0         11.6                0.3   0.7   98.9      0.2

Disks:        % tm_act     Kbps      tps    Kb_read   Kb_wrtn
hdisk0           2.0       3.2       0.4          0        16
hdisk1           2.0       6.4       0.8          0        32
cd0              0.0       0.0       0.0          0         0

tty:      tin         tout    avg-cpu: % user % sys % idle % iowait
          0.0         77.6                0.3   1.5   97.9      0.3

Disks:        % tm_act     Kbps      tps    Kb_read   Kb_wrtn
hdisk0           0.2      11.0       2.4          0        56
hdisk1           0.2       7.9       1.2          0        40
cd0              0.0       0.0       0.0          0         0

These disks are doing absolutely nothing. The little activity residue is the system itself idling away. It is the computer equivalent of one twiddling his thumbs.

[srvbd1]root]/]>no -a

Looks like everything is at defaults here. Once the system will actually do anything there might be a reason to optimize a bit, but now just leave it alone.

I wonder what you want with the many adapters - you have no disks (save for the two system disks) right now.

Summary:

It seems that the system is built right now and some of the hardware ins't even connected (like disks). The system is definitely not the problem when a "putty" eds "several minutes" to connect. I'd look at the network (routers, firewalls, VLANs, etc.) and network-related services (DNS, NIS, maybe kerberos or LDAP, etc.) if the culprit is there. My first guess would be the name server, then the other components i named.

I hope this helps.

bakunin

gopeezere · February 27, 2013, 8:36am

Thanks for your detailed Analysis Bakumin & zaxxon.

As exclaimed, yes the system was doing nothing at that point in time, they were completely idle. I was either trying to login to sqlplus from other session and it was taking 2 minutes for that 'or' may be some other things very general like bring up a small service.

I m going to try all these suggessions given. Will let you know guys.

Thanks a ton for your help.

MichaelFelt · March 1, 2013, 6:13pm

another way to testif delays are caused bt name server lookups is to edit /etc/netsvc.conf.
add or edit a line so that it says,
hosts=local4

fyi my normal setting is: hosts=local4,bind4 as I am not using any IP6.

bakunin · March 1, 2013, 7:22pm

Correct. Sometimes it takes me a while to sift through what i have written, but here is the name resoution process in detail.

I hope this helps.

bakunin

MichaelFelt · March 2, 2013, 1:38pm

There are two ways to modify the default behavior of AIX name resolution - which is: bind4|6, nis, local4|6
because of the lookup to bind6 and local6 - and I am not using them - I specify
hosts=local4,bind4
in /etc/netsvc.conf

However, there is another way - that require no write access to /etc/netsvc.conf
$ export NSORDER="bind4,local4"
will do the same as what I do normally in /etc/netsvc.conf. So, e.g., to test if timeout is an issue you could also specify
$ export NSORDER=local
and neither bind nor nis will be queried, even if they are configured.

Other key environment variables (I have these defined system wide in /etc/environment )

RES_RETRY=2
RES_TIMEOUT=3

RES_RETRY sets resolve retry count (default=3); RES_TIMEOUT sets resolve timeout (default=30).

--- missing <HR> sniff :p ---

One more handy thing to know: The minimum contents of /etc/hosts should resolve localhost and the hostname (so do not delete localhost as I sometimes find! at sites complaining of a slooooow boot).

gopeezere · March 6, 2013, 5:03pm

Thanks a lot guys.. The server is fine now... I have tried everything told here.... and it was really informative and i learnt a lot....

/etc/netsvc.conf was the culprit.... Thanks everyone for your help..