High load average troubleshoot

erick_tuk · October 29, 2011, 4:35pm

Hi all, hope you can help me. I'm getting high load average and can't find a reason for this, please share your inputs.

 load average: 7.78, 7.50, 7.31


Tasks: 330 total,   1 running, 329 sleeping,   0 stopped,   0 zombie
Cpu0  :  7.0%us,  1.0%sy,  0.0%ni, 23.9%id,  0.0%wa, 38.9%hi, 29.2%si,  0.0%st
Cpu1  :  2.0%us,  1.0%sy,  0.0%ni, 97.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu2  : 54.3%us,  4.7%sy,  0.0%ni, 40.0%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu3  :  0.7%us,  0.0%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu4  : 55.3%us,  4.0%sy,  0.0%ni, 39.7%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu5  :  0.3%us,  0.3%sy,  0.0%ni, 99.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu6  : 48.0%us,  5.3%sy,  0.0%ni, 45.4%id,  0.0%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu7  :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu8  : 50.0%us,  4.3%sy,  0.0%ni, 44.7%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu9  :  1.0%us,  2.3%sy,  0.0%ni, 96.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu10 : 48.0%us,  3.0%sy,  0.0%ni, 48.0%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu11 :  0.7%us,  1.3%sy,  0.0%ni, 98.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu12 : 61.5%us,  7.6%sy,  0.0%ni, 29.9%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu13 :  1.0%us,  1.3%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu14 : 57.8%us,  6.0%sy,  0.0%ni, 35.2%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu15 :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu16 : 54.3%us,  5.6%sy,  0.0%ni, 39.4%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu17 :  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu18 : 56.3%us,  4.3%sy,  0.0%ni, 38.7%id,  0.0%wa,  0.0%hi,  0.7%si,  0.0%st
Cpu19 :  0.3%us,  0.0%sy,  0.0%ni, 99.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu20 : 51.7%us,  5.0%sy,  0.0%ni, 42.4%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu21 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu22 :  1.7%us,  0.7%sy,  0.0%ni, 97.7%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Cpu23 :  0.7%us,  1.0%sy,  0.0%ni, 98.3%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  74167472k total, 35214936k used, 38952536k free,   788124k buffers
Swap: 33551744k total,        0k used, 33551744k free, 11540200k cached

 free -g
             total       used       free     shared    buffers     cached
Mem:            70         33         37          0          0         11
-/+ buffers/cache:         21         48
Swap:           31          0         31

 df -h
Filesystem            Size  Used Avail Use% Mounted on
/dev/mapper/vg_root-lv_root
                      9.7G  1.4G  7.9G  16% /
/dev/mapper/vg_root-lv_tmp
                      2.0G  778M  1.1G  42% /tmp
/dev/mapper/vg_root-lv_var
                      992M  387M  555M  42% /var
/dev/mapper/vg_root-lv_log
                      2.0G  697M  1.2G  38% /var/log
/dev/mapper/vg_root-lv_crash
                       34G  177M   32G   1% /var/crash
/dev/mapper/vg_root-lv_vtmp
                      992M   34M  908M   4% /var/tmp
/dev/mapper/vg_root-lv_home
                      4.9G  263M  4.4G   6% /home
/dev/mapper/vg_root-lv_audit
                      2.0G   86M  1.8G   5% /var/log/audit
/dev/mapper/vg_root-lv_usr
                      4.9G  1.3G  3.4G  28% /usr
/dev/sda1             996M   53M  891M   6% /boot
tmpfs                  36G     0   36G   0% /dev/shm
/dev/mapper/vg_root-lv_opt
                      144G   11G  126G   8% /opt

Device:            tps   Blk_read/s   Blk_wrtn/s   Blk_read   Blk_wrtn
sda               7.72         1.35       152.93     786543   89194896
sda1              0.00         0.00         0.02       1634      14008
sda2              0.00         0.00         0.00       1377          0
sda3              7.72         1.34       152.91     783244   89180888
sdb               0.00         0.00         0.00       1504          0
dm-0              1.86         0.17        14.84     101418    8656072
dm-1              0.80         0.00         6.39       2794    3724232
dm-2              0.65         0.39         4.92     228586    2869568
dm-3              0.93         0.03         7.46      14666    4352368
dm-4              0.00         0.01         0.00       3490         88
dm-5              0.00         0.00         0.00       1154        392
dm-6              0.32         0.00         2.57       2810    1499640
dm-7              0.05         0.00         0.37       1714     218456
dm-8              1.24         0.46         9.62     269874    5612272
dm-9             13.36         0.27       106.73     156170   62247800

sudo netstat -pan | grep -c 'ESTABLISHED'
38494

sudo netstat -pan | grep -c 'TIME_WAIT'
10

sudo netstat -pan | grep -c 'LISTEN'
84

sudo netstat -pan | grep -c 'FIN_WAIT'
362

What else should I look for? Appreciate the help

agama · October 29, 2011, 6:37pm

You're not finding any smoking guns because there are none. A load average of 7.x on a machine with 8 or less cores would be high, and you'd probably see a different picture painted by top in terms of CPU utilisation or I/O wait on that class of machine. However, on a 24 core machine I don't believe your load average to be a concern.

For a 24 core machine, I wouldn't be concerned until your load average hits 70 to 75% of the number of cores -- 16 to 18 in your case. So here, 7.x isn't a concern.

NOTE: this is my perception of how load average should be interpreted and I might stand corrected.

erick_tuk · October 29, 2011, 6:44pm

Appreciate your input agama, I don't have access to this box right now, but as I recall that's the only box that shows such alert (using nagios), the rest of the boxes are all green, and all of them are 24 core machines, I should also mention the load is balanced among 8 boxes, so it's a bit weird this is the only one showing alerts.

Regards

agama · October 29, 2011, 7:11pm

I'm guessing that the alarm threshold coded in Nagios is set to alarm on a value without regard to number of cores. I'd have a look at the scripts and make adjustments such that the number of cores is taken into account.

I just peeked at one of our larger machines (255 cores) which is showing this load average:

  6:55pm  up 8 day(s),  2:12,  106 users,  load average: 148.22, 154.36, 153.55

Depending on the sophistication of the scheduler, it is very possible to end up with a machine that is more heavily loaded. It's also possible that the load is more evenly balanced than it appears from the Nagios alarms, but the other machines are just running under the threshold value.

erick_tuk · October 29, 2011, 7:37pm

Again. Thanks a lot!