Need help determining if %SI(software interrupts) are too high

cdlaforc · October 14, 2014, 9:33am

Hello,
The organization I work for uses SCOM(Microsoft Systems Center Operations Manager) for Data Center Management/alerting. Since the client was installed on our Linux servers we have been getting messages from SCOM stating "DPC Time Percentage is too high". This is happening on all our MySQL cluster servers. From researching it appears that this message relates to software interrupts.

From running top or mpstat I can see the %SI for processor 7 is frequently over 20%.

Cpu0  : 41.0%us, 15.3%sy,  0.0%ni, 35.7%id,  0.0%wa,  0.0%hi,  8.0%si,  0.0%st
Cpu1  : 25.7%us, 17.3%sy,  0.0%ni, 52.0%id,  0.0%wa,  0.0%hi,  5.0%si,  0.0%st
Cpu2  : 21.9%us,  1.3%sy,  0.0%ni, 75.7%id,  0.0%wa,  0.0%hi,  1.0%si,  0.0%st
Cpu3  : 14.0%us,  9.3%sy,  0.0%ni, 73.1%id,  0.0%wa,  0.0%hi,  3.7%si,  0.0%st
Cpu4  : 55.3%us,  4.3%sy,  0.0%ni, 38.3%id,  0.0%wa,  0.0%hi,  2.0%si,  0.0%st
Cpu5  : 53.3%us,  4.6%sy,  0.0%ni, 40.1%id,  0.0%wa,  0.0%hi,  2.0%si,  0.0%st
Cpu6  :  5.0%us,  9.0%sy,  0.0%ni, 83.7%id,  1.0%wa,  0.0%hi,  1.3%si,  0.0%st
Cpu7  : 50.7%us,  4.3%sy,  0.0%ni,  1.3%id,  0.0%wa, 11.6%hi, 32.1%si,  0.0%st

 mpstat -P ALL 60
Linux 2.6.18-238.9.1.el5 ()     10/13/2014

02:19:27 PM  CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal   %idle    intr/s
02:20:27 PM  all   29.34    0.00    4.64    0.06    0.91    4.35    0.00   60.70  17469.47
02:20:27 PM    0   12.07    0.00    3.62    0.07    0.00    0.42    0.00   83.83   1000.03
02:20:27 PM    1   38.68    0.00    4.33    0.05    0.00    2.23    0.00   54.70      0.00
02:20:27 PM    2    8.79    0.00    1.97    0.00    0.00    0.55    0.00   88.70      0.00
02:20:27 PM    3   28.72    0.00    5.50    0.12    0.00    2.05    0.00   63.61      0.53
02:20:27 PM    4   53.98    0.00    3.74    0.00    0.00    1.70    0.00   40.59      0.00
02:20:27 PM    5   44.97    0.00    5.08    0.00    0.00    1.93    0.00   48.01      0.58
02:20:27 PM    6   35.75    0.00    4.02    0.02    0.00    1.37    0.00   58.85      0.00
02:20:27 PM    7   11.74    0.00    8.85    0.20    7.28   24.59    0.00   47.34  16468.35

From /proc/interrupts IRQ 185 seems to be the largest cause of interrupts for processor 7. This is the same on all 4 servers in question each with "IO-APIC-level megasas, eth1, eth0" on IRQ 185.

cat /proc/interrupts 
           CPU0       CPU1       CPU2       CPU3       CPU4       CPU5       CPU6       CPU7       
  0: 1385547152          1          0          0          0         80          5      57382    IO-APIC-edge  timer
  1:          0          0          0          0          0          0          0          2    IO-APIC-edge  i8042
  8:          0          0          0          0          0          0          0          1    IO-APIC-edge  rtc
  9:          0          0          0          0          0          1          0         34   IO-APIC-level  acpi
 11:          0          0        323          0          0          0          0        127   IO-APIC-level  ehci_hcd:usb1, ohci_hcd:usb2, ohci_hcd:usb3
 12:          0          0          0          0          0          0          0          5    IO-APIC-edge  i8042
138:         24          0      85097    1927115   17543366    4371772   26364546    4915645         PCI-MSI  eth3
154:         22          0      55073    1919698    9263542  111344311   28653821  119374902         PCI-MSI  eth2
185:          2          1          0          2          1  336790701   12301729 3763055601   IO-APIC-level  megasas, eth1, eth0
NMI:    7588535    7138711    7412871    7375055    7517698    8340865    8123444    8485641 
LOC: 1384277563 1384278693 1384279520 1384278027 1384279083 1384265499 1384279672 1384273293 
ERR:          0
MIS:          0

This is what is in /proc/irq/185/smp_affinity which appears to be setting IRQ 185 to CPU7.

cat /proc/irq/185/smp_affinity
00000000,00000000,00000000,00000000,00000000,00000000,00000000,00000080

Can anyone offer assistance on the steps needed to determine if this is an issue on these servers? The average load on these servers is typically about 3.5, so the servers seem to be running fine. These are Red Hat 5.6 servers.

Thanks,

Chris.