Nearly Random, Uncorrelated Server Load Average Spikes

Neo · February 12, 2020, 9:42pm

I have been wrangling with a small problem on a Ubuntu server which runs a LAMP application.

Linux ubuntu 4.15.0-33-generic #36-Ubuntu SMP Wed Aug 15 16:00:05 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux

This server runs fine, basically:

ubuntu:/var/www# uptime
 20:17:13 up 105 days, 19:54,  1 user,  load average: 1.38, 1.12, 1.

But around five to six times a day, the (1 minute) load average spikes to around 200 and quickly goes back down to normal, all in the time interval of less than one minute:

I have added sensors to the crontab processes, both for the LAMP application and the server crontabs and log all cron start and stop events in the database. This instrumentation of the server yields no fruit.

In addition, in the occasional spurious events that seem to happen at "near regular" intervals, I have had a terminal window open running top , mytop , ifstats and other command line tools trying to trap the process which is causing the load spurs. There has been nothing unusual and mysql is always the "leading contender" when the pikes occur.

There is no other process with a high CPU during the spikes, so that seems to indicate something related to mysql. , obviously.

With only mysql showing up as the "leading contender", I added more instrumentation to various mysql processes in the application, and can find no event on the server or in the application which correlates to this spurious behavior.

I have correlated the spikes to (1) network interface i/o stats, (2) apache processes, (3) all crontab processes (in app and on server) and regardless of my skills at adding sensors to various processes, I cannot trap the spurious process.

I have running time series graphs, and logging in the database. When there is an incident on the graph, I go to the database log and look at all the sensor entries and can find no correlation.

There is no correlation to cron processes, network i/o, users on the server, bots, and backup processes.

I keep adding more and more instrumentation to every process on the server and in the app, but all that instrumentation looking for correlation to a server process, including the LAMP app, bears no fruit.

At first, I thought this problem was caused by bots hitting the web server; but there is no correlation to increased bot traffic or LAN interface I/O.

Then I thought the problem was caused by various cron entries in the LAMP application; but reconfiguring them, turning them on and off, adding instrumentation for start / stop times in a log, also bears no fruit.

I've been working this problem on and off for over a week and cannot find a single causal reason, no a single cause for any effect, which may be causing this spurious load average behavior.

It's only around five to six times a day for less than a minute at each event; but I want to find the cause and fix it. I'm not 100% convinced the issue is caused by a single process / issue.

I wonder if there is some underlying disk I/0 activity on the server causing these spikes? Could this be related to a potential disk error on the SSD drives? Could it be related to underlying disk raid activity?

ubuntu:/etc/mdadm# cat /proc/mdstat
Personalities : [raid1] [linear] [multipath] [raid0] [raid6] [raid5] [raid4] [raid10] 
md2 : active raid1 sdb4[1] sda4[0]
      479503360 blocks super 1.2 [2/2] [UU]
      bitmap: 3/4 pages [12KB], 65536KB chunk

md1 : active raid1 sdb3[1] sda3[0]
      7995392 blocks super 1.2 [2/2] [UU]
      
md0 : active raid1 sdb2[1] sda2[0]
      498688 blocks super 1.2 [2/2] [UU]

If so, any ideas how to trap this?

My current "best guess" is that there is some underlying disk I/O activity causing this spurious issue, and that is why nothing at the application level correlates.

Neo · February 12, 2020, 9:46pm

For completeness, here are some entries from the tail of dmesg :

[    0.783084] intel_pstate: Intel P-state driver initializing
[    0.784360] intel_pstate: HWP enabled
[    0.785146] ledtrig-cpu: registered to indicate activity on CPUs
[    0.785426] EFI Variables Facility v0.08 2004-May-17
[    0.790146] intel_pmc_core:  initialized
[    0.790564] NET: Registered protocol family 10
[    0.794450] Segment Routing with IPv6
[    0.794735] NET: Registered protocol family 17
[    0.795098] Key type dns_resolver registered
[    0.796173] RAS: Correctable Errors collector initialized.
[    0.796471] microcode: sig=0x506e3, pf=0x2, revision=0xc6
[    0.797051] microcode: Microcode Update Driver: v2.2.
[    0.797068] sched_clock: Marking stable (797040485, 0)->(777089743, 19950742)
[    0.797950] registered taskstats version 1
[    0.798222] Loading compiled-in X.509 certificates
[    0.799763] Loaded X.509 cert 'Build time autogenerated kernel key: b2b2adbda3eef7b219e699c1b87c022096150517'
[    0.800481] Couldn't get size: 0x800000000000000e
[    0.800767] MODSIGN: Couldn't get UEFI db list
[    0.802200] Loaded UEFI:MokListRT cert 'Canonical Ltd. Master Certificate Authority: ad91990bc22ab1f517048c23b6655a268e345a63' linked to secondary sys keyring
[    0.802855] Couldn't get size: 0x800000000000000e
[    0.803150] MODSIGN: Couldn't get UEFI dbx list
[    0.803458] zswap: loaded using pool lzo/zbud
[    0.806301] Key type big_key registered
[    0.806595] Key type trusted registered
[    0.807944] Key type encrypted registered
[    0.808249] AppArmor: AppArmor sha1 policy hashing enabled
[    0.808530] ima: No TPM chip found, activating TPM-bypass! (rc=-19)
[    0.808826] evm: HMAC attrs: 0x1
[    0.809603]   Magic number: 7:736:368
[    0.810095] rtc_cmos 00:05: setting system clock to 2019-10-30 06:23:54 UTC (1572416634)
[    0.810491] BIOS EDD facility v0.16 2004-Jun-25, 0 devices found
[    0.810787] EDD information not available.
[    1.155604] Freeing unused kernel memory: 2408K
[    1.188162] Write protecting the kernel read-only data: 20480k
[    1.189458] Freeing unused kernel memory: 2008K
[    1.191801] Freeing unused kernel memory: 1900K
[    1.195636] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[    1.195905] x86/mm: Checking user space page tables
[    1.199568] x86/mm: Checked W+X mappings: passed, no W+X pages found.
[    1.253419] ahci 0000:00:17.0: version 3.0
[    1.253706] ahci 0000:00:17.0: AHCI 0001.0301 32 slots 6 ports 6 Gbps 0x3f impl SATA mode
[    1.254096] ahci 0000:00:17.0: flags: 64bit ncq sntf led clo only pio slum part ems deso sadm sds apst 
[    1.254129] r8169 Gigabit Ethernet driver 2.3LK-NAPI loaded
[    1.254133] r8169 0000:01:00.0: can't disable ASPM; OS doesn't have ASPM control
[    1.265296] r8169 0000:01:00.0 eth0: RTL8168h/8111h at 0x        (ptrval), 4c:cc:6a:04:4c:37, XID 14100880 IRQ 124
[    1.265918] r8169 0000:01:00.0 eth0: jumbo features [frames: 9200 bytes, tx checksumming: ko]
[    1.296757] r8169 0000:01:00.0 enp1s0: renamed from eth0
[    1.320853] scsi host0: ahci
[    1.321512] scsi host1: ahci
[    1.322057] scsi host2: ahci
[    1.322601] scsi host3: ahci
[    1.323061] scsi host4: ahci
[    1.323651] scsi host5: ahci
[    1.323975] ata1: SATA max UDMA/133 abar m2048@0xdf118000 port 0xdf118100 irq 123
[    1.324302] ata2: SATA max UDMA/133 abar m2048@0xdf118000 port 0xdf118180 irq 123
[    1.324558] ata3: SATA max UDMA/133 abar m2048@0xdf118000 port 0xdf118200 irq 123
[    1.324810] ata4: SATA max UDMA/133 abar m2048@0xdf118000 port 0xdf118280 irq 123
[    1.325060] ata5: SATA max UDMA/133 abar m2048@0xdf118000 port 0xdf118300 irq 123
[    1.325309] ata6: SATA max UDMA/133 abar m2048@0xdf118000 port 0xdf118380 irq 123
[    1.328090] [drm] Memory usable by graphics device = 4096M
[    1.328379] checking generic (c0000000 1e0000) vs hw (c0000000 10000000)
[    1.328380] fb: switching to inteldrmfb from EFI VGA
[    1.328641] Console: switching to colour dummy device 80x25
[    1.328696] [drm] Replacing VGA console driver
[    1.334329] [drm] Supports vblank timestamp caching Rev 2 (21.10.2013).
[    1.334331] [drm] Driver supports precise vblank timestamp query.
[    1.338355] i915 0000:00:02.0: vgaarb: changed VGA decodes: olddecodes=io+mem,decodes=io+mem:owns=io+mem
[    1.338674] [drm] Finished loading DMC firmware i915/skl_dmc_ver1_26.bin (v1.26)
[    1.342834] [drm] Initialized i915 1.6.0 20171023 for 0000:00:02.0 on minor 0
[    1.344165] ACPI: Video Device [GFX0] (multi-head: yes  rom: no  post: no)
[    1.344419] input: Video Bus as /devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A08:00/LNXVIDEO:00/input/input6
[    1.628416] clocksource: Switched to clocksource tsc
[    1.641018] ata3: SATA link down (SStatus 4 SControl 300)
[    1.641032] ata5: SATA link down (SStatus 4 SControl 300)
[    1.641048] ata2: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.641060] ata6: SATA link down (SStatus 4 SControl 300)
[    1.641074] ata4: SATA link down (SStatus 4 SControl 300)
[    1.641089] ata1: SATA link up 6.0 Gbps (SStatus 133 SControl 300)
[    1.642670] ata1.00: supports DRM functions and may not be fully accessible
[    1.642759] ata2.00: supports DRM functions and may not be fully accessible
[    1.643571] ata1.00: disabling queued TRIM support
[    1.643572] ata1.00: ATA-9: Samsung SSD 850 EVO 500GB, EMT02B6Q, max UDMA/133
[    1.643574] ata1.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    1.643598] ata2.00: disabling queued TRIM support
[    1.643599] ata2.00: ATA-9: Samsung SSD 850 EVO 500GB, EMT02B6Q, max UDMA/133
[    1.643615] ata2.00: 976773168 sectors, multi 1: LBA48 NCQ (depth 31/32), AA
[    1.645667] ata1.00: supports DRM functions and may not be fully accessible
[    1.645691] ata2.00: supports DRM functions and may not be fully accessible
[    1.646488] ata1.00: disabling queued TRIM support
[    1.646513] ata2.00: disabling queued TRIM support
[    1.648205] ata1.00: configured for UDMA/133
[    1.648232] ata2.00: configured for UDMA/133
[    1.648412] scsi 0:0:0:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 5
[    1.648765] sd 0:0:0:0: Attached scsi generic sg0 type 0
[    1.649033] sd 0:0:0:0: [sda] 976773168 512-byte logical blocks: (500 GB/466 GiB)
[    1.649080] scsi 1:0:0:0: Direct-Access     ATA      Samsung SSD 850  2B6Q PQ: 0 ANSI: 5
[    1.649104] sd 0:0:0:0: [sda] Write Protect is off
[    1.649107] sd 0:0:0:0: [sda] Mode Sense: 00 3a 00 00
[    1.649216] sd 0:0:0:0: [sda] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.649313] sd 1:0:0:0: Attached scsi generic sg1 type 0
[    1.649393] sd 1:0:0:0: [sdb] 976773168 512-byte logical blocks: (500 GB/466 GiB)
[    1.649408] sd 1:0:0:0: [sdb] Write Protect is off
[    1.649411] sd 1:0:0:0: [sdb] Mode Sense: 00 3a 00 00
[    1.649425] sd 1:0:0:0: [sdb] Write cache: enabled, read cache: enabled, doesn't support DPO or FUA
[    1.650494]  sda: sda1 sda2 sda3 sda4
[    1.650580]  sdb: sdb1 sdb2 sdb3 sdb4
[    1.651233] sd 0:0:0:0: [sda] supports TCG Opal
[    1.651235] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.651310] sd 1:0:0:0: [sdb] supports TCG Opal
[    1.651311] sd 1:0:0:0: [sdb] Attached SCSI disk
[    1.816138] [drm] Cannot find any crtc or sizes
[    1.832959] random: fast init done
[    1.833102] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    1.833120] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    1.833128] random: systemd-udevd: uninitialized urandom read (16 bytes read)
[    1.835976] md/raid1:md1: active with 2 out of 2 mirrors
[    1.835990] md1: detected capacity change from 0 to 8187281408
[    1.836596] md/raid1:md0: active with 2 out of 2 mirrors
[    1.836609] md0: detected capacity change from 0 to 510656512
[    1.836831] md/raid1:md2: active with 2 out of 2 mirrors
[    1.841013] md2: detected capacity change from 0 to 491011440640
[    1.920035] raid6: sse2x1   gen() 14417 MB/s
[    1.968041] raid6: sse2x1   xor() 10359 MB/s
[    2.016042] raid6: sse2x2   gen() 18297 MB/s
[    2.052151] [drm] Cannot find any crtc or sizes
[    2.064043] raid6: sse2x2   xor() 11977 MB/s
[    2.112038] raid6: sse2x4   gen() 21191 MB/s
[    2.160038] raid6: sse2x4   xor() 12870 MB/s
[    2.208039] raid6: avx2x1   gen() 29740 MB/s
[    2.256038] raid6: avx2x1   xor() 20399 MB/s
[    2.304039] raid6: avx2x2   gen() 36755 MB/s
[    2.352038] raid6: avx2x2   xor() 22326 MB/s
[    2.400039] raid6: avx2x4   gen() 40979 MB/s
[    2.448033] raid6: avx2x4   xor() 23242 MB/s
[    2.448035] raid6: using algorithm avx2x4 gen() 40979 MB/s
[    2.448036] raid6: .... xor() 23242 MB/s, rmw enabled
[    2.448037] raid6: using avx2x2 recovery algorithm
[    2.448681] xor: automatically using best checksumming function   avx       
[    2.449192] async_tx: api initialized (async)
[    2.780803] [drm] RC6 on
[    3.040192] random: crng init done
[    3.040195] random: 7 urandom warning(s) missed due to ratelimiting
[    3.302697] EXT4-fs (md2): mounted filesystem with ordered data mode. Opts: (null)
[    3.380050] ip_tables: (C) 2000-2006 Netfilter Core Team
[    3.391342] systemd[1]: systemd 237 running in system mode. (+PAM +AUDIT +SELINUX +IMA +APPARMOR +SMACK +SYSVINIT +UTMP +LIBCRYPTSETUP +GCRYPT +GNUTLS +ACL +XZ +LZ4 +SECCOMP +BLKID +ELFUTILS +KMOD -IDN2 +IDN -PCRE2 default-hierarchy=hybrid)
[    3.408121] systemd[1]: Detected architecture x86-64.
[    3.411068] systemd[1]: Set hostname to <condor2769>.
[    3.470691] systemd[1]: Set up automount Arbitrary Executable File Formats File System Automount Point.
[    3.470764] systemd[1]: Reached target System Time Synchronized.
[    3.470880] systemd[1]: Created slice User and Session Slice.
[    3.470904] systemd[1]: Reached target User and Group Name Lookups.
[    3.470943] systemd[1]: Started Forward Password Requests to Wall Directory Watch.
[    3.471019] systemd[1]: Created slice System Slice.
[    3.471075] systemd[1]: Listening on Journal Socket (/dev/log).
[    3.480628] EXT4-fs (md2): re-mounted. Opts: errors=remount-ro,usrquota
[    3.505280] systemd-journald[364]: Received request to flush runtime journal from PID 1
[    3.513289] systemd-journald[364]: File /var/log/journal/c0688dc6550e40d9b44a66237890f6da/system.journal corrupted or uncleanly shut down, renaming and replacing.
[    3.530296] IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not ready
[    3.555546] (NULL device *): hwmon_device_register() is deprecated. Please convert the driver to use hwmon_device_register_with_info().
[    3.555803] shpchp: Standard Hot Plug PCI Controller Driver version: 0.4
[    3.556869] intel-lpss 0000:00:15.0: enabling device (0000 -> 0002)
[    3.560814] idma64 idma64.0: Found Intel integrated DMA 64-bit
[    3.565947] intel-lpss 0000:00:15.1: enabling device (0000 -> 0002)
[    3.566174] idma64 idma64.1: Found Intel integrated DMA 64-bit
[    3.570988] mei_me 0000:00:16.0: enabling device (0000 -> 0002)
[    3.578580] r8169 0000:01:00.0 enp1s0: link down
[    3.578582] r8169 0000:01:00.0 enp1s0: link down
[    3.578660] IPv6: ADDRCONF(NETDEV_UP): enp1s0: link is not ready
[    3.585416] intel-lpss 0000:00:1e.0: enabling device (0000 -> 0002)
[    3.585799] idma64 idma64.2: Found Intel integrated DMA 64-bit
[    3.592856] RAPL PMU: API unit is 2^-32 Joules, 5 fixed counters, 655360 ms ovfl timer
[    3.592857] RAPL PMU: hw unit of domain pp0-core 2^-14 Joules
[    3.592858] RAPL PMU: hw unit of domain package 2^-14 Joules
[    3.592859] RAPL PMU: hw unit of domain dram 2^-14 Joules
[    3.592859] RAPL PMU: hw unit of domain pp1-gpu 2^-14 Joules
[    3.592859] RAPL PMU: hw unit of domain psys 2^-14 Joules
[    3.602188] AVX2 version of gcm_enc/dec engaged.
[    3.602189] AES CTR mode by8 optimization enabled
[    3.631620] Adding 7995388k swap on /dev/md1.  Priority:-2 extents:1 across:7995388k SSFS
[    3.649228] intel_rapl: Found RAPL domain package
[    3.649229] intel_rapl: Found RAPL domain core
[    3.649230] intel_rapl: Found RAPL domain uncore
[    3.649231] intel_rapl: Found RAPL domain dram
[    3.716185] dw-apb-uart.2: ttyS4 at MMIO 0xdf117000 (irq = 20, base_baud = 115200) is a 16550A
[    3.768473] EXT4-fs (md0): mounting ext2 file system using the ext4 subsystem
[    3.769920] EXT4-fs (md0): mounted filesystem without journal. Opts: (null)
[    7.404362] r8169 0000:01:00.0 enp1s0: link up
[    7.404368] IPv6: ADDRCONF(NETDEV_CHANGE): enp1s0: link becomes ready
[ 4075.546801] perf: interrupt took too long (2508 > 2500), lowering kernel.perf_event_max_sample_rate to 79500
[ 5689.837424] perf: interrupt took too long (3142 > 3135), lowering kernel.perf_event_max_sample_rate to 63500
[10115.250368] perf: interrupt took too long (3945 > 3927), lowering kernel.perf_event_max_sample_rate to 50500
[22838.481641] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[28028.743897] perf: interrupt took too long (4950 > 4931), lowering kernel.perf_event_max_sample_rate to 40250
[280629.260406] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[343987.398818] md: data-check of RAID array md0
[343987.411753] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[343987.467934] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
[343989.916428] md: md0: data-check done.
[343989.920038] md: data-check of RAID array md1
[344029.804168] md: md1: data-check done.
[344029.808416] md: data-check of RAID array md2
[346437.124572] md: md2: data-check done.
[385142.122374] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[526986.800286] perf: interrupt took too long (6196 > 6187), lowering kernel.perf_event_max_sample_rate to 32250
[1150592.768812] TCP: request_sock_TCP: Possible SYN flooding on port 11211. Sending cookies.  Check SNMP counters.
[1150631.783170] cgroup: fork rejected by pids controller in /system.slice/apache2.service
[1212935.515650] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[2186110.449750] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[2410857.588817] cgroup: fork rejected by pids controller in /system.slice/apache2.service
[2766788.336203] md: data-check of RAID array md0
[2766788.352563] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[2766788.392480] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
[2766790.823840] md: md0: data-check done.
[2766790.826994] md: data-check of RAID array md1
[2766830.713540] md: md1: data-check done.
[2766830.716909] md: data-check of RAID array md2
[2769236.777339] md: md2: data-check done.
[4077245.109452] usb 1-4: new low-speed USB device number 2 using xhci_hcd
[4077245.266436] usb 1-4: New USB device found, idVendor=0a81, idProduct=0205
[4077245.266441] usb 1-4: New USB device strings: Mfr=1, Product=2, SerialNumber=0
[4077245.266446] usb 1-4: Product: PS2 to USB Converter
[4077245.266450] usb 1-4: Manufacturer: CHESEN
[4077245.304834] hidraw: raw HID events driver (C) Jiri Kosina
[4077245.321865] usbcore: registered new interface driver usbhid
[4077245.321867] usbhid: USB HID core driver
[4077245.330021] input: CHESEN PS2 to USB Converter as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.0/0003:0A81:0205.0001/input/input7
[4077245.389637] hid-generic 0003:0A81:0205.0001: input,hidraw0: USB HID v1.10 Keyboard [CHESEN PS2 to USB Converter] on usb-0000:00:14.0-4/input0
[4077245.390055] input: CHESEN PS2 to USB Converter as /devices/pci0000:00/0000:00:14.0/usb1/1-4/1-4:1.1/0003:0A81:0205.0002/input/input8
[4077245.449900] hid-generic 0003:0A81:0205.0002: input,hidraw1: USB HID v1.10 Mouse [CHESEN PS2 to USB Converter] on usb-0000:00:14.0-4/input1
[4077272.721817] fbcon: inteldrmfb (fb0) is primary device
[4077272.748609] Console: switching to colour frame buffer device 128x48
[4077272.767415] i915 0000:00:02.0: fb0: inteldrmfb frame buffer device
[4077295.853502] usb 1-6: new high-speed USB device number 3 using xhci_hcd
[4077296.002270] usb 1-6: New USB device found, idVendor=14dd, idProduct=0002
[4077296.002275] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[4077296.002281] usb 1-6: Product: Multidevice
[4077296.002285] usb 1-6: Manufacturer: Peppercon AG
[4077296.002289] usb 1-6: SerialNumber: 03CDF9B8AC3A66ED5EE5F7C3072A91E8
[4077296.004069] input: Peppercon AG Multidevice as /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0/0003:14DD:0002.0003/input/input9
[4077296.061725] hid-generic 0003:14DD:0002.0003: input,hidraw2: USB HID v1.01 Keyboard [Peppercon AG Multidevice] on usb-0000:00:14.0-6/input0
[4077296.062405] input: Peppercon AG Multidevice as /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1/0003:14DD:0002.0004/input/input10
[4077296.062625] hid-generic 0003:14DD:0002.0004: input,hidraw3: USB HID v1.01 Mouse [Peppercon AG Multidevice] on usb-0000:00:14.0-6/input1
[4088040.967662] usb 1-6: USB disconnect, device number 3
[4088104.869754] usb 1-6: new high-speed USB device number 4 using xhci_hcd
[4088105.018599] usb 1-6: New USB device found, idVendor=14dd, idProduct=0002
[4088105.018605] usb 1-6: New USB device strings: Mfr=1, Product=2, SerialNumber=3
[4088105.018609] usb 1-6: Product: Multidevice
[4088105.018613] usb 1-6: Manufacturer: Peppercon AG
[4088105.018616] usb 1-6: SerialNumber: 03CDF9B8AC3A66ED5EE5F7C3072A91E8
[4088105.020475] input: Peppercon AG Multidevice as /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.0/0003:14DD:0002.0005/input/input11
[4088105.078213] hid-generic 0003:14DD:0002.0005: input,hidraw2: USB HID v1.01 Keyboard [Peppercon AG Multidevice] on usb-0000:00:14.0-6/input0
[4088105.079216] input: Peppercon AG Multidevice as /devices/pci0000:00/0000:00:14.0/usb1/1-6/1-6:1.1/0003:14DD:0002.0006/input/input12
[4088105.079532] hid-generic 0003:14DD:0002.0006: input,hidraw3: USB HID v1.01 Mouse [Peppercon AG Multidevice] on usb-0000:00:14.0-6/input1
[4425126.063597] usb 1-6: USB disconnect, device number 4
[4425127.120530] usb 1-4: USB disconnect, device number 2
[4608173.543565] audit: type=1400 audit(1577024807.084:2): apparmor="DENIED" operation="change_onexec" info="label not found" error=-2 profile="unconfined" name="system_tor" pid=4141 comm="(tor)"
[4674102.600246] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[5389167.147735] TCP: enp1s0: Driver has suspect GRO implementation, TCP performance may be compromised.
[5790886.556050] md: data-check of RAID array md0
[5790886.570198] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[5790886.580154] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
[5790889.088669] md: md0: data-check done.
[5790889.091915] md: data-check of RAID array md1
[5790928.804089] md: md1: data-check done.
[5790928.808618] md: data-check of RAID array md2
[5793342.403501] md: md2: data-check done.
[5885167.576271] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[5942225.927974] r8169 0000:01:00.0 enp1s0: link down
[5942286.125907] r8169 0000:01:00.0 enp1s0: link up
[6100421.130628] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[6848807.673874] DCCP: Activated CCID 2 (TCP-like)
[6848807.681997] sctp: Hash tables configured (bind 1024/1024)
[8210127.728955] md: data-check of RAID array md0
[8210127.742698] md: delaying data-check of md1 until md0 has finished (they share one or more physical units)
[8210127.780876] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
[8210130.257361] md: md0: data-check done.
[8210130.260788] md: data-check of RAID array md1
[8210170.116940] md: md1: data-check done.
[8210170.121703] md: data-check of RAID array md2
[8212579.951548] md: md2: data-check done.
[8325516.433600] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[9051600.650503] r8169 0000:01:00.0 enp1s0: link down
[9051660.833780] r8169 0000:01:00.0 enp1s0: link up

Neo · February 12, 2020, 11:53pm

Update:

Have checked the health of the disks and all seems fine, for example (one of many checks):

ubuntu#  smartctl -H /dev/sda
smartctl 6.6 2016-05-31 r4324 [x86_64-linux-4.15.0-33-generic] (local build)
Copyright (C) 2002-16, Bruce Allen, Christian Franke, www.smartmontools.org

=== START OF READ SMART DATA SECTION ===
SMART overall-health self-assessment test result: PASSED

vbe · February 13, 2020, 2:58am

I have seen cases like that when you have enormous disc cache synchronising... all is fine when pure transactional mode ( the periodic sync has hmmm "little" to update ) but when it comes to batches running, it fills the cache before periodic sync and then the system has to wait the end of forced sync (file cache OS side...)
Other contention possible big file cache with big SGA: parsing both and not finding reduces perfs...

vbe · February 13, 2020, 3:09am

I would check if at that time your server is doing something "heavy batch mod style" at the same time like backup or import/export of tables...

Neo · February 13, 2020, 3:18am

Yes, I know...

That is why I previously mentioned that I have checked all cron and cron-like processes which are all batch-like processes. The server only runs a LAMP application, so there are no other application processes of any consequence.

I think I mentioned all the things you mentioned already in an earlier post; but thanks for brainstorming... and coming up with ideas.

Honestly, I am already logging all batch-like processing at the application level, and there is no correlation, as I mentioned.

vbe · February 13, 2020, 3:25am

you tried

sar -b 5 5

or more adequate value?

vbe · February 13, 2020, 3:31am

[5885167.576271] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[5942225.927974] r8169 0000:01:00.0 enp1s0: link down
[5942286.125907] r8169 0000:01:00.0 enp1s0: link up
[6100421.130628] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters. Did you find anything here?
[6848807.673874] DCCP: Activated CCID 2 (TCP-like)
[6848807.681997] sctp: Hash tables configured (bind 1024/1024)
[8210127.728955] md: data-check of RAID array md0
[8210127.742698] md: delaying data-check of md1 until md0 has finished (they share one or more physical units) Due to your check?
[8210127.780876] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
[8210130.257361] md: md0: data-check done.
[8210130.260788] md: data-check of RAID array md1
[8210170.116940] md: md1: data-check done.
[8210170.121703] md: data-check of RAID array md2
[8212579.951548] md: md2: data-check done.

vbe · February 13, 2020, 3:35am

How are the disks attached ? a NAS, a SAN? what type?
But usually this sort of issues comes more from the OS side... Or SAN is flushing and syncing its cache but badly configured, not optimised to your usage (dont laugh I have seen cases with the best equipment...)

Neo · February 13, 2020, 3:38am

That is a good idea.

I'll give some combination of sar a try, maybe I'll write a little script when the load goes over 50 to kick of something like:

sar --human -d 5 5

or some variation of the above.

Good idea.

Neo · February 13, 2020, 3:40am

The SSD drives are in the box.

ubuntu# dmesg | grep disk
[    1.651235] sd 0:0:0:0: [sda] Attached SCSI disk
[    1.651311] sd 1:0:0:0: [sdb] Attached SCSI disk

Neo · February 13, 2020, 3:46am

vbe:

[5885167.576271] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters.
[5942225.927974] r8169 0000:01:00.0 enp1s0: link down
[5942286.125907] r8169 0000:01:00.0 enp1s0: link up
[6100421.130628] TCP: request_sock_TCP: Possible SYN flooding on port 443. Sending cookies.  Check SNMP counters. Did you find anything here?
[6848807.673874] DCCP: Activated CCID 2 (TCP-like)
[6848807.681997] sctp: Hash tables configured (bind 1024/1024)
[8210127.728955] md: data-check of RAID array md0
[8210127.742698] md: delaying data-check of md1 until md0 has finished (they share one or more physical units) Due to your check?
[8210127.780876] md: delaying data-check of md2 until md1 has finished (they share one or more physical units)
[8210130.257361] md: md0: data-check done.
[8210130.260788] md: data-check of RAID array md1
[8210170.116940] md: md1: data-check done.
[8210170.121703] md: data-check of RAID array md2
[8212579.951548] md: md2: data-check done.

Based on prior experience, a SYN Flood attack msg in dmesg is a fraction of the traffic the site has (it's noise), so I don't think that is an issue (it's noise, I think... not "signal" .... ).

There is no I/O spike (network I/O) as mentioned earlier (in case you missed it).

There is zero correlation between network I/0 and the load spike.

I do not think it is network I/O related.

The site gets tons of bot traffic from wayward bots globally, and there would be a correlation, but there is also no correlation between bots, network i/o, etc. None.

Neo · February 13, 2020, 3:51am

FYI, there has been no spike in the past 8-9 hours (my time):

vbe · February 13, 2020, 4:04am

OK so it is more now I have read again your first and second post, related to the MySQL engine: Do you have a MyISAM engine running too?
What most admins I work with don't understand is the recommendations often given by vendors on tuning their soft are based on standard configurations and don't really apply when out of that.
What you see to me is the backside effect of big SGA that could be more efficient when size is reduced since the SAN bays have huge cache too... Either it is loosing its time parsing the storage now nearly full either the storage is too fragmented

Neo · February 13, 2020, 4:07am

Yes, most of the DB tables (99 percent) are MYISAM tables, especially the larger ones.

I don't have SAN.... The SCSI disks are directly attached in the box.

Neo · February 13, 2020, 5:12am

Just spiked again.... nearly exactly Thursday, February 13, 2020 10:02 AM UTC to Thursday, February 13, 2020 10:03 AM UTC, (5PM my time) just a one minute spike hit. Instrumentation shows no cron or batch-like processing. Only MySQL, Apache, etc (LAMP) and nothing "traceable" in the application:

hicksd8 · February 13, 2020, 5:25am

sar I agree is a good idea. Does the CPU spike coincide with an I/O spike??

In my experience this is likely to be kernel element feature. For example, are you watching the cron process itself because, as you will know, this wakes up periodically (and the regularity varies from kernel to kernel, somewhere between 4 and 24 hours) to integrity check its cron table cache in memory against the crontabs on disk. If there's a lot of cron jobs this process takes a short while.

So question is are you checking the cron process itself and not just cron jobs scheduled to run?

The above is a sheer guess.

Neo · February 13, 2020, 5:36am

Hi Dennis,

No. I have mentioned this a number of times already, including the first post . There are no network I/O spikes.

Regarding disk I/O, I have not yet set up any instrumentation to attempt to correlate disk I/O to the spikes.

hicksd8 · February 13, 2020, 5:49am

So sar as suggested is a good idea because it might tell you that.

Neo · February 13, 2020, 5:55am

Agreed...

I think I will try iostat or iotop during anticipated spike periods (if I can predict one, LOL)

Or I will write some code to instrument this when the spike happens and try to trap the causal process that way.