System does not reboot after injecting uncorrectable PCIE errors via aer-inject

CPU info :
root@node:~# cat /sys/devices/cpu/caps/pmu_name 
broadwell
Architecture:                    x86_64
CPU op-mode(s):                  32-bit, 64-bit
Byte Order:                      Little Endian
Address sizes:                   46 bits physical, 48 bits virtual
CPU(s):                          20
On-line CPU(s) list:             0-11
Off-line CPU(s) list:            12-19
Thread(s) per core:              1
Core(s) per socket:              10
Socket(s):                       1
NUMA node(s):                    1
Vendor ID:                       GenuineIntel
CPU family:                      6
Model:                           79
Model name:                      Intel(R) Xeon(R) CPU E5-2618L v4 @ 2.20GHz
$uname -a
Linux smirnoff-node 5.2.60-rt15-LTS19 #1 SMP Mon Nov 21 19:33:51 PST 2022 x86_64 x86_64 x86_64 GNU/Linux

Trying and simulate and validate aer-inject functionality on Linux machine with below pcie error

 cat /root/pcie.err
### AER Inject Error file
## DEVICE: Ethernet controller: Broadcom Inc. and subsidiaries Device b045
##-----------------------------------
AER
BUS 0x4a DEV 00 FN 0
UNCOR_STATUS TRAIN
HEADER_LOG 7 1 2 5
$aer-inject pcie.err
 kernel panic logs after aer-inject :
=======
- It is getting struck and doe not reboot

pcieport 0000:25:02.0: BAR 13: failed to assign [io  size 0x1000]
perf: interrupt took too long (2502 > 2500), lowering kernel.perf_event_max_sample_rate to 79000
perf: interrupt took too long (3132 > 3127), lowering kernel.perf_event_max_sample_rate to 63000
perf: interrupt took too long (3916 > 3915), lowering kernel.perf_event_max_sample_rate to 51000
pcieport 0000:00:03.1: aer_inject: Injecting errors 00000000/00000001 into device 0000:4a:00.0
pcieport 0000:00:03.1: AER: Uncorrected (Non-Fatal) error received: 0000:4a:00.0
linux-kernel-bde 0000:4a:00.0: AER: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Receiver ID)
linux-kernel-bde 0000:4a:00.0: AER:   device [14e4:b045] error status/mask=00000001/00000000
linux-kernel-bde 0000:4a:00.0: AER:    [ 0] Undefined              (First)
pcieport 0000:00:03.1: AER: Device recovery failed
kvm: exiting hardware virtualization
sd 5:0:0:0: [sdb] Synchronizing SCSI cache
sd 0:0:0:0: [sda] Synchronizing SCSI cache
reboot: Restarting system
printk: enabled sync mode
watchdog: BUG: soft lockup - CPU#10 stuck for 134s! [lcmd:10451]
sd 0:0:0:0: timing out command, waited 180s
printk: console [ttyS0]: printing thread stopped
reboot: machine restart
Modules linked in: vhost_net vhost macvtap tap xt_CHECKSUM xt_MASQUERADE xt_conntrack ipt_REJECT nf_reject_ipv4 iptable_mangle iptable_nat ebtable_filter ebtables linux_user_bde(PO) linux_kernel_bde(PO) xt_tcpudp bridge stp llc ip6table_filter ip6_tables iptable_filter ip_tables x_tables kvm_intel kvm vfio_pci vfio_virqfd vfio_iommu_type1 vfio pci_stub uio_pci_hostif i40e(O) qfx_pci_static_map(O) macvlan socktun(O) i2c_dev uio_fpga(O) uio iTCO_wdt iTCO_vendor_support watchdog intel_rapl_msr intel_rapl_common x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul crct10dif_common aesni_intel aes_x86_64 glue_helper crypto_simd cryptd i2c_i801 lpc_ich igb(O) configfs pcc_cpufreq sch_fq_codel nfsd openvswitch nsh nf_conncount nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 irqbypass fuse [last unloaded: kvm]
CPU: 10 PID: 10451 Comm: lcmd Kdump: loaded Tainted: P           O      5.2.60-rt15-LTS19 #1
Hardware name: Juniper Networks Inc. 0CA3/0CA3, BIOS CBEP_P_VAL1_00.15.01 10/30/2018
**Shutting down cpus with NMI**

  • Observing reboot gets stuck here and needs hard power cycle of setup to recover the setup 6/10 iterations issue is reproducible.
  • Current analysis in reboot path ,observing that issue is reproduced when system goes reboot with NMI_VECTOR path

-Observed in reboot path tries to stop all active CPU's before reboot and invokes REBOOT_VECTOR irq handler to shutdown in non-working case REBOOT_VECTOR is failing and 2 cpus are still active and tries to do force shutdown using NMI_VECTOR irq .

  • Since the active CPU's are not stopped due to some locking or other , NMI_VECTOR is invoked to reboot but in this case NMI_VECTOR is failing to turn off CPU's and causing deadlock or hang while reboot .
working case when all CPU's are active and stops all
apic->send_IPI_allbutself(REBOOT_VECTOR);
/linux/v5.2.21/source/arch/x86/kernel/smp.c#L218
not-working case when two CPU's are unable to stop
apic->send_IPI_allbutself(NMI_VECTOR);
/linux/v5.2.21/source/arch/x86/kernel/smp.c#L244
  • In failure case reboot is triggered via "NMI_VECTOR" and getting reboot stuck .
  • Need help in understanding this behavior to fix the issue .
  • looking forward for your responses on this issue and please let me know if any info required .

Welcome to the community!
I suggest to discuss this with kernel.org

Perhaps you need to turn off kdump, or configure it to use something else than a disk on the failed PCIE bus.

It appears that you are attempting to use the 'aer-inject' tool on a Linux machine with a Broadcom Ethernet controller (Device b045) to simulate and validate aer-inject functionality. The tool is being used to inject errors into the PCIe bus, specifically BUS 0x4a DEV 00 FN 0, but it seems that it is causing the system to kernel panic and not reboot.

The kernel panic is likely caused by the failure of the tool to properly handle the errors that it is injecting into the system. The specific error that is being reported is "AER: Uncorrected (Non-Fatal) error received: 0000:4a:00.0" and "pcieport 0000:00:03.1: AER: Device recovery failed" .

It's difficult to say for certain what is causing the kernel panic without more information about the specific system configuration and the version of the aer-inject tool that you are using. Some possible causes could include bugs in the aer-inject tool, conflicts with other kernel modules, or incompatibilities with the specific version of Linux that you are using. It may be helpful to contact the developers of the aer-inject tool or seek help from a Linux kernel expert for further assistance.