Kernel Panic : how to stop systemd managed process from being respawned

sundarpv · January 5, 2023, 5:48pm

I am using systemd to manage an application process, which is configured for "respawn" always.

During kernel panic, i need to tweak the behavior such that the process isnt respawned.

Is there a way for identifying kernel is in shutdown path (reboot or panic) and appropriately, take action in the service file such that "start" or "restart" is not honoured.

please share any deterministic way for application process to be managed better during kernel shutdown path

drysdalk · January 5, 2023, 11:27pm

Hello,

Welcome to the forum ! We hope you enjoy your time here, and find this to be a friendly and helpful place.

Once the kernel panics, the system along with all its services is effectively down, and the only way anything starts back up is by rebooting the system. There is no way of performing a controlled shutdown of services when the kernel panics: they simply stop running instantly, because there is no longer any functional operating system for them to run under.

Generally speaking, you normally don't need to worry about going out of your way to make a particular service detect kernel panics. When these happen every service will fail, along with everything else on the system - without a functioning kernel, there can be no functioning services. However every service which is set to start on boot will then start normally again when the system reboots to recover from the panic, thus restoring normal operation.

Now, if what you're asking is if there's a way to make a service only start on normal boots but not on reboots after a kernel panic, then I think you're into trickier territory. However, one thing worth considering is whether you normally want the service to automatically start on a regular boot, or would you only ever want it to be started manually ? Because if you actually don't want it auto-starting ever, then the easiest thing for you to do here is to disable the service, which will prevent it from starting on any kind of boot. You will still be able to manually stop and start the disabled service yourself, however. That way you would always directly be in full control of when (and if) the service should start, as you would have to log in to stop or start it.

If you really are for some reason wanting a service to specifically not auto-start on boot if and only if the system is recovering from a kernel panic, then you'd probably have to handle the logic for that yourself, and it wouldn't be straightforward. By definition, since the crash occurred due to a failure of the kernel, you don't normally see in the regular syslogs any information about the nature of the crash. You just see the system running apparently normally, and then suddenly booting again without any previous reboot or shutdown. The best you can hope for is that a kernel dump exists recording the crash if you want more info on what happened after the fact, but that will only exist if your system is configured to support such behaviour.

Systems which take kernel dumps in the event of a kernel panic normally do so by transferring control to a secondary capture kernel which exists in a reserved area of memory, and which then has the sole purpose of recording as much information regarding the now-dead main kernel as it can before initiating a reboot. So you could maybe incorporate some kind of logic into the startup script for your service to search for such dumps and react differently if any exist in a timeframe immediately prior to the current boot, but that may not be as reliable (or as easy) as you might want.

Anyway, hope this helps ! If I've mis-understood anything here or if you have any further questions then please do get back to us again and we can take things from there.

sundarpv · January 6, 2023, 2:15am

I do have specific case in the system, where a specific daemon does PCI device discovery during the application a launch. During "system reboot" Or "kernel panic" , the application is still being re-launched. this triggers PCIe rescan, in a perpetual loop.

By, manually disabling the application, the "reboot'" or "kernel panic" doesnt seem to trigger the PCIe rescan.

In the systemd service, i am looking for a way to identify that the system is in shutdown path and appropriately disable "start" or "restart" of the service.

If i have a way to identify that system is in shutdown path either through some sysctl or any other means, i can try it out.

sundarpv · January 6, 2023, 2:26am

this above behavior is validated by creating a "marker" file in the kernel shutdown script, which was accessed in the application launch service. this works well during a regular shutdown (reboot). while, this may not be reliable in case of "Kernel panic"

chatgpt · January 6, 2023, 5:35am

Yes, you can use the ExecStop and ExecStopPost directives in your systemd service file to specify commands that should be run when the service is stopped. This can be used to take action when the system is shutting down or entering a kernel panic.

For example, you could add the following to your service file:

[Unit]
...

[Service]
...
ExecStop=/path/to/script.sh

[Install]
...

Then, in /path/to/script.sh, you can add any commands that you want to run when the service is stopped. You can use the systemctl command to check the current state of the system and take appropriate action based on that. For example:

#!/bin/bash

if systemctl is-system-running --quiet; then
    # system is not shutting down, so do something
else
    # system is shutting down, so don't do anything
fi

You can also use the ExecStopPost directive to specify a command that should be run after the service has been stopped. This can be used to perform any additional cleanup or to take other actions.

I hope this helps! Let me know if you have any questions.

See Also:

https://chat.openai.com/chat/fe18b709-2d94-4f04-a985-5eb8ccd52d77

drysdalk · January 6, 2023, 7:20am

Given that no actual system shutdown takes place during a kernel panic, I don't think there's a way you can easily or automatically handle this. When a kernel panic occurs, the system simply stops dead in an instant. There is no controlled shutdown, no stopping of services, no killing of processes, no unmounting of filesystems, or anything else. The system has crashed, and so all activity ceases. From a running application's point of view, it's no different than if someone had taken the power cord out the back of the server. One moment everything is fine, and the next, there's no system anymore.

It may be that the simplest solution for you is simply to stop the service from auto-starting, and to only manually start it when you log in. That way, you will always be able yourself to verify that the system is in a safe state before starting your service. Alternatively the behaviour of the service itself will need to be changed to avoid the loop that you mention.

As an aside, however, it must be said that kernel panics are (or certainly should be) an exceptionally rare event. If for some reason the server you are using is regularly experiencing kernel panics, then the real question you need to be asking is why that is happening. Now if you're using this system to develop hardware or device drivers or kernel modules or things like that then this might well explain it, and you may be very well aware of the cause of the panics. But if this is a "normal" server, as it were, then it shouldn't really be experiencing panics at all. If it is, something is seriously wrong somewhere.

sundarpv · January 6, 2023, 9:16am

thanks for the elaborate response with the details. i do realise, an application level solution will not be deterministic in nature.

In FreeBSD, we have the shutdown events trigger by kernel at various stages of shutdown for which callbacks can be registered. this model almost always works.

Is there an event generation model available in Linux ? if a similar provisioning is possible, that will make the solution deterministic.
Or could you suggest function in the shutdown sequence, where i can hook in my code to disable the PCIe tree.

once again, i thank the forum for the patient responses.

MadeInGermany · January 6, 2023, 9:54am

Do you really have a kernel panic?
Is /var/crash filled with crash dumps?
Do you have kdump installed?
systemctl status kdump

drysdalk · January 6, 2023, 11:17am

I think the key thing here is to differentiate between a normal shutdown and a kernel panic. You specifically said you were looking for solutions for a kernel panic, and for the reasons previously stated, there is no scriptable or automated solution to this, since a kernel panic is by definition not a shutdown, and is more akin to a sudden power loss in terms of its impact to running applications.

Now if you want your service to do certain things on a normal clean shutdown, that's a different thing, and entirely do-able. The ChatGPT-based reply you received earlier contains the basic gist of what you'd need to do there: you can define in your service's systemd unit file scripts that should be run when it is stopped, as well as when it starts, so that certain actions can take place when the service is stopped (as part of a normal system shutdown, say).

But again to be clear, this would only help in the event of a normal system shutdown. In a kernel panic the system does not shut down - rather it instantaneously ceases to exist.

sundarpv · January 8, 2023, 2:33am

i am simulating the kernel panic behavior with "mce-inject". we expect to see the panic to report the failure and reboot the system.

Due to the PCIe rescan being in perpetual loop, the kernel doesnt report the mce error reporting and doesnt shutdown in the normal kernel panic sequence as well.

Neo · January 8, 2023, 3:37am

Yes, mce-inject (very old utility) can cause the kernel to panic, but I've never used it. Seems it may need to be updated or modified to better meet your requirements @sundarpv ?