How to troubleshoot why the system went down?

Ihattaren · January 9, 2024, 4:19am

This is not a specific question as per se. Please bear with me. I've consulted chatgpt as well and google as well, I'll put down my points here.

A server

linux
running k3s,docker.

crashed.

Restarting it fixed the issue.

Now, how do I troubleshoot why it happened?

Is there anything I can do post-incident?

If not, is there something I can do to debug the next incident?

What I think I can do?

I can install sar and next monitor for unusualities.

Is there anything I am missing.

The application logs get stopped at the moment the server crashes, So I don't think I can take a look at it. My only hope is the logs in rancher itself. But let's see.

Please provide me guidance.

Neo · January 9, 2024, 4:37am

What crashed exactly? Docker? The Linux Kernel on the host (not Docker)?

Start by defining exactly your problem actually is.

Restarted what exactly? What is "it"? Docker? Or did you reboot?

Start by defining exactly your problem actually is and don't use vague pronouns like "it" when describing a technical problem; be precise.

Correctly and precisely defining your problem is normally 80% + toward a solution.

Thanks

Ihattaren · January 9, 2024, 5:22am

Restarting the pods via rancher gui.

Neo · January 9, 2024, 11:54am

So,, the "system did not go down" as you mentioned.

You have an issue with Kubernetes Pods.

Here is one Kubernetes guide to troubleshooting: