How to troubleshoot why the system went down?

This is not a specific question as per se. Please bear with me. I've consulted chatgpt as well and google as well, I'll put down my points here.

A server

  • linux

  • running k3s,docker.

​crashed.

Restarting it fixed the issue. :slight_smile:

Now, how do I troubleshoot why it happened?

Is there anything I can do post-incident?

If not, is there something I can do to debug the next incident?

What I think I can do?

I can install sar and next monitor for unusualities.

Is there anything I am missing.

The application logs get stopped at the moment the server crashes, So I don't think I can take a look at it. My only hope is the logs in rancher itself. But let's see.

Please provide me guidance.

What crashed exactly? Docker? The Linux Kernel on the host (not Docker)?

Start by defining exactly your problem actually is.

Restarted what exactly? What is "it"? Docker? Or did you reboot?

Start by defining exactly your problem actually is and don't use vague pronouns like "it" when describing a technical problem; be precise.

Correctly and precisely defining your problem is normally 80% + toward a solution.

Thanks

1 Like

Restarting the pods via rancher gui.

So,, the "system did not go down" as you mentioned.

You have an issue with Kubernetes Pods.

Here is one Kubernetes guide to troubleshooting:

See also:

So, you should focus on log files and status messages related to Kubernetes, and in particular, your Pods.

Additionally, you should consider increasing your Kubernetes logging level; for example, see:

4 Likes