Something is causing network timeout on servers at 02 AM, how to investigate?

No, it's not the cronjob. I've tested its logs.
It's something else. It's a unix server on top of kubernetes and docker.

Well it is difficult to answer not knowing with what we are dealing...
What is the environment? By that I mean the infrastructure, when you say something is causing network timeout... at what level are you talking about, in containers? the server hosting the lot?... and what about the rest of your infrastructure?

Looking at the time it happens, I suppose we all have the same thoughts:
Batches
Backups...

When big infrastructures, you cannot rely on cron, you need a true professional job scheduler...

2 Likes

This is the architecture

A sends request to B.
B sends "system timeout" after 1 seconds.
That's just making it interesting.

Hello,

In order to stand a decent chance of offering meaningful assistance here, we're going to need quite a bit more actual detail. Some key questions that spring to mind are:

  • What application is involved in this timeout (e.g. Apache, NGINX, MariaDB, sendmail, a service of your own creation, etc.) ?
  • What is the nature of the request or function that times out (e.g. HTTP GET/POST, SMTP session, SCP/SFTP, etc.) ?
  • Does this happen every day at 02:00, or just some days ? If it only happens occasionally, do you get any other alerts or performance metrics indicating an issue at around that time (e.g. system load, memory utilisation, I/O load, etc.) ?
  • Does it ever happen at other times, or does this exact issue only ever occur at or near 02:00 ?
  • Is anything else of significance happening on the system at or around the time that the failure occurs (e.g. backups, other cron jobs, other applications performing other tasks, etc.) ?
  • What is the exact and complete text of any error message or failure notification that you receive from your application, or in the system logs ? Does this error/failure message ever change, or is it always the same ?
  • Is there anything else of interest in the system and application logs on both the server and the client from around that time that might seem to explain things (e.g. indications of hardware issues, signs the system has run out-of-memory and has killed a process to free it up, etc.) ?
  • Are the two servers on which this timeout occurs on the same local network, or on different networks ? If they are on different networks, is there a firewall or router of any kind in the way ?
  • Are the servers running any local iptables/firewalld firewall rules ?
  • If you try this network operation from a different client, do you still get the same timeouts occurring ? If so this could be a sign that the issue is at the server side; if not, then it would tend to indicate the issue may be with the original client

These are some of the key questions that spring to mind, though no doubt there are others that will be relevant based on the answers to the above.

2 Likes