Network related issues

Oflate we are finding a few servers experiencing severe slowness. What would be the commands that I need to try to postmortem the situation?

What OS? Your other post mentions AIX. Getting data from the past is really difficult unless you had already set up monitoring or auditing.

If you have detailed logs from applications, sometimes you can infer that application A has been taking longer and longer times to complete.

Many kinds of problems are sporadic or are hard to reproduce. These can only be found by creating monitors before the fact.

Please give us more system details: specific OS, main application(s) for the system.
Example: AIX 7.3, sybase server on SAN.

Some wild guesses:-

  • Loss of access to DNS server (slow reverse IP lookup for auditing, so slow login or application)
  • Database locks - hugely dependant on your application
  • Missing database index causing full table scans
  • Poor data queries, e.g. get all records from the database then check each in turn on criteria rather than building the condition into the query
  • Database logs files filling and flushing too slowly
  • Exhausting real memory causing paging (potentially DB consuming too much real memory)
  • Network speed conflict, e.g. if NIC is 10M-half and switch is 100M-full, it will work, but any file transfer will cripple it with lots of dropped packets.
  • IO issues, especially with NFS or an HA cluster if you fail over
  • Scheduled work, e.g. current stock summary
  • Ad-hoc jobs, e.g. current stock summary
  • Resources stealing by another LPAR if the definitions allow it
  • Large write volume to direct disk (e.g. local) rather than cached disk (RAID or SAN etc.)
  • High NFS contention especially with other seemingly unrelated servers

You can see it is a very very VERY wide spread of options so far - and the list is a long way from being exhaustive. You need to be a fair bit more explicit about what you have (including OS) what goes slow, what's happening at the time, what dependencies you have with other servers.

Robin

Most *NIX systems (AIX, Linux, Solaris, BSD) have some kind of system and accounting records. You can run

sar

to see if it is properly deployed on your system. If you run it and get loads of output, you may be in luck. To use it, refer to the man pages. Typically you want to check options for memory and swap usage, CPU usage, and I/O activity.

If it's not installed, consider deploying this first before installing some complex monitoring software; it's a very standard unix utility that has been around for ages, but the implementation and features vary from platform to platform. For Linux install the sysstat package.

On most systems, sar's data is collected through another program which is run as a cronjob. On a typical RedHat/CentOS Linux system, you will find /etc/cron.d/sysstat to contain:

* * * * * root /usr/lib64/sa/sa1 -S XALL 1 1

which I immediately change to

*/5 * * * * root /usr/lib64/sa/sa1 -L -S XALL 10 30

The original form collects data once per minute, which is often simply not enough granularity to get a feel for rapid changes to the system, the kind that cause instability and crashes. Also, if memory becomes extremely sparse, cron might not be able to spawn the job every minute.

My form, however, spawns a new job every 5 minutes. It writes 30 records, one every 10 seconds. The corresponding reports contain enough detail to know very precisely when the problem started. You will need an additional 1.5 GB of disk space on /var/log if you do this.

If you want graphs and pretty output, you may be able to export the data into graphing engines or spreadsheets. Linux's sar has such a program (sadf), and other related projects can slurp of the data and present graphs.