Profiling Processes while shutdown

I was wondering how can I find the culprit of a slow shutdown on my debian box? I am actually looking for a diagnosis tool that might dump the process name and amount of time it took to close the process after signal was send.

As for now I am trying to use journalctl to seek some information, but I would like to narrow the suspects down.

We had a problem like this in Solaris 10. There was an issue with using NFS across zones on the same system. It hung in certain circumstances.
The point I am trying to make: it may not be a process but a relationship between processes and their current status.

You are assuming a single process is the problem, which is okay, but you ma want to think "larger", multiple process or a device and some process group.

What you gave us is a start, we need more:

1. is the box standalone - not clustered, no NFS mounts, no samba mounts, etc?
2. does the box actually come down?
3. how much extended time does it take to come down.
    Ex: yesterday it came down in 30 seconds, today it came down in 10 minutes.
4. did you install new software in the near past, and did you get errors on install

Thanks Jim :),

You are right, it could very much be a co/in-dependent set of processes creating the problem.

I have not installed any packages noticeably. I am sure gdb wouldn't have this issue. However, I do have my own code on the box (multiple demons). The problem started appearing recently when the reboot/shutdown command started taking more than 10 minutes as opposed to 45 second previous reboot time. And now, the delay is almost consistent.

Whether mine or external, I simply need to narrow down the problem.

I think you have to work "backwards".
Change startup to be more minimal, do not start any your own daemons.

If the problem goes away, keep adding them back into the mix one by one. If it is resource contention, like waiting on some kind of lock, it may be hard to track down. Any daemons that work cooperatively with others may deserve first attention.

If the problem still exists, you may have to start looking at your configuration by changing to single user boot, then changing startup/shutdown script to boot and shutdown at each runlevel to eliminate process groups and processes as a problem.

I vote for your hand-rolled daemons as a great place to start. Sorry I cannot be more specific.

With regard to gdb: it will not have the problem, but if the process it controls does have issues, what then? Why are you shutting down with processes running under gdb? Sounds like a bad plan to me.

Shutdown works by sending signals to processes to go through orderly shutdown. If a process cannot or is in a deadlock because a another process locked a mutex then got killed off, SIGTERM will not shut the process down. There are so-called robust mutexes that can help.

pthread_mutexattr_getrobust

There isn't a lot of detail in the thread, but things to consider might be:-

  • If you have a database, is it possible that there is a major transaction rollback being done?
  • Do any of you shutdown scripts have waits in them?
  • NFS (as already mentioned)
  • Is there some sort of notification you are trying to do and the target server is down? Perhaps a closing down report to ensure all transactions are centrally held etc.
  • Has someone introduced a backup into the wrong place, so it runs at shutdown?
  • Has someone created a shutdown script that actually does a startup by mistake? (i.e. never it checks $1 for start or stop, it just starts)
  • Is there an AV scan being triggered in the shutdown?
  • Do you run an fsck during shutdown?
  • Do you try to sync the clock during shutdown?
  • Is this a High Availability node, or worse an HA node where the other node(s) are all off?

There are lots of other possibles too, I'm sure. What more can you tell us about it?

Kind regards,
Robin