Logging events of /tmp full

bdegiovanni · October 27, 2016, 4:04pm

Hi everybody,
few days ago we had a big issue with one of our solaris10 server.
Suddenly while my colleague was working on it for some troubleshooting he realized that the performance started to degrade.
At the end it reached the point that was not even possible to login usng the local console from the ilom.
as soon as it as supposed to ask for the password it stays there forever doing nothing.
It was also impossible to execute an ssh. In other words even if we cold ping it end the cluster resources from the other node seemed to be online we could not execute any command. Everything we sent to that server just hanged with no response.
At the end the only solution was to power off from the ilom.

I have the feeling that the /tmp folder was full but of course after the restart it is now empty and the server is working properly.

I would like to ask if from your experience a full /tmp folder can experience such kind of behaviour and if someone know if I can find somewhere some log in which I could have a confirmation that the reason was the /tmp folder full.

Is it logged somewhere the space utilization of /tmp?

thanks in advance

Biagio

jlliagre · October 27, 2016, 4:16pm

# dmesg
...
Oct 27 22:12:51 myserver tmpfs: [ID 582450 kern.warning] WARNING: /tmp: File system full, swap space limit exceeded
...

bdegiovanni · October 27, 2016, 4:39pm

Will this command show me the warning even if it occurred before the server was restarted?
Should this log be present also on /var/adm/messages ? (it is not there) or only if I use dmesg command?

---------- Post updated at 08:39 PM ---------- Previous update was at 08:23 PM ----------

Eventually if it was not due to /tmp full does anybody had a similar experience in which no possibility to do any action not even loggin in from the console while the system is not crashed but just hanging forever?

jlliagre · October 27, 2016, 5:55pm

If the event really happened and has been logged, yes.

The dmesg command retrieves its data from /var/adm/messages.

It is a very common situation. What often happens is not /tmp being filled but the virtual memory being exhausted. /tmp being full (or almost full) is a side effect. It is also perfectly possible to have a system exhibiting the symptoms you describe without virtual memory exhaustion. If you haven't enough RAM available for the active memory to be stored in it, performance will degrade. If the deficit is very high, the system might became essentially unresponsive.

bdegiovanni · October 28, 2016, 12:34am

Thank you very much for the explanation. If it was the case do you think it is possible to find somewhere some log that registered the event in order to discover the root cause?
Our boss is pushing a lot to know which was the root cause and I am not really sure we can Ben able to do it

jlliagre · October 28, 2016, 3:06am

If there is no system monitoring in place (esp. vmstat output), memory thrashing leaves no logs.

rbatte1 · October 28, 2016, 6:45am

By default, Solaris creates /tmp in memory and therefore swap-space can be allocated if necessary. This makes everything in /tmp very quick to access until you run out of real memory and then is depends on the devices used for swap after that, usually no worse than a regular filesystem. This is usually very efficient but can lead to problems if you have a process that runs away writing a large log file or doing a huge sort with /tmp defined as the working area. You can use the -T flag or variable $TMPDIR to adjust this behaviour if this is the problem.

It is also possible to define more swap space if you have disk available and bring it online whilst you do certain operations if you are able to narrow down what triggered the problem.

This may not be much help in tracking down the cause but it may give you options in future. It might be worth watching the output from vmstat and looking for paging activity.

Can you share the output from df -k /tmp and swap -l ?

Kind regards,
Robin

bdegiovanni · October 28, 2016, 10:43am

Thanks for the input. I will be back to work next Wednesday so I will do check this suggestion and share it.
Unfortunately it is a quite complex system and this is the first time we have such kind of problem. The only difference between this server and the other one on the cluster (this is the 2nd node of a cluster) is that there was some hardware configuration issue of one Tape library connected to this server and my colleague at the moment of degrading performance was working in troubleshooting that issue. I think it will be hard to identify what brought to hang up He server.

Inviato dal mio iPhone utilizzando Tapatalk