maybe we can start a thread to keep a record of administration changes made by yourself or other people but later blew into a huge incident affecting many users.
I'll start first. Recently due to security requirements we decided to disallow ftp usage to all users on all our servers by updating the /etc/ftpusers. But we also wanted to avoid duplication of work when people leave and we'd have to delete their accounts but sometimes forget to update /etc/ftpusers so we decided to have a script do this job for us.
This is the script I came up with that we put into all our servers. Basically it grabs every user in /etc/passwd and updates into /etc/ftpusers and runs once a month from crontab.
/usr/bin/cp -p /etc/passwd /etc/passwd.`date +%d%m%y`
if [[ -s /etc/ftpd/ftpusers ]] then
/usr/bin/cp -p /etc/ftpd/ftpusers /etc/ftpd/ftpusers.`date +%d%m%y`
/usr/bin/cat /etc/passwd | cut -d: -f1 > /etc/ftpd/ftpusers
else
/usr/bin/touch /etc/ftpd/ftpusers
/usr/bin/cat /etc/passwd | cut -d: -f1 > /etc/ftpd/ftpusers
fi
We did this very minor change on a friday and by monday totally forgot about it. When monday morning came around application folks complained a strange problem. One of their more critical apps had problems re-starting. On this solaris server due to configuration max_nprocs was set to 400 in the /etc/system and ps -ef showed 399 processes. The os couldn't fork anymore processes and server became very sluggish we wrestled with the problem for hours shutdown apps and database and finally decision came to do an emergency reboot in the afternoon.
By evening the system had slowly built itself up to around 400 processes and the problem resurfaced again. We went thru all the processes and
realized that one other application showed consistent errors from the logs, we also saw this application which does some "migration" activity had a large number of backlog processes.
One of the good things about Solaris operating system is that it has a command called truss. We manually ran the command with 'truss' and from the debug output managed see that it was trying to logon to the backend storage server via ftp service but complained about a 'login mismatch' so in the meantime the number of file transfer requests started growing and this began to have a 'knock on' effect on the other applications. Once we excluded that user from /etc/ftpusers on the backend server we saw a substantial drop in the number of process and things started to normalize.
Mistakes: didn't do 'last | grep ftp' as a pre-check before implementing script.