Tell us about your most recent system incident

sparcguy · May 17, 2011, 10:53pm

maybe we can start a thread to keep a record of administration changes made by yourself or other people but later blew into a huge incident affecting many users.

I'll start first. Recently due to security requirements we decided to disallow ftp usage to all users on all our servers by updating the /etc/ftpusers. But we also wanted to avoid duplication of work when people leave and we'd have to delete their accounts but sometimes forget to update /etc/ftpusers so we decided to have a script do this job for us.

This is the script I came up with that we put into all our servers. Basically it grabs every user in /etc/passwd and updates into /etc/ftpusers and runs once a month from crontab.

/usr/bin/cp -p /etc/passwd /etc/passwd.`date +%d%m%y`
if [[ -s /etc/ftpd/ftpusers ]] then
        /usr/bin/cp -p /etc/ftpd/ftpusers /etc/ftpd/ftpusers.`date +%d%m%y`
        /usr/bin/cat /etc/passwd | cut -d: -f1 > /etc/ftpd/ftpusers
else
        /usr/bin/touch /etc/ftpd/ftpusers
        /usr/bin/cat /etc/passwd | cut -d: -f1 > /etc/ftpd/ftpusers
fi

We did this very minor change on a friday and by monday totally forgot about it. When monday morning came around application folks complained a strange problem. One of their more critical apps had problems re-starting. On this solaris server due to configuration max_nprocs was set to 400 in the /etc/system and ps -ef showed 399 processes. The os couldn't fork anymore processes and server became very sluggish we wrestled with the problem for hours shutdown apps and database and finally decision came to do an emergency reboot in the afternoon.

By evening the system had slowly built itself up to around 400 processes and the problem resurfaced again. We went thru all the processes and
realized that one other application showed consistent errors from the logs, we also saw this application which does some "migration" activity had a large number of backlog processes.

One of the good things about Solaris operating system is that it has a command called truss. We manually ran the command with 'truss' and from the debug output managed see that it was trying to logon to the backend storage server via ftp service but complained about a 'login mismatch' so in the meantime the number of file transfer requests started growing and this began to have a 'knock on' effect on the other applications. Once we excluded that user from /etc/ftpusers on the backend server we saw a substantial drop in the number of process and things started to normalize.

Mistakes: didn't do 'last | grep ftp' as a pre-check before implementing script.

figaro · May 18, 2011, 4:54pm

Today we received an email from one of our clients requesting a password reset. He is non-technical to say the least. We provide him the password, telling him it has been reset, but it is always the same one. He has multiple, one for the domain, one for the hosting company, one for ftp etc etc, each having its own password. We really shouldn't have these, but it is just easier to help him, look up the password from the list and gain some goodwill. Big security breach.

jim_mcnamara · May 18, 2011, 5:39pm

We turned 3 - M4000's into the power of of one v445.

We were required to use SAN that is not fully supported by Solaris 10- Eqallogic PS6000.
Our I/O is beyond pathetic. We have latencies of 300ms on the production SAN. We are playing with data links and other things to get around the fact that the iSCSi initiator cannot create mote than four sessions. Our SAN can handle 16 sessions without even breathing hard. There are loads of problems.

I now know more about scsi_vhci than ever before. Wrote some C system code to look at what the driver thinks it has for parms.

One of our M4000 boxes when correctly configured, can do 5-10 times what a v445 can do. Our three boxes together cannot even do that, we are so terribly I/O bound.

Oracle Sun contacted us. They plan on driver enhancements (scsi_vhci).