Users get booted

Hello....weird problem but hoping someone can help!

Server: PowerEdge 2850 - 8 core - 12gb ram - 32 bit processor
OS: Redhat ES 6
Setting: University setting .. use server for Computer Science student accounts (350 users)

Every so often the system just boots you, no matter what you are doing. Has happened to me as well when editing a file or just sitting at the command prompt.

Most users use putty to connect, though I have used SSH Secure Shell as well, get booted with both.

Things I have checked or done:

Changed out ethernet cables, changed out the ethernet card. Fiddled with the number of user processes, number of system logins etc.

Have looked at settings in sshd.config relating to Timeouts.
Have connected machine to a different physical switch.
Have increased user limits.

Anyone have any ideas what might cause the system to just boot users randomly like this?

What is your network topology?

It is a switched class B network, much like other Universities. My server is connected to a Netscreen firewall which in turn is connected to a switch in the building. I have about 6 other servers which are all configured the same way and none of them see this problem. This leads me to believe that it is possibly a motherboard problem?? but i dont know of any way to prove/disprove that.

Does this netscreen firewall act s a router? i.e. do you end up on a different subnet than the school network?

If so I suspect it is the problem. Hardware routers often have tiny connection tables, and enforce draconian connection timeouts to keep it from filling up.

no, it doesnt do any routing at all. I actually bought a new firewall as well hoping that would fix the problem but the problem was there with the new firewall as well.

New firewall as in different kind of firewall, or new firewall as in identical replacement?

It is still a netscreen but a newer version...old one was a netscreen 5GT and new one is a netscreen SSG5

What does

dmesg

have to say - right after someone get booted?

 /var/log messages

? (or whatever you have configured)

If you get disconnected what does

netstat -a

show -

TIME_WAIT 

for high number ports like 49152?

I can't know but I think there is a network disconnect going on.
Try connecting to the console on the server. If you can stay connected for several days (or whatever you deem reasonable) then that asserts it is probably related to the network.

Does your firewall have logs - does it timestamp when a given action is taken - deny, block ,etc? Is there any correlation there with the "boot-off" problem?

Do you have the TMOUT variable defined?

well i too have pointed the finger at network as well. Unfortunately our network folks arent very co-operative as they take it somewhat personal. the only messages i ever see are in /var/log/messages

messages:Dec  9 19:53:02 student kernel: e1000: eth1 NIC Link is Down
messages:Dec  9 19:53:04 student kernel: e1000: eth1 NIC Link is Up 100 Mbps Full Duplex, Flow Control: None

which tells me the network was restarted for some reason...however those messages arent on any of the other servers which sit right beside the one in question. So the million dollar question is what is restarting the network? Is it the physical network or the OS?

Ah, if it thought the cable was unplugged, that would certainly kick all your network connections! :eek:

That doesn't look like an OS problem. I don't know of any way for a software problem to cause that message.

You've already ruled out most of the rest of the network as well as the firewall, leaving either the cable between server and firewall or the NIC itself. My guess would be a bad network card in the server.

well that is what i thought so i replaced both the cables and the network card. But the problem is still there with the new network card.

What does the firewall box show in logs/whatever. Does it see network down?
Or did I miss that the firewall is just a software add on?

Our CISCO firewall runs in a monster blade server <- where I'm coming from.

I don't see anything in firewall logs that pertain, maybe I should go back through the log configuration and se how they are set up. Thanks! that might be a good next move anyway.

The OS RHEL 6 has Network Manager component which manages the Networking and it works great in dynamic ip assignment. It is a pain if its on in the context of static ip assignment. Ensure that it is turned off and see if the problem persists.

chkconfig --list NetworkManager

lol...funny you say that. I ran into the same trouble and actually removed the entire package from all my servers. I use static ip addresses on all of them.