Right On Time, Somewhere

Corona688 · January 12, 2012, 5:58pm

Things like this are teaching me a greater appreciation for network time...

The huge gap between the speed of a satellite connection and its tiny bandwidth allowance means either being completely draconian or having to chide people to not abuse it all the time. We're somewhere inbetween, if a customer downloads too much they'll be slowed down temporarily, but not shut off; if the situation continues, we may phonecall and investigate.

We had a situation where a small community had massively overused their satellite connection pretty much collectively, going over not only our limits but our provider's limits, causing the entire satellite connection to be throttled. We needed to shut the connection down for a few hours before the satellite modem would let go.

We planned it, set a time, and warned our customers. How it worked was very simple -- two entries in root's crontab. At noon that day, the first one would run 'ifconfig eth1 down', taking the community offline but leaving me in communication with the server. At 5pm that day, it would run '/sbin/reboot'.

The server clock had drifted far more than I'd anticipated in the months since its last boot and clock-set, and the shutdown happened one hour early.

jgt · January 13, 2012, 8:46am

Couldn't even blame it on Daylight Savings Time. Did you install ntp?

Corona688 · January 13, 2012, 10:27am

Testing it in dev first. If I mess up a production system, it's hours of driving and sometimes a bit of sledding to fix them in person, especially at this time of year.

admin_xor · January 16, 2012, 2:12am

Thanks for sharing your story. It's very true that most of the times we do not bother to check the time of the clock before scheduling stuffs.

We maintain IT infrastructure for a big pharma company. For any SLA (service level agreement) breach, my employer has to pay a real big amount of money to the client. Now that's been told, once my colleague had to schedule a maintenance on an AIX server. We have a procedure to do that. There's a lot of approvals from service delivery managers of both the client and our company required. After getting those, this guy went on scheduling the reboot of the machine in maintenance mode in cron a day before. The next day, I got a call from IT Incident management people saying a server is down before it's scheduled maintenance window. It happened around 20 minutes before the scheduled time. We had to raise a severity for this. Upon checking the root cause of this later, we found somehow the server was failing to sync with the NTP server and the clock was going 20 minutes faster than the actual time.

And yes, because of all these, we breached the SLA! :wall: