How to roll out upgrades

https://channeldailynews.com/news/rogers-ceo-admits-outage-caused-by-maintenance-operation/77643?utm_source=CDN&utm_medium=enews&utm_campaign=CDN&scid=181adcce-53df-0860-95bf-4e602627c346

Just in case the link disappears: Rogers have 10,000,000 customers in Canada, all of them lost service for 24 hours on Friday July 8 because a software update went south.

2 Likes

Reminds me in some ways of the Great BlackBerry Outage of 2011, which was probably the highest-profile Canadian Internet company outage, until now. If I recall rightly though that was more a single-point-of-failure issue, rather than an "oops-this-upgrade-has-gone-wrong" kind of problem.

1 Like

From an IT-security risk management perspective, the greatest risks to most organizations are from knowledgable insiders, both accidental or purposeful, not outsider attackers.

When we further compare insiders "accidental versus purposeful", many organizations experience their greatest IT risks from accidental "fat fingering" a config file, code, libs, etc.

I recall a blast-from-my-past where, during the time of the Internet transition from an academic to a public network, I worked in as a unix network systems engineer for a major US telecom provider. One of our team members received an mail from Cisco Systems about an important router software upgrade and he took it upon his good self to upgrade all at once, every router in our commercial network at the same time via a script.

The problem was that the network was nationwide and the new router software, untested by the well-intended network engineer, had a bug and none of the routers would reboot. This meant that every location, hundreds of locations at that time, across the US, had to be manually booted from the local terminal before we could manage them again.

This is also one good reason why it is always a good idea to test any upgrade on one "test device" before sending the upgrade to every device on the network.

One point to keep in mind is that most of these errors, which are generally accidental and internal, are not made public whenever possible; so we only see these errors when they affect the public-at-large. Another point is that, as mentioned, the biggest losses to companies related to IT-security are caused by internal employees, not outside hackers or attackers.

4 Likes

This topic was automatically closed 10 days after the last reply. New replies are no longer allowed.