[outages] News item: Blackberry services down worldwide, Egypt affected (not N.A.)

Guys the outage has moved to U.S and Canada, I think we need to look at this perhaps being sabotage.



What kills me is what they have told the public. The lost a "core switch". I don't know if they actually mean network switch or not but I'm pretty sure any of us that work on an enterprise environment know how to factor N+1 just for these types of days. And then the backup solution failed? I'm not buying it either.


Never put down to malice which can be more easily explained by stupidity..
or in this case failure.

RIM explained the problem earlier..

"The messaging and browsing delays being experienced by BlackBerry users in
Europe, the Middle East, Africa, India, Brazil, Chile and Argentina were
caused by a core switch failure within RIM's infrastructure. Although the
system is designed to failover to a back-up switch, the failover did not
function as previously tested. As a result, a large backlog of data was
generated and we are now working to clear that backlog and restore normal
service as quickly as possible. We apologise for any inconvenience and we
will continue to keep you informed."

This appears to have been a result of a change on monday

"The problems began at about 11am on Monday. The Guardian understands that
RIM was attempting a software upgrade on its database but suffered
corruption problems, and that attempts to switch back to an older version
led to a collapse"




Maybe they use the same security solutions as Playstation Network does... that would explain a lot suddenly.


North American outages of the blackberry platform (particularly related
to upgrades gone wrong) were not uncommon.

Think for example sept 10, dec 18 and dec 22 2009.

It ain't sabotage till you rule out "misconfigured router".

Consider the actual real-world threat models and their likelyhoods:

1) Insufficiently caffienated network engineer - this *NEVER* happens in real
life, it's a total Bruce Schneier caliber movie-plot scenario.

2) Somebody sabotaging a RIM router. This is more likely, because there's just
*bazillions* of people out there that stand to benefit from a RIM outage (and
in fact profit more from an outage than from being able to watch traffic as it
goes by). It's just a question of which one of those bazillions did it *this*

Andrew, you *really* need to learn what the actual failure modes and
root causes in real-life production networks are, and draw conclusions from
reality, not whatever MI-7 inspired dream world the claim of "sabotage"
came from.

Yeah, and that extra comma in the one config file that didn't make a difference
when you tested the failover in the lab *never* makes a difference when it hits
in the production network, right? Or they changed the config of the primary and
it didn't get propogated just right to the backup, or they had mismatched firmware
levels on blades in the blades on the primary and backup switches, so traffic that
didn't tickle a bug on the primary blades caused the blade to crash on the backup,

Anybody on this list who's been around long enough probably has enough "We
should have had N+2 because the N+1'th device failed too" stories to drain
*several* pitchers of beer at a good pub... I've even had one case where my
butt got *saved* from a ohnosecond-class whoops because the N+1'th device *was*
crashed (stomped a config file, it replicated, was able to salvage a copy from
a device that didn't replicate because it was down at the time).



I think it raises serious questions about RIM's DR strategy if a DB corruption or switch failure or whatever can cause this much outage. 'Surely' RIM have an second site that is independent of the primary (within reason) that they could of flipped to when they realised the DB was borked. If not then any business that relies on them needs to be shouting from the rooftops to get RIM to fix it.


I have been witness to N+1 HUMAN failures but never a N+1 hardware failure or system/design failure that warranted questioning the need for N+2. Usually your N+1 failure is (as already referenced) pasting in a bad config that gets replicated or something like that. Not saying the hardware is perfect. It's just that I haven't personally seen a full blown failure like that without human help.

Closest example would be an update that wasn't properly vetted in dev/test before migrating to prod. I've seen a few of those that I guess you could blame on the system. Even though the humans could have tested better....


You have not seen VIP2-40s and CEF in action :wink:

I have and totally get the point ...

They are out there scrambling, trying to figure out where the truck that hit them came from. The PIO has been told to make up a story.

In fairness, Valdis, Andrew did not say "this was obviously sabotage".

He suggested that that possibility be added to the list of things which
the RIM employees tasked with finding a root cause consider.

I think the old filtering rule applies here:

Once is happenstance.
Twice is coincidence.

Three times is enemy action.

If this turns out to look like it came from 3 or more non-cascading failures, then
sabotage will look a little more likely.

-- jra

Again. I know those stories are out there. I'm blessed with a lower profile or higher karma. One of the two.

<digging thru cube to fine wood to knock on....>


