*http://aws.amazon.com/message/65648/*<http://aws.amazon.com/message/65648/>
So, in a nut shell, Amazon had a single point of failure which touched off this entire incident.
I am still waiting for proof that single points of failure can realistically be completely eliminated from any moderately complicated network environment / application. So far, I think murphy is still winning on this one.
Good job by the AWS team however, I am sure your new procedures and processes will receive a shakeout again, and it will be interesting to see how that goes. I bet there will be more to learn along this road for us all.
Mike-
Well, in fairness to Amazon, let's ask this: did the failure occur *behind
a component interface they advertise as Reliable*? Either way, was it possible
for a single customer to avoid that possible failure, and at what cost in
expansion of scope and money?
Cheers,
-- jra
Sure they can, but as a thought exercise fully 2n redundancy is
difficult on a small scale for anything web facing. I've seen a very
simple implementation for a website requiring 5 9's that consumed over
$50k in equipment, and this wasn't even geographically diverse. I have
to believe that scaling up the concept of "doing it right" results in
exponential cost increases. To illustrate the problem, I would give you
the first step in the thought exercise: first find two datacenters with
diverse carriers, that aren't on the same regional power grid (As we've
learned in the (iirc) 2003 power outage, New York and DC won't work, nor
will Ohio, so you need redundant teams to cover a very remote site).
What it really boils down to is this: if application developers are
doing their jobs, a given service can be easy and inexpensive to
distribute to unrelated systems/networks without a huge infrastructure
expense. If the developers are not, you end up spending a lot of
money on infrastructure to make up for code, databases, and APIs which
were not designed with this in mind.
These same developers who do not design and implement services with
diversity and redundancy in mind will fare little better with AWS than
any other platform. Look at Reddit, for example. This is an
application/service which is utterly trivial to implement in a cheap,
distributed manner, yet they have failed to do so for years, and
suffer repeated, long-duration outages as a result. They probably buy
a lot more AWS services than would otherwise be needed, and truly have
a more complex infrastructure than such a simple service should.
IT managers would do well to understand that a few smart programmers,
who understand how all their tools (web servers, databases,
filesystems, load-balancers, etc.) actually work, can often do more to
keep infrastructure cost under control, and improve the reliability of
services, than any other investment in IT resources.
I am still waiting for proof that single points of failure can
realistically be completely eliminated from any moderately complicated
network environment / application. So far, I think murphy is still
winning on this one.Good job by the AWS team however, I am sure your new procedures and
processes will receive a shakeout again, and it will be interesting to
see how that goes. I bet there will be more to learn along this road
for
us all.Mike-
From my reading of what happened, it looks like they didn't have a
single point of failure but ended up routing around their own
redundancy.
They apparently had a redundant primary network and, on top of that, a
secondary network. The secondary network, however, did not have the
capacity of the primary network.
Rather than failing over from the active portion of the primary network
to the standby portion of the primary network, they inadvertently failed
the entire primary network to the secondary. This resulted in the
secondary network reaching saturation and becoming unusable.
There isn't anything that can be done to mitigate against human error.
You can TRY, but as history shows us, it all boils down the human that
implements the procedure. All the redundancy in the world will not do
you an iota of good if someone explicitly does the wrong thing. In this
case it is my opinion that Amazon should not have considered their
secondary network to be a true secondary if it was not capable of
handling the traffic. A completely broken network might have been an
easier failure mode to handle than a saturated network (high packet loss
but the network is "there").
This looks like it was a procedural error and not an architectural
problem. They seem to have had standby capability on the primary
network and, from the way I read their statement, did not use it.
If you want a perfect example of this, consider Netflix. Their infrastructure runs on AWS and we didn't see any downtime with them throughout the entire affair.
One of the interesting things they've done to try and enforce reliability of services is an in house service called Chaos Monkey who's sole purpose is to randomly kill instances and services inside the infrastructure. Courtesy of Chaos Monkey and the defensive programming it enforces, nothing is dependent on each other, you will always get at least some form of a service. For example if the recommendation engine dies, then the application is smart enough to catch that and instead return a list of the most popular movies, and so on. There is an interesting blog from their Director of Engineering about what they learned on their migration to AWS, including using less chatty APIs to reduce the impact of typical AWS latency:
http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
Paul
The procedural error was putting all the traffic on the secondary
network. They promptly recognized that error, and fixed it. It's
certainly true that you can't eliminate human error.
The architectural problem is that they had insufficient error recovery
capability. Initially, the system was trying to use a network that was
too small; that situation lasted for some number of minutes; it's no
surprise that the system couldn't operate under those conditions and
that isn't an indictment of the architecture. However, after they put
it back on a network that wasn't too small, the service stayed
down/degraded for many, many hours. That's an architectural problem.
(And a very common one. Error recovery is hard and tedious and more
often than not, not done well.)
Prodecural error isn't the only way to get into that boat. If the
wrong pair of redundant equipment in their primary network failed
simultanesouly, they'd have likely found themselves in the same boat: a
short outage caused by a risk they accepted: loss of a pair of
rundundant hardware; followed by a long outage (after they restored the
network) caused by insufficient recovery capability.
Their writeup suggests they fully understand these issues and are doing
the right thing by seeking to have better recovery capability. They
spent one sentence saying they'll look at their procedures to reduce
the risk of a similar procedural error in the future, and then spent
paragraphs on what they are going to do to have better recovery should
something like this occur in the future.
(One additional comment, for whoever posted that NetFlix had a better
architecture and wasn't impacted by this outage. It might well be that
NetFlix does have a better archiecture and that might be why they
weren't impacted ... but there's also the possibility that they just
run in a different region. Lots of entities with poor architecture
running on AWS survived this outage just fine, simply by not being in
the region that had the problem.)
-- Brett
Date: Sun, 01 May 2011 11:07:56 -0700
From: Mike <mike-nanog@tiedyenetworks.com>
To: nanog@nanog.org
Subject: Re: Amazon diagnosis> http://aws.amazon.com/message/65648/
>
> ___So, in a nut shell, Amazon had a single point of failure which touched
off this entire incident.I am still waiting for proof that single points of failure can
realistically be completely eliminated from any moderately complicated
network environment / application. So far, I think murphy is still
winning on this one.
this was a classical case of _O'Brien's_Law_ in action -- which states,
rather pithily:
"Murphy...
was an OPTIMIST!!"
For starters, you almost always screw up and have one NOC full of
chuckle-headed banana eaters. And if you have two NOCs, that implies
one entity deciding which one takes lead on a problem.
Subject: RE: Amazon diagnosis
Date: Sun, 1 May 2011 12:50:37 -0700
From: George Bonser <gbonser@seven.com>They apparently had a redundant primary network and, on top of that, a
secondary network. The secondary network, however, did not have the
capacity of the primary network.Rather than failing over from the active portion of the primary network
to the standby portion of the primary network, they inadvertently failed
the entire primary network to the secondary. This resulted in the
secondary network reaching saturation and becoming unusable.There isn't anything that can be done to mitigate against human error.
You can TRY, but as history shows us, it all boils down the human that
implements the procedure. All the redundancy in the world will not do
you an iota of good if someone explicitly does the wrong thing. ...This looks like it was a procedural error and not an architectural
problem.
A sage sayeth sooth:
"For any 'fool-proof' system, there exists
a *sufficiently*determied* fool capable of
breaking it."
It would seem that the validity of that has just been re-confirmed. <wry grin>
It is worthy of note that it is considerably harder to protect against
accidental stupidity than it is to protect againt intentional malice.
('malice' is _much_ more predictable, in general. <wry grin>)
http://storagemojo.com/2011/04/29/amazons-ebs-outage/
***Stefan Mititelu
http://twitter.com/netfortius
http://www.linkedin.com/in/netfortius
Jeff Wheeler wrote:
IT managers would do well to understand that a few smart programmers,
who understand how all their tools (web servers, databases,
filesystems, load-balancers, etc.) actually work, can often do more to
I fully agree.
But much to my dismay and surprise I have learned that developers know very little above and beyond their field of interest, say java programming. And I bet this is vice versa.
It surprised me because I, perhaps naively, assumed IT workers in general have a rather broad knowledge because in general they're interested in many aspects of IT, try to find out as much as possible and if they do not know something they make an effort learning it. Also considering many (practical) things just aren't taught in university, which is to be expected since the idea is to develop an academic way of thinking.
Maybe this "hacker" mentality is less prevalent than I, naively, assumed.
So I believe it's just really hard to find someone who is smart and who understands all or most of the aspects of IT, i.e. servers, databases, file systems, load balancers, networks etc. And it's easier and cheaper in the short term to just open a can of <insert random IT job> and hope for the best.
Regards,
Jeroen
No, the average IT worker is always a mere 3 keystrokes away from getting their
latest creation listed on www.thedailywtf.com. They're lucky they can manage to
get stuff done in their own area of competency, much less develop broad
knowledge.
Sorry to break it to you.
I work with a bunch of developers, we're a primarily java based company, but I've got more than enough on my plate trying to keep up with everything practical as a sysadmin, from networks to hardware to audit needs, to even start to think about adding in Java skills to my repertoire! Especially given I'm the only sysadmin here and our infrastructure needs are quite diverse. I've learned to interpret java stack traces that get sent to me 24x7 on our critical mailing list so that I can identify whether is code or infrastructure but that's as far as I go with java. I don't particularly see that I need to either. I strive to work with//developers, no 'them vs us' attitudes, no arrogant "my way or the highway". I can't conceive why anyone would even consider maintaining those kind of attitudes but unfortunately have seen them frequently, and it seems so often to be the normal rather than the abnormal.
Programming is not something I'd consider myself to be any good at. I'll happily and reasonably competently script stuff in perl, python or bash for sysadmin purposes, but I'd never make any pretence at it being 'good' and well done scripting. It's just not the way my mind works. I have my specialisms and they have theirs, more productive use of time is to work with those who excel at that kind of thing. Here they don't make assumptions about my end of things, and I don't make assumptions about theirs. We ask each other questions, and work together to figure out how best to proceed. Thankfully we're a relatively small enough operation that management isn't too much of a burden.
Smart IT managers, in my book, work to take advantage of all the skills that their workers have and provide an efficient framework for them to work together. What it seems we see more often than not are IT managers that persist in seeing Sysadmin and Development as 'ops' and 'dev' separately rather than combined, perpetuating the 'them' vs 'us' attitudes rather than throwing them out for the inefficient, financially wasteful things they are.
Paul
That's ok, the past tense in my story testifies to the fact I was already aware of it. But thanks.
Greetings,
Jeroen
There was a significant decline in knowledge as the .com era peaked in
the 90s; less CS background required as an entry barrier, the
employment pool grew fast enough that community knowledge
organizations (Usenix, etc) didn't effectively diffuse into the new
community, etc.
The number of people who "get" computer architecture, ops, clusters,
networking, systems architecture and engineering, etc... Not good.
Sigh.
Unfortunately we see this when we interview candidates. Even those who have certifications generally only know how to do specific things within a narrow field. They don't have the base understanding of how things work, such as TCP/IP, so when they need to do something a little outside of the normal, they flounder.
Jason
Unlike the US of A, here in Australia the industry has gone *very* heavily
down the path of requiring/expecting certification. They have bought into
the faith that unless your resume includes CC?? you're worthless.
There are "colleges" (er, I mean training businesses) who will *guarantee*
you will pass your exam at the end of the course. Amazingly enough for some
of them you never actually touch a router console (not even a virtual one)
through the entire course.
Unfortunately the end result has been an entire generation of potential
employees who are perfectly capable of passing an exam and thereby becoming
'certified', but cannot be trusted to touch a production network. They have
no understanding of what they're doing, why things work (or not) the way
they do, no real troubleshooting skills and certainly not an ounce of real
world production-network common sense.
Not *all* of them, but by far the vast overwhelming majority of candidates
have the "I'm certified so gimme a job and pay me big bucks" attitude
despite having *no real skills* worth mentioning.
PhilP
Umm... see the CAP theorem. There are certain things, such as ACID
transactions, which are *impossible* to geographically distribute with
redundancy in a performant and scalable way because of speed of light
constraints.
Of course web-startups like Reddit have no excuse in this area: they
don't even *need* ACID transactions for anything they do, as what they
are storing is utterly unimportant in the financial sense and can be
handled with eventually-consistent semantics. But asynchronous
replication doesn't cut it for something like stock trades, or even
B2C order taking.
I like to bag on my developers for not knowing anything about the
infrastructure, but sometimes you just can't do it right because of
physics. Or you can't do it right without writing your own OS,
networking stacks, file systems, etc., which means it is essentially
"impossible" in the real world.