Doing it the right way makes the cloud far less cost-effective and far
less "agile". Once you get it all set up just so, change becomes very
difficult. All the monitoring and fail-over/fail-back operations are
generally application-specific and provider-specific, so there's a lot
of lock-in. Tools like RightScale are a step in the right direction,
but don't really touch the application layer. You also have to worry
about the availability of yet another provider!
I am pretty sure Netflix and others were "trying to do it right", as they
all had graceful fail-over to a secondary AWS zone defined.
It looks to me like Amazon uses DNS round-robin to load balance the zones,
because they mention returning a "list" of addresses for DNS queries, and
explains the failure of the services to shunt over to other zones in their
postmortem.
Elastic Load Balancers (ELBs) allow web traffic directed at a single IP
address to be spread across many EC2 instances. They are a tool for high
availability as traffic to a single end-point can be handled by many
redundant servers. ELBs live in individual Availability Zones and front EC2
instances in those same zones or in other Availability Zones.
ELBs can also be deployed in multiple Availability Zones. In this
configuration, each Availability Zone’s end-point will have a separate IP
address. A single Domain Name will point to all of the end-points’ IP
addresses. When a client, such as a web browser, queries DNS with a Domain
Name, it receives the IP address (“A”) records of all of the ELBs in random
order. While some clients only process a single IP address, many (such as
newer versions of web-browsers) will retry the subsequent IP addresses if
they fail to connect to the first. A large number of non-browser clients
only operate with a single IP address.
During the disruption this past Friday night, the control plane (which
encompasses calls to add a new ELB, scale an ELB, add EC2 instances to an
ELB, and remove traffic from ELBs) began performing traffic shifts to
account for the loss of load balancers in the affected Availability Zone.
As the power and systems returned, a large number of ELBs came up in a
state which triggered a bug we hadn’t seen before. The bug caused the ELB
control plane to attempt to scale these ELBs to larger ELB instance sizes.
This resulted in a sudden flood of requests which began to backlog the
control plane. At the same time, customers began launching new EC2
instances to replace capacity lost in the impacted Availability Zone,
requesting the instances be added to existing load balancers in the other
zones. These requests further increased the ELB control plane backlog.
Because the ELB control plane currently manages requests for the US East-1
Region through a shared queue, it fell increasingly behind in processing
these requests; and pretty soon, these requests started taking a very long
time to complete.
Summary of the AWS Service Event in the US East Region
*In reality, though, Amazon data centers have outages all the time. In
fact, Amazon tells its customers to plan for this to happen, and to be
ready to roll over to a new data center whenever there’s an outage.*
*That’s what was supposed to happen at Netflix Friday night. But it
didn’t work out that way. According to Twitter messages from Netflix
Director of Cloud Architecture Adrian Cockcroft and Instagram Engineer Rick
Branson, it looks like an Amazon Elastic Load Balancing service, designed
to spread Netflix’s processing loads across data centers, failed during the
outage. Without that ELB service working properly, the Netflix and Pintrest
services hosted by Amazon crashed.*
(Real) Storm Crushes Amazon Cloud, Knocks out Netflix, Pinterest, Instagram | WIRED
I am a big believer in using hardware to load balance data centers, and not
leave it up to software in the data center which might fail.
Speaking of services like RightScale, Google announced Compute Engine at
Google I/O this year. BuildFax was an early Adopter, and they gave it great
reviews...
It looks like Google has entered into the VPS market. 'bout time... ;-]
http://cloud.google.com/products/compute-engine.html
--steve pirk