Cloudflare is down

Arthur_Wist · March 3, 2013, 10:46am

https://twitter.com/CloudFlare/status/308157556113698816
https://twitter.com/CloudFlare/status/308165820285083648

Apparently due to a routing issue...

-AW

Jay_Ashworth · March 3, 2013, 7:47pm

Unless you're in UTC-12 or your clock is wrong, you posted that 7 hours
ago and it just hit the list. (Whomever wrote that Sent header, BTW,
presumably your MUA, is 5322 non-compliant: no TZ.)

Anyroad, their Twitter account now seems to say they're back up.

Cheers,
-- jra

Nick_Hilliard3 · March 3, 2013, 8:02pm

back up again: http://blog.cloudflare.com/todays-outage-post-mortem-82515

tl;dr: outage caused by flowspec filter tickling vendor bug.

Nick

Vinod_K · March 3, 2013, 8:31pm

I am not sure of bug... could be normal behavior for how JunOS CLI handle "extended" packet size. Will wait for Juniper comment on incident.

Vinod

Constantine_A_Mureni · March 3, 2013, 8:46pm

Apparently due to a routing issue...

back up again: Today's Outage Post Mortem

tl;dr: outage caused by flowspec filter tickling vendor bug.

Definitely smart to be delegating your DNS to the web-accelerator
company and a single point of failure, especially if you are not just
running a web-site, but have some other independent infrastructure,
too.

CloudFlare's 23 data centers span 14 countries so the response took some time but within about 30 minutes we began to restore CloudFlare's network and services. By 10:49 UTC, all of CloudFlare's services were restored. We continue to investigate some edge cases where people are seeing outages. In nearly all of these cases, the problem is that a bad DNS response has been cached. Typically clearing the DNS cache will resolve the issue.

Yet, apparently, CloudFlare doesn't even support using any of their
services with your own DNS solutions.

And how exactly do they expect end-users "clearing the DNS cache"? Do
I call AT&T, and ask them to clear their cache?

C.

Florian_Weimer · March 3, 2013, 9:35pm

* Constantine A. Murenin:

And how exactly do they expect end-users "clearing the DNS cache"? Do
I call AT&T, and ask them to clear their cache?

Sure, and also tell them to clear their BGP cache (aka "route flap
dampening").

Alex4 · March 3, 2013, 9:54pm

Is there any blog or some sort of site that has a up to date list with the latest network outages?
Like, not just Cloudflare, but every major outage that has happen lately.

Its really nice to see a post-mortem analysis like in this case.
Bugs/hidden "features" are not "documented" in most of the books I've read, so the only way to run into them is to either step on it yourself or learn from other networks going splat before yours does.

I've tried to search a few more and I've either found "electrical network outage" or reports from individual businesses as to why their service went down, but I wasn't able to find the "big list".

I'd appreciate it a lot if anyone could point me in the right direction.

Thanks,

Alex.

Frank_Bulk1 · March 3, 2013, 10:06pm

I'd start here:
http://www.outages.org/
and the listserv is here:
http://wiki.outages.org/index.php/Main_Page#Outages_Mailing_Lists

Set aside the idea of "every" outage, and "major" is relative to the eye of the beholder. =)

Frank

Saku_Ytti1 · March 4, 2013, 7:31am

To be fair, most of us probably have harmonized peering edge, running one
vendor, with one or two software releases and as as such as susceptible to
BGP update taking down whole edge.

I'm not comfortable personally to point cloudflare and say this was easily
avoidable and should not have happened (Not implying you are either).

If fuzzing BGP was easy, vendors would provide us working software and we
wouldn't lose good portion of Internet every few years due to mangled
UPDATE.
I know lot of vendors are fuzzing with 'codenomicon' and they appear not to
have flowspec fuzzer.

Lot of things had to go wrong for this to cause outage.

1. their traffic analyzer had to have bug which could claim packet size is
90k
2. their noc people had to accept it as legit data
(2.5 their internal software where filter is updated, had to accept this
data, unsure if it was internal system or junos directly)
3. junos cli had to accept this data
4. flowspec had to accept it and generate nlri carrying it
5. nlri -> ACL abstraction engine had to accept it and try to program to
hardware

Even if cloudflare had been running out-sourced anycast DNS with many
vendor edge, the records had still been pointing out to a network which you
couldn't reach.

Probably only thing you could have done to plan against this, would have
been to have solid dual-vendor strategy, to presume that sooner or later,
software defect will take one vendor completely out. And maybe they did
plan for it, but decided dual-vendor costs more than the rare outages.

Leo_Bicknell1 · March 4, 2013, 2:51pm

In a message written on Mon, Mar 04, 2013 at 09:31:13AM +0200, Saku Ytti wrote:

Probably only thing you could have done to plan against this, would have
been to have solid dual-vendor strategy, to presume that sooner or later,
software defect will take one vendor completely out. And maybe they did
plan for it, but decided dual-vendor costs more than the rare outages.

From what I have heard so far there is something else they could
have done, hire higher quality people.

Any competent network admin would have stopped and questioned a
90,000+ byte packet and done more investigation. Competent programmers
writing their internal tools would have flagged that data as out
of rage.

I can't tell you how many times I've sat in a post mortem meeting
about some issue and the answer from senior management is "why don't
you just provide a script to our NOC guys, so the next time they
can run it and make it all better." Of course it's easy to say
that, the smart people have diagnosed the problem!

You can buy these "scripts" for almost any profession. There are
manuals on how to fix everything on a car, and treatment plans for
almost every disease. Yet most people intuitively understand you
take your car to a mechanic and your body to a doctor for the proper
diagnosis. The primary thing you're paying for is expertise in
what to fix, not how to fix it. That takes experience and training.

But somehow it doesn't sink in with networking. I would not at all
be surprised to hear that someone over at Cloudflare right now is
saying "let's make a script to check the packet size" as if that
will fix the problem. It won't. Next time the issue will be
different, and the same undertrained person who missed the packet
size this time will miss the next issue as well. They should all be
sitting around saying, "how can we hire compentent network admins for
our NOC", but that would cost real money.

Christopher_Morrow · March 4, 2013, 3:09pm

i suspect they fuzz where the money is ...

number of users of bgp?
number of users of flowspec?

Saku_Ytti1 · March 4, 2013, 4:17pm

Your solution to mistakes seem to be not to make them. I can understand the
train of thought, but I suspect the practicality of such advice.

ianai · March 4, 2013, 4:43pm

The last couple words are the best thing I've read on NANOG in a very long time.

Warren_Bailey1 · March 4, 2013, 4:59pm

+1.

Jeff_S_Wheeler1 · March 4, 2013, 6:23pm

I think that is hard because virtually all training / education in our
industry is based on procedures, not on concepts.

Pick up any book about networking and you'll find examples of how to
configure a lab of Cisco 2900s so you can pass an exam. Very few that
go into conceptual detail or troubleshooting of any kind. Educational
programs suffer from the same flaw.

There are exceptions to this rule, but they are very few. I'm sure
many NANOG readers are familiar with "Interdomain Multicast Routing,"
for example. It is an excellent book because it covers concepts and
compares two popular vendor platforms on a variety of multicast
topics.

We have lots of stupid people in our industry because so few
understand "The Way Things Work."

Saku_Ytti1 · March 4, 2013, 6:40pm

We have lots of stupid people in our industry because so few
understand "The Way Things Work."

We have tendency to view mistakes we do as unavoidable human errors and
mistakes other people do as avoidable stupidity.

We should actively plan for mistakes/errors, if you actively plan for no
'stupid mistakes', you're gonna have bad time

From my point of view, outages are caused by:

1) operator
2) software defect
3) hardware defect

Most people design only against 3), often with design which actually
increases likelihood of 2) and 1), reducing overall MTBF on design which
strictly theoretically increases it.

Valdis_Kletnieks · March 4, 2013, 6:53pm

I have to admit I've always suspect that MTBWTF would be a more useful
metric of real-world performance.

Constantine_A_Mureni · March 4, 2013, 8:33pm

Definitely smart to be delegating your DNS to the web-accelerator
company and a single point of failure, especially if you are not just
running a web-site, but have some other independent infrastructure,
too.

To be fair, most of us probably have harmonized peering edge, running one
vendor, with one or two software releases and as as such as susceptible to
BGP update taking down whole edge.

I'm not comfortable personally to point cloudflare and say this was easily
avoidable and should not have happened (Not implying you are either).

The issue I have is not with their network.

The issue is that they require ALL of their customers to hand over DNS
control, and completely disregard any kind of situation as what has
just happened.

* They don't provide any IP-addresses which you can set your A or AAAA
records to.

* They don't provide any hostnames which you can set a CNAME to.
(Supposedly, they do offer CNAME support to paid customers, but if you
look at their help page for CNAME support, it's clearly evident that
it's highly discouraged and effectively an unsupported option.)

* They don't let you AXFR and mirror the zones, either.

So, the issue here, is that a second point of failure is suddenly
introduced to your own harmonised network, and introduced in a way as
to suggest that it's not a big deal, and will make everything better
anyways.

In actuality, this doesn't even stop their users from going the
unsupported route: I've seen some relatively major and popular
hosting provider turn over their web-site to CloudFlare when it was
under attack, but they did it with an A record, potentially to not
suffer a complete embarrassment of having `whois` show that they don't
even use the nameservers that they provide to their own users.

[...]

Even if cloudflare had been running out-sourced anycast DNS with many
vendor edge, the records had still been pointing out to a network which you
couldn't reach.

This is where you have it wrong. DNS is not only useful for http.
Yet CloudFlare only provides http-acceleration. Yet they do require
that you delegate your domains to the nameservers on their own
single-vendor network, with no option to opt-out.

I don't think they should necessarily be running an out-sourced DNS,
but I do think that they should not make it a major problem for users
to use http-acceleration services without DNS tie-ins. Last I
checked, CloudFlare didn't even let you setup just a subdomain for
their service, e.g. they do require complete DNS control from the
registrar-zone level, all the time, every single time.

C.

Saku_Ytti1 · March 4, 2013, 9:14pm

I'm not going to justify this behaviour. It would not occur to make to give
zone control out to 3rd party unless all the zones point to records hosted
by same 3rd party.

George_Herbert · March 4, 2013, 11:00pm

...And a lot of people who know the heirarchy solve 3 and then solve 2
in a way that increases 1 (multiple parallel environments with
different vendors' equipment) only to find that 1 increased, due to
additional complexity.

On the other hand, I've seen people who had horrible explosions of 2
or 3 due to ignoring all but 1.

If you ACTUALLY need that many 9s, you need all of redundancy,
diversity of vendors, and suitably trained, exercised,
process-supported net admins. That's a few multiples of 2 more
expense than nearly anyone typically wants to pay for.