Facebook post-mortems...

Assuming there is any truth to that, guess we can't cancel the hard lines yet :-).

#EverythingoIP

Mark.

My speculative guess would be that OOB access to a few outbound-facing
routers per DC does not help much if a configuration error withdraws the
infrastructure prefixes down to the rack level while dedicated OOB to
each RSW would be prohibitive.

If your OOB has any dependence on the inband side, it’s not OOB.

It’s not complicated to have a completely independent OOB infra , even at scale.

1 Like

That's not quite true. It still gives much better clue as to what is
going on; if a host resolves to an IP but isn't pingable/traceroutable,
that is something that many more techy people will understand than if
the domain is simply unresolvable. Not everyone has the skill set and
knowledge of DNS to understand how to track down what nameservers
Facebook is supposed to have, and how to debug names not resolving.
There are lots of helpdesk people who are not expert in every topic.

Having DNS doesn't magically get you service back, of course, but it
leaves a better story behind than simply vanishing from the network.

... JG

Thanks for sharing that article. But OOB access involves exactly that - Out Of Band - meaning one doesn't depend on any infrastructure prefixes or DFZ announced prefixes. OOB access is usually via a local ADSL or wireless modem connected to the BFR. The article does not discuss OOB at all.

Regards,
Hank

Facebook stopped announcing the vast majority of their IP space to the DFZ during this.

This is where I would like to learn more about the outage. Direct Peering FB connections saw a drop in a networks (about a dozen) and one the networks covered their C and D Nameservers but the block for A and B name servers remained advertised but simply not responsive .
I imagine the dropped blocks could have prevented internal responses but an suprise all of these issue would stem from the perspective I have .

That's great for you and me who believe in and like troubleshooting.

Jane and Thando who just want their Instagram timeline feed couldn't care less about DNS working but network access is down. To them, it's broken, despite your state-of-the-art global DNS architecture.

I'm also yet to find any DNS operator who makes deploying 3rd party resiliency to give other random network operators in the wild troubleshooting joy their #1 priority for doing so :-).

On the real though, I'm all for as much useful redundancy as we can get away with. But given just how much we rely on the web for basic life these days, we need to do better about making actual services as resilient as we can (and have) the DNS.

Mark.

If your NS are in 2 separate entities, you could still resolve your MX/A/AAAA/NS.

Look how Amazon is doing it.

dig +short amazon.com NS
ns4.p31.dynect.net.
ns3.p31.dynect.net.
ns1.p31.dynect.net.
ns2.p31.dynect.net.
pdns6.ultradns.co.uk.
pdns1.ultradns.net.

They use dyn DNS from Oracle and ultradns. 2 very strong network of anycast DNS servers.

Amazon would have not been impacted like Facebook yesterday. Unless ultradns and Oracle have their DNS servers hosted in Amazon infra? I doubt that Oracle has dns hosted in Amazon, but it's possible.

Probably the management overhead to use 2 different entities for DNS is not financially viable?

Jean

Carsten Bormann wrote:

While Ruby indeed has a chain-saw (read: powerful, dangerous, still
the tool of choice in certain cases) in its toolkit that is generally
called “monkey-patching”, I think Michael was actually thinking about
the “chaos monkey”,Chaos engineering - Wikipedia

That was a Netflix invention, but see alsohttps://en.wikipedia.org/wiki/Chaos_engineering#Facebook_Storm

It seems to me that so called chaos engineering assumes cosmic
internet environment, though, in good old days, we were aware
that the Internet is the source of chaos.

            Masataka Ohta

You don't think at least 10,000 helpdesk requests about Facebook being
down were sent yesterday?

There's something to be said for building these things to be resilient
in a manner that isn't just convenient internally, but also externally
to those people that network operators sometimes forget also support
their network issues indirectly.

... JG

So I'm not worried about DNS stability when split across multiple physical entities.

I'm talking about the actual services being hosted on a single network that goes bye-bye like what we saw yesterday.

All the DNS resolution means diddly, even if it tells us that DNS is not the issue.

Mark.

You don't think at least 10,000 helpdesk requests about Facebook being
down were sent yesterday?

That and Jane + Thando likely re-installing all their apps and iOS/Android on their phones, and rebooting them 300 times in the hopes that Facebook and WhatsApp would work.

Yes, total nightmare yesterday, but sure that 9,999 of the helpdesk tickets had nothing to do with DNS. They likely all were - "Your Internet is down, just fix it; we don't wanna know".

There's something to be said for building these things to be resilient
in a manner that isn't just convenient internally, but also externally
to those people that network operators sometimes forget also support
their network issues indirectly.

I don't disagree with you one bit. It's for that exact reason that we built:

 https://as37100.net/

... not for us, but specifically for other random network operators around the world whom we may never get to drink a crate of wine with.

I have to say that it has likely cut e-mails to our NOC as well as overall pain in half, if not more.

Mark.

I agree to resolve non-routable address doesn’t bring you a working service.

I thought a few networks were still reachable like their MX or some DRP networks.

Thanks for the update

Jean

What I forgot to add, however, is that unlike Facebook, we aren't a major content provider. So we don't have a need to parallel our DNS resiliency with our service resiliency, in terms of 3rd party infrastructure. If our network were to melt, we'll already be getting it from our eyeballs.

If we had content of note that was useful to, say, a handful-billion people around the world, we'd give some thought - however complex - to having critical services running on 3rd party infrastructure.

Mark.

As of now, their MX is hosted on 69.171.251.251

Was this network still announced yesterday in the DFZ during the outage?

69.171.224.0/19

69.171.240.0/20

Jean

* telescope40@gmail.com (Lou D) [Tue 05 Oct 2021, 15:12 CEST]:

Facebook stopped announcing the vast majority of their IP space to the DFZ during this.

People keep repeating this but I don't think it's true.

It's probably based on this tweet: https://twitter.com/ryan505/status/1445118376339140618

but that's an aggregate adding up prefix counts from many sessions. The total number of hosts covered by those announcements didn't vary by nearly as much, since to a significant extent it were more specifics (/24) of larger prefixes (e.g. /17) that disappeared, while those /17s stayed.

(There were no covering prefixes for WhatsApp's NS addresses so those were completely unreachable from the DFZ.)

  -- Niels.

People keep repeating this but I don’t think it’s true.

My comment is solely sourced on my direct observations on my network, maybe 30-45 minutes in.

Everything except a few /24s disappeared from DFZ providers, but I still heard those prefixes from direct peerings. There was no disaggregation that I saw, just the big stuff gone. This was consistent over 5 continents from my viewpoints.

Others may have seen different things at different times. I do not run an eyeball so I had no need to continually monitor.

Does anyone have info whether this network 69.171.240.0/20 was reachable during the outage.

Jean

Unrealistic user expectations are not the point. Users can demand
whatever unrealistic claptrap they wish to.

The point is that there are a lot of helpdesk staff at a lot of
organizations who are responsible for responding to these issues.
When Facebook or Microsoft or Amazon take a dump, you get a storm
of requests. This is a storm of requests not just to one helpdesk,
but to MANY helpdesks, across a wide number of organizations, and
this means that you have thousands of people trying to investigate
what has happened.

It is very common for large companies to forget (or not care) that
their technical failures impact not just their users, but also
external support organizations.

I totally get your disdain and indifference towards end users in these
instances; for the average end user, yes, it indeed makes no difference
if DNS works or not.

However, some of those end users do have a point of contact up the
chain. This could be their ISP support, or a company helpdesk, and
most of these are tasked with taking an issue like this to some sort
of resolution. What I'm talking about here is that it is easier to
debug and make a determination that there is an IP connectivity issue
when DNS works. If DNS isn't working, then you get into a bunch of
stuff where you need to do things like determine if maybe it is some
sort of DNSSEC issue, or other arcane and obscure issues, which tends
to be beyond what front line helpdesk is capable of.

These issues often cost companies real time and money to figure out.
It is unlikely that Facebook is going to compensate them for this, so
this brings me back around to the point that it's preferable to have
DNS working when you have a BGP problem, because this is ultimately
easier for people to test and reach a reasonable determination that
the problem is on Facebook's side quickly and easily.

... JG

Maybe withdrawing those routes to their NS could have been mitigated by having NS in separate entities.

Well, doesn’t really matter if you can resolve the A/AAAA/MX records,
but you can’t connect to the network that is hosting the services.

Disagree for two reasons:

  1. If you have some DNS working, you can point it at a static “we are down and we know it” page much sooner.

  2. If you have convinced the entire world to install tracking pixels on their web pages that all need your IP address, it is rude to the rest of the world’s DNS to not be able to always provide a prompt (and cacheable) response.

Unrealistic user expectations are not the point. Users can demand
whatever unrealistic claptrap they wish to.

The user's expectations, today, are always going to be unrealistic, especially when they are able to enjoy a half-decent service free-of-charge.

The bar has moved. Nothing we can do about it but adapt.

The point is that there are a lot of helpdesk staff at a lot of
organizations who are responsible for responding to these issues.
When Facebook or Microsoft or Amazon take a dump, you get a storm
of requests. This is a storm of requests not just to one helpdesk,
but to MANY helpdesks, across a wide number of organizations, and
this means that you have thousands of people trying to investigate
what has happened.

We are in agreement.

And it's no coincidence that the Facebook's of the world rely almost 100% on non-human contact to give their users support. So that leaves us, infrastructure, in the firing line to pick up the slack for a lack of warm-body access to BigContent.

It is very common for large companies to forget (or not care) that
their technical failures impact not just their users, but also
external support organizations.

Not just large companies, but I believe all companies... and worse, not at ground level where folk on lists like these tend to keep in touch, but higher up where money decisions where caring about your footprint on other Internet settlers whom you may never meet matters.

You and I can bash our heads till they come home, but if the folk that need to say "Yes" to $$$ needed to help external parties troubleshoot better don't get it, then perhaps starting a NOG or some such is our best bet.

I totally get your disdain and indifference towards end users in these
instances; for the average end user, yes, it indeed makes no difference
if DNS works or not.

On the contrary, I looooooove customers. I wasn't into them, say, 12 years ago, but since I began to understand that users will respond to empathy and value, I fell in love with them. They drive my entire thought-process and decision-making.

This is why I keep saying, "Users don't care about how we build the Internet", and they shouldn't. And I support that.

BigContent get it, and for better or worse, they are the ones who've set the bar higher than what most network operators are happy with.

Infrastructure still doesn't get it, and we are seeing the effects of that play out around the world, with the recent SK Broadband/Netflix debacle being the latest barbershop gossip.

However, some of those end users do have a point of contact up the
chain. This could be their ISP support, or a company helpdesk, and
most of these are tasked with taking an issue like this to some sort
of resolution. What I'm talking about here is that it is easier to
debug and make a determination that there is an IP connectivity issue
when DNS works. If DNS isn't working, then you get into a bunch of
stuff where you need to do things like determine if maybe it is some
sort of DNSSEC issue, or other arcane and obscure issues, which tends
to be beyond what front line helpdesk is capable of.

We are in agreement.

These issues often cost companies real time and money to figure out.
It is unlikely that Facebook is going to compensate them for this, so
this brings me back around to the point that it's preferable to have
DNS working when you have a BGP problem, because this is ultimately
easier for people to test and reach a reasonable determination that
the problem is on Facebook's side quickly and easily.

We are in agreement.

So let's see if Facebook can fix the scope of their DNS architecture, and whether others can learn from it. I know I have... even though we provide friendly secondary for a bunch of folk we are friends with, we haven't done the same for our own networks... all our stuff sits on just our network - granted in many different countries, but still, one AS.

It's been nagging at the back of my mind for yonks, but yesterday was the nudge I needed to get this organized; so off I go.

Mark.