massive facebook outage presently

On a related note, what do you think the scene is like in FB HQ right now? (shaking head)


* (Mel Beckman) [Mon 04 Oct 2021, 18:23 CEST]:

Here’s a screenshot:


Please don't do this on the NANOG list.

  -- Niels.

Suspiciously, this comes the morning after Facebook whistleblower Frances Haugen disclosed on 60 Minutes that Facebook’s own research shows that it chose to profit from misinformation and political unrest through deliberate escalation of conflicts. Occam’s razor says “When multiple causes are plausible, and CBS 60 Minutes is one of them, go with 60 Minutes.” :slight_smile:


Yes, embedded ISP CDN’s show a huge drop


Hi Anne,

On a related note, what do you think the scene is like in FB HQ right now?
(shaking head)

Very quiet, as their offices are still closed for all but essentials :slight_smile:

But, from experience I can tell you how that works. I assume Facebook works in a
similar manner as some of my previous employers. This assumption comes from the
fact that quite a number of my previous colleagues now work at Facebook in similar

First there is the question of detecting the outage. Obviously, Facebook will have
a monitoring/SRE team that continuously monitors 1000s of metrics. They observe
a number of metrics go down, and start to investigate. Most likely they will have
some sort of overall technical lead (let's call this the Technical Duty Officer),
that is responsible for the whole thing. Once the SRE team figured out where the
problem lies, they will alert the TDO. TDO will then hit that big red button and
send out alerts to the appropriate teams to jump on a bridge (let's call that the
Technical Crisis Bridge), to fix the issue.

If done right, whomever was on call for that team will take the lead and interface
with adjoining teams, and other team members who are available to help out. Looking
at how long this outage lasts, there must be either something very broken, or they're
having trouble rolling back a change which was expected to not have impact.

Once the issue is fixed, the TDO will write a report and submit it to the Problem
Management group. This group will now contact the teams deemed responsible for the
outage. This team will no have an opportunity to explain themselves during a post-
mortem. Depending on the scale of the outage, the post-mortem can be a 10 minute
call on a bridge with a Problem Management manager, or in the hot seat during a
60 minute meeting with a bunch of execs.

I've been in that hot seat a few times. Not the most pleasurable experience. Perhaps
it's time for a new career :slight_smile:




It’s got to be more than just DNS.

I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can’t access the equipment to put it back… :slight_smile:

You laugh but that kind of sounds like what happened so far as oops we isolated prod and are scrambling on DR. There was someone supposedly live tweeting from their incident response for a bit before their account panic deleted.


That's an interesting theory. Once upon a time I saw a billion dollar company suffer
a significant outage after enabling EVPN on a remote site. Took down the entire
backbone, including access to the site.



From what I believe was a FB employee on Reddit, account now deleted it seems.

As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that’s that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).

There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.

Part of this is also due to lower staffing in data centers due to pandemic measures.

I believe the original change was ‘automatic’ (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don’t exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally. “About five minutes before Facebook’s DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook’s ASN.”

Facebook will have to send out a Reason For Outage with all the services it’s effecting, like login

Some of the DNS addresses are no longer prefix from AS32934.