Because it's helpful to know what a link is about before clicking on it, Adrian's link goes to "The Onion Movie - Internet Down"
On a related note, what do you think the scene is like in FB HQ right now? (shaking head)
Anne
Because it's helpful to know what a link is about before clicking on it, Adrian's link goes to "The Onion Movie - Internet Down"
On a related note, what do you think the scene is like in FB HQ right now? (shaking head)
Anne
In other news worker productivity is up 100% today.
* mel@beckman.org (Mel Beckman) [Mon 04 Oct 2021, 18:23 CEST]:
Here’s a screenshot:
[cid:3E071EF9-BBC5-44BF-865D-2EDC36E05C71-L0-001]
Please don't do this on the NANOG list.
-- Niels.
Suspiciously, this comes the morning after Facebook whistleblower Frances Haugen disclosed on 60 Minutes that Facebook’s own research shows that it chose to profit from misinformation and political unrest through deliberate escalation of conflicts. Occam’s razor says “When multiple causes are plausible, and CBS 60 Minutes is one of them, go with 60 Minutes.”
-mel
Yes, embedded ISP CDN’s show a huge drop
-Aaron
Hi Anne,
On a related note, what do you think the scene is like in FB HQ right now?
(shaking head)
Very quiet, as their offices are still closed for all but essentials
But, from experience I can tell you how that works. I assume Facebook works in a
similar manner as some of my previous employers. This assumption comes from the
fact that quite a number of my previous colleagues now work at Facebook in similar
roles.
First there is the question of detecting the outage. Obviously, Facebook will have
a monitoring/SRE team that continuously monitors 1000s of metrics. They observe
a number of metrics go down, and start to investigate. Most likely they will have
some sort of overall technical lead (let's call this the Technical Duty Officer),
that is responsible for the whole thing. Once the SRE team figured out where the
problem lies, they will alert the TDO. TDO will then hit that big red button and
send out alerts to the appropriate teams to jump on a bridge (let's call that the
Technical Crisis Bridge), to fix the issue.
If done right, whomever was on call for that team will take the lead and interface
with adjoining teams, and other team members who are available to help out. Looking
at how long this outage lasts, there must be either something very broken, or they're
having trouble rolling back a change which was expected to not have impact.
Once the issue is fixed, the TDO will write a report and submit it to the Problem
Management group. This group will now contact the teams deemed responsible for the
outage. This team will no have an opportunity to explain themselves during a post-
mortem. Depending on the scale of the outage, the post-mortem can be a 10 minute
call on a bridge with a Problem Management manager, or in the hot seat during a
60 minute meeting with a bunch of execs.
I've been in that hot seat a few times. Not the most pleasurable experience. Perhaps
it's time for a new career
Thanks,
Sabri
Hi,
Oops, this was not supposed to go to the list, apologies for the clutter.
Thanks,
Sabri
With Facebook down, how are people doing their vaccine research?
It’s got to be more than just DNS.
Jonathan Kalbfeld
office: +1 310 317 7933
fax: +1 310 317 7901
home: +1 310 317 7909
mobile: +1 310 227 1662
ThoughtWave Technologies, Inc.
Studio City, CA 91604
View our network at
https://bgp.he.net/AS54380
+1 213 984 1000
It could just be that after the 60 Minutes interview they've shut things down in order to divert all power to the shredders.
Wishful thinking .. but hey, one is allowed to dream.
/M
I know what this is… They forgot to update the credit card on their godaddy account and the domain lapsed. I guess it will be facebook.info when they get it back online. The post mortem should be an interesting read.
I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so they took down all BGP sessions instead of just NLIX and now they can’t access the equipment to put it back…
man. 4. okt. 2021 20.31 skrev Billy Croan <BCroan@unrealservers.net>:
You laugh but that kind of sounds like what happened so far as oops we isolated prod and are scrambling on DR. There was someone supposedly live tweeting from their incident response for a bit before their account panic deleted.
Hi,
I got a mail that Facebook was leaving NLIX. Maybe someone botched the script so
they took down all BGP sessions instead of just NLIX and now they can't access
the equipment to put it back...
That's an interesting theory. Once upon a time I saw a billion dollar company suffer
a significant outage after enabling EVPN on a remote site. Took down the entire
backbone, including access to the site.
Thanks,
Sabri
From what I believe was a FB employee on Reddit, account now deleted it seems.
As many of you know, DNS for FB services has been affected and this is likely a symptom of the actual issue, and that’s that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).
There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified.
Part of this is also due to lower staffing in data centers due to pandemic measures.
I believe the original change was ‘automatic’ (as in configuration done via a web interface). However, now that connection to the outside world is down, remote access to those tools don’t exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.
https://twitter.com/jgrahamc/status/1445068309288951820 “About five minutes before Facebook’s DNS stopped working we saw a large number of BGP changes (mostly route withdrawals) for Facebook’s ASN.”
I mean, you’re an idiot if you post that public on the internet about your own place of work. What do you think would happen? Nothing? He should never of said anything, but now the Facebook hitman got him.
Facebook will have to send out a Reason For Outage with all the services it’s effecting, like login
Didn't write that part of the automation script and that coder left...
Hi
Some of the DNS addresses are no longer prefix from AS32934.