I believe that the above link refers to a previous outage. The duration of the outage doesn't match today's, the technical explanation doesn't align very well, and many of the comments reference earlier dates.
My bad - might be best to ignore the above post as it is a
unconfirmed/undated post-mortem that may reference a different event.
If I'm reading the source correctly, the timestamp inside is for 09 SEP
2021 14:22:49 GMT (Unix time 1631197369). Then again, I may not be
reading it correctly.
Yes, actually, they do. They use Chef extensively to configure
operating systems. Chef is written in Ruby. Ruby has something called
Monkey Patches. This is where at an arbitrary location in the code you
re-open an object defined elsewhere and change its methods.
Chef doesn't always do the right thing. You tell Chef to remove an RPM
and it does. Even if it has to remove half the operating system to
satisfy the dependencies. If you want it to do something reasonable,
say throw an error because you didn't actually tell it to remove half
the operating system, you have a choice: spin up a fork of chef with a
couple patches to the chef-rpm interaction or just monkey-patch it in
one of your chef recipes.
129.134.30.0/23, 129.134.30.0/24, 129.134.31.0/24. The specific routes covering all 4 nameservers (a-d) were withdrawn from all FB peering at approximately 15:40 UTC.
While Ruby indeed has a chain-saw (read: powerful, dangerous, still the tool of choice in certain cases) in its toolkit that is generally called “monkey-patching”, I think Michael was actually thinking about the “chaos monkey”,
Amnazon and Netflix seem to not keep their eggs in the same basket. From a first look, they seem more resilient than facebook.com, google.com and apple.com
Rumour is that when the FB route prefixes had been withdrawn their door authentication system stopped working and they could not get back into the building or server room
My speculative guess would be that OOB access to a few outbound-facing
routers per DC does not help much if a configuration error withdraws the
infrastructure prefixes down to the rack level while dedicated OOB to
each RSW would be prohibitive.
Well, doesn't really matter if you can resolve the A/AAAA/MX records, but you can't connect to the network that is hosting the services.
At any rate, having 3rd party DNS hosting for your domain is always a good thing to have. But in reality, it only hits the spot if the service is also available on a 3rd party network, otherwise, you keep DNS up, but get no service.
Maybe withdrawing those routes to their NS could have been mitigated by having NS in separate entities.
Assuming they had such a thing in place , it would not have helped.
Facebook stopped announcing the vast majority of their IP space to the DFZ during this. So even they did have an offnet DNS server that could have provided answers to clients, those same clients probably wouldn’t have been able to connect to the IPs returned anyways.
If you are running your own auths like they are, you likely view your public network reachability as almost bulletproof and that it will never disappear. Which is probably true most of the time. Until yesterday happens and the 9’s in your reliability percentage change to 7’s.