Cloudflare is down

Adam_Vitkovsky · March 5, 2013, 8:42am

From my point of view, outages are caused by:
1) operator
2) software defect
3) hardware defect

From my experience now days the likelihood of an outage as a result of 3) is

magnitude less than 2) and same goes for 2) to 1) ratio.
In other words the vast majority of the outages are caused by human error.
One way to partially rule out 1) is to have a fully customized stupid proof
provisioning system - customized by those who know how stuff works.

adam

Danny_McPherson4 · March 6, 2013, 4:42pm

While fuzzing of BGP[*] on the wire _may have identified some of this, there were many components involved (e.g., the DDoS attack on a customer's DNS servers that tickled their "attack profiler", their attack profiler was presumably confused about the suspect packet sizes as indicated in the presented "output signature", their operator didn't identify the issue before disseminating the recommended "signatures", JUNOS didn't barf when compiling the configuration (that'd be a big packet), a memory leak / thrashing triggered by the ingested flow_spec UPDATE crashed receiving routers, routers apparently recovered non-deterministically, etc..).

Leo's comments remind me of the The President's Commission to Investigate the Accident at Three Mile Island (TMI) findings, where pretty much everyone was blamed, but the operators were identified as ultimately culpable (in this case, presumably, _they also wrote the "attack profiler", although "they" may not have been precisely who deployed the policy).

For an interesting perspective of "normal accidents" derived from interactive complexity see [NormalAccidents], it's quite applicable to today's networks systems, methinks.

-danny

[NormalAccidents] Perrow, Charles, "Normal Accidents: Living with High-Risk Technologies", 1999.