Global Akamai Outage

Mark.

Matt Harris​

Infrastructure Lead

816‑256‑5446

Direct

Looking for help?

Helpdesk

Email Support

We build customized end‑to‑end technology solutions powered by NetFire Cloud.

Yes, seems to be restoring...

 https://twitter.com/akamai/status/1418251400660889603?s=28

Mark.

The outage appears to have, ironically, taken out the outages and outages-discussion lists too.

Kinda like having a fire at the 911 dispatch center...

<hat=puck.nether.net/MX=outages.org>

Should not have impacted me in my hosting of the list. Obviously if the domain names were impacted in the lookups for sending e-mail as well, there would be problems.

</hat>

- Jared

I received multiple messages from the Outages (proper) mailing list, including messages about the Akamai issue.

I'd be surprised if the Outages Discussion mailing list was on different infrastructure.

I am now seeing some messages sent to Outages (proper) as if others aren't seeing messages (about the Akamai) issue. This is probably going to be a more nuanced issue that effected some but not all entities. Probably weird interactions / dependencies.

It does seem that way, and I opened a ticket with my provider to see what they can find as well.

-Andy

[18:30 UTC on July 22, 2021] Update:

Akamai experienced a disruption with our DNS service on July 22, 2021. The disruption began at 15:45 UTC and lasted for approximately one hour. Affected customer sites were significantly impacted for connections that were not established before the incident began.

Our teams identified that a change made in a mapping component was causing the issue, and in order to mitigate it we rolled the change back at approximately 16:44 UTC. We can confirm this was not a cyberattack against Akamai's platform. Immediately following the rollback, the platform stabilized and DNS services resumed normal operations. At this time the incident is resolved, and we are monitoring to ensure that traffic remains stable.

From Akamai. How companies and vendors should report outages:

[07:35 UTC on July 24, 2021] Update:

Root Cause:

This configuration directive was sent as part of preparation for independent load balancing control of a forthcoming product. Updates to the configuration directive for this load balancing component have routinely been made on approximately a weekly basis. (Further changes to this configuration channel have been blocked until additional safety measures have been implemented, as noted in Corrective and Preventive Actions.)

The load balancing configuration directive included a formatting error. As a safety measure, the load balancing component disregarded the improper configuration and fell back to a minimal configuration. In this minimal state, based on a VIP-only configuration, it did not support load balancing for Enhanced TLS slots greater than 6145.

The missing load balancing data meant that the Akamai authoritative DNS system for the akamaiedge.net zone would not receive any directive for how to respond to DNS queries for many Enhanced TLS slots. The authoritative DNS system will respond with a SERVFAIL when there is no directive, as during localized failures resolvers will retry an alternate authority.

The zoning process used for deploying configuration changes to the network includes an alert check for potential issues caused by the configuration changes. The zoning process did result in alerts during the deployment. However, due to how the particular safety check was configured, the alerts for this load balancing component did not prevent the configuration from continuing to propagate, and did not result in escalation to engineering SMEs. The input safety check on the load balancing component also did not automatically roll back the change upon detecting the error.

Contributing Factors:

 The internal alerting which was specific to the load balancing component did not result in blocking the configuration from propagating to the network, and did not result in an escalation to the SMEs for the component\.
 The alert and associated procedure indicating widespread SERVFAILs potentially due to issues with mapping systems did not lead to an appropriately urgent and timely response\.
 The internal alerting which fired and was escalated to SMEs was for a separate component which uses the load balancing data\. This internal alerting initially fired for the Edge DNS system rather than the mapping system, which delayed troubleshooting potential issues with the mapping system and the load balancing component which had the configuration change\. Subsequent internal alerts more clearly indicated an issue with the mapping system\.
 The impact to the Enhanced TLS service affected Akamai staff access to internal tools and websites, which delayed escalation of alerts, troubleshooting, and especially initiation of the incident process\.

Short Term

Completed:

 Akamai completed rolling back the configuration change at 16:44 UTC on July 22, 2021\.
 Blocked any further changes to the involved configuration channel\.
 Other related channels are being reviewed and may be subject to a similar block as reviews take place\. Channels will be unblocked after additional safety measures are assessed and implemented where needed\.

In Progress:

 Validate and strengthen the safety checks for the configuration deployment zoning process
 Increase the sensitivity and priority of alerting for high rates of SERVFAILs\.

Long Term

In Progress:

 Reviewing and improving input safety checks for mapping components\.
 Auditing critical systems to identify gaps in monitoring and alerting, then closing unacceptable gaps\.

Hey,

Not a critique against Akamai specifically, it applies just the same
to me. Everything seems so complex and fragile.

Very often the corrective and preventive actions appear to be
different versions and wordings of 'dont make mistakes', in this case:

- Reviewing and improving input safety checks for mapping components
- Validate and strengthen the safety checks for the configuration
deployment zoning process

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I think the mean-time-to-repair actions described are more actionable
than the 'do better'. However Akamai already solved this very fast
and may not be very reasonable to expect big improvements to a 1h
start of fault to solution for a big organisation with a complex
product.

One thing that comes to mind is, what if Akamai assumes they cannot
reasonably make it fail less often and they can't fix it faster. Is
this particular product/solution such that the possibility of having
entirely independent A+B sides, for which clients fail over is not
available? If it was a DNS problem, it seems like it might have been
possible to have entirely failed A, and clients automatically
reverting to B, perhaps adding some latencies but also allowing the
system to automatically detect that A and B are performing at an
unacceptable delta.

Did some of their affected customers recover faster than Akamai due to
their own actions automated or manual?

Complex systems are apt to break and only a very limited set of tier-3 engineers will understand what needs to be done to fix it.

KISS

-Hank

Indeed. Worth rereading for that reason alone (or in particular).

Miles Fidelman

Hank Nussbacher wrote:

Can we learn something from how the airline industry has incrementally improved safety through decades of incidents?

"Doing better" is the lowest hanging fruit any network operator can strive for. Unlike airlines, the Internet community - despite being built on standards - is quite diverse in how we choose to operate our own islands. So "doing better", while a universal goal, means different things to different operators. This is why we would likely see different RFO's and remedial recommendations from different operators for the "same kind of" outage.

In most cases, continuing to "do better" may be most appealing prospect because anything better than that will require significantly more funding, in an industry where most operators are generally threading the P&L needle.

Mark.

Work hat is not on, but context is included from prior workplaces etc.

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I have seen a very strong culture around risk and risk avoidance whenever possible at akamai. Some minor changes are taken very seriously.

I appreciate that on a daily basis, and when we make mistakes (I am human after all) are made, reviews of the mistakes and corrective steps are planned and followed up on. I'm sure this time will not be different.

I also get how easy it is to be cynical about these issues. There's always someone with power who can break things, but those can also often fix them just as fast.

Focus on how you can do a transactional routing change and roll it back, how you can test etc.

This is why for years I told one vendor that had a line-by-line parser their system was too unsafe for operation.

There's also other questions like:

How can we improve response times when things are routed poorly? Time to mitigate hijacks is improved my majority of providers doing RPKI OV, but interprovider response time scales are much longer. I also think about the two big CTL long haul and routing issues last year. How can you mitigate these externalities.

- Jared

Steering dangerously off-topic from this thread, we have so far had
more operational and availability issues from RPKI than from hijacks.
And it is a bit more embarrassing to say 'we cocked up' than to say
'someone leaked to internet, it be like it do'.

Very often the corrective and preventive actions appear to be
different versions and wordings of 'dont make mistakes', in this case:

- Reviewing and improving input safety checks for mapping components
- Validate and strengthen the safety checks for the configuration
deployment zoning process

It doesn't seem like a tenable solution, when the solution is 'do
better', since I'm sure whoever did those checks did their best in the
first place. So we must assume we have some fundamental limits what
'do better' can achieve, we have to assume we have similar level of
outage potential in all work we've produced and continue to produce
for which we exert very little control over.

I think the mean-time-to-repair actions described are more actionable
than the 'do better'. However Akamai already solved this very fast
and may not be very reasonable to expect big improvements to a 1h
start of fault to solution for a big organisation with a complex
product.

One thing that comes to mind is, what if Akamai assumes they cannot
reasonably make it fail less often and they can't fix it faster. Is
this particular product/solution such that the possibility of having
entirely independent A+B sides, for which clients fail over is not
available? If it was a DNS problem, it seems like it might have been
possible to have entirely failed A, and clients automatically
reverting to B, perhaps adding some latencies but also allowing the
system to automatically detect that A and B are performing at an
unacceptable delta.

formal verification

Are you speaking globally, or for NTT?

Mark.

Doesn't matter. And I'm not trying to say RPKI is a bad thing. I like
that we have good AS:origin mapping that is verifiable and machine
readable, that part of the solution will be needed for many
applications which intend to improve the Internet by some metric.
And of course adding any complexity will have some rearing problems,
particularly if the problem it attempts to address is infrequently
occurring, so it would be naive not to expect an increased rate of
outages while maturing it.

Yes, while RPKI fixes problems that genuinely occur infrequently, it's intended to work very well for when those problems do occur, especially the intentional hijacks, because when they do occur, it disrupts quite a large part of the Internet, even if for a few minutes or couple of hours. So from that standpoint, RPKI does add value.

Where I do agree with you is that we should restrain ourselves from applying RPKI to use-cases that are non-core to its reasons for existence, e.g., AS0.

I can count, on my hands, the number of RPKI-related outages that we have experienced, and all of them have turned out to be a misunderstanding of how ROA's work, either by customers or some other network on the Internet. The good news is that all of those cases were resolved within a few hours of notifying the affected party.

Mark.

Hello,

I can count, on my hands, the number of RPKI-related outages that we
have experienced, and all of them have turned out to be a
misunderstanding of how ROA's work, either by customers or some other
network on the Internet. The good news is that all of those cases were
resolved within a few hours of notifying the affected party.

That's good, but the understanding of operational issues in the RPKI
systems in the wild is underwhelming, we are bound to make the same
mistakes of DNS all over again.

Yes, a complete failure of an RTR server theoretically does not have
big negative effects in networks. But failure of RPKI validation with
a separate RTR server can lead to outdated VRP's on the routers, just
as RTR server bugs will, which is why monitoring not only for
availability but also whether the data is actually not outdated is
*very* necessary.

Here some examples (both of operators POV as well as actual failure scenarios):

https://mailman.nanog.org/pipermail/nanog/2020-August/208982.html

we are at fault for not deploying the validation service in a redundant
setup and for failing at monitoring the service. But we did so because
we thought it not to be too important, because a failed validation
service should simply lead to no validation, not a crashed router.

In this case a RTR client bug crashed the router. But the point is
that it is not clear that setting up RPKI validators and RTR servers
is a serious endeavor and monitoring it is not optional.

we noticed that one the ROAs was wrong. When I pulled output.json
from octorpki (/output.json), it had the correct value. However when
I ran rtrdump, it had different ASN value for the prefix. Restarting
gortr process did fix it. Sending SIGHUP did not.

yesterday we saw a unexpected ROA propagation delay.

After updating a ROA in the RIPE lirportal, NTT, Telia and Cogent
saw the update within an hour, but a specific rpki validator
3.1-2020.08.06.14.39 in a third party network did not converge
for more than 4 hours.

I wrote a naive nagios script to check for stalled serials on a RTR server:

and talked about it in this his blog post (shameless plug):

This is on the validation/network side. On the CA side, similar issues apply.

I believe we still lack a few high level outages caused by
insufficient reliability in the RPKI stacks for people to start taking
it seriously.

Some specific failure scenarios are currently being addressed, but
this doesn't make monitoring optional:

rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
possible for RTR servers to stop considering outdated VRP's:

stayrtr (a gortr fork), will consider this attribute in the future:

cheers,
lukas