CenturyLink RCA?

Saku_Ytti1 · December 30, 2018, 1:42pm

Apologies for the URL, I do not know official source and I do not
share the URLs sentiment.
https://fuckingcenturylink.com/

Can someone translate this to IP engineer? What did actually happen?

From my own history, I rarely recognise the problem I fixed from

reading the public RCA. I hope CenturyLink will do better.

Best guess so far that I've heard is

a) CenturyLink runs global L2 DCN/OOB
b) there was HW fault which caused L2 loop (perhaps HW dropped BPDU,
I've had this failure mode)
c) DCN had direct access to control-plane, and L2 congested
control-plane resources causing it to deprovision waves

Now of course this is entirely speculation, but intended to show what
type of explanation is acceptable and can be used to fix things.
Hopefully CenturyLink does come out with IP-engineering readable
explanation, so that we may use it as leverage to support work in our
own domains to remove such risks.

a) do not run L2 DCN/OOB
b) do not connect MGMT ETH (it is unprotected access to control-plane,
it cannot be protected by CoPP/lo0 filter/LPTS ec)
c) do add in your RFP scoring item for proper OOB port (Like Cisco CMP)
d) do fail optical network up

Mike_Hammett · December 30, 2018, 3:42pm

It’s technical enough so that laypeople immediately lose interest, yet completely useless to anyone that works with this stuff.

Joe_Carroll · December 30, 2018, 3:46pm

Technical obscurity… managed perception.

John_Von_Essen · December 30, 2018, 4:24pm

One thing that is troubling when reading that URL is that it appears several steps of restoration required teams to go onsite for local login, etc.,. Granted, to troubleshoot hardware you need to be physically present to pop a line card in and out, but CTL/LVL3 should have full out-of-band console and power control to all core devices, we shouldn’t be waiting for someone to drive to a location to get console or do power cycling. And I would imagine the first step to alot of the troubleshooting was power cycling and local console logs.

-John

Saku_Ytti1 · December 30, 2018, 5:17pm

Hey John,

Your criticism is warranted, but would also be addressed by
explanation DCN/OOB being the source of the problem.

At any rate, I am looking forward to stop speculating and start
reading post-mortem written by someone who knows how networks work.

Toma_Gavrichenkov · December 30, 2018, 5:46pm

There’s a Reddit user claiming he works at CL who says the reason were some faulty Infinera DTN-X instances.

https://www.reddit.com/r/centurylink/comments/aa2qa4/comment/ecovgab

(dunno though why the user posted that to Reddit and not here)

30 Dec. 2018 г., 20:19 Saku Ytti <saku@ytti.fi>:

Naslund_Steve · December 31, 2018, 2:53pm

Not buying this explanation for a number of reasons :

1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, couldn't that be accomplished by simply reseating them?

2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

Steven Naslund
Chicago IL

Saku_Ytti1 · December 31, 2018, 3:06pm

Hey Steve,

I will continue to speculate, as that's all we have.

1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, couldn't that be accomplished by simply reseating them?

L2 DCN/OOB, whole network shares single broadcast domain

2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
However I can be argued that optical network should fail up in absence
of control-plane, IP network has to fail down.

3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

BPDU

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

Lot of possible reasons, I choose to believe what they've communicated
is what the writer of the communication thought that happened, but as
they likely are not SME it's broken radio communication. BCAST storm
on L2 DCN would plausibly fit the very ambiguous reason offered and is
something people actually are doing.

Brian_Bruns · December 31, 2018, 3:20pm

(Forgive my top posting, not on my desktop as I’m out of town)

Wild guess, based on my own experience as a NOC admin/head of operations at a large ISP - they have an automated deployment system for new firmware for a (mission critical) piece of backbone hardware.

They may have tested said firmware on a chassis with cards that did not exactly match the hardware they had in actual deployment (ie: card was older hw revision in deployed hardware), and while it worked fine there, it proceeded shit the bed in the production.

Or, they missed a mandatory low level hardware firmware upgrade that has to be applied separately before the other main upgrade.

Kinda picturing in my mind that they staged all the updates, set a timer, staggered reboot, and after the first hit the fan, they couldn’t stop the rest as it fell apart as each upgraded unit fell on its own sword on reboot.

I’ve been bit by the ‘this card revision is not supported under this platform/release’ bug more often then I’d like to admit.

And, yes, my eyes did start to get glossy and hazy the more I read their explanation as well. It’s exactly the kind of useless post I’d write when I want to get (stupid) people off my back about a problem.

Eric_Loos · December 31, 2018, 3:23pm

This seems entirely plausible given that DWDM amplifiers and lasers being a complex analog system, they need OOB to align.

Naslund_Steve · December 31, 2018, 3:23pm

See my comments in line.

Steve

Hey Steve,

I will continue to speculate, as that's all we have.

1. Are you telling me that several line cards failed in multiple cities in the same way at the same time? Don't think so unless the same software fault was propagated to all of them. If the problem was that they needed to be reset, >couldn't that be accomplished by simply reseating them?

L2 DCN/OOB, whole network shares single broadcast domain.

Bad design if that’s the case, that would be a huge subnet. However even if that was the case, you would not need to replace hardware in multiple places. You might have to reset it but not replace it. Also being an ILEC it seems hard to believe how long their dispatches to their own central office took. It might have taken awhile to locate the original problem but they should have been able to send a corrective procedure to CO personnel who are a lot closer to the equipment. In my region (Northern Illinois) we can typically get access to a CO in under 30 minutes 24/7. They are essentially smart hands technicians that can reseat or replace line cards.

2. Do we believe that an OOB management card was able to generate so much traffic as to bring down the optical switching? Very doubtful which means that the systems were actually broken due to trying to PROCESS the "invalid >frames". Seems like very poor control plane management if the system is attempting to process invalid data and bringing down the forwarding plane.

L2 loop. You will kill your JNPR/CSCO with enough trash on MGMT ETH.
However I can be argued that optical network should fail up in absence of control-plane, IP network has to fail down.

Most of the optical muxes I have worked with will run without any management card or control plane at all. Usually the line cards keep forwarding according to the existing configuration even in the absence of all management functions. It would help if we knew what gear this was. True optical muxes do not require much care and feeding once they have a configuration loaded. If they are truly dependent on that control plane, then it needs to be redundant enough with watch dogs to reset them if they become non responsive and they need policers and rate limiter on their interfaces. Seems they would be vulnerable to a DoS if a bad
BPDU can wipe them out.

3. In the cited document it was stated that the offending packet did not have source or destination information. If so, how did it get propagated throughout the network?

BPDU

Maybe, it would be strange that it was invalid but valid enough to continue forwarding. In any case loss of the management network should not interrupt forwarding. I also would not be happy with an optical network that relies on spanning tree to remain operational.

My guess at the time and my current opinion (which has no real factual basis, just years of experience) is that a bad software package was propagated through their network.

Lot of possible reasons, I choose to believe what they've communicated is what the writer of the communication thought that happened, but as they likely are not SME it's broken radio communication. BCAST storm on L2 DCN >would plausibly fit the very ambiguous reason offered and is something people actually are doing.

My biggest problem with their explanation is the replacement of line cards in multiple cities. The only way that happens is when bad code gets pushed to them. If it took them that long to fix an L2 broadcast storm, something is seriously wrong with their engineering. Resetting the management interfaces should be sufficient once the offending line card is removed. That is why I think this was a software update failure or a configuration push. Either way, they should be jumping up and down on their vendor as to why this caused such large scale effects.

Naslund_Steve · December 31, 2018, 3:26pm

I agree 100%. Now they need to figure out why bricking the management network stopped forwarding on the optical side. > (Forgive my top posting, not on my desktop as I’m out of town)

Steven Naslund
Chicago IL

Naslund_Steve · December 31, 2018, 3:31pm

They shouldn’t need OOB to operate existing lambdas just to configure new ones. One possibility is that the management interface also handles master timing which would be a really bad idea but possible (should be redundant and it should be able to free run for a reasonable amount of time). The main issue exposed is that obviously the management interface is critical and is not redundant enough. That is if we believe the OOB explanation in the first place (which by the way is obviously not OOB since it wiped out the in band network when it failed).

Steven Naslund

Chicago IL

Dave_Temkin1 · December 31, 2018, 5:41pm

They shouldn’t need OOB to operate existing lambdas just to configure new ones. One possibility is that the management interface also handles master timing which would be a really bad idea but possible (should be redundant and it should be able to free run for a reasonable amount of time). The main issue exposed is that obviously the management interface is critical and is not redundant enough. That is if we believe the OOB explanation in the first place (which by the way is obviously not OOB since it wiped out the in band network when it failed).

Steven Naslund

Chicago IL

A theory, and only a theory, is that they decided to, in order to troubleshoot a much smaller problem (OOB/etc.), deploy an optical configuration change that, when faced with inaccessibility to multiple nodes, ended up causing a significant inconsistency in their optical network, wreaking havoc on all sorts of other systems. With the OOB network already in chaos, card reseats were required to stabilize things on that network and then they could rebuild the optical network from a fully reachable state.

Again, only a theory.

-Dave

Aaron · December 31, 2018, 6:49pm

Yeah, could have been one of those…gone from bad to worse things like Dave mentioned… initial problem and course of action perhaps led to a worse problem.

I’ve had DWDM issues that have taken down multiple locations far apart from each other due to how the transport guys hauled stuff

A few years back I had about 15 routers all reboot suddenly… they were all far apart from each other, turned out to be one of the dual bgp sessions to rr cluster flapped and all 15 routers crash rebooted.

But ~50 hours of downtime !?

Aaron

Lee · December 31, 2018, 7:02pm

It could have been worse:
https://www.cio.com.au/article/65115/all_systems_down/

Naslund_Steve · December 31, 2018, 8:11pm

A note for the guys hanging on to those POTS lines…It won’t really help. One of our sites in Dubuque Iowa had ten CenturyLink PRIs (they are the LEC there) homed off of a 5ESS switch. These all were unable to process calls during the CenturyLink problem. The ISDN messaging returned indicated that the CL phone switch had no routes. This tells me that either their inter-switch trunking or SS7 network or both are being transported over the same optical network as the Internet services. So, even if your local line is POTS or traditional TDM it won’t matter if all of their transport is dependent on the IP world.

Looking at the Reddit comments on the Infinera devices being a problem, that makes more sense because that device blurs the line between optical mux and IP enabled devices with its Ethernet mapping functions. One advantage of the pure optical mux is that it does not need, care, or understand L2 and L3 network protocols and are largely unaffected by those layers. Convergence in devices moving across more network layers exposes it to more potential bugs. Convergence can easily lead to more single points of failure and the traffic capacity of these devices kind of encourages carriers to put more stuff in one basket than they traditionally did. I understand the motivation to build a single high speed IP centric backbone but it makes everything dependent on that backbone.

Steven Naslund
Chicago IL

Keith_Medcalf · December 31, 2018, 8:31pm

It could have been worse:
https://www.cio.com.au/article/65115/all_systems_down/

"Make network changes only between 2am and 5am on weekends."

Wow. Just wow. I suppose the IT types are considerably different than Process Operations. Our rule is to only make changes scheduled at 09:00 (or no later than will permit a complete backout and restore by 15:00) Local Time on "Full Staff" day that is not immediately preceded or followed by a reduced staff day, holiday, or weekend-day.

William_Herrin · December 31, 2018, 9:30pm

It depends on your system architecture. If you've built your
redundancy well so that you have a continuously maintainable system
then you do the work during normal staffing and only when followed by
days when folks will be around to notice and fix any mistakes.

If you require a disruptive maintenance window then you schedule it
for minimum usage times instead.

Other conclusions from the article are dubious as well:

* Retire legacy network gear faster and create overall life cycle
management for networking gear.

Retire equipment when it ceases to be cost-effective, not merely
because it was manufactured too many years ago. Just don't forget to
factor risk in to the cost.

* Document all changes, including keeping up-to-date physical and
logical network diagrams.

"Good intentions never work, you need good mechanisms to make anything
happen." - Jeff Bezos

Regards,
Bill Herrin

William_Herrin · December 31, 2018, 9:32pm

Bad design if that’s the case, that would be a huge subnet.

According to the notes at the URL Saku shared, they suffered a cascade
failure from which they needed the equipment vendor's help to recover.
That indicates at least two grave design errors:

1. Vendor monoculture is a single point of failure. Same equipment
running the same software triggers the same bug. It all kabooms at
once. Different vendors running different implementations have
compatibility issues but when one has a bug it's much less likely to
take down all the rest.

2. Failure to implement system boundaries. When you automate systems
it's important to restrict the reach of that automation. Whether it's
a regional boundary or independent backbones, a critical system like
this one should be structurally segmented so that malfunctioning
automation can bring down only one piece of it.

Regards,
Bill Herrin

However even if that was the case, you would not need to replace
hardware in multiple places. You might have to reset it but not
replace it. Also being an ILEC it seems hard to believe how long
their dispatches to their own central office took. It might have
taken awhile to locate the original problem but they should have been
able to send a corrective procedure to CO personnel who are a lot
closer to the equipment. In my region (Northern Illinois) we can
typically get access to a CO in under 30 minutes 24/7. They are
essentially smart hands technicians that can reseat or replace line
cards.