L3 East cost maint / fiber 05FEB2012 maintenance

Josh_Reynolds · February 5, 2013, 3:39pm

I know a lot of you are out of the office right now, but does anybody have
any info on what happened with L3 this morning? They went into a 5 hour
maintenance window with expected downtime of about 30 minutes while they
upgraded something like *40* of their "core routers" (their words), but
also did this during some fiber work and completely cut off several of
their east coast peers for the entirety of the 5 hour window.

If anybody has any more info on this, on a NOC contact for them on the East
Coast for future issues, you can hit me off off-list if you don't feel
comfortable replying with that info here.

Thanks, and I hope hope you guys are enjoying Orlando.

David_Hubbard · February 5, 2013, 3:48pm

We saw the same thing out of their Tampa location; there was
a brief drop around 2am EST and a more severe one around
4:05 AM which lasted about 10 minutes for us. Unfortunately
whatever they did, they did it in a way that our BGP sessions
stayed up so we couldn't react until bgpmon altered me about
some route withdrawals but by that time things were back to
normal and remained stable.

Viral_Vira · February 5, 2013, 3:48pm

We also noticed outage due to L3 Maintenance that went into the outage. We
were not even notified about the Maintenance itself.

We also noticed black hauling in their network.

-Thanks,
Viral

Jon_Lewis1 · February 5, 2013, 3:51pm

We're a Level3 customer in Orlando. Our BGP sessions stayed up, but the number of routes received from Level3 fell to only a few tens of thousands at about 4:10am, and gradually returned to normal numbers by about 4:35am.

Jason_Lixfeld1 · February 5, 2013, 4:00pm

I got notification of their maintenance window, albeit with < 24 hours notice. Notice came in at 11:00GMT-5 yesterday, maintenance was scheduled for 00:00GMT-5 this morning.

That said, the notice said that the maintenance was in Phoenix but I got a notice about my IPT circuit at 60 Hudson which I found confusing.

Based on my logs, our BGP session with them went down at 03:06GMT-5 and back up at 03:15GMT-5. Down again at 03:37GMT-5 until 04:20GMT-5. A third time at 06:41GMT-5 and back at 06:45GMT-5.

Traffic graphs tell a bit of a different story. Just before 05:00GMT-5, our outbound traffic to Level 3 dropped substantially. About that time, I started getting reports about issues to Level 3 destinations. Traces seemed to indicate a black hole condition within Level 3's network in NYC, seemingly at, or just past csw3.NewYork1.Level3.net. Stuff seemed to correct itself by about 06:45GMT-5, but due to Level 3 sending only about 180k routes. About 20 minutes later, the table was back to ~431K and all's been fine since.

Jonathan_Lassoff · February 5, 2013, 5:23pm

My hunch is that this is fallout and repairs from Juniper PR839412.
Only fix is an upgrade. Not sure why they're not able to do a hitless
upgrade though; that's unfortunate.

Specially-crafted TCP packets that can get past RE/loopback filters
can crash the box.

--j

Jason_Biel · February 5, 2013, 5:33pm

Workaround is proper filtering and other techniques on the RE/Loopback to
prevent the issue from happening.

Should an upgrade be performed? Yes, but certainly doesn't have to have
right away or without notice to customers.

Jonathan_Lassoff · February 5, 2013, 5:41pm

Workaround is proper filtering and other techniques on the RE/Loopback to
prevent the issue from happening.

Agreed. However, if it only takes one packet, what if an attacker
sources the traffic from your management address space?

Guarding against this requires either a separate VRF/table for
management traffic or transit traffic, RPF checking, or TTL security.
If these weren't setup ahead of time, maybe it would be easier to
upgrade than lab, test, and deploy a new configuration.

This is all speculation about Level3 on my part; I don't know their
network from an internal perspective.

--j

Jason_Biel · February 5, 2013, 6:02pm

Agree as well.

Bad assumption on my part that Level3 would doing the items listed in the
workaround already.

Joel_Jaeggli · February 5, 2013, 7:38pm

Agree as well.

Bad assumption on my part that Level3 would doing the items listed in the
workaround already.

Workaround is proper filtering and other techniques on the RE/Loopback to
prevent the issue from happening.

Agreed. However, if it only takes one packet, what if an attacker
sources the traffic from your management address space?

Guarding against this requires either a separate VRF/table for
management traffic or transit traffic, RPF checking, or TTL security.
If these weren't setup ahead of time, maybe it would be easier to
upgrade than lab, test, and deploy a new configuration.

This is all speculation about Level3 on my part; I don't know their
network from an internal perspective.

Routers that show up on exchange fabrics are a particular problem...

For this issue...

For what it's worth we have several dzone circuits with them from 100mb/s office links to 10Gb/s paths and we have notifications for maintenances last night and tonight and touching locations in europe us east and us west coasts. I'm presuming that there is further internal work that is not directly impactful.

I have evidence of various other providers as well as ourselves undertaking fixes to this issue.