BGP failure analysis and recommendations

JRC_NOC · October 24, 2013, 2:40am

Hello Nanog -

On Saturday, October 19th at about 13:00 UTC we experienced an IP failure at one of our sites in the New York area.
It was apparently a widespread outage on the East coast, but I haven't seen it discussed here.

We are multihomed, using EBGP to three (diverse) upstream providers. One provider experienced a hardware failure in a core component at one POP.
Regrettably, during the outage our BGP session remained active and we continued receiving full routes from the affected AS. And our prefixes continued to be advertised at their border. However basically none of the traffic between those prefixes over that provider was delivered. The bogus routes stayed up for hours. We shutdown the BGP peering session when the nature of the problem became clear. This was effective. I believe that all customer BGP routes were similarly affected, including those belonging to some large regional networks and corporations. I have raised the questions below with the provider but haven't received any information or advice.

My question is why did our BGP configuration fail? I'm guessing the basic answer is that the IGP and route reflectors within that provider were still connected, but the forwarding paths were unavailable. My BGP session basically acted like a bunch of static routes, with no awareness of the failure(s) and no dynamic reconfiguration of the RIB.

Is this just an unavoidable issue with scaling large networks?
Is it perhaps a known side effect of MPLS?
Have we/they lost something important in the changeover to converged mutiprotocol networks?
Is there a better way for us edge networks to achieve IP resiliency in the current environment?

This is an operational issue. Thanks in advance for any hints about what happened or better practices to reduce the impact of a routine hardware fault in an upstream network.

- Eric Jensen

Christopher_Morrow · October 24, 2013, 3:06am

Is this just an unavoidable issue with scaling large networks?

nope... sounds like (to me at least) the forwarding plane and control
plane are non-congruent in your provider's network so as you said,
if the forwarding-plane is dorked up between you and 'the rest of
their netowrk', but the edge device you are connected to thinks
next-hops for routes are still valid... oops

Is it perhaps a known side effect of MPLS?

nope.

Have we/they lost something important in the changeover to converged
mutiprotocol networks?
Is there a better way for us edge networks to achieve IP resiliency in the
current environment?

sadly I bet not, aside from active probing and disabling paths that
are non-functional.

Brandon_Ross3 · October 24, 2013, 7:07am

Um, how about, don't buy services from network providers that fail in this way?

Since we're not naming names, I won't, but in the past there's been at least one provider that used multi-hop eBGP at their edges because they didn't want to invest in edge gear that could handle a full BGP table. My concern with their network (beyond many other concerns) was that when that router in the middle had a soft failure, how would BGP know to route around it? Answer: it wouldn't, you'd black hole.

On the opposite side of the spectrum, there was at least one provider that used custom software to actively probe their upstream providers and route around poor performance. At one time, there was also software, hardware and services that you could install/run on your own network to try to detect these things as well, however I'm not sure how many of them are still on the market.

The bottom line, however, is don't buy services from companies that do a poor job of running their network unless you can accept these kinds of failures.

Sam_Roche · October 24, 2013, 12:10pm

We had a similar issue happen and modified our BGP peering to use one BGP session per provider, as we had multiple neighbours for one of our peers.

It seems to have resolved this particular issue for us.

I would love to hear how others are actively probing their peers networks using an NMS to verify connectivity.

Sam Roche - Supervisor of Network Operations - Lakeland Networks
sroche@lakelandnetworks.com| Office: 705-640-0086 | Cell: 705-706-2606| www.lakelandnetworks.com

IT SOLUTIONS for BUSINESS
Fiber Optics, Wireless, DSL Network Provider; I.T. Support; Telephony Hardware and Cabling; SIP Trunks, VoIP; Server Hosting; Disaster Recovery Systems

"The information contained in this message is directed in confidence solely to the person(s) named above and may not be otherwise distributed, copied or disclosed. The message may contain information that is privileged, proprietary and/or confidential and exempt from disclosure under applicable law. If you have received this message in error, please notify the sender immediately advising of the error and delete the message without making a copy."

Christopher_Morrow · October 24, 2013, 1:21pm

I suppose the question is: "how would you know that any particular
network had this failure mode?"

until, of course, you run into it... as jrc did...

Brandon_Ross3 · October 24, 2013, 4:00pm

Um, how about, don't buy services from network providers that fail in this
way?

I suppose the question is: "how would you know that any particular
network had this failure mode?"

Ask detailed questions about how their network is architected. Do they use eBGP multihop anywhere? Do they use BFD on internal Ethernet links? Do they put their peering links in their IGP, or directly into iBGP?

until, of course, you run into it... as jrc did...

That too.

Pete_Lumbis · October 25, 2013, 6:01pm

As a member of the support team for a vendor, I'll say this problem isn't
entirely unheard of. The CPU is in charge of local traffic and the BGP
session and some sort of hardware chip or ASIC is in charge of moving
packets through the device. If the hardware is misprogrammed it won't
properly forward traffic while BGP thinks it's doing it's job. This is not
to be confused with a hardware failure. This is purely a software problem.
The software is responsible for telling the hardware what to do, and
sometimes there are bugs there, like there are bugs in all code.

The easiest way to test this kind of issue is to have some other control
plane that is tied to the data plane. That is, the only way to make sure
that the peer is forwarding traffic is to make it forward traffic and react
when it fails. You could do something like set up IP SLA (i.e., ping) to
something in that SP network. If the ping fails then it sounds like your
peer may have a forwarding issue and you can apply a policy to remove or at
least not prefer that peer (in case it's a false positive).

-Pete

Rajiv_Asati_rajiva · November 4, 2013, 5:31pm

The below problem was the motivation for this BGP improvement :

http://tools.ietf.org/html/draft-ietf-idr-bgp-bestpath-selection-criteria