BGP Failover Question

I am looking for some help with an issue we recently had with one of our BGP peers recently. I currently have two DIA providers each terminated into their own edge router and I am doing iBGP to exchange routes between the two edge routers. Last week Provider A made a policy change "somewhere" in their network in the middle of the day causing traffic to stop routing. Of course this connection happens to be the preferred route for the majority of our inbound and outbound traffic. I never saw our physical link go down and never saw our peer drop therefore BGP did not stop advertising routes, this caused most of our customers traffic to go nowhere. In order to fix the issue I had to manually shutdown the peer till Provider A confirmed the change they made had been reverted. This isn't the first time we have seen this issue with our various providers, how can I prevent issues like this from happening in the future?

---Chris

Chris,

The best way to resolve this issue is to not use a service provider who takes down your connectivity outside of maintenance windows, but I digress.

This is the nature of BGP. You send your providers routes about your network prefixes and they send you routes to say the DFZ. When you forward packets to them ,because they sent you routes saying they can get the destinations your packets have on them, it is now outside of anything you can do about it. It is now up to the peer to forward the packets as they said they would by sending you prefixes.

This is a trust relationship as you trust they will forward your packets because that is why you are paying them.

- Brian J.

I would simply monitor PPS on those links and set a threshold which will
kick off an alert at least. If your scripting savvy, other tools such as IP
SLA and EEM on Cisco could be used to automate the failover. Juniper also
has a similar scripting tool that can probably do the same. I've had this
happen before and is a real pain.

Regards,
M

I had a provider like that a long time ago; it was an ATG T1 (which was
fine) but when they were bought by Eschelon the exact problem you're
describing would happen every other month like clockwork. The first time
was forgivable. The second time I was annoyed. After the third I was
angry, unplugged it, and told them to stuff it because apparently they
didn't know how to deal with BGP.

You can't prevent it from happening. You can only come up with band-aids
to notify you. Save yourself the headache and find a new provider that
knows how to handle BGP. What happens if the other circuit is not
available (outage, planned maintenance, etc.) at the same time the
problem one decides to black hole you? If you're facing the same
repeating problem they are obviously not the best fit for you.

~Seth

Save yourself the headache and find a new provider that knows how to handle

BGP

I've had this happen with providers that do know how to handle BGP. Just
because you peer with 3356, 701, etc, doesn't mean operators can't make a
mistake. I've even seen this happen due to some wierd BGP behavior caused by
some cool new "features".

IMHO, better to plan for it and deploy it as a policy (by whatever means).

M

On a predictable schedule? That's where I drew the line: they were
"fixing" something that was not "normal" to them every two months that
resulted in the problem the OP described. Yes, mistakes happen, but
identical repeating mistakes don't count in my book. I would expect my
providers to document changes and whoever is making changes to consult
it when they see a deviation from common config.

~Seth

Quick question, are you running with a default route from your
provider? If so, you're better off either finding another provider,
or upgrading the router (if necessary) to carry a full table. If
they do something to partition their network, you will see the
decrease in routes learned from them, provided you see those routes
and not the default route as asked above.

charles

We are recieving full routes from both providers.

---Chris

As Max stated, you can set triggers based on thresholds that are monitered
via multiple methods in Cisco IOS. That way you could force the route down
dynamically. There's always a risk when letting the machines do the thinking
but this would help in situations like this. Can't speak for other vendors
but I'm sure the features are similar.

-Hammer-

"I was a normal American nerd."
-Jack Herer

Well as someone else stated, if an upstream provider can't provide BGP reliably then it's time to give them the boot. Once in a year, okay, but beyond that, then it's time to read riot act with that provider.
Bret

I'm not argueing that at all. But it wasn't relevent to the question at
hand. And depending on the scale of your business dumping providers is not
something done on a whim. It's not like your fed up with DSL and want to
convert to Cable.

-Hammer-

"I was a normal American nerd."
-Jack Herer

Assuming that he has provider independent space (why run full BGP feeds if you
are not multihomed?), then, actually it's about on par and less disruptive in
general. Add new provider, wait a day or two, then disconnect old provider.

If he's using provider assigned space, then, the big hurdle is switching to provider
independent (requires a renumber), but, that's a good idea for a variety of reasons.

I would hardly call the type and frequency of outages described a "whim" when
using that as a reason to change providers. Sounds like he is suffering
severe impact to his business.

Owen

I agree. But swapping providers is not the default answer in some
environments. I work in an enterprise with multiple GE circuits from
multiple providers to the Internet. The lead time on calling up a different
carrier and saying "I need a gigabit connection to the Internet" would
probably be 90-120 days. And then you get to go thru the
contracts/negotiations and MSAs. You don't just flip. In smaller operations
I understand. But I was simply saying that it's not always that easy. If I
went to my boss and said one of our carriers sucks and we should dump them
he would just laugh and throw me out.

1. What are the SLAs with the carrier in question? Do you have them clearly
defined? Are they out of SLA? If so, what compensation is entitled based on
violation of said SLA?

2. What trending are you doing to document the failures in SLA of the
carrier in question? Do we have a documented pattern of poor performence by
using that trending?

3. What are our contractual or legal options based on items 1 and 2?

4. Don't forget about the Layer8 (political) factor. If your telco manager
is buddies with the carrier then you have to double your documentation
against them. Some companies spend tens of millions a month on circuits. You
better be ready to justify yourself.

-Hammer-

"I was a normal American nerd."
-Jack Herer

Funny, I was just at your IPv6 sight this morning while researching
multihoming scenarios. "That name sounds familiar....."

-Hammer-

"I was a normal American nerd."
-Jack Herer

I agree. But swapping providers is not the default answer in some environments. I work in an enterprise with multiple GE circuits from multiple providers to the Internet. The lead time on calling up a different carrier and saying "I need a gigabit connection to the Internet" would probably be 90-120 days. And then you get to go thru the contracts/negotiations and MSAs. You don't just flip. In smaller operations I understand. But I was simply saying that it's not always that easy. If I went to my boss and said one of our carriers sucks and we should dump them he would just laugh and throw me out.

That depends on where you are. If you have a router in one or more of the many "carrier hotels" around the world, you can usually order a new Gig-E cross-connect with service in less than a week. If you need to have a circuit engineered, then, 30-90 days is probably about right. If you need to have facilities installed to provide said circuit, it can be as much as 180 days.

However, I don't think the point was "disconnect them tomorrow". I think the point was "If the impact is that severe, the sooner you start the new provider process, the sooner you get relief."

1. What are the SLAs with the carrier in question? Do you have them clearly defined? Are they out of SLA? If so, what compensation is entitled based on violation of said SLA?

99.99% of all SLAs are a pittance of money refunded IF you jump through extreme hoops to collect. They are rarely sufficient to resolve
or even compensate for outages.

2. What trending are you doing to document the failures in SLA of the carrier in question? Do we have a documented pattern of poor performence by using that trending?

3. What are our contractual or legal options based on items 1 and 2?

4. Don't forget about the Layer8 (political) factor. If your telco manager is buddies with the carrier then you have to double your documentation against them. Some companies spend tens of millions a month on circuits. You better be ready to justify yourself.

Yeah, this is usually the biggest problem.

Owen

Uncle!

-Hammer-

"I was a normal American nerd."
-Jack Herer