question regarding multi-homing

Hi all,

Happy new year...

I have a question regarding multi-homing, mostly from stub network's
operational point of view. My big question is: what kind of failures
do you usually see from your providers? Link down? Link up, but
withdraw some routes? Link up, no route change, but blackholing
partial or all traffic? Anything else?

Let's say that I have two local routers (Ra and Rb) connecting to two
providers, A and B. If router Ra sees provider A with problems of the
first two cases (link down, link up but withdraw routes), the Rb can
easily step up. My question is, if I am using provider A as the
default, but provider A has the third problem (link up, no route
change, but blackholing traffic), how can I detect it and switch
provider automatically?

To state this problem in detail: I use a static default route on Ra to
forward traffic to provider A, or receive 0/0 from provider A via BGP.
For some reason, provider A can no longer reach a /24. My network
cannot be notified (unless, I receive a full internet routing table).
In this case, all I know is that my traffic to /24 is blackholed
through provider A. In this case, is there an automatic way for my
stub network to switch over to provider B? Do I have to do the
detection and switch over manually? I don't think VRRP can help here,
right?

Thanks.
-Simon

Simon Chen wrote:

Hi all,

Happy new year...

I have a question regarding multi-homing, mostly from stub network's
operational point of view. My big question is: what kind of failures
do you usually see from your providers? Link down? Link up, but
withdraw some routes? Link up, no route change, but blackholing
partial or all traffic? Anything else?

I am a multihomed network with no downstream customers. Speaking only
for myself over the last 5 years I have only had loss of link conditions
as the majority problem such as:

* DLCI deleted (LEC "accidentally canceled" a FRT1 once)
* Loss of signal (almost always LEC problems)
* Loss of frame (almost always long haul problems)

It's worth noting all my circuits are T1, T3, or OC-x and less likely to
have an "up/up but not passing traffic" state like an Ethernet handoff
could do.

And only once:

* Sprint vs. Cogent peering spat (I'm a Sprint customer)

The last one would have been a huge problem for default route or single
homed users - and why I always recommend full tables - but for me I
didn't care since the affected paths disappeared via Sprint but were
still there via my other upstream.

To state this problem in detail: I use a static default route on Ra to
forward traffic to provider A, or receive 0/0 from provider A via BGP.
For some reason, provider A can no longer reach a /24. My network
cannot be notified (unless, I receive a full internet routing table).
In this case, all I know is that my traffic to /24 is blackholed
through provider A. In this case, is there an automatic way for my
stub network to switch over to provider B? Do I have to do the
detection and switch over manually? I don't think VRRP can help here,
right?

You're asking for what BGP does. You could ping every prefix you care
about and do it by hand, I guess. If this is a major concern for you I'd
say full tables are in your best interest so you can let BGP do what it
does best. (Disclaimer: there may be some trick I'm not aware of because
I always prefer to let BGP do its job.)

~Seth

Simon-
   We do exactly what you are trying to accomplish. We have two routers and two providers. Provider A is our primary and we receive partial routes from them (no static route). Then Router B is connected to Provider B with no default route (basically it looks like we are not advertising to them). Our AS on router b is prepended several times. Router A and B are connected via iBGP to eachother. Then, using interface tracking (we are a cisco shop) we can fail to provider B. So, about the only failure we cannot automatically recover from is if we have our router A interface / layer1 to provider A start to fail and we get enough traffic through to keep BGP up, but errors make ip traffic fail.

This failover has worked server times while in production. Mostly we see our BGP drop from provider A, but we have also seen link down from provider a. In testing we failed links and routers, which always recovered just fine. But we all know the lab can be completely different from the real world.

If you want to see how this work for us, go to bgplay.com and enter the following:

Network: 67.135.55.0/24

Start: 26/12/2009 20:00:00
End: 27/12/2009 07:00:00

Pull out 19629 (ME)
209 (Qwest, provider A)
7263 (GoFast. Dba Sungard, provider B)

At about 20:11 you see the routes start failing to AS7263 and then at about 6:23 the next day they start failing back.

This example happened when Qwest lost an edge router in Minnesota. Link status was up, but BGP tables were lost, so we had no router out to qwest.

Dylan Ebner, Network Engineer
Consulting Radiologists, Ltd.
1221 Nicollet Mall, Minneapolis, MN 55403
ph. 612.573.2236 fax. 612.573.2250
dylan.ebner@crlmed.com
www.consultingradiologists.com

If you are using Cisco...

http://www.cisco.com/en/US/prod/collateral/iosswrel/ps6537/ps6554/ps6599/ps8787/product_data_sheet0900aecd806c4ee4.html

Two more failure modes:

Link up, receiving all routes but provider stops propagating your
announcement outward.

Link up but unusably high packet loss to some or all destinations.

Regards,
Bill Herrin

Link up, receiving all routes but provider stops propagating your announcement outward.

Longer AS path prepending on your secondary connection should take care of this, eh? Might end up with asymmetric routing but better than no traffic being returned.

Link up but unusably high packet loss to some or all destinations.

Assuming Cisco hardware what is the best way to handle this? Setup some IP SLA and bind them to a tracking objects? Use EEM and TCL scripting?

Thanks!
Jason

Thank you all for the reply!
It seems to me that Cisco performance based routing and other
commercial solutions can probably handle the potential problems. How
about operators that deal with this on their own? Is there a standard
detection and recovery procedure? How long does it usually take, with
or without scripting?

Thanks!
-Simon