redundancy [was: something about arrogance]

Brad writes:
> I'm probably demonstrating my ignorance here (and my stupidity in > stepping into a long-standing highly charged argument), but I'm > completely missing something. For reasons of redundancy & > reliability, even if you were to buy bandwidth in only one location, > wouldn't you want to buy it from at least two different providers?
  > If you buy bandwidth from two different providers at two > different locations, this would seem to me to be a good way to > provide backup in case on provider or one location goes > Tango-Uniform, and you could always backhaul the bandwidth for the > site/provider that is down.

Several other posters have mentioned reasons why redundancy between 2 different connections to separate providers are not, in most situations, the preferable aproach but i would like to add another point/question...

When considering redudancy/reliability/etc it is important to think about what kind of failures do you want to protect against vs cost of doing so.

It is my impression, from reading this list and tidbits of gossip, that the most common causes of failure are:
- link failure
- equipment failure (routers mostly), both software and hardware
- configuration errors

All of those are much more frequent than the failure of an entire ISP (a transit provider). It is expected, i believe, of a competent ISP to provide redudancy both within a POP and intra-POP links/equipment and its connections to upstreams/peers.

As such, probably the first level of redundancy that a origin AS (non-transit) would look at would be with the intent to protect from failures of its external connectivity link and termination equipment (routers on both ends).

To do so, one can look at:
- 2 external links to distinct providers
- 2 external links to the same provider

While i can't speak to the economics part of the equation (although i would expect it to be cheaper to buy an additional link than connect to a different provider) from a point of view of restoration, protecting a path with an alternate path from the same provider is certainly an aproach that gives you much better convengence times.

This comes from the fact that in terms of network topology, the distance between 2 links to the same upstream is much shorter than 2 links to different upstreams. While, if you protect a path with an alternate path to the same ISP you can expect convergence to occur within the IGP convergence times of your provider, with 2 different providers you need global BGP convergence to occur.

This gets to be longer dependent on how topologically distant your 2 upstreams are... for instance attempting to protect a path to an ISP with very wide connectivity with a protection path from one with very limited connectivity would be a particularly bad case as you would have to wait for the path announced by the larger ISP to be withdrawn n times from all its peering points and the protection path to make its way through in replacement.

It is counter-intuitive to me what i perceive to be the standard practice of attempting to multi-home to 2 distinct providers by origin-only ASes... from several points of view: convergence times, load on the global routing system, complexity of management, etc, dual connectivity to different routers of the same provider (using distinct physical paths) would seem to me to make more sense.

Unless the main concern is that the upstream ISP fails entirely... which given the fact that it tends to have frontpage honors on the NYTimes this days does not apear to be an all to common occurence (i mean operationally, not financially - clarification added to dispel potential humorous remarks).

So, my question to the list is, why is multi-homing to 2 different providers such a desirable thing ? What is the motivation and why is it prefered over multiple connections to the same upstream ?

Is the main motivation not so much reliability but having a shorter as-path to more destinations ? This would apear to me to be a clear advantage since that doesn't necessarily reflect in better qualitify of interconnection.

My apologies in advance if these seem to be stupid questions...

thanks,
  Pedro.

All of those are much more frequent than the failure of an entire ISP (a
transit provider). It is expected, i believe, of a competent ISP to
provide redudancy both within a POP and intra-POP links/equipment and
its connections to upstreams/peers.

  Yes, but when the ISP that all your redundant links go to and that you got
all your IPs from goes out of business, what's the mean time to repair? 30
days?

So, my question to the list is, why is multi-homing to 2 different
providers such a desirable thing ? What is the motivation and why is it
prefered over multiple connections to the same upstream ?

  You cannot as easily be held hostage. I have consulted for a few ISPs and
have my share of war stories.

  Here's a (true!) example. One day, a certain head of a fairly large ISP
decided that he wouldn't route traffic to or from IPs he had assigned that
didn't reverse resolve because he felt it was imperative that people be able
to find network contacts in this way (I think he got sick of being the one to
get the abuse emails). He told my client three days before implementing a
sweep and filter. He had the equivalent of about 38 /24s from this ISP
distributed over about 180 customers, they were his sole uplink.

  Here's another good one. A client needed a /22 immediately for a major
customer about to come online, set it up fast or lost the account. We made
sure to met all the IP assignment guidelines and our justification was
impeccable, we had >90% utilization of a /18. The only problem was, the
client's provider had a screw up in their allocations and justifications and
their applications were being refused by ARIN until they fixed their
problems. Now what?

  One more just for kicks. Client had a 100Mbps circuit from their sole
provider (100Mbps to colocated router, DS3 from this router to their
premises). The circuit had been in place for several years and the contract
had long since expired. One day, they got a call -- they had 5 days to agree
to a new (and MUCH higher) pricing scheme with a much higher minimum paid
bandwidth amount or their circuit would be turned off. The kicker -- they had
to agree to a two year term!

  The other issue is provider misconfigurations/meltdowns. They're not common,
but if you're multihomed, you can just shut down the circuit to the
misconfigured providers. There have been a few cases of these that I've seem
where the repair time was several hours.

  If you add cases where just one POP was out, the number goes way up. If
you're only in one location yourself and only use one provider, all of your
redundant links will likely go to the same POP.

  DS

  You cannot as easily be held hostage. I have consulted for
a few ISPs and
have my share of war stories.

  Here's a (true!) example. One day, a certain head of a
fairly large ISP
decided that he wouldn't route traffic to or from IPs he had
assigned that
didn't reverse resolve because he felt it was imperative that
people be able
to find network contacts in this way (I think he got sick of
being the one to
get the abuse emails). He told my client three days before implementing

a

sweep and filter. He had the equivalent of about 38 /24s from this ISP
distributed over about 180 customers, they were his sole uplink.

[SNIP]

Often overlooked is the redundancy in business processes. We tend
to view events with an external-forces engineering perspective while
frequently the culprits are uninformed decisions, knee-jerk reactions and
opportunism by humans at our vendors. (Not to downplay other risks.)

-John

I have in the past single-homed to Level(3) and Verio, each in their own
facility in NC.
In that time, both carriers had about 1 solid hour a month of solid
downtime (some months were worse, some were better). Some of the outages
were on the order of 8 solid hours (verio) or 4 hours (level3).

We did not run HSRP with Level3, so it may be difficult to guarantee the
uptime of one gige handoff... But we ran HSRP with verio, and of all the
outages (about 20 of them) -- Maybe two of them were avoided because of
HSRP.

Other than that, it was all downtime.

At this point, I couldn't conceive single-homing to any uplink anymore.

--Phil

I couldn't possibly agree more. In fact, my approach has been to create
a mesh between different Colo centers, and keep it at about 3 Transit
carriers. Because of the different methods of interconnection, I haven't
ever had a long-term outage. Also, I've been able to filter any issues
that are beyond my carrier's immediate reach (i.e. congested peering
points.) At the same time, I've been able to maintain aggregation of all
of my routes, and maintain true stability in my network. There is
absolutely no excuse to fill up the routing tables with nonsense.

Derek

Yes their is a reason to some /24s advertised to the world. If this a
class on BGP they would tell you that was a nono, but since this is the
real world it happens and is sometimes required. It is required when you
need to give a customer T-1 access at a location seperate from yours and
has a seperate connection to the net and you are using your AS on the
access router. A /24 is a solution that works nicely and still works
with your aggregated /20 address.

That is even worse than what we have been talking about. You should be
running a P2P T1 back to yourself, and distributing the access from a
POP, or have the carrier you're reselling the T1 for allocate a /24.
There is no reason to run BGP for a single /24 whatsoever, it should be
announced in Carrier address space. Using your AS for another company
totally violates the whole idea of an "Autonomous System".

Derek

It is my impression, from reading this list and tidbits of gossip,
that the most common causes of failure are:
- link failure
- equipment failure (routers mostly), both software and hardware
- configuration errors

  Most likely true.

To do so, one can look at:
- 2 external links to distinct providers
- 2 external links to the same provider

  The latter doesn't protect you from a mis-configuration problem from the same provider, upstream of their redundant links to you. Moreover, it also doesn't protect you if they have a SPOF above your redundant links, even if logically they have two (or more) separate outward links, if they are over the same fiber, or the fibers in question are physically close to each other, then a single backhoe could take you out.

  A second provider doesn't necessarily protect you against the backhoe problem, but it would reduce the chances of a problem caused by an upstream misconfiguration.

While i can't speak to the economics part of the equation (although
i would expect it to be cheaper to buy an additional link than connect
to a different provider) from a point of view of restoration,
protecting a path with an alternate path from the same provider
is certainly an aproach that gives you much better convengence times.

  Perhaps, perhaps not. I would be willing to bet that there are at least a few large providers that effectively run each city as a separate business, and they'll rape you just as much or more for two connections as you would pay to get one connection each from two companies.

Unless the main concern is that the upstream ISP fails entirely...
which given the fact that it tends to have frontpage honors on the
NYTimes this days does not apear to be an all to common occurence
(i mean operationally, not financially - clarification added to
dispel potential humorous remarks).

  Again, I think that this is at least partly dependant on who the upstreams are. If they're small enough, then a single backhoe could take out all the fiber (or cause the remaining fiber to be loaded well past capacity and practically useless) or cause a power loss across the entire facility.

  Even if you buy connectivity from a pretty big upstream, what with WorldCom and Qwest both being in serious trouble (and KPN/Qwest having completely shut down operations), I would indeed be very concerned about complete failure of my upstream.

Seeing as I don't understand much about this process, I would love to hear a detailed explanation of how you have managed to do all this.

Er, what does due diligence mean to you?

(We're waaaay into no-shit-sherlock territory here)

(For the record, I'd consider any operation without an AS number a
startup, and my first project, if network availability were a key
issue, within any organisation would be to a) obtain one and b) make
use of it. YMMV, but some V are more equal then others. Particularly
in the current economic climate.)

Patrick Evans <pre@PRE.ORG> writes:

My first project, if network availability were a key issue, within any
organisation would be to a) obtain [an AS number] and b) make use of
it.

Heh. How many bits in an AS number, again?

Jim Shankland

*grin*

That's a problem with the underlying protocol. I get paid to run
operational networks, not bleat endlessly about "how much work would
it *really* take to implement 24bit AS numbers?" :slight_smile:

Crying about protocol deficiencies is a distant second to keeping a
business up and running these days.