dual router vs. single "reliable" router

David_Barak · April 10, 2003, 3:07pm

Okay, I'll bite...

the mentioned file,
http://www.nspllc.com/New%20Pages/Reliable%20IP%20Nodes.pdf
seems to be fluff to me.

There are many assumptions and statements about
reliability, but the methodology of how the numbers
were reached is not present. If one assumes that one
has a router which fails very rarely, this would
dramatically affect network design. However, this is
an assumption, not a conclusion. The assumption of
the paper is that the Alcatel box has ultra-low
failure rates, while the Juniper and Cisco boxen have
relatively high failure rates. Personally, before I
let something like this influence my buying/design
decisions, I'd want to see some serious raw data...

Richard_A_Steenbegen · April 10, 2003, 3:54pm

2x the hardware means 2x the number of hardware failures. It also means 2x
the number of software upgrades, and probably some multiplier greater than
2x for the increased complexity and opportunity for software to go wrong.
Dual routers just increases the number of overall failures in exchange for
hoping that only one goes down at any given time.

Throw in some assumptions (which may or may not be true, I'll agree that
some of their numbers are a little "off") that every one of those failures
involves some service impact, you could easily make a case that one box
which doesn't go down is better than two boxes which routinely go down.

On one side of the coin, Cisco has done a masterful job at convincing the
networking industry that the correct answer to their routine failures is
to purchase double of everything. On the other side... Show me the box
that never goes down.

David_Barak · April 10, 2003, 4:07pm

2x the hardware means 2x the number of hardware
failures. It also means 2x
the number of software upgrades, and probably some
multiplier greater than
2x for the increased complexity and opportunity for
software to go wrong.
Dual routers just increases the number of overall
failures in exchange for
hoping that only one goes down at any given time.

The fallacy here is that the greater number of
failures which a dual-router scenario will encounter
are of the same Qualitative type as the failures your
single router will encounter.

This is clearly not true: one of a pair failing means
that there will be a period of convergence, and then
the remaining router will carry the load. If a single
router fails, the load will not be carried until the
router can be restored.

On one side of the coin, Cisco has done a masterful
job at convincing the
networking industry that the correct answer to their
routine failures is
to purchase double of everything. On the other
side... Show me the box
that never goes down.

My point exactly: from a design perspective it's much
simpler to have a single box, but I have not seen
single boxen which don't fail.

I'm actually a big fan of the "cold-spare" approach:
you preserve your simplicity, and any outage only
lasts as long as it takes to unplug and re-plug...

Stephen_Sprunk3 · April 10, 2003, 4:59pm

Thus spake "David Barak" <thegameiam@yahoo.com>

There are many assumptions and statements about
reliability, but the methodology of how the numbers
were reached is not present. If one assumes that one
has a router which fails very rarely, this would
dramatically affect network design. However, this is
an assumption, not a conclusion. The assumption of
the paper is that the Alcatel box has ultra-low
failure rates, while the Juniper and Cisco boxen have
relatively high failure rates. Personally, before I
let something like this influence my buying/design
decisions, I'd want to see some serious raw data...

Nearly all the Cisco device failures I've seen were either software or human
problems; actual hardware failure is _way_ down the list. Also, I've
observed significantly worse reliability among devices specifically designed
to be highly reliable compared to devices simply designed to work.

There are several networks out there using Cisco devices to achieve over six
9's availability, and the way they do that is by extensive procedure review
and rigorous software testing. Writing more reliable software is certainly
doable, but more-reliable humans aren't likely and more-reliable hardware is
unnecessary. IMHO.

S

Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Stephen_Sprunk3 · April 10, 2003, 5:04pm

Thus spake "Richard A Steenbergen" <ras@e-gerbil.net>

Throw in some assumptions (which may or may not be true, I'll agree
that some of their numbers are a little "off") that every one of those
failures involves some service impact, you could easily make a case
that one box which doesn't go down is better than two boxes which
routinely go down.

If a tree falls down in a forest, but service isn't affected, do we care if
the tree falling made a noise? If you have two devices which are up 99% of
the time, then one of the two is up 99.99% of the time. While designing
with two of everything is indeed more complex, it's often simpler than
designing a single product that's more reliable.

Having two of everything also simplifies maintenance, since you don't care
(much) about an individual box being down. Public ATM networks are hell to
maintain because every node must be up 24x7 and simple things (to
routerheads) like a software upgrade are a 3+ month project because it must
be done online without dropping a single cell.

S

Stephen Sprunk "God does not play dice." --Albert Einstein
CCIE #3723 "God is an inveterate gambler, and He throws the
K5SSS dice at every possible opportunity." --Stephen Hawking

Joe_Provo4 · April 10, 2003, 10:02pm

[snip]

There are several networks out there using Cisco devices to achieve
over six 9's availability, and the way they do that is by extensive
procedure review and rigorous software testing.

[snip]

Tear this page out of the notebook and highlight it. Even a small shop
that "can't afford" a testbed or lab can dramatically reduce problems
with formal procedures borrowed from software engineering.

Joe

Andre_Gironda · April 10, 2003, 10:07pm

This is also my experience. The chance of a forklift or ceiling tile taking
out your infrastructure is not even close to the amount of times you have to
tell the junior guys "nononononono, `debug all' is _bad_ idea". Until the
software reduces/eliminates pilot error to a severe degree, and is proven to
prevent forwarding issues (read: fib bugs) -- there is just no big motivation
to run single box. Single box has its application in IP networks, but moreso
in the access layer (customer edge), or Internet edge (peering).

I'm just flinching thinking about using single box in the core (per POP), when
a single command of any type can just take out the whole box (`no ip routing'
immediately comes to mind). There's just more software on IP boxes compared
to telco technology.

The only work I've seen in the IETF on topic is this draft:
http://www.ietf.org/internet-drafts/draft-kilsdonk-router-upgrade-01.txt
But it has left a lot to be desired, IMO.

dre

John_L_Lee · April 11, 2003, 4:41pm

The report seermed to like the boxes from the companies that Alcatel bought. On this side of the pond they
did not get alot of traction.

>From my experience with a “large” Cisco network besides the normal too many fingers on the CLI to the router there were about one to
two hardware failures a week on the big “C” devices. Most of these were interface cards but some were control cards on switches.

As long as you did not overrun the 12xxx series with too many BGP updates per unit of time they as a box were stable.

To calculate “99.99” per cent uptime you -

Define scheduled down time as any maintanance that can be done after giving the worldwide network at least five minutes notice of major
network outages …

Define the device to be up as long as you can secure telnet into the box and get a command prompt -

Ignore BGP, routing or forwarding tables or any other routing issues because “Up-Time” means the box is up not that it is doing any usefull work …

And finally go back to a “real person” deterministic protocol like “ATM” with PNNI and UNI so you
“know” when the network is down and do not have to guess … My layer two network ran for two years with no hardware or routing issues …

The other minor technical issue is that say you can calculate uptime on a “router” how do you calculate up time on the network? … Since
just because one router is down the “network” is still up so individual up-time for a router in an IP network may only affect SLAs for customer
directly attached to it.

John (It Suites Dennis’s Needs) Lee

David Barak wrote: