Network Reliability Engineering

I'm looking for some good reference materials to do some
"reliability engineering" calculations and projections.

This is to justify increased redundancy, and I want to
include quantifiable numbers based on MTBF data and other
reliability factors, kind of a scientific justification
instead of just the typical emotional appeal using
analyst/vendor FUD.

I'd appreciate references on how to do this in a network
environment (what data to collect, how to collect it, how to
analyze, etc). Also any data (or rules of thumb) on typical
MTBFs for network events that I won't find on vendor product
slicks (like what's the MTBF on IOS, or human-caused service
outages of various types, etc).

If someone has put together something remotely like this
that they'd care to share, that'd be incredibly helpful.


Good luck. For a proper scientific analysis you'd need MTBF info on every
point of failure - i.e. the physical link, CSU/DSU, power supply, ...
As a rather non-scientific observation, a couple outages per year of 1-4
hours seems to be quite common for a single-homed T1 or faster connection,
be it from WorldCom, AT&T, Sprint...

I think the arguments in favor of dual-homing are pretty cut and
dry. Tri-homing vs dual-homing would be a much tougher benefit to

Ralph Doncaster
div. of Doncaster Consulting Inc.