This is my first posting to the NANOG list. I don't think this is
off-topic, but if so, please send replies (or flames) directly to me
(rather than to the list) and I will issue a detailed summary.
We are in the process of implementing an OC-12 SONET ring connection
between two sites in New Jersey that span a distance of 15 miles.
The SONET ring will be provided by Bell Atlantic and is composed of
fully redundant hardware (there are no single points of within the
telecom equipment) and redundant rings. We are splitting the OC-12
into pairs of OC-3's on two routers in each location (running ATM on
the WAN). This interface is extremely mission critical to the point
that a 99.9% uptime will not be acceptable. I have the following
questions:
1) Bell Atlantic assures us that, because of the redundancy, we can
expect 100% uptime from the OC-12. I would like feedback as to
whether this is a realistic portrayal of the SONET environment.
2) We have the option of using either single-mode or multi-mode fiber
OC-3 connections - what factors should be considered in selecting
the fiber media type.
3) Which routers should be used. The options are 3com NB-II's DPE+
(dual CPU), Cisco 7200 series, or Cisco 7500 (each router will
have at least two 100Base-T LAN ports). OK, I know the 3com
suggestion is a loaded question for this list, but has anybody
used the 3com's in this capacity? We are a 3com shop that is
considering switching to Cisco - this is a significant decision
because switching will require us to continue to maintain the
existing 3com environment (~500 routers) and the new Cisco
routers.
Thanks,
Peter Polasek
1) Bell Atlantic assures us that, because of the redundancy, we can
expect 100% uptime from the OC-12. I would like feedback as to
whether this is a realistic portrayal of the SONET environment.
Assuming the primary and backup paths are not in the same cable
anywhere (don't laugh, it happens), the chances of the OC-12 itself
failing inside Bell land are tiny. (Of course, if your local loop gets
assaulted by a backhoe, you're in trouble.)
2) We have the option of using either single-mode or multi-mode fiber
OC-3 connections - what factors should be considered in selecting
the fiber media type.
Single mode goes farther. Multi-mode is cheaper. If you've got
relatively short cables, save some money and use multi-mode. If
you're dragging fiber all the way across the building, or between
buildings on a campus, use single-mode.
3) Which routers should be used. The options are 3com NB-II's DPE+
(dual CPU), Cisco 7200 series, or Cisco 7500 (each router will
have at least two 100Base-T LAN ports). OK, I know the 3com
suggestion is a loaded question for this list, but has anybody
used the 3com's in this capacity? We are a 3com shop that is
considering switching to Cisco - this is a significant decision
because switching will require us to continue to maintain the
existing 3com environment (~500 routers) and the new Cisco
routers.
If you go Cisco, I'd recommend the 7500 series. The 7200 series won't
handle two OC-3's and two fast ethers. You'll drop packets.
pete@cobra.brass.com (Peter Polasek) writes:
This interface is extremely mission critical to the point
that a 99.9% uptime will not be acceptable. I have the following
questions:
1) Bell Atlantic assures us that, because of the redundancy, we can
expect 100% uptime from the OC-12. I would like feedback as to
whether this is a realistic portrayal of the SONET environment.
How many significant digits do you consider acceptable? Even in an ideal
APS environment, link failure detection and protection switching does take
finite time. You might get 99.999% uptime, but probably not 99.9999999%.
Methinks that you've been subjected to Marketing. 
Tony
For total system uptime
90.0% (one nine or less) Desktop systems.
99.0% (two nines) Intermediate business systems
99.9% (three nines) Most business data systems and workgroup servers
99.99% (four nines) High-end business systems and your friendly
neighborhood telco
99.999% (five nines) Bank Data Centers and Telco Data Centers, some ISPs
99.9999% (six nines) Only God and Norad live here.
99.99999% (seven nines) Even God doesn't have pockets this deep.
There is a matching exponential cost increment with each step.
How many significant digits do you consider acceptable? Even in an ideal
APS environment, link failure detection and protection switching does take
finite time. You might get 99.999% uptime, but probably not 99.9999999%.
The thing that always got me was that there never seems to be a mention of
the sampling period for the stat.
Methinks that you've been subjected to Marketing. 
Well ... I'll give you 99.9999999% on any system you like - with a sampling
period of say every billion years. I think that allows me to stay down for
the first 100 years, long enough to extend beyong the life of any stressed
sysadmin 
More seriously - SLA's that specify a sampling period then also give an
indication what is considered too long an outage. If you get just under the
.1% downtime allowed per year all in one go you may well be pretty pissed
at being told the 8 hour outage was within the SLA.
Manar
For total system uptime
90.0% (one nine or less) Desktop systems.
99.0% (two nines) Intermediate business systems
99.9% (three nines) Most business data systems and workgroup servers
99.99% (four nines) High-end business systems and your friendly
neighborhood telco
99.999% (five nines) Bank Data Centers and Telco Data Centers, some ISPs
99.9999% (six nines) Only God and Norad live here.
99.99999% (seven nines) Even God doesn't have pockets this deep.
What's your source for this data?
-a
You mean, besides 22 years in the system design and development trade?
Well, you can start with the various companies I have worked for. Then the
manufacturing specs on the various systems. My last analysis involved
two-headed server setup on HP 9000 series T520 with MC service guard and
shared RAID5. HP guarantees that at three nines. With the right add-ons I
got it to four nines (Complete second site in AZ, 1500 miles away). Very
expensive. Five nines would have broken the budget, that was Wells Fargo.
Northrup/Grumman MD-18 flight-line support.
The PacBell broadband system was quad redundant data centers in Fairfield
and San Diego. I was hired in as the Techinical Architect for that system.
Again, HP equipment. That system would have hit five nines, or better, in
production. I think we were pushing past $16M on that system, thirty-six
specially configured T520's plus RAID packs.
Various systems I worked on in Patrice Carrol's org in MCI COS (Garden of
the Gods facility), including the Fraud Management System.
This stuff is more art than science, too many non-deterministic variables.
Experience is the only thing that counts. It tells you which formulaii to
use and when they have a chance of working.
I should have my web-site up again this week-end, we're converting to
FastTrack with LiveWire, in addition to Apache-SSL/mod_perl.
There's an interesting white paper about a Bell Atlantic SONET deployment
for military organizations at:
http://www.bell-atl.atd.net/s-wpaper
How many significant digits do you consider acceptable? Even in an ideal
APS environment, link failure detection and protection switching does take
finite time. You might get 99.999% uptime, but probably not 99.9999999%.
The thing that always got me was that there never seems to be a mention of
the sampling period for the stat.
Methinks that you've been subjected to Marketing. 
Well ... I'll give you 99.9999999% on any system you like - with a sampling
period of say every billion years. I think that allows me to stay down for
the first 100 years, long enough to extend beyong the life of any stressed
sysadmin 
More seriously - SLA's that specify a sampling period then also give an
indication what is considered too long an outage. If you get just under the
.1% downtime allowed per year all in one go you may well be pretty pissed
at being told the 8 hour outage was within the SLA.
The quasi Engineering guidelines for many CLECs when calculating average
downtime over a year's span is 52 minutes (meaning .0001% downtime over
the year). Anything above and beyond this estimate would be suspect.
Obviously, these Engineering baselines vary from carrier to carrier.
Also, this 52 minute guideline relates to the SONET ring and the muxes
and not the tributaries (OC-3 or OC-12) or the optical/electrical hand-offs
that might fail due to bad terminations/bad wiring/or misconfigured nodes.
A common failure for OC-3c or OC-12c is the 2-fiber optical handoff to the
customer which has nothing to do with the SONET ring itself or the associated
SONET gear.
Dave Cooper
Electric Lightwave, Inc.
Disclaimer: Comments above reflect my experience with numerous CLECs and
not specifically ELI.
How many significant digits do you consider acceptable? Even in an ideal
APS environment, link failure detection and protection switching does take
finite time. You might get 99.999% uptime, but probably not 99.9999999%.
The thing that always got me was that there never seems to be a mention of
the sampling period for the stat.
Methinks that you've been subjected to Marketing. 
Well ... I'll give you 99.9999999% on any system you like - with a sampling
period of say every billion years. I think that allows me to stay down for
the first 100 years, long enough to extend beyong the life of any stressed
sysadmin 
More seriously - SLA's that specify a sampling period then also give an
indication what is considered too long an outage. If you get just under the
.1% downtime allowed per year all in one go you may well be pretty pissed
at being told the 8 hour outage was within the SLA.
The quasi Engineering guidelines for many CLECs when calculating average
downtime over a year's span is 52 minutes (meaning .0001% downtime over
the year). Anything above and beyond this estimate would be suspect.
Sorry, drop the % on the .0001 -> should be .01%. Coffee wasn't strong
enough this morning. Thanks Barry.
-dave cooper
eli