RE: Quick Question on Industry Standard

and match. Is this a hard and fast rule or is this a value that we all try
and emulate as best as we can? Do I have the value incorrect? Is it higher
or lower? I had always thought that it was 99.97% but have not found
anywhere to reference that figure,

Cisco references that figure in their MTBF/MTTR (mean time before failure) calculations, however they also reference 99.9%. Our organization does not include scheduled maintenance in our HA (high availability) calculations, and I expect most organizations don’t either.

Quote, “Availability is calculated using statistical models for all the system components, the simplest model for a component being binary. The component is either in or out of service. Availability can be calculated from failure rates, measured in mean time between failures (MTBF), and repair times, measured in mean time to repair (MTTR).”

Also, “The average downtime contribution by any component is calculated by amortizing the MTTR time over the MTBF period. For example, if a component critical to the operation of the platform has an MTBF of 250,000 hours and a MTTR of 1 hour, it contributes 2.1 minutes (60 min/250,000 hr/8760 hr/yr) of unavailability to the system per year”

So to calculate the physical HA stats, you need to reduce your network to component levels and do the calculations. Cables are not usually included in these calculations.


– Tim


Any thoughts?

OK, I’ll bite.

CAUTION: As always, my email is long, wordy, technical and sometimes skirts off topic; however, I’ve got to put up with free marketing references to Cisco/Juniper at every turn on NANOG. It’s nice to get Foundry’s name here every once in awhile. We’re all good companies. For the most part (95%), I’ll stay on HA operational topics for the NANOG reader.

Some recommended books on this subject are listed below. I will refer to these books during this email:

Top Down Network Design by Priscilla Oppenheimer (Cisco Press)
Designing Enterprise Routing and Switching Architectures by Howard Berkowitz (McMillan Press)
High Availability Network Fundamentals by Chris Oggerino (Cisco Press)

Lot’s of other references to include industry standards by Telcordia (how’s your calculus?). See HA Networking Fundamentals for a good root reference list.

I guess the main thing to do is look on page 48 of Priscilla’s book. She categorizes customer requirements and recommends a method to prioritize those requirements for network design tradeoffs. I’ve added “profitability” to the list for service providers. You can tailor it to your business goals. I go back to this a lot and it helps me know where availability sits as a design requirement.

Now we have to think about what you mean by “Industry Standard”. This definitely depends on the industry; however, it varies per company based on design requirements mapped to business goals of YOUR company in that industry. Obviously, having better availability is just one part of a multi-faceted competitive business plan. In some industries, it is assumed to be basic to have high availability. Other’s ignore it.

Some components you must look at are:

Human Error/Process Error
Physical Infrastructure Security and Robustness
Equipment Quality
Special Events, Risks, and Threats (Sean Donelan digging up your fiber, hacker attack, governmental or organizational shutdown/de-peering, war or political unrest, resource shortages, economy, insert your imagination here)

On the technology side, basically… The lower you push your redundant failover technology, the better your failover. SONET APS can failover in 50ms over and over again. L2 and L3 protocols continue to operate as normal with minimum In-Flight Packet loss (IFP). This is exactly why the 10GbE Forum is promoting APS in the 10GbE WAN PHY!

Foundry Networks (my company) has two new technologies that can give you sub second failover and avoid the failover of L3 and slower L2 redundancy protocols (RSTP and STP). The technologies are Metro Ring Protocol and Virtual Switch Redundancy Protocol. Both of these are currently in beta (soon to be released), but I’ve been playing with them for the past week. VSRP is VRRP on L2 steroids (sub second failover). Easy to understand (one switch is actively forwarding while the other is on standby). All of these L2 protocols are interoperable on the same devices in the same networks (RSTP, STP, VSRP, MRP). A customer can run STP with a provider VSRP edge and MRP core. VLAN stacking and STP tunneling is supported for those of you looking at Metro business plans. Below is an example of HA technology with MRP. Take a look at this topology:

PE1 1 | ___P2A___P2C
_____P1B/ 2 | _____P4A
___P2B___P3A/ 3 |

I’ve got link P2B to P3A running MPLS (LER to LER, don’t ask why, it’s just a lab) OC-48 (wire speed 2.5G) with Draft Martini L2 VPN. Link P2A to P2C is 802.3ae draft 4.2 compliant 10GbE. All other links are GbE. I’ve got 50 VLAN’s. 25 of them travel clockwise around the rings and 25 of them travel counter clockwise. Each group of 25 is grouped in a topology group and run an instance of MRP on the lead (master) VLAN of that topology group. Rings are 1, 2, and 3. I really hope my diagram shows up OK for the readers of this email.

The MRP ring masters are PE1 for ring1, P1B for ring 2, and PE2 for ring 3. MRP masters send out Ring Health Packets (RHP) around the ring every 100ms (configurable). They originate these out of their primary ports and receive them on their secondary ports. MRP masters block forwarding on their secondary ports if they receive the RHP’s. They transition to forwarding (ring broken) when they stop receiving the RHP’s.

Now let’s assume that all traffic is taking the bottom path via MRP primary paths on the masters. OK, let’s start pinging ( is PE2 loopback address):

PE1#ping count 100000000 time 800
Sending 100000000, 16-byte ICMP Echo to, timeout 800 msec, TTL 64
Type Control-c to abort
511000Request timed out. < Here I unplug PE1 to P1B link (primary path). 1 In-Flight Packet (IFP) lost.
854000Request timed out. < Here I unplug PE1B to P2B link (primary path) 1 IFP lost
1160000Request timed out.< Here I unplug P3A to PE2 link (primary path) 2 IFP’s lost. All traffic on secondary path now.
Request timed out.
1513000Request timed out. < Here I plug PE2 to PE3 link back. 2 IFP’s lost.
Request timed out.
1638000Request timed out. < Plug P1B to P2B link back in. 1 IFP lost.
1823000Request timed out. < Plug PE1 to P1B link back. 2 IFP’s lost.
Request timed out.

Not too bad considering that MRP is a software technology eh? Also, the CPU’s of all the devices are at 1%!

802.17 Resilient Packet Ring (RPR) is supposed to do EXACTLY what MRP does, but faster 'cause it is in HW. Personally, I don’t think the industry needs another L2 technology. Ethernet will be just fine with APS in the WAN PHY (Coming this year I’m sure)! RPR is not Ethernet and will be more expensive. I was a Token Ring fan. I’ve learned to respect Ethernet and I regard Ethernet as the clear winner. My XBOX™ at home has Ethernet (NOTE: My XBOX has only rebooted suddenly on me 3 times as opposed to ZERO for my PS2. Thanks MS!)! ATM Segmentation and Reassembly on OC-192 will be a lot more costly than simple 10GbE as well. I’m not even sure if SAR has the capability to do it at wire speed today. I’ve seen nothing on this from the ATM front. Ethernet will be at 40GbE (OC-768) before ATM SAR is at OC-192. My money is on Ethernet. LINX is just one of many folks running 10GbE today! Took them 3 minutes to make the conversion from what I read in the press release. Wonder how long it would take to do an RPR upgrade from GbE (Haven’t seen a working RPR network yet. I have seen MRP on 10GbE, GbE, and POS). ATM?

We can see here that the technology can get us to the point of 100% availability (I don’t consider one or two packets per user session on a link failover as downtime. Do you?); however, as you can see from my design, I’ve got Single Points of Failure. I can easily design more rings (at more cost) and remove these SPOF’s. Now the only question is: What are your business goals and your acceptable amount of downtime. I want to point you to Howard Berkowitz’s book for some advice on downtime tolerance. I don’t want to explain it here; however, the unit of measurement is called the “Fulford”. Howard talks about a network design requirement no more than two Fulfords a year. Hilarious scenario, but often true. Howard also is quick to point out that simply having redundant links does not equal high availability! Good read (although Howard can get a bit repetitious, it helps drive the main points home).

I think that there is no acceptable industry standard that you can simply overlay into an individual company’s requirements. It is all customized. Some folks are happy with slow convergence. As long as the phone doesn’t ring. Some users accept a provider to be down for 30 minutes every Sunday night. All relevant. Some providers have governmental reporting requirements if they have downtime.

One other thing. If an organization doesn’t have downtime reporting processes and run charts, then I feel they really don’t know what they’ve got. You can calculate device serial and parallel availability by using MTBF/MTTR calculations. These are all probabilities. There is a much bigger picture. How many times does a network engineer mess something up and then hide it? I think every one on this list has made a serious mistake and caused network downtime WITHOUT reporting it. I caused about 10 seconds of downtime on a large service provider network by accidentally removing the default route (fat fingered!) not two months ago. The guy in charge said: “Only 10 seconds? Don’t worry 'bout it. Nobody will every know.”. Now you know. Plan, Do, Check, Act, Repeat. You must have processes to track and improve your availability. Else you are doing nothing but talking 'bout it and you are still clueless.

Bottom line, there is no hard and fast rule. Everyone wants 5 nines (99.999%), yet how can you get that using routing protocol (L3) or spanning tree (L2) redundancy technologies! I’m a big L3 fan in my designs, but I understand the convergence factors. VSRP on an L2 core could give you sub second. L3 BGP may be your only choice. OSPF with link aggregation and auto-cost decrementation (is this a word?) on link failure (hey, aggregate 4X10GbE and loosing one link can have significant impact on a network core). An example is, you can get 50ms to 5 seconds failover using Rapid Spanning Tree, but it takes the normal failover of STP when returning back to the original path. This is almost a minute of downtime. Not too good on the statistics.

I worked on US Military networks before. Their redundancy is not only terrestrial, but uses aircraft and satellites. Wanna buy a used AWACS? See how it all relates? How much money you got? What are your goals? How smart are your humans (training is the most upstream process)? Save money and use monkeys? Now back to your original question:

Is this a hard and fast rule or is this a value that we all try and emulate as best we can? Do I have the value incorrect? Is it higher or lower?

Set your own standard. I doubt if you’ll find the right answer on NANOG. If you want my generic answer. I’d say you want 99.999% availability from all network endpoints to network endpoints during times of network utilization. I doubt if you’ll hear many complaints from users/customers at this level. Please be careful when jumping this high. You could pull a muscle (take away from another key requirement such as Cost, Manageability, Security, Reliability, et al…).

Gary Blankenship
Systems Engineer - Japan