Data Center testing

Does any one know of any data centers that do failure testing of their
networking equipment
regularly? I mean to verify that everything fails over properly after
changes have been made over
time. Is there any best practice guides for doing this?

Thanks,
Dan

"-----Original Message-----

I know Peer1 in vancouver reguarly send out notifications of
"non-impacting" generator load testing, like monthly. Also InterXion
in Dublin, Ireland have occasionally sent me notification that there
was a power outage of less than a minute however their backup
successfully took the load.

I only remember one complete outage in Peer1 a few years ago... Never
seen any outage in InterXion Dublin.

Also I don't ever remember any power failure at AiNet (Deepak will
probably elaborate)

We have done power tests before and had no problem. I guess I am looking
for someone who does testing of the network equipment outside of just power
tests. We had an outage due to a configuration mistake that became apparent
when a switch failed. It didn't cause a problem however when we did a power
test for the whole data center.

-Dan

Dan Snyder wrote:

We have done power tests before and had no problem. I guess I am looking
for someone who does testing of the network equipment outside of just power
tests. We had an outage due to a configuration mistake that became apparent
when a switch failed. It didn't cause a problem however when we did a power
test for the whole data center.

The plus side of failure testing is that it can be controlled. The downside to failure testing is that you can induce a failure. Maintenance windows are cool, but some people really dislike failures of any type which limits how often you can test. I personally try for once a year. However, a lot can go wrong in a year.

Jack

Thanks for the kind words Ken.

Power failure testing and network testing are very different disciplines.

We operate from the point of view that if a failure occurs because we have scheduled testing, it is far better since we have the resources on-site to address it (as opposed to an unplanned event during a hurricane). Not everyone has this philosophy.

This is one of the reasons we do monthly or bimonthly, full live load transfer tests on power at every facility we own and control during the morning hours (~10:00am local time on a weekday, run on gensets for up to two hours). Of course there is sufficient staff and contingency planning on-site to handle almost anything that comes up. The goal is to have a measurable "good" outcome at our highest reasonable load levels [temperature, data load, etc].

We don't hesitate to show our customers and auditors our testing and maintenance logs, go over our procedures, etc. They can even watch events if they want (we provide the ear protection). I don't think any facility of any significant size can operate differently and do it well.

This is NOT advisable to folks who do not do proper preventative maintenance on their transfer bus ways, PDUs, switches, batteries, transformers and of course generators. The goal is to identify questionable relays, switches, breakers and other items that may fail in an actual emergency.

On the network side, during scheduled maintenance we do live failovers -- sometimes as dramatic as pulling the cable without preemptively removing traffic. Part of *our* procedures is to make sure it reroutes and heals the way it is supposed to before the work actually starts. Often network and topology changes happen over time and no one has had a chance to actually test all the "glue" works right. Regular planned maintenance (if you have a fast reroute capability in your network) is a very good way to handle it.

For sensitive trunk links and non-invasive maintenance, it is nice to softly remove traffic via local pref or whatever in advance of the maintenance to minimize jitter during a major event.

As part of your plan, be prepared for things like connectors (or cables) breaking and have a plan for what you do if that occurs. Have a plan or a rain-date if a connector takes a long time to get out or the blade it sits in gets damaged. This stuff looks pretty while its running and you don't want something that has been friction-frozen to ruin your window.

All of this works swimmingly until you find a vendor (X) bug. :slight_smile: Not for the faint-of-heart.

Anyone who has more specific questions, I'll be glad to answer off-line.

Deepak Jain
AiNET

Deepak Jain wrote:

Thanks for the kind words Ken.

Power failure testing and network testing are very different disciplines.

We operate from the point of view that if a failure occurs because we have scheduled testing, it is far better since we have the resources on-site to address it (as opposed to an unplanned event during a hurricane). Not everyone has this philosophy.

This is one of the reasons we do monthly or bimonthly, full live load transfer tests on power at every facility we own and control during the morning hours (~10:00am local time on a weekday, run on gensets for up to two hours). Of course there is sufficient staff and contingency planning on-site to handle almost anything that comes up. The goal is to have a measurable "good" outcome at our highest reasonable load levels [temperature, data load, etc].

At least once a year I like to go out and kick the service entrance
breaker to give the whole enchilada an honest to $diety plugs out test.
As you said, not recommenced if you don't maintain stuff, but that's how
confident I feel that my system works.

~Seth

At least once a year I like to go out and kick the service entrance
breaker to give the whole enchilada an honest to $diety plugs out test.
As you said, not recommenced if you don't maintain stuff, but that's
how
confident I feel that my system works.

Nature has a way of testing it, even if you don't. :slight_smile:

For those who haven't seen this occur, make sure you have a plan in case your breaker doesn't flip back to the normal position, or your transfer switch stops switching (in either direction -- for example, it fuses itself into the "generator/emergency" position).

For small supplies (say <1MW) it's not as big a deal, but when the breakers in a bigger facility can weigh hundreds of pounds each and can take months to replace, these are real issues and will test your sparing, consistency and other disciplines.

Deepak Jain
AiNET

Dan,

With all due respect, if there are config changes being made to your
devices that aren't authorized or in accordance with your standards (you
*do* have config standards, right?) then you don't have a testing problem,
you have a data integrity problem. Periodically inducing failures to catch
them is sorta like using your smoke detector as an oven timer.

There are several tools that can help in this area; a good free one is
rancid [1], which logs in to your routers and collects copies of configs
and other info, all of which gets stored in a central repository. By
default, you will be notified via email of any changes. An even better
approach than scanning the hourly config diff emails is to develop scripts
that compare the *actual* state of the network with the *desired* state and
alert you if the two are not in sync. Obviously this is more work because
you have to have some way of describing the desired state of the network in
machine-parsable format, but the benefit is that you know in pseudo-realtime
when something is wrong, as opposed to finding out the next time a device
fails. Rancid diffs + tacacs logs will tell you who made the changes, and
with that info you can get at the root of the problem.

Having said that, every planned maintenance activity is an opportunity to
run through at least some failure cases. If one of your providers is going
to take down a longhaul circuit, you can observe how traffic re-routes and
verify that your metrics and/or TE are doing what you expect. Any time you
need to load new code on a device you can test that things fail over
appropriately. Of course, you have to willing to just shut the device
down without draining it first, but that's between you and your customers.
Link and/or device failures will generate routing events that could be used
to test convergence times across your network, etc.

The key is to be prepared. The more instrumentation you have in place
prior to the test, the better you will be able to analyze the impact of the
failure. An experienced operator can often tell right away when looking at
a bunch of MRTG graphs that "something doesn't look right", but that doesn't
tell you *what* is wrong. There are tools (free and commercial) that can
help here, too. Have a central syslog server and some kind of log reduction
tool in place. Have beacons/probes deployed, in both the control and data
planes. If you want to record, analyze, and even replay routing system
events, you might want to take a look at the Route Explorer product from
Packet Design [2].

You said "switch failure" above, so I'm guessing that this doesn't apply
to you, but there are also good network simulation packages out there.
Cariden [3] and WANDL [4] can build models of your network based on actual
router configs and let you simulate the impact of various scenarios,
including device/link failures. However, these tools are more appropriate
for design and planning than for catching configuration mistakes, so
they may not be what you're looking for in this case.

--Jeff

[1] http://www.shrubbery.net/rancid/
[2] http://www.packetdesign.com/products/rex.htm
[3] http://www.cariden.com/
[4] http://www.wandl.com/html/index.php

There's more to data integrity in a data center (well, anything powered,
that is) than network configurations. There's the loading of individual
power outlets, UPS loading, UPS battery replacement cycles, loading of
circuits, backup lighting, etc. And the only way to know if something is
really working like it's designed is to test it. That's why we have
financial auditors, military exercises, fire drills, etc.

So while your analogy emphasizes the importance of having good processes in
place to catch the problems up front, it doesn't eliminate throwing the
switch.

Frank

Config checking can't say much about silent hardware failures.
Unanticipated problems are likely to arise in failover systems,
especially complicated ones. A failover system that has not been
periodically verified may not work as designed.

Simulations, config review, and change controls are not substitutes
for testing, they address overlapping but different problems.
Testing detects unanticipated error; config review is a preventive
measure that helps avoid and correct apparent configuration issues.

Config checking (both software and hardware choices) also help to
keep out unnecessary complexity.

A human still has to write the script and review its output -- an
operator error would eventually occur that is an accidental omission
from both the current state and from the "desired" state; there is a
chance that an erroneous entry escapes detection.

There can be other types of errors:
Possibly there is a damaged patch cable, dying port, failing power
supply, or other hardware on the warm spare that has silently degraded
and its poor condition won't be detected (until it actually tries
to take a heavy workload, blows a fuse, eats a transceiver, and
everything just falls apart).

Perhaps you upgraded a hardware module or software image X months ago,
to fix bug Y on the secondary unit, and the upgrade caused completely
unanticipated side effect Z.

Config checking can't say much about silent hardware problems.

Most Provider type datacenters I've worked with get a lot of flak from
customers when they announce they're doing network failover testing, because
there's always going to be a certain amount of chance (at least) of
disruption. Its the exception to find a provider that does it I think (or
maybe just one that admits it when they're doing it). Power tests are a
different thing.
As for testing your own equipment, there are a couple ways to do that,
regular failover tests (quarterly, or more likely at 6 month intervals),
and/or routing traffic so that you have some of your traffic on all paths
(ie internal traffic on one path, external traffic on another). The latter
doesn't necessarily tell you that your failover will work perfectly, only
that all your gear in the 2nd path is functioning. I prefer doing both.

When doing the failover tests, no matter how good your setup is, there's
always a chance for taking a hit, so I
always do this kind of work during a maintenance window, not too close
to quarter end, etc.
If you have your equipment set up correctly of course, it goes like butter
and is a total non-event.

For test procedure, I usually pull cables. I'll go all the way to line cards
or power cables if I really want to test, though that can be hard on
equipment.

E

James Hess wrote:

Config checking can't say much about silent hardware failures.
Unanticipated problems are likely to arise in failover systems,
especially complicated ones. A failover system that has not been
periodically verified may not work as designed.

I've seen 3-4 failover failures in the last year alone on the sonet transport gear. In almost each case, the backup cards were dead when the primary either died or induced errors causing telco to switch to the backup card. I have no doubts that they haven't been testing. While it didn't effect most of my network, I have a few customers that aren't multihomed, and it wiped them out in the middle of the day up to 3 hours.

There can be other types of errors:
Possibly there is a damaged patch cable, dying port, failing power
supply, or other hardware on the warm spare that has silently degraded
and its poor condition won't be detected (until it actually tries
to take a heavy workload, blows a fuse, eats a transceiver, and
everything just falls apart).

Lots of weird things to test for. I remember once rebooting a c5500 that had been cruising along for 3 years and the bootup diag detected 1/2 a linecard as bad, which had been running decently up until the reload. Over the years, I think I've seen or detected everything you mentioned either during routine testing or in production "oh crap" events.

Jack

I would hope that the data center engineers built and ran suite of tests to find failure points before the network infrastructure was put into production. That said, changes are made constantly to the infrastructure and it can become very difficult very quickly to know if the failovers are still going to work. This is one place where the power and network in a datacenter divulge. The power systems may take on additional load over the course of the life of the facility, but the transfer switches and generators do not get many changes made to them. Also, network infrastructure tests are not going to be zero impact if there is a config problem. Generator tests are much easier. You can start up the generator and do a load test. You can also load test the UPS systems as well. Then you can initiate your failover. Network tests are not going to be zero impact even if there isn't a problem. Let's say you wanted to power fail a edge router participating in BGP, it can take 30 seconds for that routers route to get withdrawn from the BGP tables of the world. The other problem is network failures always seem to come from "unexpected" issues. I always love it when I get an outage report from my ISP's or datacenter and they say an "unexpected issue" or "unforseen issue" caused the problem.

Dylan

The idea of regular testing is to essentially detect failures on your time schedule rather than entropy's (or Murphy's). There can be flaws in your testing methodology too. This is why generic load bank tests and network load simulators rarely tell the whole story.

Customers are rightfully unpleased with any testing that affects their normal peace-of-mind, and doubly so when it affects actual operational effectiveness. However, since no system can operate indefinitely without maintenance, failover and other items, the question of taking a window is not negotiable. The only thing that is negotiable (somewhat) is when, and only in one direction (ahead of the item failing on its own).

So, taking this concept to networks. It's not negotiable whether a link or a device will fail, the question is only how long you are going to forward bits along the dead path before rerouting and how long that rerouting will take. SONET says about 50ms, standard BGP about 30-300seconds. BFD and other things may improve these dramatically in your setup. You build your network around your business case and vice versa.

Clearly, most of the known universe has decided that BGP time is "good enough" for the Internet as a whole right now. Most are aware of the costs in terms of overall jitter, CPU and stability if we reduce those times too far.

Its intellectually dishonest to talk about never losing a packet or never forwarding along a dead path for even a nanosecond when the state-of-the-art says something very different indeed.

Deepak Jain
AiNET

There's more to data integrity in a data center (well, anything powered,
that is) than network configurations.

Understood and agreed. My point was that induced failure testing isn't
the right way to catch incorrect or unauthorized config changes, which is
what I understood the original poster to have said was his problem. My
apologies if I misunderstood what he was asking.

So while your analogy emphasizes the importance of having good processes in
place to catch the problems up front, it doesn't eliminate throwing the
switch.

Yup, and it's precisely why I suggested using planned maintenance events
as one way of doing at least limited failure testing.

--Jeff

Any suggested tools for describing the desired state of the network?

NDL, the only option I'm familiar with, is just a brute-force approach
to describing routers in XML. This is hardly better than a
router-config, and the visualizations break down on any graph with
more than a few nodes or edges. I'd need thousands to describe
customer routers.

Or do we just give up on describing all of those customer-facing
interfaces, and only manage descriptions for the service-provider part
of the network? This seems to be what people actually do with network
descriptions (oversimplify), and that doesn't seem like much of a
description to me.

Is there a practical middle-ground between dismissing a multitude of
relevant customer configuration and the data overload created by
merely replicating the entire network config in a new language?

Ross

Well, at least it's better than "yeah, we knew about it, but didn't think it
was worth worrying about".

- Matt

We have done power tests before and had no problem. I guess I am looking
for someone who does testing of the network equipment outside of just power
tests. We had an outage due to a configuration mistake that became apparent
when a switch failed.

So, one of the better ways to make sure that your failover system is working when you need it is just to do away with the concept of a failover system and make your "failover" system be part of your "primary" system
.
This means that your failover system is always passing traffic and you know that it is alive and well -- it also helps mitigate the pain when a device fails (you are sharing the load over both systems and so only half as much traffic gets disrupted). Scheduled maintenance is also simpler and less stressful as you already know that your other path is alive and well.

Your design and use case dictates how exactly you implement this, but in general it involves things like tuning your IGP so you are using all your links, staggering VLANs if you rely on them, multiple VRRP groups per subnet, etc.

This does require a tiny bit more planning during the design phase, and also requires that you check every now and then to make sure that you are actually using both devices (and didn't, for example, shift traffic to one device and then forget to shift it back :-)).
It also requires that you keep capacity issues in mind -- in a primary and failover scenario you might be able to run devices fairly close to capacity, but if you are sharing the load you need to keep things under 50% (so when you *do* have a failure the remaining device can handle the full load) -it's important to make this clear to the finance folks before going down this path :slight_smile:

W