Followup British Telecom outage reason

BT is telling ISPs the reason for the multi-hour outage was
a software bug in the interface cards used in BT's core network.
BT installed a new version of the software. When that didn't fix
the problem, they fell back to a previous version of the software.

BT didn't identify the vendor, but BT is identified as a "Cisco Powered
Network(tm)." Non-BT folks believe the problem was with GSR interface
cards. I can't independently confirm it.

I'd be surprised if it was the GSR, and in anycase that doesn't
absolve anyone. If it was a software issue- why wasn't the software
properly tested? Why was such a critical upgrade rolled out across
the entire network at the same time? It doesn't add up.

Neil.

It appears to be yet another CEF bug. If you want to use a GSR
you are stuck using some version of IOS with a CEF bug. The
question is which bug do you want. Each version of IOS has
a slightly different set. Several US network providers have also
been bitten by CEF bugs too.

While trying to fix one set of bugs, BT upgraded of their network.
I'm not sure if they were upgrading at 9am in the morning, or had
upgraded earlier and the bug finally came out under load at 9am.
When the BT network melted down, Cisco suggested installing a
different version of IOS, which had previously been tested. At
noon, BT found the new version had an even worse bug, sending packets
out the wrong interface. It was until 2200 (13 hours later), BT and
Cisco found a version of IOS which stablized the network. "Stablized"
not fixed. The running version of IOS still has a bug, but it isn't
as severe.

They probably did. The vendor probably did also. Of course, they can't
always simulate real network conditions. Nor can your own labs. Heck,
even a small deployment on 2 or 3 routers (out of, say, 200) can't
catch everything. It is a simple fact that some bugs don't show up
until its too late.

And cascade failures occure more often than you might think (and not
necessarily from software.) Remember the AT&T frame outage? Procedural
error. How about the netcom outage of a few years ago? Someone
misplaced a '.*' if I remember correctly. Human error of the simplest
kind. I've had a data center go offline because someone slipped and
turned off one side of a large breaker box.

These things happen.

The challenge is to eliminate the ones you CAN control. And, IMO, the
industry is generally doing a good job of that.

I chalk this whole thing up to bad karma for BT.

-Wayne

> I'd be surprised if it was the GSR, and in anycase that doesn't
> absolve anyone. If it was a software issue- why wasn't the software
> properly tested? Why was such a critical upgrade rolled out across
> the entire network at the same time? It doesn't add up.

  After a fully run lab-test as well as limited "real-life"
deployment you can still never see all the possible cases that would
possibly come to haunt you later. Sometimes you do an across-the-board
upgrade for security as well as specific feature/bugset reasons to
fix the set of bugs into the "we know what they are and how to deal
with them".

  No vendor claims to have perfect software. Nor will you find
anyone but the irresponsible vendor to suggest that any specific
image is "perfect".

It appears to be yet another CEF bug. If you want to use a GSR
you are stuck using some version of IOS with a CEF bug. The
question is which bug do you want. Each version of IOS has
a slightly different set. Several US network providers have also
been bitten by CEF bugs too.

  True, but most of those are in the past. I'm not familiar
with the specifics of the bugs that BT encountered but something that
should be taken note of is the ability for a Cisco router to function
when in a "broken" state and you want to get a 'fixed' image onto it.

  It would be nice if there were easier ways to do it in some cases
but you can't have a perfect environment esp when you do sw upgrades
you don't always have your on-site hands standing by to help you swap
flash cards or deal with whatever logistical issues you may encounter.

While trying to fix one set of bugs, BT upgraded of their network.
I'm not sure if they were upgrading at 9am in the morning, or had
upgraded earlier and the bug finally came out under load at 9am.
When the BT network melted down, Cisco suggested installing a
different version of IOS, which had previously been tested. At
noon, BT found the new version had an even worse bug, sending packets
out the wrong interface. It was until 2200 (13 hours later), BT and
Cisco found a version of IOS which stablized the network. "Stablized"
not fixed. The running version of IOS still has a bug, but it isn't
as severe.

  I'm sure that BT and Cisco have had some conversations about
what can be done to improve the testing that Cisco does to better
simulate their network at this time from such a public outage.

you are right. But in an era of focusing on the box, vendors are
forgetting that solid software and knowledgable support are just as
important.

Possibly slow down a bit on rolling all those new features and widgets
into the software.... Make the software do what it should, reliably.. then
put the new stuff in there.

ie.. bug scrub a train, per chassis. Make it solid.. then put the toyz
in.

These days you don't see boxes hitting the 1 year mark that often.. It is
usually interupted somewhere in the 20 week range with something beautiful
like,

SBN uptime is 2 weeks, 4 days, 6 hours, 12 minutes
System returned to ROM by processor memory parity error at PC 0x607356A0,
address 0x0 at 21:02:01 UTC Tue Nov 6 2001

or

BMG uptime is 34 weeks, 3 hours, 44 minutes
System returned to ROM by error - a Software forced crash, PC 0x6047F3E8
at 18:28:17 est Sat Mar 31 2001

or

LVX uptime is 24 weeks, 1 day, 20 hours, 21 minutes
System returned to ROM by abort at PC 0x60527DD4 at 00:38:36 EST Fri Jun 8
2001

At least its not 0xDEADBEEF.. yet.

<snip>

  No vendor claims to have perfect software. Nor will you find
anyone but the irresponsible vendor to suggest that any specific
image is "perfect".

<snip>

you are right. But in an era of focusing on the box, vendors are
forgetting that solid software and knowledgable support are just as
important.

Possibly slow down a bit on rolling all those new features and widgets
into the software.... Make the software do what it should, reliably.. then
put the new stuff in there.

ie.. bug scrub a train, per chassis. Make it solid.. then put the toyz
in.

easier said than done when everybody wants every fancy new feature 110%
solid and yesterday.

Possibly slow down a bit on rolling all those new features and widgets
into the software.... Make the software do what it should, reliably.. then
put the new stuff in there.

Yeah but... these activities support existing customers with existing
products, and they enable no revenue (the support fees are already in
the bag). To dedicate that much engineering energy -- probably more
than 50% of the corporate total if "doing it right" is the goal -- would
put new customer revenue and new product revenue at risk.

This icky tradeoff is why new (as in pre-IPO in some cases) vendors can
still get a fair test in existing networks. Eng&Ops type people have
told me more than once that they thought $NEW_ROUTER_VENDOR could be a
good investment simply because nearly 100% of their engineering resources
would be dedicated to making their small number of customers happy, and
being a larger customer amongst a small set increased this advantage even
more.

The big challenge at an established router company is in management of the
competing priorities more than in management of, or doing of, engineering.

easier said than done when everybody wants every fancy new feature 110%
solid and yesterday.

Not everybody. What I want more than the fancy new feature is: honest
schedules and honest self-appraisals. A vendor who promises me what they
think I want to hear or maybe even what I really do wish I could hear is
not as valuable to me as a vendor who tells me the bold bald truth no
matter how much it hurts my proposed rollout schedule or how much it might
help one of their competitors who can deliver $FANCY_NEW_FEATURE earlier.

This icky tradeoff is why new (as in pre-IPO in some cases) vendors can
still get a fair test in existing networks. Eng&Ops type people have
told me more than once that they thought $NEW_ROUTER_VENDOR could be a
good investment simply because nearly 100% of their engineering resources
would be dedicated to making their small number of customers happy, and
being a larger customer amongst a small set increased this advantage even
more.

.. which is certainly true until small $NEW_ROUTER_VENDOR IPO'ed or
otherwise grew into not-so-small $NEW_ROUTER_VENDOR, as we have all
witnessed on numerous occasions. At which point, they're all the same
again. You only gain an advantage for a limited amount of time. There are
costs attributed to this as well which need to be realized.

The big challenge at an established router company is in management of the
competing priorities more than in management of, or doing of, engineering.

And there also is a business reality. If you get almost everything right,
most people are happy with that. Few people demand, need, and can afford to
pay for perfection.

In fact, one could argue that it is poor design to rely on anything to
operate perfectly 100% of the time.

Now, if lack of infrastructure realiability can harm human life you may feel
differently, but that isn't the case for most of us at the present time.

We can sit here all day long and argue back and forth for strategies from
multiple vendor based networks to why single source offers advantages to why
bla bla is cool. The bottom line is that there is no free lunch here. If
you want perfection, you will pay for perfection either in house or for your
vendor or lost revenue or all of the above. And sometimes business cases
cannot support perfection. The trade-off that has to be made here is how
much "slack" you can get away with while still making your customers happy
and at the same time supporting your business case. Anything else has no
long term viability. From an engineering perspective this view certainly
stinks, but when you take into account business realities engineering's
perspectives may be an illusion. It's the old wisdom of 'pick any two:
cheap, fast, reliable'.

Faults will happen. And nothing matters as much as how your prepare for
when they do.

Cheers,
Chris

"The parting on the left.... is now the parting on the right...
and the beards have all grown longer overnight"
  -- Pete Townsend, "Wont get fooled again"

And yes, router design *is* politics, not engineering.

Now, if lack of infrastructure realiability can harm human life you may feel
differently, but that isn't the case for most of us at the present time.

I've designed software and networks used for public safety and
emergencies. And yes, people have died on my watch. It is a somewhat
different mindset, but not that different. A lot of "good engineering
practice" applies to any engineering activity, including software
engineering.

Its not even a matter of cost. A typical hospital spends less on
their emergency power system than a Internet/telco hotel. The major
difference is the hospital staff knows (more or less) what to do when
the generators don't work.

The big secret is most "life safety" systems fail regularly. Most of
the time it doesn't matter because the "big one" doesn't coincide with
the failure.

Faults will happen. And nothing matters as much as how your prepare for
when they do.

Mean Time To Repair is a bigger contributor to Availability calculations
than the Mean Time To Failure. It would be great if things never failed.
But some people are making their systems so complicated chasing the Holy
Grail of 100% uptime, they can't figure out what happened when it does
fail.

Murphy's revenge: The more reliable you make a system, the longer it will
take you to figure out what's wrong when it breaks.

My first thought in response to this is the vendor's support costs -
wouldn't shipping more reliable images bring down those costs
signficantly? Or is it just that the extra revenue opportunities gained
by adding $WHIZBANG_FEATURE_DU_JOUR outweigh those potential support
savings?

-C

Wandering off the subject of BT's misfortune ...

Sean Donelan wrote:

[...]

> Faults will happen. And nothing matters as much as how your prepare for
> when they do.

Mean Time To Repair is a bigger contributor to Availability calculations
than the Mean Time To Failure. It would be great if things never failed.

And Mean Time To Fault Detected (Accurately) is usually the biggest
sub-contributor within Repair but that's kinda your point.

But some people are making their systems so complicated chasing the Holy
Grail of 100% uptime, they can't figure out what happened when it does
fail.

Similar people pursue creation of perpetuum mobile. A strange and somewhat
congruent example stumbled into recently is:
http://www.sce.carleton.ca/netmanage/perpetum.shtml.

Overall simplicity of the system, including failure detection mechanisms, and real
redundancy are the most reliable tools for availablity. Of course, popping just a
few layers out, profit and politics are elements of most systems.

Murphy's revenge: The more reliable you make a system, the longer it will
take you to figure out what's wrong when it breaks.

Hmm.

What's the upside to $ROUTER_VENDOR in reducing support cost? They already make money on the support but can't make too much, so a reduction in cost would probably imply a reduction in revenue. Also, given that network engineering rarely make support cost a key issue in vendor selection and negotiation, reducing support costs look like they have little payback to $ROUTER_VENDOR in terms of equipment sold. With that, $WHIZBANG_FEATURE_DU_JOUR, sure looks like a good profit decision.

To change this, stop buying gear from vendors that charge too much for support.

just my jaded opinion,
jerry

I'm referring to the _vendor's_ support costs - as in, you don't need as
many people in the TAC if people don't keep running into IOS bugs; you
don't need as large of a RMA pool if the hardware is more reliable, etc.

As the vendor would most likley decline to pass these savings along to
the customer, I would see this as a profit opportunity for the vendor.

-C

This used to be the cc train, then later the S train. However, the S train
has never been as stable as cc, and it has become increasingly less stable
over time, with too many new features rolling in.

I'd be curious as to exactly which CEF bugs bit them. The introduction of
greater MPLS functionality seems to have given CEF a nasty bit of
destabilization.

- Daniel Golding

[I may digress a bit.. apologies]

I think there is a second order effect on revenue though. The more
reliable/stable the gear is, the less expensive the support contract you may
be inclined to get for it.

Would you really get 4hr replacement if you can't think of the last time a
particular series of box failed? If you really need 4 hr replacement [this
gets mentioned very often] its cheaper to keep spares of your own, so then
wouldn't next business day suffice?

Although we keep spares for our low end switch blades and switches, they
really aren't worth keeping on Smartnet because 1) the OS virtually never
changes, and 2) they don't fail outside of warranty. Or at least, not in my
experience.

I think recognizing this, Cisco for example has given smaller switches
lifetime warranties with lifetime OS upgrades.

If routers worked [/could work] this way, no one with a good technical team
in house would buy support contracts.

[Think of the pharmaceutical industry] Its much more profitable to "treat"
an illness than cure it. If as an entity, expensive gear is expensive to
maintain, there is more money to be made.
Mainframes were the same way until PCs and workstations continuously ate at
their market share and the costs of problem solving.

Today even though mainframes have gotten oodles cheaper to buy and run, if
you really want a mainframe, you have to have a great reason.

Deepak Jain
AiNET

My first thought in response to this is the vendor's support costs -
wouldn't shipping more reliable images bring down those costs
signficantly? Or is it just that the extra revenue opportunities gained
by adding $WHIZBANG_FEATURE_DU_JOUR outweigh those potential support
savings?

When presented with an either-or decision like "Doing <X> will make $M but
doing <NOT-X> will save $S" then in most public companies $S would have to be
more than $M x 2 before <X> will stop happening. In this case $S is not even
a notable fraction of $M so it's not even worth discussing.

Somebody here said router design was a political process. I disagree. But
networks are complex systems owing their existence (and their nature) to an
ever shifting matrix of politics, economics, and physics. And, Heraclitus'
maxim is very much appropos here. The target for a router designer is moving.

easier said than done when everybody wants every fancy new feature 110%
solid and yesterday.

Not everybody. What I want more than the fancy new feature is: honest
schedules and honest self-appraisals. A vendor who promises me what they
think I want to hear or maybe even what I really do wish I could hear is
not as valuable to me as a vendor who tells me the bold bald truth no
matter how much it hurts my proposed rollout schedule or how much it might
help one of their competitors who can deliver $FANCY_NEW_FEATURE earlier.

It took a long long time for one router vendor in particular
to pay any attention to a number of high-spending customers
who said 'stop implementing the {3,4}-letter acronym du-jour
protocols, or at least stop trying to integrate them into
s/w we want to run, just fix the bugs the the s/w train we all run'.
The lesson seemed to be learnt for a little while, then spent
the next year trying (unsuccessfully) to abandon that release
train.

I can only assume what drives them is the either (a) desire
to support slideware protocols early, and in code people
actually try and use, (b) the knowledge that such slideware
protocols, in aggregate across a large network, eat more router
horse power, and thus sales-$$$, for those gullible enough
to implement them, or (c) some combination between the two.

I guess some time someone will realize routers are both
hardware, and software, and shock horror both, if done
well, can actually add value. [hint & example: compare the
scheduler on, say, Linux/FreeBSD, Windows 95 (sic),
and your favourite router OS (*); pay particular attention
to suitability for running realtime, or near realtime tasks,
where such tasks may occasionally crash or overrun their
expected timeslice; note how the best OS amongst the
bunch for this aint exactly great].

(*) results may vary according to personal choice here.

That's probably debatable, however what is clear is that the
standard deviation in Internet/telco hotel performance is far
greater than that of hospitals, and it's hard to judge your
(potential) vendor's capability of responding to an unplanned
power situation in advance of one happening. Bizarrely, those
who seem to get it wrong, don't seem to learn from their
mistakes.

Alex Bligh
Personal Capacity