outages, quality monitoring, trouble tickets, etc

From: kwe@6SigmaNets.COM (Kent W. England)
I'm skeptical of any end-to-end availability figures over 97%. I don't
think they reflect the reality of leased line circuits today, or else they
don't include the leaf node circuits and only report backbone availability.
For a highly redundant backbone, almost any definition of availability
should result in a number like 99.mumble%. Remember 99.9% availability
means less than 9 hours outage per year. Routing hiccups take that much.
One or two leased lines outages is all you get for 9 hours. The real world
is a lot less available than that.

Thank you! I thought I was living in a twilight zone with people
reporting 99.9% network availability. This is the rathole end-to-end
network useablity. The customer is interested in end-to-end useability.
While the network operator can only easily measure intra-network modules.

I can't tell you the answer, but there is definitely something happening
with customer perceptions of Internet useability. Looking at the
numbers I would agree a single leased circuit should be less reliable
(single point of failure) than a highly redundant backbone. But by
our customer perceptions, that isn't the case. Either we have better
than "normal" leased circuits, or the highly redundant backbones aren't,
or our customers needs are based something we aren't directly measuring.

Highly redundant backbones remain extremely vunerable to the "glitch."
Human glitches, software glitches, "impossible" data glitches. Redundant
backbones do protect against the backhoe "glitch."

But since half the web servers I try to talk to refuse me half the time,
I'm not sure that network availability per se (HWB's complaints duly
acknowledged) is the tallest pole in the tent.

Part of this problem is the growing number of interdepencies (complexity,
chaos?). Even if each individual module is working 99.9% of the time,
the probabilities start looking pretty bad when all need to be working
at the same time. To make a web connection, you have a string of name
servers, a string of networks to the name servers, a string of routers
on those networks, another string of networks to the web server, another
string of routers, more strings of networks and routers and servers on
the return path.

I'm amazed it even works 50% of the time. Unfortunately our customers
aren't always as understanding.

Since error reporting sucks in most network applications, it becomes
the fault of whatever help desk happens to take the customers phone call.

Since error reporting sucks in most network applications, it becomes
the fault of whatever help desk happens to take the customers phone call.


Ever since ten years ago, I've wanted *all* my programs to
automatically detect larger-than-usual delays (more than 5ms?) and
start giving *exact* status reports, such as "your host is doing DNS
query", getting more detailed as the delay gets worse, getting the
status and error information dynamically from the intermediaries "host
x.y.z is doing a query of nameservers of the BAZ.COM domain in your
host flappy.c.e.baz.com, and out of seven route servers 4 have failed,
currently trying ... a.root-servers.net ... ICMP unreachable by router
X-Y-Z based on data obtained from ..." and on and on until either
utter failure or success; the greater the delays from what's
reasonable, the more status information gets queried and automatically
offered; if any point freezes, you'd know who is responsible.
Programs should use a common library of routines which would pop up a
window of the responsible organization for the error at hand, and the
proper email address to send complaints to; an automated, standard
computer-interpretable complaint should be registered automatically
similar to syslog but internetted, and a button the user can press to
add comments, opinions, etc. or even send further emails (and
attaching CC's, making local copies, etc.).

I'm amazed it even works 50% of the time, too. I understand why it
doesn't work more often; a zillion pieces. We could use the tools (I
mean computers) at hand better to inform us of these problems.

Yes, I want it to be like my Volvo -- a light lights up whenever a
bulb stops working, but better yet the computer can tell me which
light is out, and can automatically order the spare part from the
right factory. When I'm in Mosaic or Netscape, there's no reason it
shouldn't tell me that a diode in a CSU/DSU just blew out in Wyoming,
owned by Joe Bizzlededof, and that his staff has an average response
time of six hours to fix such a problem, and that his functionaries
have been notified, and whether or not I should expect workarounds and
what actions I would have to take in order to use them (mostly, this
is none -- wait n minutes for routers to find another route and the
route will work again?) And, the same thing with a route missing from
a router table.

Is this crazy?

I don't think so! Everything is getting so complex, we need:

1) to know when users can't use the network; thus the automatic feedback
   mechanisms from end-to-end
2) to know what things to fix to minimize #1 (we're not all
   trillionares and get to buy redundant everythings; we have to know
   which items manufactured by who and which programs maintained by
   who are bound to be more reliable than which others, so we can
   choose good items or know which ones need redundancy or other
   protective measures).
3) users to know exactly who to blame, so that help desks get
   *appropriate* calls rather than *inappropriate* ones. (the
   get-calls metephore is starting to get old; get-email will be more
   and more appropriate), and so that users know which organizations
   not to spend money on.

In many cases, users are help desks fixing other peoples' problems.
It's all hierarchical, everyone's the top and everyone's the bottom.

I believe programmers should experiment with these things, and
standards should be drafted, specifically for
feedback-to-user-of-actual-problem and end-to-end automatic error
reporting in *both* directions (so that each side of the connection
and each end of the stack of layers knows what to fix), and
responsible party lookup automation (who to *really* bother when
there's just no other way (right now, 95% of the time)).

Bradley Allen