Monitoring highly redundant operations

Not to pick on Dave, since I suspect he is going to have to face
the Microsoft PR department for re-indoctrination for speaking out
of turn, I'm glad to see someone from microsoft made an appearance.

But he does raise an interesting problem. How do you know if your
highly redudant, diverse, etc system has a problem. With an ordinary
system its easy. It stops working. In a highly redudant system you
can start losing critical components, but not be able to tell if
your operation is in fact seriously compromised, because it continues
to "work."

As many of us have found out as we moved from simple networks to more
complex networks, the network management is often much harder than
the architecture of the network itself. Instead of relying on being
notified when stuff "breaks" you have to actively monitor the state
of your systems. Fairly frequently I see cases where the backup system
failed, but no one knows about it until after the primary system also
fails.

From the typical monitoring stations Dave sees, everything appears

"normal." Yet, out in the real world there is a problem. Like most
things its rarely a single thing that breaks, but chain of problems
resulting in the final failure. So what should you be monitoring
in addition to the typical graphs and logins to detect the problem
seen by Microsot yesterday and today?

[ On , January 24, 2001 at 14:31:20 ( -0800), Sean Donelan wrote: ]

Subject: Monitoring highly redundant operations

But he does raise an interesting problem. How do you know if your
highly redudant, diverse, etc system has a problem. With an ordinary
system its easy. It stops working. In a highly redudant system you
can start losing critical components, but not be able to tell if
your operation is in fact seriously compromised, because it continues
to "work."

The real problem is that the most critical part of the puzzle has _not_
been made "highly redundant" in this case.

If at least one of your registered authoritative DNS servers are not
responding from the point of view of any _and_ every user on the
Internet, your hosts (MX records, etc.) don't exist for those people and
their e-mail to you may well bounce and they will not view your web
pages.

The only way to ensure that your DNS is highly redundant and working is
to ensure that you've got maximum possible dispersion of _registered_
authoritative servers throughout the network geography, just like the
root and TLD servers are widely distributed.

Note this is just as important (if not more so!) for any delegated
sub-domains in your zone too, and equally important for any related
zones (eg. passport.com in this case).

The only really effective way to measure the effectiveness of your
nameserver dispersion is to make it terribly easy for anyone anywhere to
report any problems they percieve to you via as many optional channels
as possible -- you can't be everywhere at once, but if you make it easy
for people to send you information out-of-band then you'll get lots of
early warning when various chunks of the Internet can't see your
nameservers and/or your other hosts.

Now if the majority of DNS cache server operators don't get too paranoid
you could try to set up a mesh of equally widely dispersed monitoring
systems that cross-check the availability of test records from your zone
by querying any number of regional and remote cache servers. You'd make
the TTL of these test records the minimum recommended by major
nameserver software vendors (300 seconds?) and then query the whole
group every TTL+N seconds. Obviously you're probably going to have to
report your results out-of-band, and/or have independent people at each
monitoring site who are responsible for investigating problems
immediately and doing what they can locally to resolve as them.

Not to pick on Dave, since I suspect he is going to have to face
the Microsoft PR department for re-indoctrination for speaking out
of turn, I'm glad to see someone from microsoft made an appearance.

out of curiosity, how do you know he's really from microsoft,
whether unofficial or not?
(he might be one of those LIENUX ZeAlOtS.)

perhaps he might face re-indoctrination over the mail client he
is (apparently) using, as well as the mail server software he
is (apparently) using:

: [ snip ]
:
: Received: by segue.merit.edu (Postfix)
: id 91FFE5E0E3; Wed, 24 Jan 2001 18:12:28 -0500 (EST)
: Delivered-To: nanog-outgoing@merit.edu
: Received: by segue.merit.edu (Postfix, from userid 56)
: id B9B1E5E067; Wed, 24 Jan 2001 17:39:46 -0500 (EST)
: Received: from sneakerz.org (sneakerz.org [207.154.226.254])
: by segue.merit.edu (Postfix) with ESMTP id 58CCE5EBCC
: for <nanog@merit.edu>; Wed, 24 Jan 2001 17:32:37 -0500 (EST)
:---> Received: by sneakerz.org (Postfix, from userid 1003)
---> ^^^^^^^^^^^
: id E7CC15D006; Wed, 24 Jan 2001 16:32:36 -0600 (CST)
: Date: Wed, 24 Jan 2001 16:32:36 -0600
: From: Dave McKay <dave@sneakerz.org>
:
: [ snipped/edited ]
:
: Message-ID: <20010124163236.A37343@sneakerz.org>

Sean Donelan <sean@donelan.com> observed,

But he does raise an interesting problem. How do you know if your
highly redudant, diverse, etc system has a problem. With an ordinary
system its easy. It stops working. In a highly redudant system you
can start losing critical components, but not be able to tell if
your operation is in fact seriously compromised, because it continues
to "work."

I suspect answers here aren't going to be found in traditional engineering, but more in a discipline that deals with extremely complex systems where a full failure may be irretrievable. I'm thinking of clinical medicine.

The initial problem there indeed may be subtle. I have a substantial amount of medical experience, but it easily was 2-3 hours before I recognized, in myself, early symptoms of a cardiac problem. It seemed so much like indigestion, and then a pulled muscle. I remember relaxing, and then recognizing a chain of minor events...sweating...mild but persistent left arm pain radiating into the chest...shortness of breath...and then a big OH SH*T.

My first point is having what physicians call a "high index of suspicion" when seeing a combination of minor symptoms. I suspect that we need to be looking for patterns of network symptoms that are sensitive (i.e., high chance of being positive when there is a problem) but not necessarily selective (i.e., low probability of false positives).

Once the index of suspicion is triggered, the next thing to look for is not necessarily direct indication of a problem, but a more selective surrogate marker: objective criteria, especially when analyzed as trends, point in the direction of an impending failure. In emergency medicine, the EKG often isn't as informative as TV drama would suggest. A constantly improving area, however, has been measurement, especially successive measurements, of blood chemicals that indicate cardiac tissue is being damaged or destroyed.

Early in the use of cardiac-related enzymes, it was a matter of considering several nonspecific factors in combination. SGOT, CPK and LDH are all enzymes that will elevate with tissue damage. The problem is that any one can be elevated by problems in different areas: liver and heart, heart and skeletal muscle, etc. You need to look for elevations in a couple of areas that are associated with the heart, AND look for normal values for other tests that rule out liver disease, etc. The biochemical techniques have constantly improved, but you still need to look at several factors.

The second-phase analogy for networking could be more frequent polling and trending, or relatively benign tests such as traceroutes, etc.

Only after there is a clear clinical problem, or several pieces of laboratory evidence, does a physician jump to more invasive tests, or begin aggressive treatment on suspicion. In like manner, you wouldn't do a processor-intensive trace on a router, or do a possibly disruptive switch to backup links, unless you had reasonable confidence that there was a problem.

>
> > Not to pick on Dave, since I suspect he is going to have to face
> > the Microsoft PR department for re-indoctrination for speaking out
> > of turn, I'm glad to see someone from microsoft made an appearance.
>
> out of curiosity, how do you know he's really from microsoft,
> whether unofficial or not?
> (he might be one of those LIENUX ZeAlOtS.)
>
> perhaps he might face re-indoctrination over the mail client he
> is (apparently) using, as well as the mail server software he
> is (apparently) using:

This is getting a little stupid now, don't you think(and if you don't
think so maybe you're one of those M1CR0S0FT Z34L0TS h0h0h0h0h0)?

i hope you'll note that i just tacks onto sean donelan's thread, where
he states '... suspect he is going to face ...', and i added an
observation prefaced with '... he might face ...'. pick on sean, not me.
also:

#1) He works for microsoft. I can personally attest to that.

that's fine. i've been lurking on this list for a long time and don't
remember seeing much traffic from this person. that someone else
will vouch for him is the correct response.

#2) He seems to want to sincerely state that:
  a) microsoft acknowledges the problem
  b) microsoft is working on the problem
  c) microsoft isn't globally affected by this
    ... even on a possibly unofficial basis. Information is good.

i only partially agree. it is a fact that many things from big
companies (microsoft being only yet another big company) are guesses,
projections, wet dreams, and/or FUD. especially since his response
contained almost no information other than a)/b)/c), above, all not
particularly useful to {na}nog-ers, IMHO. and topped off with
"i wish i could disclose more but i can't". perhaps better not
to say anything instead of seeding suspicions or raising expectations.

#3) Just because a person may work for one Evil Empire or another
    doesn't mean that they are forced to use their operating system
    or MUA. I work for a company that has a mail product that a
    considerable amount of people use, and I don't use it. This is
    an acceptable practice at a lot of companies.

it's probably not a secret that i'm not microsoft's biggest ally.
nevertheless, i find it curious that you appear to be implying that
you think i said he works for an evil empire.

as far as mail software
is concerned, your statement is factually correct, but how often does
it apply to companies who make/manufacture/distribute an e-mail product
(especially *their own*), as it is in this case?

#4) He is as furthest from a linux zealot as you can possibly be.

obviously, i was insensitive by not adding a smiley to the following
characters:
     LIENUX ZeAlOtS

i humbly apologize.

you would be advised to note that i am, as far as i can remember, one
of the only 2 (or 3) people who've defended microsoft in the role of
aggrieved network operator (on nanog), at least until we all find out
what really happened. if we get a complete, accurate, and verifiable
(inasmuch as that's possible) report from microsoft backed up with
additional personal insight from dave@sneakerz.org, then we'll all
be richer for the experience. let's see if that happens...

[ On Wednesday, January 24, 2001 at 23:23:11 ( -0500), Howard C. Berkowitz wrote: ]

Subject: Re: Monitoring highly redundant operations

My first point is having what physicians call a "high index of
suspicion" when seeing a combination of minor symptoms. I suspect
that we need to be looking for patterns of network symptoms that are
sensitive (i.e., high chance of being positive when there is a
problem) but not necessarily selective (i.e., low probability of
false positives).

Your analogy is very interesting because just like in this case with
M$'s DNS, the root cause may very well not have been in failing to
notice the symptoms or diagnose them correctly, but rather in allowing a
situation to build such that these symptoms even occur in the first
place.

I don't wish to read more into your analogy and your personal life (in a
public forum, no less!) than I have a right to do, so let's say
"theoretically" if it were past events in your life that were under your
direct personal control and which were known at the time to be almost
guaranteed to bring on your condition, then presumably you could have
avoided that condition by actively avoiding or counter-acting those past
events.

In the same way M$'s DNS would not likely have suffered any significant
visible problems, even if their entire campus had been torn to ruin by a
massive earthquake or whatever, if only they had deployed registered DNS
servers in other locations around the world (and of course if they'd
have been careful enough to use them fully for all relevant zones).

The DNS was designed to be, and is at least in theory possible to be,
one of the most reliable subsystems on the Internet. However it isn't
that way by default -- every zone must be specifically engineered to be
that way, and then of course the result needs to be managed properly
too. Luckily the engineering and management is extremely simple and in
most cases only requires periodic co-operation of autonomous entities to
make it all fit together. No doubt M$'s zones get a larger than average
number of queries, but still it's just basic engineering to build an
enormously reliable DNS system to distribute those zones and answer
those queries. If this were not true the root and TLD zones would have
crumbled long ago (and stayed that way! :-).

Only after there is a clear clinical problem, or several pieces of
laboratory evidence, does a physician jump to more invasive tests, or
begin aggressive treatment on suspicion. In like manner, you
wouldn't do a processor-intensive trace on a router, or do a possibly
disruptive switch to backup links, unless you had reasonable
confidence that there was a problem.

No, perhaps not, but surely in an organisation the size of M$ there
should have been enough operational procedures in place to have
identified the events shortly preceding the beginning of the incident
(eg. the configuration change). Similarly of course there should have
been procedures in place to roll back all such changes to see if the
problem goes away.

Obviously such operational recovery procedures are not always perfect,
as history has shown, but in the case of something as simple as a set of
authoritative nameservers is supposed to be, they should have been
highly effective.

Furthermore in this particular case there's no need for expensive or
disruptive tests -- a company the size of M$ should have had (and
perhaps do have, but don't know how to use effectively) proper test gear
that can passively analyse the traffic at various points on their
networks (including their connection(s) to the Internet) without having
to actually use their routers or servers for diagnostic purposes.

Finally in this particular case the outage was so long that there was
ample time for them to have deployed new, network diverse, servers and
added their IP#s to the TLD delegations for their zone and had them show
up world-wide well before they'd fixed the actual problem!