Monitoring highly redundant operations

But he does raise an interesting problem. How do you know if your
highly redudant, diverse, etc system has a problem. With an ordinary
system its easy. It stops working. In a highly redudant system you
can start losing critical components, but not be able to tell if
your operation is in fact seriously compromised, because it continues
to "work."

Indeed. We currently monitor each part of our operation from a monitoring
station on our network. Under certain conditions, this can give us both
false positives and false negatives:

- We've lost off-site routing. Our monitoring station can see all our
nodes okay, so it thinks everything is fine, but no-one else can see them.

- We've lost routing to just the part of our network with the monitoring
station on. It reports that everything is down, when in fact stuff is
working fine for serving the rest of the internet.

One way we plan to overcome these issues is to locate monitoring stations
on other ISPs networks at random places on the internet. If you correlate
the results from these multiple monitoring stations, then you get a better
view of what the rest of the internet is seeing.

Simon

>But he does raise an interesting problem. How do you know if your
>highly redudant, diverse, etc system has a problem. With an ordinary
>system its easy. It stops working. In a highly redudant system you
>can start losing critical components, but not be able to tell if
>your operation is in fact seriously compromised, because it continues
>to "work."

Indeed. We currently monitor each part of our operation from a monitoring
station on our network. Under certain conditions, this can give us both
false positives and false negatives:

- We've lost off-site routing. Our monitoring station can see all our
nodes okay, so it thinks everything is fine, but no-one else can see them.

With our monitoring software we also check a few off-site links (our
interfaces on our uplinks routers and the router after that) it tends to
work well.

- We've lost routing to just the part of our network with the monitoring
station on. It reports that everything is down, when in fact stuff is
working fine for serving the rest of the internet.

For that situation the software we use allows us to set dependencies, ie,
servers A B & C depend on router Z, if router Z is down, assume server A B
& C are unreachable/down (but dont start spewing out alerts about it)

Unfortunately the software is MS based (Enterprise Monitor, now named IP
monitor iirc) I first came across it while working at Xerox, it resides on
the only MS box on our network (beyond customer machines, and yes, it's
kinda of an oxymoron, a windows monitoring box).

One way we plan to overcome these issues is to locate monitoring stations
on other ISPs networks at random places on the internet. If you correlate
the results from these multiple monitoring stations, then you get a better
view of what the rest of the internet is seeing.

A kind of distributed monitoring system would be nice, or just having
people who agree to give you access to add your systems to their
monitoring systems (easily done with some software, not so easily with
others) I also do this to a small extent.

        Matthew S. Hallacy
        XtraTyme Technologies

Umm... Keynote?
(http://www.keynote.com)

I find it truly amazing that people don't already diversely
monitor. Hell, have cronned pings running off your friend's cable modem
if that's all you can afford, but for christ's sake, a single box colo'd
in someone else's cage, or a shell at shells.com or nether.net really
isn't that expensive.

Fighting the war against bad networks,

Matthew Devney
Teamsphere Interactive

P.S.: It is not wise to get me started about other bad network practices
(stub network) or various other stupidities that piss me off.