But he does raise an interesting problem. How do you know if your
highly redudant, diverse, etc system has a problem. With an ordinary
system its easy. It stops working. In a highly redudant system you
can start losing critical components, but not be able to tell if
your operation is in fact seriously compromised, because it continues
to "work."
Indeed. We currently monitor each part of our operation from a monitoring
station on our network. Under certain conditions, this can give us both
false positives and false negatives:
- We've lost off-site routing. Our monitoring station can see all our
nodes okay, so it thinks everything is fine, but no-one else can see them.
- We've lost routing to just the part of our network with the monitoring
station on. It reports that everything is down, when in fact stuff is
working fine for serving the rest of the internet.
One way we plan to overcome these issues is to locate monitoring stations
on other ISPs networks at random places on the internet. If you correlate
the results from these multiple monitoring stations, then you get a better
view of what the rest of the internet is seeing.
Simon