Westnet and Utah outage

[ .. discussion of smoke filled rooms and UFOs deleted. :wink: ]

I routinely see high packet losses, the environment, from what I see,
has severely degraded to what it was, say, a year ago. I submit it also
has significantly grown and we have many more cooks in the kitchen,
but, as ill as they are defined, we cannot even keep our old standards
up. I still believe y'all got the Internet on a silver plate, and just
need to fix things up.

I suspect at least 80% of the problem is non-technical, but
abministrative and mindset. If those 80% could be resolved to a
somewhat high degree, I suspect we will find solutions for the
technical problems. Not over night, I am sure, but something we could
plan for.

One requirement that we have been pushing hard for is:

  Zero packet loss to stable destinations if the path is uncongested
  in the presence of arbitrarily high levels of route flap.

We are trying to enforce this by requiring router vendors to report
the highest traffic rates that they can sustain where this holds true.
We completely ignore the Bradner test results and enphatically insist
that those tests are completely useless and have done the Internet a
great disservice.

This zero loss condition would seem to the naive Internet user to be a
given. It absolutely is not (unless your router is an NSS :). If you
have a cache of prefixes and can forward really fast with that cache
and have a much slower secondary means of forwarding, you had better
not invalidate any cache entries by flushing subsets of cache entries
(or worse yet the whole cache) and you had better not insure cache
consistency by timing out cache entries.

The cache problems are particularly difficult when trying to maintain
overlaps and componets of the overlap or the aggregate flap. Route
deletion is a hard problem with a partially populated cache. Router
vendors seemd to have overlooked this difficulty and/or not been
acutely aware of the requirement.

Unless this is fixed, you will get loss between two stable points even
if there is no congestion since high levels of route flap is sort of a
given now that the Internet has become this big.

Of course, things get much worse if links in the path are congested,
or if routers can't handle the PPS load, or if routers in the path
fall over and die. I'm just pointing out that for some routers, when
things are "working" they are not working by your definition or mine.

And - Yes, it is being fixed!

Unfortunately, the problem is too big to be resolved solely by the
engineers. Takes more than just intelligence.

Wow! Sound real complicated. Must require beaurocrats. :wink: