Keepalives, WAS: NAP/ISP Saturation

     I.e. no amount of
     routing updates should cause false link flapping due to delayed
     keepalive messages. The same is true for keepalive messages vs
     routing updates on a link.

If you have strict priority of keepalives over routing (and we should
distinguish between hellos and updates for those protocols where you can
tell the difference), you can (theoretically) starve the routing updates.
Thus, equal priority makes sense.

Well, keepalives cannot starve routing protocols, simply because they're
pretty much rate-limited. A scenario when routing protocol starves
keepalives is clearly disastrous -- you get spurious line flaps, which
produce even more routing updates, and the network collapses. On the
other hand, insufficient capacity to handle routing updates is ok, as
it does not have that kind of positive feedback.

The routing-level hellos are useful only to find out if the routing
process is still running and/or resynchronize protocols. They are certainly
too slow to handle regular link failures. (So there's a need to have
a gateway-to-gateway keepalive protocol over shared media like Ethernet
or another trendy madness).

  d) sub-second keepalive intervals -- this is probably the only
     method to discover _fast_ that remote end is dead. The way it
     is now it takes 30 sec or so for a local router to find out that
     the remote one is wedged, and take appropriate action.

Do you really want that? Given your comments in b, I would think that you
would want to ride a 1 second outage. I also wouldn't be thrilled about
the overhead involved given much more precision. Note that what you said
originally was about ping, which is slightly different.

Well, a packet per line per, say, 200 msec is not much. OTOH, you want
to avoid connectivity losses longer than 0.5-1 sec. To provide a reasonable
redundancy you may want to wait for 2 missing keepalives in a row or so
to decide that the link's down.

I.e. hold-down interval determines maximal rate of link-status updates
as seen from routing protocols; the keepalive intervals determine maximal
delay between link failure and the reaction of the routing system.

If sub-1sec recovery is the goal, and IGP convergence is on the order
of 500 msec; that leaves only 500 msec for the keepalives, or 250 msec
keepalive interval.

It would also be a good idea to wait for, say, 20 consequtive good
keepalives to decide that the link is up. Cisco is getting the line
up at the very first good keepalive (at least it appears so).