endpoint liveness (RE: Do ATM-based Exchange Points make sense an ymore?)

BGP keepalive/hold timers are configurable even down to granularity
of link or PVC level keepalives, but for session stability reasons,
it appears that most ISPs at GigE exchanges choose not to
tweak them down from the defaults. IIRC, Juniper is 30/90 and Cisco is
60/180. My gut feel was that even something like 10/30 would be
reasonable, but nobody seems compelled that this is much of an
issue.

Cheers,
-Lane

It makes little sense to detect transient glitches. Any possible reaction
on those glitches (i.e. withdrawal of exterior routes with subsequent
reinstatement) is more damaging than the glitches themselves.

--vadim

Endpoint liveness may also start to become more of an issue as more
networks choose to private peer, or reach ethernet exchanges, over L2
pseudowires.

When the router at the far end goes away for whatever reason - the router
has really gone away, the MPLS provider in the middle is banjaxed, etc -
this isn't immediately visible to the other end, which will still see
"link up" from the PE.

I think someone (can't remember who, maybe Riverstone) is implementing a
method of dropping link on the ethernet ports at both ends of a pseudowire
if something goes bang in the middle, and end-to-end connectivity fails.

But, how does that work when you may be delivering multiple q-tags on a
single GigE port (for example)? If only one tag is affected, you don't
want to drop link, right?

So, we're back to detection at layer 3, can I ping it, do I have
adjacency, etc.

Some sort of lower-level heartbeat (maybe like OAM), not dependent on IP
reachability, would be a bonus - and it's probably low in the tax stakes,
if it can be made simple enough.

Mike

Mike Hughes wrote:

But, how does that work when you may be delivering multiple q-tags on a
single GigE port (for example)? If only one tag is affected, you don't
want to drop link, right?

So, we're back to detection at layer 3, can I ping it, do I have
adjacency, etc.

Some sort of lower-level heartbeat (maybe like OAM), not dependent on IP
reachability, would be a bonus - and it's probably low in the tax stakes,
if it can be made simple enough.

I think pseudowire liveness (in case of ethernet pseudowires which are
by nature multipoint and multi-vlan) does not really make sense but as you
conclude L3 liveness does. Obviously one can repeat the exercise for everything
that needs liveness but it would make more sense to have a generic way to
determine L3 reachability in a robust manner.

Pete

Your Cisco router (say a GSR) will go foobar if you use 10/30 seconds
timers, a IGP topology change, causing a new next-hop interface for
100k routes, will cause processes (probably CEF related) to run for so
long, that you will loose your BGP keepalives, thus loose sessions, and
everything will go *BOOM* - so please be nice and don't do that without
real testing.

/Jesper

Jesper Skriver wrote:

Your Cisco router (say a GSR) will go foobar if you use 10/30 seconds
timers, a IGP topology change, causing a new next-hop interface for
100k routes, will cause processes (probably CEF related) to run for so
long, that you will loose your BGP keepalives, thus loose sessions, and
everything will go *BOOM* - so please be nice and don't do that without
real testing.

This is the exact reason why you want your liveness to be detected
out of band of the actual routing protocol keepalives which might
also be stuck behind a queue of incoming updates which you need to
read off the socket before you can see the HELLO coming in.

Of course you're toast either way if your interface queues are
large enough and you don't do preferential queueing for BGP.

Pete

Thus spake "Vadim Antonov" <avg@exigengroup.com>

It makes little sense to detect transient glitches. Any possible reaction
on those glitches (i.e. withdrawal of exterior routes with subsequent
reinstatement) is more damaging than the glitches themselves.

(Ignoring BGP for the moment, which has no clue of the reliability of its links)

That's due to the "slow down, fast up" nature of IETF protocols. Do you really
want a link or routing protocol claiming your link is "up" if it passes only 33%
of your keepalives?

IMHO, the key to fast-response protocols is reversing this behavior: require
(say) 10 keepalives in a row for a link to be "up", and missing one forces it
"down".

S