2008.02.20 NANOG 42 Graceful Restart, NSF and NSR

Last set of three notes...was going to send them from the
ballroom, but they started tearing down as soon as the closing
finished, even though it was 30 minutes early. ^_^;;


2008.02.20 Graceful restart, non-stop routing,
spiteful switchover, and non-stop forwarding.

Ken Weissner, kweissne@cisco.com

systems and technology architect
introduction to HA technologies

Lots of moving parts, we start with some
HA -- high availability
   general term
SSO -- Stateful Switchover
   two processors/processes, transfers information from
   one processor so second can pick up from first one.
NSF -- NonStop Forwarding
   forwarding is key portion; router continues moving packets
   as control plane recovers/restarts
GR -- Graceful Restart
   IETF specificed mechanism, allows peers to give time
   for peer to come back before routes are flushed; both
   sides have to agree on it first.
NSR -- NonStop Routing
   maintain session completely without losing state as it
   shifts from one processor to the other, without having
   to alert peer via graceful restart
Thos four all work together to allow unplanned switchover
to occur with minimal interruption of service (not quite
ISSU -- In Service Software Upgrade
   use above items to allow upgrading of software while
   packets continue to move.

Why do people care about SSO/NSF?
Concerns about single point of failure; customer aggregation
or customer connect point which would otherwise impact many,
many customers when issues arise.

They can also do it on non-distributed platforms; very little
packet loss on them.

Diagram showing what impacts happen with adjacency
failures prior to SSO/NSF.

Another slide showing nonstop forwarding due to
graceful restart mechanisms.

supported for LDP, OSPF, BGP, ISIS

Graceful restart for EIGRP, and two different draft
mechanisms for OSPF.

Two modes; "aware" and "capable". If you're a device
that is "capable", your peers need to be "aware" that
you are capable of doing it.

the capable device tells its 'aware' peers that it
will come back within a timeout interval.
"aware" and "helper" are generally synonymous.

configuration for capability needs to be turned on;
not on by default.
awareness is turned on by default in Cisco code.
TCP based protocols have to have it configured on
both sides for graceful restart to be able to

graceful restart concerns voiced at nanog 40
If I'm an aware peer, how do I tell if my neighbor
really went away, and I should reroute quickly, or
if it's going to be back shortly, and I should
continue to move packets towards it.

Need to know if NSF is active or not; but NSF
isn't something you configure.

so, graceful restart concerns addressed:
For BGP, there's a restart timer which limits amount
of time before peer comes back, which limits the amount
of blackhole time; default is 120 seconds, but can be
set shorter to limit the duration of blackhole events.
Other conditions also apply; if link is POS point to
point, linkdown will abort GR, and will tear down the
session and flush routes.

Once open message with restart bit set comes across,
routes are put into stale bucket; still used, but
stale counter begins to count down until they get
Also, added "end of RIB" update message; at that point,
it can build new table based on updates, and can act
upon changes in the stale table; clear truly stale
entries, stop stale timer, and process is done.

For OSPF, two ways to handle it:
RFC3623, vs draft-nguyen-ospf-restart-06
slight differences; does a new route update
cause an abort out of graceful restart or not?

Cisco supports Nonstop routing for ISIS and BGP
in IOS
inbox solution, no other communication needed.

BGP configured on neighbor basis; ISIS is box-wide.

Hybrid BGP NSR
run GR on route reflectors

Per peer GSR/NSR config for BGP

currently BGP GR is globally enabled for all peers
in the routing process

Where BGP NSR is available, configured on a per peer

GR gets used over NSR if peer supports GR

Need mechansism to protect interfaces/L2
state and forwarding; protect both data plane
and control plane.

Features work together to provide protection and
redundancy; should really use them all together
as designed. eg, enabling SSO also enables NSF
(FIB checkpointing).
Each routing protocol on the box needs to be
aware and configured to get full benefit of

routing protocol timers often cranked down for
faster detection of failures.

When switchover happens, takes a little while
for new processor to catch up, if dead timer is
set too low, graceful restart may not kick in
before the sessions are torn down.

First packet can be pushed in 10 seconds for BGP;
but setting timers down to 1/5 or anything below
10 seconds can result in oscillation or other
problematic interactions.