AS690 Gated/BGP4 Deployment

The good news is that since Friday morning, we now have a stable
gated/BGP4 router interoperating with 90+ rcp_routed peers within
AS690, and we are preparing for further gated/BGP4 deployment on
AS690 routers.

The bad news is that we suffered a major system-wide rcp_routed
failure along the way. On Thursday evening around 22:00EST, we
exercised a dormant bug within rcp_routed where the gated router
(ENSS205) sent a valid IBGP message to its 90+ rcp_routed peers
that they did not process correctly and the routing daemon died.
This resulted in about a 45 minute outage, although the gated peer
(ENSS205) stayed up. The problem was immediately traced to an
rcp_routed bug. The rcp_routed bug (which was not found on our
testnet) was fixed and deployed during Friday mornings configuration

The ENSS205 router is a semi-production system that has been used
to stage new gated versions onto AS690. The last update to gated
on this system was done outside the normal configuration
maintenance window in an effort to expedite the deployment on other
full-production nodes for CIDR transition. While this outage could
have occurred at any time (it was induced by a random external peer
interaction) a procedural improvement will be administered for all
gated installations by establishing special scheduled maintenance
windows to minimize the impact of any problems, and to work
towards a timely AS690 CIDR deployment.

There are a two non-critical known problems with the AS690 gated
that have been (or will be addressed) prior to the next installation.
I have summarized these problems below for those interested.

Another new development is that the rcp_routed support for a default
route that was previously deployed on AS690 was successfully tested
during the last scheduled configuration maintenance window. A non-
disruptive configuration change can be administered at any time to
enable AS690 to default to the AS1133 (gated BGP4/CIDR) peer
which is connected to both MAE-East, and to CNSS57 (AS690) over
a private 10Mb/s ethernet segment. This may be required to assist
specific MAE-East peers with a timely CIDR deployment.

We have established two specific upcoming maintenance windows for
gated deployment on AS690 routers. The next window will be
Sunday Morning Feb 14 (00:30EST - 08:00EST). The AS690 routers
that we have picked as candidates to convert to run gated (pending
consent of the peers) include:

ENSS194 (ANS Ann Arbor Backup Router)
ENSS158 (Maui HPCC)
CNSS120 (Honolulu CNSS)
ENSS205 (new Gated re-deployment)
ENSS160 (ANS Elmsford)
ENSS139 (Rice U)
ENSS131 (Ann Arbor)

There is an overlapping power maintenance scheduled window for the
MCI New York POP (Sunday morning 00:30EST-02:30EST) that we
will work around, and should not interfere with the gated deployment

The ANS NOC has contacted each of the peers that will be affected
by service disruptions on these nodes to acknowledge their consent,
and NSR messages have been sent for each of these maintenance
windows describing the gated deployment, and the potential for
AS690 routing instabilities that may occur during these windows.

Following this maintenance we would like to validate with the peers
that all configurations with external peers are behaving as expected.
If this maintenance is successful we would like to continue the AS690
deployment during the Tuesday morning Feb 15th configuration
maintenance window (05:00EST - 08:00EST) for the following
candidate nodes:

ENSS136 (College Park)
ENSS145 (FIX-East)
ENSS144 (FIX-West)

We would try to identify any problems within the maintenance
window. If we could not complete this maintenance within the
window, these routers would be rolled back to run rcp_routed.

Once we have stabilized the above set of nodes, the rest of the
AS690 could proceed as rapidly as can be scheduled (within
maintenance windows).

For general interest, the two known problems with AS690 gated that
we expect to fix before the AS690 deployment concludes. These
may be summarized as:

1. Jurassic LSPs. At the time rcp_routed was originally developed,
     there was an IS-IS LSP packet format that was designed to
     carry external routes, before IBGP was implemented to carry
     these. Modern rcp_routed does no longer cares about this,
     however still occasionally generates these "Jurassic LSPs" which
     have header but carry no external routes, from specific
     production rcp_routed nodes at rcp_routed startup time (we can
     not reproduce this on our testnet). Gated does not know what
     to do with these packets so it drops them and logs an error
     message. Rcp_routed acks them and goes on about its
     business. Rather than modifying rcp_routed to eliminate these
     LSPs, we instead worked around this by adding a few lines of
     code to the gated SLSP to ack these packets and ignore them.
     This does not cause problems running gated on a few
     production network nodes and will not bother ENSS205, but we
     should and will fix this before we scale the deployment up

2. BGP memory leak. We have previously known from our testnet
     experience that there is a slow memory leak in the gated BGP
     code that is exercised when a connection to an external peer
     fails and gated tries to re-establish the session. The only way
     we could observe this problems on the testnet was when a
     gated machine was forced to try and establish sessions (2 times
     per second) with 80 external peers that refused to connect for
     about 4 days.

     We would not expect to see problems with this on nodes that
     are configured properly, but in any case we expect to have a
     fix for this shortly and this will be retrofitted on existing gated
     nodes during a future installation window.