RE: [j-nsp] Krt queue issues

Jensen_Tyler · October 3, 2012, 2:45pm

Look into Static route retain. Should keep the route in the forwarding table.

From Jniper site

<<<
Route Retention

By default, static routes are not retained in the forwarding table when the routing process shuts down. When the routing process starts up again, any routes configured as static routes must be added to the forwarding table again. To avoid this latency, routes can be flagged as retain, so that they are kept in the forwarding table even after the routing process shuts down. Retention ensures that the routes are always in the forwarding table, even immediately after a system reboot.

Thanks,

Jensen Tyler
Sr Engineering Manager
Fiberutilities Group, LLC

Naslund_Steve · October 3, 2012, 9:50pm

I think route retention might help in the event the table was cleared or
routing process restarted but I don't that it will help with a boot
because the table structures are being built as part of the system
initialization. In reality, I would expect the static routes to get
installed very early as soon as the routing process comes up. Since you
will need a route to your BGP neighbor (even though it may be directly
connected, it is still a route), routing has to be up BEFORE BGP
establishes and by definition your static routes will have to be up
before your BGP routes are ready. How well your router responds to
traffic during an initial boot and during a 300,000 route update is
another story. My experience with very large routers and tables is that
you will have a hard time guaranteeing user traffic will pass with very
much performance during an event like a full table rebuild. Luckily
with the bandwidth we have these days and the CPU power on the routers,
it does not take that long to pull in a full internet table and begin
handling traffic.

Steven Naslund

Tim_Vollebregt · January 8, 2013, 2:45pm

Hi,

What we do nowadays as some workaround, is configuring a default route towards a core router on 8 x 10G before maintaining an MX box. Which will be installed before BGP sessions come up, this will cause some packet loss during burst hour outages but is fine during maintenance hours.

I've seen cases where it took up to 30 minutes before the full table was installed correctly in the PFE's.

Currently this issue/bug is holding back our Juniper deployments. As far as I know Juniper created a project group for this bug, and so far they were able to reproduce the issue. Looks like the issue is being taken serious from now.

Tim

Richard_A_Steenbegen · January 8, 2013, 9:20pm

PR 836197

I actually have very good luck reproducing it:

http://cluepon.net/ras/rpdstall.png

The issue appears to be that when rpd is busy processing incoming BGP
updates (such as when you turn up a large number of peers
simultaniously), it starves the rest of the process from actually
spending any CPU time handling/installing the route. The graph above
shows a plot of the total BGP paths, the number of routes in the
"pending" state, and the number of routes actually installed into the
forwarding hardware. This is a very simplified example (nothing but IBGP
sessions with very simple policies here, not even any EBGP neighbors),
using the latest top of the line routing engine, so in real life the
issue is much worse.

As you can see, while rpd is still busy receiving and processing the
incoming updates, the number of pending routes rises and doesn't fall,
and the number of routes installed in the PFE stays almost non-existant.
A few routes actually manage to squeek in before all of the BGP sessions
come up, which is why it has any at all for the period between 0 and 330
seconds. After the router finishes receiving the BGP paths, the pending
routes clear very quickly, and then the FIB installation process begins.
8 minutes after turning up the BGP sessions, this router finally has a
full table installed in hardware. The pending routes actually clear much
quicker than this once the BGP routes stop coming int, I need to update
this graph with a higher resolution to show it.

Juniper actually DOES have a fix for this issue, tweaking the scheduler
in rpd so that the router still processes BGP routes even when it's
spending a lot of time receiving new routes. Unfortunately they haven't
yet decided to prioritize implementing this fix, so it's still stuck in
development. If this issue drives you as insane as it does me, I highly
encourage you to talk to your account team about PR 836197 and why 8-20+
minutes to install routes to the FIB is not acceptable to you.

Iddo · January 8, 2013, 10:10pm

Hi,

PR 836197

That looks like a spanking new PR number to me.
The highest PR number I found in 12.2 release notes was 82xxxx.
Rather strange that they didn't have an earlier PR number, while the
issue has existed for such a long time.

If this issue drives you as insane as it does me, I highly
encourage you to talk to your account team about PR 836197

Done.

I can't read PR836197 online as it is not public.
Can you post it without liability?
If you would be liable do not post it.. Also do _not_ email me off
list with the PR description.......

Thanks.

Richard_A_Steenbegen · January 9, 2013, 12:49am

Hi,

> PR 836197

That looks like a spanking new PR number to me.
The highest PR number I found in 12.2 release notes was 82xxxx.
Rather strange that they didn't have an earlier PR number, while the
issue has existed for such a long time.

Oh I have a pile of PR's about a mile long, including some that I opened
on this issue 5+ years ago. But I'm not going to harp on the complete
absurdity of how long it has taken to finally figure this thing out, or
the number of people who have seen this issue while they've claimed all
along that nobody else sees it. I'm just going to focus on fixing it.
This is the PR that they've chosen for implementing the actual fix, so
that's what I'm going with for the sake of simplicity.

I can't read PR836197 online as it is not public.
Can you post it without liability?
If you would be liable do not post it.. Also do _not_ email me off
list with the PR description.......

Neither can I, but the basic description of the issue is what I said
before.