where was my white knight....

how would a sidr-enabled routing infrastructure have fared in yesterdays routing circus?

/bill

The effects of large amounts of route-churn on the auth chain - perhaps DANE? - might've been interesting . . .

We saw an increase in IPv6 traffic which correlated time wise with the onset of this IPv4 incident.

Happy eyeballs in action, automatically shifting what it could.

Mike.

that was/is kindof orthoginal to the question... would the sidr plan
for routing security have been a help in this event? nice to know
unsecured IPv6 took some of the load when the unsecured IPv4 path
failed.

the answer seems to be NO, it would not have helped and would have actually
contributed to network instability with large numbers of validation requests
sent to the sidr/ca nodes...

/bill

SIDR is intended to provide route-origination validation - it isn't intended to be nor can it possibly be a remedy for vendor-specific implementation problems.

Validation storm-control is something which must be accounted for in SIDR/DANE architecture, implementation, and deployment. But at the end of the day, vendors are still responsible for their own code.

To be clear, I was alluding to some discussion centering around DANE or a DANE-like mechanism to handle SIDR-type route validation. Recursive dependencies make this a non-starter, IMHO.

i'm curious about sidr cold bootup, specifically when you are attempting to
validate prefixes from an rpki CA or cache to which you do not necessarily
have network connectivity because your igp is not yet fully up. The
phrases "layering violation" and "chicken and egg" come to mind.

Nick

well... your still stuck w/ knowing where your CA is...

/bill

yeah...there is that.

/bill

the answer seems to be NO, it would not have helped and would have
actually contributed to network instability with large numbers of
validation requests sent to the sidr/ca nodes...

utter bullshit. maybe you would benefit by actually reading the doccos
and understanding the protocols.

i'm curious about sidr cold bootup, specifically when you are
attempting to validate prefixes from an rpki CA or cache to which you
do not necessarily have network connectivity because your igp is not
yet fully up. The phrases "layering violation" and "chicken and egg"
come to mind.

what comes to my mind is that NotFound is the default and it is
recommended to route on it.

i know boys are not allowed to read the manual, but this is starting
to get boring.

randy

I understand what the manual says (actually, i read it). I'm just curious
as to how this is going to work in real life. Let's say you have a router
cold boot with a bunch of ibgp peers, a transit or two and an rpki cache
which is located on a non-connected network - e.g. small transit pop / AS
boundary scenario. The cache is not necessarily going to be reachable
until it sees an update for its connected network. Until this happens,
there will be no connectivity from the router to the cache, and
consequently prefixes received in from the transit may be subject to an
incorrect and potentially inconsistent routing policy with respect to the
rest of the network. Ok, they'll be revalidated once the cache comes on
line, but what do you do with them in the interim? Route traffic to them,
knowing that they might or might not be correct? Drop until the cache
comes online from the point of view of the router? Forward potentially
incorrect UPDATEs to your other ibgp peers, and forward validated updates
when the cache comes on-line again? If so, then what if your incorrect new
policy takes precedence over an existing path in your ibgp mesh? And what
if your RP is low on memory from storing an unvalidated adj-rib-in?

You could argue to have a local cache in every pop but may not be feasible
either - a cache will require storage with a high write life-cycle (i.e.
forget about using lots of types of flash), and you cannot be guaranteed
that this is going to be available on a router.

Look, i understand that you're designing rpki <-> interactivity such that
things will at least work in some fashion when your routers lose sight of
their rpki caches. The problem is that this approach weakens rpki's
strengths - e.g. the ability to help stop youtube-like incidents from
recurring by ignoring invalid prefix injection.

Nick

Indeed, we can expect new and exciting ways to blow up networks with SIDR.

that was/is kindof orthoginal to the question... would the sidr plan
for routing security have been a help in this event? nice to know
unsecured IPv6 took some of the load when the unsecured IPv4 path
failed.

if all routing goes boom, would secure routing have saved you?
no... all routing went boom.

the answer seems to be NO, it would not have helped and would have actually
contributed to network instability with large numbers of validation requests
sent to the sidr/ca nodes.

I think actually it wouldn't have caused more validation requests, the
routers have (in some form of the plan) a cache from their local
cache, they use this for origin validation... there's not a
requirement to refresh up the entire chain. (I think).

-chris

the answer seems to be NO, it would not have helped and would have actually
contributed to network instability with large numbers of validation requests
sent to the sidr/ca nodes...

i'm curious about sidr cold bootup, specifically when you are attempting to
validate prefixes from an rpki CA or cache to which you do not necessarily
have network connectivity because your igp is not yet fully up. The
phrases "layering violation" and "chicken and egg" come to mind.

'lazy validation' - prefer to get at least somewhat converged, then validate.

or, the same old ways... only with crypto!
really, there was some care taken in the process to create this and
NOT stomp all over how networks currently work.

comments welcome though.

I understand what the manual says (actually, i read it). I'm just curious
as to how this is going to work in real life. Let's say you have a router
cold boot with a bunch of ibgp peers, a transit or two and an rpki cache
which is located on a non-connected network

Anybody who puts their rpki cache someplace that isn't accessible until they
get the rpki initialized gets what they deserve. Once you realize this, the
rest of the "what do we do for routing until it comes up" concern trolling in
the rest of that paragraph becomes pretty easy to sort out...

You could argue to have a local cache in every pop but may not be feasible
either - a cache will require storage with a high write life-cycle (i.e.
forget about using lots of types of flash), and you cannot be guaranteed
that this is going to be available on a router.

Caching just enough to validate the routes you need to get to a more capable
rpki server shouldn't have a high write life-cycle. Heck, you could just manually
configure a host route pointing to the rpki server...

And it would hardly be the first time that people have been unable to deploy
feature XYZ because it wouldn't fit in the flash on older boxes still in
production.

In a message written on Tue, Nov 08, 2011 at 04:22:48PM -0500, Christopher Morrow wrote:

I think actually it wouldn't have caused more validation requests, the
routers have (in some form of the plan) a cache from their local
cache, they use this for origin validation... there's not a
requirement to refresh up the entire chain. (I think).

I kinda think everyone is wrong here, but Chris is closer to accurate.
:stuck_out_tongue:

When a router goes boom, the rest of the routers recalculate around
it. Generally speaking all of the routers will have already had a
route with the same origin, and thus have hopefully cached a lookup
of the origin. However, that lookup might have been done
days/weeks/months ago, in a stable network.

While I'm not familar with the nitty gritty details here, caches
expire for various reasons. The mere act of the route changing
paths, if it moved to a device with a stale cache, would trigger a
new lookup, right?

Basically I would expect any routing change to generate a set of
new lookups proportial to the cache expiration rules.

What am I missing?

Which may very well fail because all the routing is hosed. I'm not all that familiar with the potential implementation issues, but I would think that network-local caches would be in order.

Even with local caches, I would expect a high incidence of change to trigger something sensible to mitigate this kind of craziness from happening. I am sure enough people have had incorrectly scaled RADIUS farms blow up when a load of DSLAMS vanish and come back again not to repeat such storms.

Anybody who puts their rpki cache someplace that isn't accessible until they
get the rpki initialized gets what they deserve.

One solution is to have directly-connected rpki caches available to all your bgp edge routers throughout your entire network. This may turn out to be expensive capex-wise, and will turn out to be yet another critical infrastructure item to maintain, increasing opex.

Alternatively, you host rpki caches on all your AS-edge routers => upgrades - and lots of currently-sold kit will simply not handle this sort of thing properly.

Once you realize this, the rest of the "what do we do for routing until
it comes up" concern trolling in the rest of that paragraph becomes
pretty easy to sort out...

I humbly apologise for expressing concern about the wisdom of imposing a hierarchical, higher-layer validation structure for forwarding-info management on a pre-existing lower layer fully distributed system which is already pretty damned complex...

What's that principle called again? Was it "Keep It Complex, Stupid"? I can't seem to remember :slight_smile:

Caching just enough to validate the routes you need to get to a more capable
rpki server shouldn't have a high write life-cycle.

Lots of older flash isn't going to like this => higher implementation cost due to upgrades.

Heck, you could just manually
configure a host route pointing to the rpki server...

Yep, hard coding things - good idea, that.

And it would hardly be the first time that people have been unable to deploy
feature XYZ because it wouldn't fit in the flash on older boxes still in
production.

This is one of several points I'm making: there is a cost factor here, and it's not clear how large it is.

Nick