Hurricane Electric has reached 0 RPKI INVALIDs in our routing table

I'm pleased to announce Hurricane Electric has completed our RPKI
INVALID filtering project and we now have 0 RPKI INVALIDs in our routing
table.

Hurricane Electric has 29021 BGP sessions with 22109 prefix filters with
7191 networks directly and 8239 networks including Internet exchanges.

We filter all BGP sessions using prefix filters based on IRR and RPKI.

These prefix filters are updated automatically both through a system of
daily updates and real time updates to prevent RPKI INVALID routes from
being carried in our routing table.

absolutely awesome Mike!

Can you put on the roadmap to enable irr based filters for customers with bgp communities?

congratulations HE team!.

Hey,

These prefix filters are updated automatically both through a system of
daily updates and real time updates to prevent RPKI INVALID routes from
being carried in our routing table.

What does real time mean in this context? Does it mean exactly 0s leak
of INVALID, or 99% less than 30s? Or how do you define it?

I'm trying to think of an ideal way to do this in Junos which does a
few second ephemeral config commits. I could have an always-on SSH
session to each device to amortise login time, but even then if I can
do this cycle in 5s, I'd have to wait for BGP propagation delay in
DFZ, which is measured in minutes not seconds. So my definition of
real time here would be 99% <5min.

Dear Mike, Ytti, others,

First of all and most importantly: congratulations Mike! I thank you and
your team for having constructed a great mechanism that helps honor the
routing intentions everyone publishes in the RPKI.

> These prefix filters are updated automatically both through a system
> of daily updates and real time updates to prevent RPKI INVALID
> routes from being carried in our routing table.

What does real time mean in this context? Does it mean exactly 0s leak
of INVALID, or 99% less than 30s? Or how do you define it?

My measurement (samplesize = 1) appears to indicate it took less than a
minute between AS 6939 receiving (and accepting) an RPKI invalid route
announcement, and that same route announcement being removed from the AS
6939 routing tables. Subsequently BGP withdraw messages were sent (for
that RPKI invalid route via 6939) to all their peers, which a few more
minutes to be processed and converge in the global routing system.

I think it is important for the community to understand that the
mechanism 6939 currently uses, is a different approach to what other
network operators are doing.

Most RPKI ROV deployments have set it up in such a way that a-priori all
EBGP routers are primed with a full set of VRPs. Feeding the routers the
VRPs through the RPKI-To-Router (RTR) protocol allows those BGP speakers
to reject an RPKI invalid route - before - installing it in the Loc-RIB.

At the same time, we should recognize and praise anyone who managed to
deploy a reactive mechanism due to the lack of RTR support on a device.

The "route collector -> script -> add prefix list to denylist" approach
cannot be avoided if you have gear in the network that does not support
RPKI OV as specced out in RFC 6811.

The reactive mechanism must be viewed in context of other protection
mechanisms that are deployed such as Peerlock, Maximum Prefix Limits,
and IRR+RPKI+WHOIS based explicit allowlists, all of which 6939 has
done. I actually had to jump through some hoops in the IRR system to
trick 6939 into accepting my RPKI invalid route announcement. :slight_smile:

Since it is with words that we construct the magic of our reality, let's
assign a name specific to this engineering effort:

Reactive RPKI ROV

Reactive RPKI ROV, it is, then :-).

A great effort by HE for a network that may not yet completely support
RFC 6811.

We're quickly running out reasons.

Mark.

Lets say someone makes an announcement that creates a RPKI invalid and it is determined to be a mistake. They then go back and add ROA objects to fix the problem. With this reactive RPKI approach then continue to block the route because filters where already generated and pushed out to routers? Or in other words, if the system can insert the filter in less than 60 seconds, how long does it take to get rid of the filter again when someone publish valid a ROA ?

Regards,

Baldur

Dear Baldur,

Lets say someone makes an announcement that creates a RPKI invalid and
it is determined to be a mistake. They then go back and add ROA
objects to fix the problem. With this reactive RPKI approach then
continue to block the route because filters where already generated
and pushed out to routers? Or in other words, if the system can insert
the filter in less than 60 seconds, how long does it take to get rid
of the filter again when someone publish valid a ROA ?

What you describe here is what I'd call a "Garbage Collection" process.
Garbage collection has to happen periodically.

Probably not slower than once an hour. See the following link for an
attempt to document that type of aspect of RPKI ROV deployments:
https://tools.ietf.org/html/draft-ietf-sidrops-rpki-rov-timing-00.html

Maybe HE can comment on their current timers?

Kind regards,

Job

The flip side of this though is that every time an IP space owner publishes an ROA for an aggregate IP block and overlooks the fact that they have customers BGP originating a subnet of the aggregate with an ASN not permitted by an ROA, HE has "less than a full table". :frowning:

i.e. I'm questioning whether the system is mature enough and properly used widely enough for dropping RPKI invalids to be a good idea?

It's hard to imagine RPKI doing its MVP function as a flip side.

If this argument is against RPKI fundamentally, I can understand it,
but that ship has sailed.

The flip side of this though is that every time an IP space owner
publishes an ROA for an aggregate IP block and overlooks the fact that
they have customers BGP originating a subnet of the aggregate with an
ASN not permitted by an ROA, HE has "less than a full table". :frowning:

This is a known business use-case and it's incumbent upon the address
and AS holders to co-ordinate this.

We dropped some prefixes due to this in October of last year. Once we
raised the issue with the remote network, it was fixed in 30 minutes.

i.e. I'm questioning whether the system is mature enough and properly
used widely enough for dropping RPKI invalids to be a good idea?

Well, if we don't deploy, nothing matures.

The problems we hit in the field will help to make the entire system
better.

Mark.

How did you know? Is there some monitoring system available to let you know or do you have your own?
-Tim

Dear Jon, group,

> I'm pleased to announce Hurricane Electric has completed our RPKI
> INVALID filtering project and we now have 0 RPKI INVALIDs in our routing
> table.
>
> Hurricane Electric has 29021 BGP sessions with 22109 prefix filters with
> 7191 networks directly and 8239 networks including Internet exchanges.

The flip side of this though is that every time an IP space owner publishes
an ROA for an aggregate IP block and overlooks the fact that they have
customers BGP originating a subnet of the aggregate with an ASN not
permitted by an ROA, HE has "less than a full table". :frowning:

Do you remember the old BSD paradigm? ... "less is more"

I think it applies here. We are now in a time where a *smaller* routing
table entry list count is preferable to a 'full' table, because the
fullest table is likely to also include problematic BGP routing
information.

It is important to recognise that RPKI ROA creation is an *OPTIONAL*
protection mechanism. If you create ROAs, you indeed can harm your
network, but at the same time, if you create the ROAs correctly, you
will gain massive benefits.

RPKI ROA creation is a big hammer. Everyone needs to think carefully
about each ROA they create and if it will positively or negatively
impact their network. NTT spend *months* creating ROAs for all the
prefixes, researching for each BGP announcement if the ROA would be good
or bad. We now got virtually all our space covered by ROAs, it'snice.

i.e. I'm questioning whether the system is mature enough and properly used
widely enough for dropping RPKI invalids to be a good idea?

Yes. "We made an impossible bird, and it was able to fly". :slight_smile:

The global deployment of RPKI ROV in the BGP Default-Free Zone already
is a fact, we made it work! All carriers that keep the Internet
connected together, and care about preventing routing incidents - are
committed to this effort. Thousands of people are now involved at this
point.

What now remains.. is polishing away some of the sharp edges
[1][2][3][4], and bikeshedding about some of the colors :slight_smile:

The below links are like an 'ala carte menu', anyone can engage in
discussions about RPKI at any level they feel comfortable with. Many
people are looking for feedback and input through different forums on
what and how to build it. Pick a platform you enjoy engaging on and
participate (and stick around on this mailing list, all good)! :slight_smile:

Kind regards,

Job

[1]: https://www.youtube.com/watch?v=oBwAQep7Q7o
[2]: https://mailarchive.ietf.org/arch/msg/sidrops/ayCQbKvJZmE5TGq9IxL9qUM-zQ4/
[3]: https://github.com/RIPE-NCC/rpki-validator-3/issues/158
[4]: https://twitter.com/routinator3000/status/1255439035553779713

Do you remember the old BSD paradigm? ... "less is more"

s/bsd/mies/ credit where due.

We are now in a time where a *smaller* routing table entry list count
is preferable to a 'full' table, because the fullest table is likely
to also include problematic BGP routing information.

do you have measurement of that? i would be *really* interested.

randy

Do you remember the old BSD paradigm? ... "less is more"

s/bsd/mies/ credit where due.

recant. it was well before mies. i was just raised by and architect,
and had uni roomies who were in the architecture school mies founded.
so my own narrow vision. sorry.

randy

Just like I said, if you create an ROA for an aggregate, forgetting that you have customers using subnets of that aggregate (or didn't create ROAs for customer subnets with the right origin ASNs), you're literally telling those using RPKI to verify routes "don't accept our customers' routes." That might not be bad for "your network", but it's probably bad for someone's.

The usual way - a customer complained :-).

Mark.

The customer monitoring system is very reliable and often superior to
in-house solutions.

Nick

What really made the experience great for us is that directly contacting
the remote network (somewhere in Eastern Europe) and getting them to fix
the issue was far more effective than the usual, "Get your customer to
log a case with our customer, who can then log a case with us, since we
have no commercial contract with you".

We had a completely separate second case caused by us rejecting an
Invalid route. It got fixed in 30 minutes as well.

Invalid routes being dropped creates downtime. People respond to
downtime a lot more eagerly.

Mark.

humanity is a crisis-driven species.

Nick