Global Akamai Outage

I was just about to cite these two as improving this particular issue in upcoming releases.

I am running RPKI-Client + StayRTR, alongside Fort, and yes, while monitoring should be standard, improvements in the validation and RTR objectives will also go a long way in mitigating these issues.

What's quickly happening in this space is that not all validators and RTR servers are going to made equal. There are a number of options currently available (both deprecated and current), but I expect that we may settle on just a handful, as experience increases. And in what remains, I anticipate that they will be bolstered to consider these very problems.

Mark.

Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:

rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
possible for RTR servers to stop considering outdated VRP's:
Add an 'expires' column to CSV & JSON output · rpki-client/rpki-client-openbsd@9e48b3b · GitHub

Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
server "stop considering" something that does not exist from its PoV?

Did you mean that it can warn about impending expiration?

StayRTR reads the VRP data generated by rpki-client.

Mark.

Hello!

Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:
> rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
> possible for RTR servers to stop considering outdated VRP's:
> Add an 'expires' column to CSV & JSON output · rpki-client/rpki-client-openbsd@9e48b3b · GitHub

Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
server "stop considering" something that does not exist from its PoV?

rpki-client can only remove outdated VRP's, if it a) actually runs and
b) if it successfully completes a validation cycle. It also needs to
do this BEFORE the RTR server distributes data.

If rpki-client for whatever reason doesn't complete a validation cycle
[doesn't start, crashes, cannot write to the file] it will not be able
to update the file, which stayrtr reads and distributes.

If your VM went down with both rpki-client and stayrtr, and it stays
down for 2 days (maybe a nasty storage or virtualization problem or
maybe this just a PSU failure in a SPOF server), when the VM comes
backup, stayrtr will read and distribute 2 days old data - after all -
rpki-client is a periodic cronjob while stayrtr will start
immediately, so there will be plenty of time to distribute obsolete
VRP's. Just because you have another validator and RTR server in
another region that was always available, doesn't mean that the
erroneous and obsolete data served by this server will be ignored.

There are more reasons and failure scenarios why this 2 piece setup
(periodic RPKI validation, separate RTR daemon) can become a "split
brain". As you implement more complicated setups (a single global RPKI
validation result is distributed to regional RTR servers - the
cloudflare approach), things get even more complicated. Generally I
prefer the all in one approach for these reasons (FORT validator).

At least if it crashes, it takes down the RTR server with it:

But I have to emphasize that all those are just examples. Unknown bugs
or corner cases can lead to similar behavior in "all in one" daemons
like Fort and Routinator. That's why specific improvements absolutely
do not mean we don't have to monitor the RTR servers.

lukas

rpki-client can only remove outdated VRP's, if it a) actually runs and
b) if it successfully completes a validation cycle. It also needs to
do this BEFORE the RTR server distributes data.

If rpki-client for whatever reason doesn't complete a validation cycle
[doesn't start, crashes, cannot write to the file] it will not be able
to update the file, which stayrtr reads and distributes.

Have you had an odd experiences with rpki-client running? The fact that it's not a daemon suggests that it is less likely to bomb out (even though that could happen as a runtime binary, but one can reliably test for that with any effected changes).

Of course, rpki-client depends on Cron being available and stable, and over the years, I have not run into any major issues guaranteeing that.

So if you've seen some specific outage scenarios with it, I'd be keen to hear about them.

If your VM went down with both rpki-client and stayrtr, and it stays
down for 2 days (maybe a nasty storage or virtualization problem or
maybe this just a PSU failure in a SPOF server), when the VM comes
backup, stayrtr will read and distribute 2 days old data - after all -
rpki-client is a periodic cronjob while stayrtr will start
immediately, so there will be plenty of time to distribute obsolete
VRP's. Just because you have another validator and RTR server in
another region that was always available, doesn't mean that the
erroneous and obsolete data served by this server will be ignored.

This is a good point.

So I know that one of the developers of StayRTR is working on having it use the "expires" values that rpki-client inherently possesses to ensure that StayRTR never delivers stale data to clients. If this works, while it does not eliminate the need to some degree of monitoring, it certainly makes it less of a hassle, going forward.

There are more reasons and failure scenarios why this 2 piece setup
(periodic RPKI validation, separate RTR daemon) can become a "split
brain". As you implement more complicated setups (a single global RPKI
validation result is distributed to regional RTR servers - the
cloudflare approach), things get even more complicated. Generally I
prefer the all in one approach for these reasons (FORT validator).

At least if it crashes, it takes down the RTR server with it:

https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163

But I have to emphasize that all those are just examples. Unknown bugs
or corner cases can lead to similar behavior in "all in one" daemons
like Fort and Routinator. That's why specific improvements absolutely
do not mean we don't have to monitor the RTR servers.

Agreed.

I've had my fair share of Fort issues in the past month, all of which have been fixed and a new release is imminent, so I'm happy.

I'm currently running both Fort and rpki-client + StayRTR. At a basic level, they both send the exact same number of VRP's toward clients, likely because they share a philosophy in validation schemes, and crypto libraries.

We're getting there.

Mark.

Mon, Jul 26, 2021 at 07:04:41PM +0200, Lukas Tribus:

Hello!

>
> Mon, Jul 26, 2021 at 02:20:39PM +0200, Lukas Tribus:
> > rpki-client 7.1 emits a new per VRP attribute: expires, which makes it
> > possible for RTR servers to stop considering outdated VRP's:
> > Add an 'expires' column to CSV & JSON output · rpki-client/rpki-client-openbsd@9e48b3b · GitHub
>
> Since rpki-client removes "outdated" (expired) VRPs, how does an RTR
> server "stop considering" something that does not exist from its PoV?

rpki-client can only remove outdated VRP's, if it a) actually runs and
b) if it successfully completes a validation cycle. It also needs to
do this BEFORE the RTR server distributes data.

If rpki-client for whatever reason doesn't complete a validation cycle
[doesn't start, crashes, cannot write to the file] it will not be able
to update the file, which stayrtr reads and distributes.

If your VM went down with both rpki-client and stayrtr, and it stays
down for 2 days (maybe a nasty storage or virtualization problem or
maybe this just a PSU failure in a SPOF server), when the VM comes
backup, stayrtr will read and distribute 2 days old data - after all -
rpki-client is a periodic cronjob while stayrtr will start
immediately, so there will be plenty of time to distribute obsolete
VRP's. Just because you have another validator and RTR server in
another region that was always available, doesn't mean that the
erroneous and obsolete data served by this server will be ignored.

There are more reasons and failure scenarios why this 2 piece setup
(periodic RPKI validation, separate RTR daemon) can become a "split
brain". As you implement more complicated setups (a single global RPKI
validation result is distributed to regional RTR servers - the
cloudflare approach), things get even more complicated. Generally I
prefer the all in one approach for these reasons (FORT validator).

At least if it crashes, it takes down the RTR server with it:

https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163

But I have to emphasize that all those are just examples. Unknown bugs
or corner cases can lead to similar behavior in "all in one" daemons
like Fort and Routinator. That's why specific improvements absolutely
do not mean we don't have to monitor the RTR servers.

I am not convinced that I want the RTR server to be any smarter than
necessary, and I think expiration handling is too smart. I want it to
the load the VRPs provided and serve them, no more.

Leave expiration to the validator and monitoring of both to the NMS and
other means. The delegations should not be changing quickly[1] enough
for me to prefer expiration over the grace period to correct a validator
problem. That does not prevent an operator from using other means to
share fate; eg: if the validator does fails completely for 2 hours, stop
the RTR server.

I perceive this to be choosing stability in the RTR sessions over
timeliness of updates. And, if a 15 - 30 minute polling interval is
reasonable, why isnt 8 - 24 hours.

I too prefer an approach where the validator and RTR are separate but
co-habitated, but this naturally increases the possibility that the two
might serve different data due to reachability, validator run-time, ....
To what extend differences occur, I have not measured.

[1] The NIST ROA graph confirms the rate of change is low, as I would
expect. But, I have no statistic for ROA stability, considering only
the prefix and origin.

> rpki-client can only remove outdated VRP's, if it a) actually runs and
> b) if it successfully completes a validation cycle. It also needs to
> do this BEFORE the RTR server distributes data.
>
> If rpki-client for whatever reason doesn't complete a validation cycle
> [doesn't start, crashes, cannot write to the file] it will not be able
> to update the file, which stayrtr reads and distributes.

Have you had an odd experiences with rpki-client running? The fact that
it's not a daemon suggests that it is less likely to bomb out (even
though that could happen as a runtime binary, but one can reliably test
for that with any effected changes).

No, I did not have a specific negative experience running rpki-client.

I did have my fair share of:

- fat fingering cronjobs
- fat fingering permissions
- read-only filesystems due to storage/virtualizations problems
- longer VM downtimes

I was also directly impacted by a hung rpki-validator, which I have
referenced in one of the links earlier. This was actually after I
started to have concerns about the lack of monitoring and the danger
of serving stale data, not before.

I was also constantly impacted by generic (non RPKI related) gray
failures in other people's network for the better part of a decade, I
guess that makes me particularly sensitive to topics like this.

Of course, rpki-client depends on Cron being available and stable, and
over the years, I have not run into any major issues guaranteeing that.

It's not the quality of the cron code that I'm worried about. It's the
amount of variables that can cause rpki-client to not complete and
fully write the validation results to disk COMBINED with the lack of
monitoring.

You get an alert in your NMS when a link is down, even if that single
link down doesn't mean your customers are impacted. But you need to
know so that you can actually intervene to restore full redundancy.
Lack of awareness of a problem is the large issue here.

> If your VM went down with both rpki-client and stayrtr, and it stays
> down for 2 days (maybe a nasty storage or virtualization problem or
> maybe this just a PSU failure in a SPOF server), when the VM comes
> backup, stayrtr will read and distribute 2 days old data - after all -
> rpki-client is a periodic cronjob while stayrtr will start
> immediately, so there will be plenty of time to distribute obsolete
> VRP's. Just because you have another validator and RTR server in
> another region that was always available, doesn't mean that the
> erroneous and obsolete data served by this server will be ignored.

This is a good point.

So I know that one of the developers of StayRTR is working on having it
use the "expires" values that rpki-client inherently possesses to ensure
that StayRTR never delivers stale data to clients. If this works, while
it does not eliminate the need to some degree of monitoring, it
certainly makes it less of a hassle, going forward.

Notice that expires is based on the cryptographic validity of the ROA
objects. It can be multiple DAYS until expiration strikes, for example
the expiration value of 8.8.8.0/24 is 2 DAYS in the future.

> There are more reasons and failure scenarios why this 2 piece setup
> (periodic RPKI validation, separate RTR daemon) can become a "split
> brain". As you implement more complicated setups (a single global RPKI
> validation result is distributed to regional RTR servers - the
> cloudflare approach), things get even more complicated. Generally I
> prefer the all in one approach for these reasons (FORT validator).
>
> At least if it crashes, it takes down the RTR server with it:
>
> https://github.com/NICMx/FORT-validator/issues/40#issuecomment-695054163
>
>
> But I have to emphasize that all those are just examples. Unknown bugs
> or corner cases can lead to similar behavior in "all in one" daemons
> like Fort and Routinator. That's why specific improvements absolutely
> do not mean we don't have to monitor the RTR servers.

Agreed.

I've had my fair share of Fort issues in the past month, all of which
have been fixed and a new release is imminent, so I'm happy.

I'm currently running both Fort and rpki-client + StayRTR. At a basic
level, they both send the exact same number of VRP's toward clients,
likely because they share a philosophy in validation schemes, and crypto
libraries.

We're getting there.

For IOS-XR I have a netconf script that makes all kinds of health
checks at XR RTR client level:

- comparing the number of total IPv4 and v6 VRP's of each RTR server
enable with absolute values, warning if there are less than EXPECTED
values
- comparing the v4 and v6 number between the RTR endpoints on this XR
box, warning if the disparity crosses a threshold
- warning if a configured RTR server is not in connected state

This is also useful for RTR client bugs in XR, which I have seen (the
state machine is broken, you need to "clear bgp rkpi ... server XY" to
restore it). I have not seen this issue in newer code fortunately.

lukas

Hello,

> But I have to emphasize that all those are just examples. Unknown bugs
> or corner cases can lead to similar behavior in "all in one" daemons
> like Fort and Routinator. That's why specific improvements absolutely
> do not mean we don't have to monitor the RTR servers.

I am not convinced that I want the RTR server to be any smarter than
necessary, and I think expiration handling is too smart. I want it to
the load the VRPs provided and serve them, no more.

Leave expiration to the validator and monitoring of both to the NMS and
other means.

While I'm all for KISS, the expiration feature makes sure that the
cryptographic validity in the ROA's is respected not only on the
validator, but also on the RTR server. This is necessary, because
there is nothing in the RTR protocol that indicates the expiration and
this change brings it at least into the JSON exchange between
validator and RTR server.

It's like TTL in DNS, and it's about respecting the wishes of the
authority (CA and ROA ressource holder).

The delegations should not be changing quickly[1] enough

How do you come to this conclusion? If I decide I'd like to originate
a /24 out of my aggregate, for DDoS mitigation purposes, why shouldn't
I be able to update my ROA and expect quasi-complete convergence in 1
or 2 hours?

for me to prefer expiration over the grace period to correct a validator
problem. That does not prevent an operator from using other means to
share fate; eg: if the validator does fails completely for 2 hours, stop
the RTR server.

I perceive this to be choosing stability in the RTR sessions over
timeliness of updates. And, if a 15 - 30 minute polling interval is
reasonable, why isnt 8 - 24 hours.

Well for one, I'd like my ROAs to propagate in 1 or 2 hours. If I need
to wait for 24 hours, then this could cause operational issues for me
(the DDoS mitigation case above for example, or just any other normal
routing change).

The entire RPKI system is designed to fail, so if you have multiple
failures and *all* your RTR servers go down, the worst case is that
the routes on the BGP routers turn NotFound, so you'd lose the benefit
of RPKI validation. It's *way* *way* more harmful to have obsolete
VRP's on your routers. If it's just a few hours, then the impact will
probably not be catastrophic. But what if it's 36 hours, 72 hours?
What if the rpki-validation started failing 2 weeks ago, when Jerry
from IT ("the linux guy") started it's vacation?

On the other hand, if only one (of multiple) validator/rtr instances
has a problem and the number of VRP's slowly goes down, nothing will
happen at all on your routers, as they just use the union of the RTR
endpoints and the VRP's from the broken RTR server will slowly be
withdrawn. Your router will keep using healthy RTR servers, as opposed
to considering erroneous data from a poisoned RTR server.

I define stability not as "RTR session uptime and VRP count", but
whether or not my BGP routers are making correct or wrong decisions.

I too prefer an approach where the validator and RTR are separate but
co-habitated, but this naturally increases the possibility that the two
might serve different data due to reachability, validator run-time, ....
To what extend differences occur, I have not measured.

[1] The NIST ROA graph confirms the rate of change is low, as I would
expect. But, I have no statistic for ROA stability, considering only
the prefix and origin.

I don't see how the rate of global ROA changes is in any way related
to this issue. The operational issue a hung RTR endpoint creates for
other people's networks can't be measured with this.

lukas