plea for comcast/sprint handoff debug help

tl;dr:

comcast: does your 50.242.151.5 westin router receive the announcement
of 147.28.0.0/20 from sprint's westin router 144.232.9.61?

details:

3130 in the westin announces
  147.28.0.0/19 and
  147.28.0.0/20
to sprint, ntt, and the six
and we want to remove the /19

when we stop announcing the /19, a traceroute to comcast through sprint
dies at the handoff from sprint to comcast.

r0.sea#traceroute 73.47.196.134 source 147.28.7.1
Type escape sequence to abort.
Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134)
VRF info: (vrf in name/id, vrf out name/id)
  1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec
  2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 1 msec 0 msec
  3 * * *
  4 * * *
  5 * * *
  6 * * *

this would 'normally' (i.e. when the /19 is announced) be

r0.sea#traceroute 73.47.196.134 source 147.28.7.1
Type escape sequence to abort.
Tracing the route to c-73-47-196-134.hsd1.ma.comcast.net (73.47.196.134)
VRF info: (vrf in name/id, vrf out name/id)
  1 r1.sea.rg.net (147.28.0.5) 0 msec 1 msec 0 msec
  2 sl-mpe50-sea-ge-0-0-3-0.sprintlink.net (144.232.9.61) [AS 1239] 1 msec 0 msec 1 msec
  3 be-207-pe02.seattle.wa.ibone.comcast.net (50.242.151.5) [AS 7922] 1 msec 0 msec 0 msec
  4 be-10847-cr01.seattle.wa.ibone.comcast.net (68.86.86.225) [AS 7922] 1 msec 1 msec 2 msec
  etc
  
specifically, when 147.28.0.0/19 is announced, traceroute from
147.28.7.2 through sprint works to comcast. withdraw 147.28.0.0/19,
leaving only 147.28.0.0/20, and the traceroute enters sprint but fails
at the handoff to comcast. Bad next-hop? not propagated? covid?
magic?

which is why we wonder what comcast (50.242.151.5) hears from sprint at
that handoff

note that, at the minute, both the /19 and the /20 are being announced,
as we want things to work. so you will not be able to reproduce.

so, comcast, are you receiving the announcement of the /20 from sprint?
with a good next-hop?

randy

tl;dr:

comcast: does your 50.242.151.5 westin router receive the announcement
of 147.28.0.0/20 from sprint's westin router 144.232.9.61?

tl;dr: diagnosed by comcast. see our short paper to be presented at imc
       tomorrow https://archive.psg.com/200927.imc-rp.pdf

lesson: route origin relying party software may cause as much damage as
  it ameliorates

randy

Hello,

tl;dr: diagnosed by comcast. see our short paper to be presented at imc
       tomorrow https://archive.psg.com/200927.imc-rp.pdf

lesson: route origin relying party software may cause as much damage as
        it ameliorates

There is a myth that ROV is inherently fail-safe (it isn't if your
production routers have stale VRP's) which leads to the assumption
that proper monitoring is neglectable.

I'm working on a shell script using rtrdump to detect stale RTR
servers (based on serial changes and the actual data). Of course this
would never detect partial failures that affect only some child-CAs,
but it does detect a hung RTR server (or a standalone RTR server where
the validator validates no more).

lukas

To clarify this for the readers here: there is an ongoing research experiment where connectivity to the RRDP and rsync endpoints of several RPKI publication servers is being purposely enabled and disabled for prolonged periods of time. This is perfectly fine of course.

While the resulting paper presented at IMC is certainly interesting, having relying party software fall back to rsync when RRDP is unavailable is not a requirement specified in any RFC, as the paper seems to suggest. In fact, we argue that it's actually a bad idea to do so:

https://blog.nlnetlabs.nl/why-routinator-doesnt-fall-back-to-rsync/

We're interested to hear views on this from both an operational and security perspective.

-Alex

tl;dr:

comcast: does your 50.242.151.5 westin router receive the announcement
of 147.28.0.0/20 from sprint's westin router 144.232.9.61?

tl;dr: diagnosed by comcast. see our short paper to be presented at imc
      tomorrow https://archive.psg.com/200927.imc-rp.pdf

lesson: route origin relying party software may cause as much damage as
  it ameliorates

randy

To clarify this for the readers here: there is an ongoing research
experiment where connectivity to the RRDP and rsync endpoints of
several RPKI publication servers is being purposely enabled and
disabled for prolonged periods of time. This is perfectly fine of
course.

While the resulting paper presented at IMC is certainly interesting,
having relying party software fall back to rsync when RRDP is
unavailable is not a requirement specified in any RFC, as the paper
seems to suggest. In fact, we argue that it's actually a bad idea to
do so:

Why Routinator Doesn’t Fall Back to Rsync

We're interested to hear views on this from both an operational and
security perspective.

in fact, <senior op at an isp> has found your bug. if you find an http
server, but it is not serving the new and not-required rrdp protocol, it
does not then use the mandatory to implement rsync.

randy

i'll see your blog post and raise you a peer reviewed academic paper and
two rfcs :slight_smile:

in dnssec, we want to move from the old mandatory to implement (mti) rsa
signatures to the more modern ecdsa.

how would the world work out if i fielded a validating dns cache server
which *implemented* rsa, because it is mti, but chose not to actually
*use* it for validation on odd numbered wednesdays because of my
religious belief that ecdsa is superior?

perhaps go over to your unbound siblings and discuss this analog.

but thanks for your help in getting jtk's imc paper accepted. :slight_smile:

randy

I don't see a compelling reason to not use rsync when RRDP is
unavailable.

Quoting from the blog post:

    "While this isn’t threatening the integrity of the RPKI – all data
    is cryptographically signed making it really difficult to forge data
    – it is possible to withhold information or replay old data."

RRDP does not solve the issue of withholding data or replaying old data.
The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP
protocol basically is rsync wrapped in XML over HTTPS.

Withholding of information is detected through verification of RPKI
manifests (something Routinator didn't verify up until last week!),
and replaying of old data is addressed by checking validity dates and
CRLs (something Routinator also didn't do until last week!).

Of course I see advantages to this industry mainly using RRDP, but those
are not security advantages. The big migration towards RRDP can happen
somewhere in the next few years.

The arguments brought forward in the blog post don't make sense to me.
The '150,000' number in the blog post seems a number pulled from thin
air.

Regards,

Job

i'll see your blog post and raise you a peer reviewed academic paper and
two rfcs :slight_smile:

For the readers wondering what is going on here: there is a reason there is only a vague mention to two RFCs instead of the specific paragraph where it says that Relying Party software must fall back to rsync immediately if RRDP is temporarily unavailable. That is because this section doesn’t exist. The point is that there is no bug and in fact, Routinator has a carefully thought out strategy to deal with transient outages. Moreover, we argue that our strategy is the better choice, both operationally and from a security standpoint.

The paper shows that Routinator is the most used RPKI relying party software, and we know many of you here rely on it for route origin validation in a production environment. We take this responsibility and therefore this matter very seriously, and would not want you to think we have been careless in our software design. Quite the opposite.

We have made several attempts within the IETF to have a discussion on technical merit, where aspects such as overwhelming an rsync server with traffic, or using aggressive fallback to rsync as an entry point to a downgrade attack have been brought forward. Our hope was that our arguments would be considered on technical merit, but that did not happen yet. Be that as it may, operators can rest assured that if consensus goes against our logic, we will change our design.

perhaps go over to your unbound siblings and discuss this analog.

The mention of Unbound DNS resolver in this context is interesting, because we have in fact discussed our strategy with the developers on this team as there is a lot to be learned from other standards and operational experiences.

We feel very strongly about this matter because the claim that using our software negatively affects Internet routing robustness strikes at the core of NLnet Labs’ existence: our reputation and our mission to work for the good of the Internet. They are the core values that make it possible for a not-for-profit foundation like ours to make free, liberally licensed open source software.

We’re proud of what we’ve been able to achieve and look forward to a continued open discussion with the community.

Respectfully,

Alex

Hi Job, all,

In fact, we argue that it's actually a bad idea to do so:

Why Routinator Doesn’t Fall Back to Rsync

We're interested to hear views on this from both an operational and
security perspective.

I don't see a compelling reason to not use rsync when RRDP is
unavailable.

Quoting from the blog post:

   "While this isn’t threatening the integrity of the RPKI – all data
   is cryptographically signed making it really difficult to forge data
   – it is possible to withhold information or replay old data."

RRDP does not solve the issue of withholding data or replaying old data.
The RRDP protocol /also/ is unauthenticated, just like rsync. The RRDP
protocol basically is rsync wrapped in XML over HTTPS.

Withholding of information is detected through verification of RPKI
manifests (something Routinator didn't verify up until last week!),
and replaying of old data is addressed by checking validity dates and
CRLs (something Routinator also didn't do until last week!).

Of course I see advantages to this industry mainly using RRDP, but those
are not security advantages. The big migration towards RRDP can happen
somewhere in the next few years.

Routinator does TLS verification when it encounters an RRDP repository. If the repository cannot be reached, or its HTTPS certificate is somehow invalid, it will use rsync instead. It's only after it found a *valid* HTTPS connection, that it refuses to fall back.

There is a security angle here.

Malicious-in-the-middle attacks can lead an RP to a bogus HTTPS server and force the software to downgrade to rsync, which has no channel security. The software can then be given old data (new ROAs can be withheld), or the attacker can simply withhold a single object. With the stricter publication point completeness validation introduced by RFC6486-bis this will lead to the rejecting of all ROAs published there.

The result is the exact same problem that Randy et al.'s research pointed at. If there is a covering less specific ROA issued by a parent, this will then result in RPKI invalid routes.

The fall-back may help in cases where there is an accidental outage of the RRDP server (for as long as the rsync servers can deal with the load), but it increases the attack surface for repositories that keep their RRDP server available.

Regards,
Tim

Alex:

When I follow the RFC rabbit hole :

RFC6481 : A Profile for Resource Certificate Repository Structure

The publication repository MUST be available using rsync
         [[RFC5781](https://tools.ietf.org/html/rfc5781)] [[RSYNC](https://tools.ietf.org/html/rfc6481#ref-RSYNC)].  Support of additional retrieval mechanisms
         is the choice of the repository operator.  The supported
         retrieval mechanisms MUST be consistent with the accessMethod
         element value(s) specified in the SIA of the associated CA or
         EE certificate.

Then :

RFC8182 : The RPKI Repository Delta Protocol (RRDP)

 This document allows the use of RRDP as an additional repository
   distribution mechanism for RPKI.  In time, RRDP may replace rsync
   [[RSYNC](https://tools.ietf.org/html/rfc8182#ref-RSYNC)] as the only mandatory-to-implement repository distribution
   mechanism.  However, this transition is outside of the scope of this
   document.

Is this not the case then that currently rsync is still mandatory, even if RRDP is in place? Or is there a more current RFC that has defined the transition that I did not locate?

> i'll see your blog post and raise you a peer reviewed academic paper
> and two rfcs :slight_smile:

For the readers wondering what is going on here: there is a reason
there is only a vague mention to two RFCs instead of the specific
paragraph where it says that Relying Party software must fall back to
rsync immediately if RRDP is temporarily unavailable. That is because
this section doesn’t exist.

*skeptical face* Alex, you got it backwards: the section that does not
exist, is to *not* fall back to rsync. But on the other hand, there are
ample RFC sections which outline rsync is the mandatory-to-implement
protocol. Starts at RFC 6481 Section 3: "The publication repository
MUST be available using rsync".

Even the RRDP RFC itself (RFC 8182) describes that RSYNC and RRDP
*co-exist*. I think this co-existence was factored into both the design
of RPKIoverRSYNC and subsequently RPKIoverRRDP. An rsync publication
point does not become invalid because of the demise of an
once-upon-a-time valid RRDP publication point.

Only a few weeks ago a large NIR (IDNIC) disabled their RRDP service
because somehow the RSYNC and RRDP repositories were out-of-sync with
each other. The RRDP service remained disabled for a number of days
until they repaired their RPKI Certificate Authority service.

I suppose that during this time, Routinator was unable to receive any
updates related to the IDNIC CA (pinned to RRDP -> because of a prior
successful fetch prior to the partial IDNIC RPKI outage). This in turn
deprived the IDNIC subordinate Resource Holders the ability to update
their Route Origin Authorization attestations (from Routinator's
perspective).

Given that RRDP is an *optional* protocol in the RPKI stack, it doesn't
make sense to me to strictly pin fetching operations to RRDP: Over time
(months, years), a CA could enable / disable / enable / disable RRDP
service, while listing the RRDP URI as a valid SIA, amongst other valid
SIAs.

An analogy to DNS: A website operator may add AAAA records to indicate
IPv6 reachability, but over time may also remove the AAAA record if
there (temporarily) is some kind of issue with the IPv6 service. The
Internet operations community of course encourages everyone to add AAAA
records, and IPv6 Happy Eyeballs were a concept to for a long time even
*favor* IPv6 over IPv4 to help improve IPv6 adoption, but a dual-stack
browser will always try to make benefit of the redundancy that exists
through the two address families.

RSYNC and RRDP should be viewed in a similar context as v4 vs v6, but
unlike with IPv4 and IPv6, I am convinced that RSYNC can be deprecated
in the span of 3 or 4 years, the draft-sidrops-bruijnzeels-deprecate-rsync
document is helping towards that goal!

Be that as it may, operators can rest assured that if consensus goes
against our logic, we will change our design.

Please change the implementation a little bit (0.8.1). I think it is too
soon for the internet wide 'rsync to RRDP' migration project to be
declared complete and successfull, and this actually hampers the
transition to RRDP.

Pinning to RRDP *forever* violates the principle-of-least-astonishment
in a world where draft-sidrops-bruijnzeels-deprecate-rsync-00 was
published only as recent as November 2019. That draft now is a working
group document, and it will probably take another 1 or 2 years before it
is published as RFC.

Section 5 of 'draft-deprecate-rsync' says RRDP *SHOULD* be used when it
is available. Thus it logically follows, when it is not available, the
lowest common denominator is to be used: rsync. After all, the Issuing
CA put an RSYNC URI in the 'Subject Information Access' (SIA). Who knows
better than the CA?

The ability to publish routing intentions, and for others to honor the
intentions of the CA is what RPKI is all about. When the CA says
delegated RPKI data is available at both an RSYNC URI and an RRDP URI,
both are valid network entrypoints to the publication point. The
resource holder's X.509 signature even is on those 'reference to there'
directions (URIs)! :slight_smile:

If I can make a small suggestion: make 0.8.1 fall back to rsync after
waiting an hour or so, (meanwhile polling to see if the the RRDP service
restores). This way the network operator takes advantage of both
transport protocols, whichever is available, with a clear preference to
try RRDP first, then eventually rsync.

RPKI was designed in such a way that it can be transported even over
printed paper, usb stick, bluetooth, vinyl, rsync, and also https (as
rrdp). Because RPKI data is signed using the X.509 framework, the
transportation method really is irrelevant. IP holders can publish RPKI
data via horse + cart, and still make productive use of it!

Routinator's behavior is not RFC compliant, and has tangible effects in
the default-free zone.

Regards,

Job

If there is a covering less specific ROA issued by a parent, this will
then result in RPKI invalid routes.

i.e. the upstream kills the customer. not a wise business model.

The fall-back may help in cases where there is an accidental outage of
the RRDP server (for as long as the rsync servers can deal with the
load)

folk try different software, try different configurations, realize that
having their CA gooey exposed because they wanted to serve rrdp and
block, ...

randy, finding the fort rp to be pretty solid!

As I’ve pointed out to Randy and others and I’ll share here.
We planned, but hadn’t yet upgraded our Routinator RP (Relying Party) software to the latest v0.8 which I knew had some improvements.
I assumed the problems we were seeing would be fixed by the upgrade.
Indeed, when I pulled down the new SW to a test machine, loaded and ran it, I could get both Randy’s ROAs.
I figured I was good to go.

Then we upgraded the prod machine to the new version and the problem persisted.
An hour or two of analysis made me realize that the “stickiness” of a particular PP (Publication Point) is encoded in the cache filesystem.
Routinator seems to build entries in its cache directory under either rsync, rrdp, or http and the rg.net PPs weren’t showing under rsync but moving the cache directory aside and forcing it to rebuild fixed the issue.

A couple of points seem to follow:

  • Randy says: “finding the fort rp to be pretty solid!” I’ll say that if you loaded a fresh Fort and fresh Routinator install, they would both have your ROAs.

  • The sense of “stickiness” is local only; hence to my mind the protection against “downgrade” attack is somewhat illusory. A fresh install knows nothing of history.
    Tony

   - Randy says: "finding the fort rp to be pretty solid!" I'll say that
   if you loaded a fresh Fort and fresh Routinator install, they would both
   have your ROAs.
   - The sense of "stickiness" is local only; hence to my mind the
   protection against "downgrade" attack is somewhat illusory. A fresh install
   knows nothing of history.

fort running
enabled rrdp on server
router reports

r0.sea#sh ip bgp rpki table | i 3130
147.28.0.0/20 20 3130 0 147.28.0.84/323
147.28.0.0/19 19 3130 0 147.28.0.84/323
147.28.64.0/19 19 3130 0 147.28.0.84/323
147.28.96.0/19 19 3130 0 147.28.0.84/323
147.28.128.0/19 19 3130 0 147.28.0.84/323
147.28.160.0/19 19 3130 0 147.28.0.84/323
147.28.192.0/19 19 3130 0 147.28.0.84/323
192.83.230.0/24 24 3130 0 147.28.0.84/323
198.180.151.0/24 24 3130 0 147.28.0.84/323
198.180.153.0/24 24 3130 0 147.28.0.84/323

disabled rrdp on server
added new roa 198.180.151.0/25
waited a while
router reports

r0.sea#sh ip bgp rpki table | i 3130
147.28.0.0/20 20 3130 0 147.28.0.84/323
147.28.0.0/19 19 3130 0 147.28.0.84/323
147.28.64.0/19 19 3130 0 147.28.0.84/323
147.28.96.0/19 19 3130 0 147.28.0.84/323
147.28.128.0/19 19 3130 0 147.28.0.84/323
147.28.160.0/19 19 3130 0 147.28.0.84/323
147.28.192.0/19 19 3130 0 147.28.0.84/323
192.83.230.0/24 24 3130 0 147.28.0.84/323
198.180.151.0/25 25 3130 0 147.28.0.84/323 <<<===
198.180.151.0/24 24 3130 0 147.28.0.84/323
198.180.153.0/24 24 3130 0 147.28.0.84/323

as i said, fort seems solid

randy

r0.sea#sh ip bgp rpki table | i 3130
147.28.0.0/20 20 3130 0 147.28.0.84/323
147.28.0.0/19 19 3130 0 147.28.0.84/323
147.28.64.0/19 19 3130 0 147.28.0.84/323
147.28.96.0/19 19 3130 0 147.28.0.84/323
147.28.128.0/19 19 3130 0 147.28.0.84/323
147.28.160.0/19 19 3130 0 147.28.0.84/323
147.28.192.0/19 19 3130 0 147.28.0.84/323
192.83.230.0/24 24 3130 0 147.28.0.84/323
198.180.151.0/25 25 3130 0 147.28.0.84/323 <<<===
198.180.151.0/24 24 3130 0 147.28.0.84/323
198.180.153.0/24 24 3130 0 147.28.0.84/323

note rov ops: if you do not see that /25 in your router(s), the RP
software you are running can be damaging to your customers and to
others.

randy

Hi Tony,

I realise there are quite some moving parts so I'll try to summarise our design choices and reasoning as clearly as possible.

Rsync was the original transport for RPKI and is still mandatory to implement. RRDP (which uses HTTPS) was introduced to overcome some of the shortcomings of rsync. Right now, all five RIRs make their Trust Anchors available over HTTPS, all but two RPKI repositories support RRDP and all but one relying party software packages support RRDP. There is currently an IETF draft to deprecate the use of rsync.

As a result, the bulk of RPKI traffic is currently transported over RRDP and only a small amount relies on rsync. For example, our RPKI repository is configured accordingly: rrdp.rpki.nlnetlabs.nl is served by a CDN and rsync.rpki.nlnetlabs.nl runs rsyncd on a simple, small VM to deal with the remaining traffic. When operators deploying our Krill Delegated RPKI software ask us what to expect and how to provision their services, this is how we explain the current state of affairs.

With this is mind, Routinator currently has this fetching strategy:

1. It starts by connecting to the Trust Anchors of the RIRs over HTTPS, if possible, and otherwise use rsync.
2. It follows the certificate tree, following several pointers to publication servers along the way. These pointers can be rsync only or there can be two pointers, one to rsync and one to RRDP.
3. If an RRDP pointer is found, Routinator will try to connect to the service, verify if there is a valid TLS certificate and data can be successfully fetched. If it can, the server is marked as usable and it'll prefer it. If the initial check fails, Routinator will use rsync, but verify RRDP works on the next validation run.
4. If RRDP worked before but is unavailable for any reason, Routinator will used cached data and try again on the next run instead of immediately falling back to rsync.
5. If the RPKI publication server operator takes away the pointer to RRDP to indicate they no longer offer this communication protocol, Routinator will use rsync.
6. If Routinator's cache is cleared, the process will start fresh

This strategy was implemented with repository server provisioning in mind. We are assuming that if you actively indicate that you offer RRDP, you actually provide a monitored service there. As such, an outage would be assumed to be transient in nature. Routinator could fall back immediately, of course. But our thinking was that if the RRDP service would have a small hiccup, currently a 1000+ Routinator instances would be hammering a possibly underprovisioned rsync server, perhaps causing even more problems for the operator.

"Transient" is currently the focus. In Randy's experiment, he is actively advertising he offers RRDP, but doesn't offer a service there for weeks at a time. As I write this, ca.rg.net. cb.rg.net and cc.rg.net have been returning a 404 on their RRDP endpoint several weeks and counting. cc.rg.net was unavailable over rsync for several days this week as well.

I would assume this is not how operators would run their RPKI publication server normally. Not having an RRDP service for weeks when you advertise you do is fine for an experiment but constitutes pretty bad operational practice for a production network. If a service becomes unavailable, the operator would swiftly be contacted and the issue would be resolved, like Randy and I have done in happier times:

https://twitter.com/alexander_band/status/1209365918624755712
https://twitter.com/enoclue/status/1209933106720829440

On a personal note, I realise the situation has a dumpster fire feel to it. I have contacted Randy about his outages months ago, not knowing they were a research project. I never got a reply. Instead of discussing his research and the observed effects, it feels like a 'gotcha' to present the findings in this way. It could even be considered irresponsible, if the fallout is as bad as he claims. The notion that using our software is quote, "a disaster waiting to happen", is disingenuous at best:

https://www.ripe.net/ripe/mail/archives/members-discuss/2020-September/004239.html

Routinator design was to try to deal with outages in a responsible manner for all actors involved. Again, of course we can change our strategy as a result of this discussion, which I'm happy we're now actually having. In that case I would advise operators who offer an RPKI publication server to ensure that they provision their rsyncd service so that it is capable of handling all of the traffic that their RRDP service normally handles, in case RRDP has a glitch. And, even if people will scale their rsync service accordingly, they will only ever find out if it actually does in a time of crisis.

Kind regards,

-Alex

cc.rg.net was unavailable over rsync for several days this week as
well.

sorry. it was cb and cc. it seems some broken RPs did not have the
ROA needed to get to our westin pop. cf this whole thread.

luckily such things never happen in real operations. :slight_smile:

randy

Hi Randy, all,

If there is a covering less specific ROA issued by a parent, this will
then result in RPKI invalid routes.

i.e. the upstream kills the customer. not a wise business model.

I did not say it was. But this is the problematic case.

For the vast majority of ROAs the sustained loss of the repository would lead to invalid ROA *objects*, which will not be used in Route Origin Validation anymore leading to the state 'Not Found' for the associated announcements.

This is not the case if there are other ROAs for the same prefixes published by others (most likely the parent). Quick back of the envelope analysis: this affects about 0.05% of ROA prefixes.

The fall-back may help in cases where there is an accidental outage of
the RRDP server (for as long as the rsync servers can deal with the
load)

folk try different software, try different configurations, realize that
having their CA gooey exposed because they wanted to serve rrdp and
block, ...

We are talking here about the HTTPS server being unavailable, while rsync *is*.

So this means, your HTTPS server is down, unreachable, or has an issue with its HTTPS certificate. Your repository could use a CDN if they don't want to do all this themselves. They could monitor, and fix things.. there is time.

Thing is even if HTTPs becomes unavailable this still leaves hours (8 by default for the Krill CA, configurable) to fix things. Routinator (and the RIPE NCC Validator, and others) will use cached data if they cannot retrieve new data. It's only when manifests and CRLs start to expire that the objects would become invalid.

So the fallback helps in case of incidents with HTTPS that were not fixed within 8 hours for 0.05% of prefixes.

On the other hand, the fallback exposes a Malicious-in-the-Middle replay attack surface for 100% of the prefixes published using RRDP, 100% of the time. This allows attackers to prevent changes in ROAs to be seen.

This is a tradeoff. I think that protecting against replay should be considered more important here, given the numbers and time to fix HTTPS issue.

randy, finding the fort rp to be pretty solid!

Unrelated, but sure I like Fort too.

Tim

On the other hand, the fallback exposes a Malicious-in-the-Middle
replay attack surface for 100% of the prefixes published using RRDP,
100% of the time. This allows attackers to prevent changes in ROAs to
be seen.

This is a mischaracterization of what is going on. The implication of
what you say here is that RPKI cannot work reliably over RSYNC, which is
factually incorrect and an injustice to all existing RSYNC based
deployment. Your view on the security model seems to ignore the
existence of RPKI manifests and the use of CRLs, which exist exactly to
mitigate replays.

Up until 2 weeks ago Routintar indeed was not correctly validating RPKI
data, fortunately this has now been fixed:
https://mailman.nanog.org/pipermail/nanog/2020-October/210318.html

Also via the RRDP protocol old data be replayed, because because just
like RSYNC, the RRDP protocol does not have authentication. When RPKI
data is transported from Publication Point (RP) to Relying Party, the RP
cannot assume there was an unbroken 'chain of custody' and therefor has
to validate all the RPKI signatures.

For example, if a CDN is used to distribute RRDP data, the CDN is the
MITM (that is literally what CDNs are: reverse proxies, in the middle).
The CDN could accidentally serve up old (cached) content or misserve
current content (swap 2 filenames with each other).

This is a tradeoff. I think that protecting against replay should be
considered more important here, given the numbers and time to fix
HTTPS issue.

The 'replay' issue you perceive is also present in RRDP. The RPKI is a
*deployed* system on the Internet and it is important for Routinator to
remain interopable with other non-nlnetlabs implementations.

Routinator not falling back to rsync does *not* offer a security
advantage, but does negatively impact our industry's ability to migrate
to RRDP. We are in 'phase 0' as described in Section 3 of

Regards,

Job

I hate to jump in late. but... :slight_smile:

After reading this a few times it seems like what's going on is:
  o a set of assumptions were built into the software stack
     this seems fine, hard to build with some assumptions :slight_smile:

  o the assumptions seem to include: "if rrdp fails <how?> feel free
to jump back/to rsync"
    I think SOME of the problem is the 'how' there.
    Admittedly someone (randy) injected a pretty pathological failure
mode into the system
    and didn't react when his 'monitoring' said: "things are broke yo!"

  o absent a 'failure' the software kept on getting along as it had before.
    Afterall, maybe the operator here intentionally put their
repository into this whacky state?
    How is an RP software stack supposed to know what the PP's
management is meaning to do?

  o lots of debate about how we got to where we are, I don't know that
much of it is really helpful.

I think a way forward here is to offer a suggestion for the software
folk to cogitate on and improve?
   "What if (for either rrdp or rsync) there is no successful
update[0] in X of Y attempts,
   attempt the other protocol to sync down to bring the remote PP back
to life in your local view."

This both allows the RP software to pick their primary path (and stick
to that path as long as things work) AND
helps the PP folk recover a bit quicker if their deployment runs into troubles.

0: I think 'failure' here is clear (to me):
    1) the protocol is broken (rsync no connect, no http connect)
    2) the connection succeeds but there is no sync-file (rrdp) nor
valid MFT/CRL

The 6486-bis rework effort seems to be getting to: "No MFT? no CRL?
you r busted!"
so I think if you don't get MFT/CRL in X of Y attempts it's safe to
say the PP over that protocol is busted,
and attempting the other proto is acceptable.

thanks!
-chris