CloudFlare issues?

Francois_Lecavalier · July 4, 2019, 3:22pm

Hi Mark,

Following that Verizon debacle I got onboard with ROV, after a couple research I stopped my choice on the ….drum roll…. CloudFlare GoRTR (https://github.com/cloudflare/gortr). If you trust them enough they provide an updated JSON every 15 minutes of the global RIR aggregate. I’ll see down the road if we’ll fetch them ourselves but at least it got us up and running in less than an hour. It was also easy for us to deploy as the routers and the servers are on the same PoP directly connected, so we don’t need the whole encryption recipe they provide for mass distribution.

But I also have a question for all the ROA folks out there. So far we are not taking any action other than lowering the local-pref – we want to make sure this is stable before we start denying prefixes. So the question, is it safe as of this date to : 1.Accept valid, 2. Accept unknown, 3. Reject invalid? Have any large network who implemented it dealt with unreachable destinations? I’m wondering as I haven’t found any blog mentioning anything in this regard and ClouFlare docs only shows example for valid and invalid, but nothing for unknown.

My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject invalid shouldn’t break anything.

Thanks,

-Francois

_Job_Snijders · July 4, 2019, 3:33pm

Dear Francois,

Following that Verizon debacle I got onboard with ROV, after a couple
research I stopped my choice on the ....drum roll.... CloudFlare GoRTR
(GitHub - cloudflare/gortr: The RPKI-to-Router server used at Cloudflare). If you trust them enough they
provide an updated JSON every 15 minutes of the global RIR aggregate.

At this point in time I think the ideal deployment model is to perform
the validation within your administrative domain and run your own
validators. You can combine routinator with gortr, or use cloudflare's
octorpki software GitHub - cloudflare/cfrpki: Cloudflare's RPKI Toolbox

I'll see down the road if we'll fetch them ourselves but at least it
got us up and running in less than an hour. It was also easy for us
to deploy as the routers and the servers are on the same PoP directly
connected, so we don't need the whole encryption recipe they provide
for mass distribution.

yeah, that is true!

But I also have a question for all the ROA folks out there. So far we
are not taking any action other than lowering the local-pref - we want
to make sure this is stable before we start denying prefixes. So the
question, is it safe as of this date to : 1.Accept valid, 2. Accept
unknown, 3. Reject invalid? Have any large network who implemented it
dealt with unreachable destinations? I'm wondering as I haven't found
any blog mentioning anything in this regard and ClouFlare docs only
shows example for valid and invalid, but nothing for unknown.

I believe at this point in time it is safe to accept valid and unknown
(combined with an IRR filter), and reject RPKI invalid BGP announcements
at your EBGP borders. Large examples of other organisations who already
are rejecting invalid announcements are AT&T, Nordunet, DE-CIX, YYCIX,
XS4ALL, MSK-IX, INEX, France-IX, Seacomm, Workonline, KPN International,
and hundreds of others.

You can run an analysis yourself to see how traffic would be impacted in
your network using pmacct or Kentik, see this post for more info:
https://mailman.nanog.org/pipermail/nanog/2019-February/099522.html

My assumption is that 1.Accept valid, 2. Accept unknown, 3. Reject
invalid shouldn't break anything.

Correct! Let us know how it went

Kind regards,

Job

Ben_Maddison · July 4, 2019, 3:50pm

Hi Francois,

Dear Francois,

>
At this point in time I think the ideal deployment model is to
perform
the validation within your administrative domain and run your own
validators.

+1

> But I also have a question for all the ROA folks out there. So far
> we
> are not taking any action other than lowering the local-pref - we
> want
> to make sure this is stable before we start denying prefixes. So
> the
> question, is it safe as of this date to : 1.Accept valid, 2. Accept
> unknown, 3. Reject invalid? Have any large network who implemented
> it
> dealt with unreachable destinations? I'm wondering as I haven't
> found
> any blog mentioning anything in this regard and ClouFlare docs only
> shows example for valid and invalid, but nothing for unknown.

We have been dropping Invalids since April, and have had only a
(single-digit) handful of support requests related to those becoming
unreachable.

The larger challenge has been related to vendor implementation choices
and bugs, particularly on ios-xe. Happy to go into more detail if
anyone is interested.

I would recommend *not* taking any policy action that distinguishes
Valid from Unknown. If you find that you have routes for the same
prefix/len with both statuses, then that is a bug and/or
misconfiguration which you could turn into a loop by taking policy
action on that difference.

Cheers,

Ben

Mark_Tinka1 · July 4, 2019, 5:10pm

Funny you should mention this… I was speaking with Tom today during an RPKI talk he gave at MyNOG, about whether we’d be willing to trust their RTR streams. But, I’m glad you found a quick solution to get you up and running. Welcome to the club. Well, a Valid and NotFound state implicitly mean that the routes can be used for routing/forwarding. In that case, the only policy we create and apply is against Invalid routes, which is to DROP them. Mark.

Nick_Hilliard3 · July 4, 2019, 5:13pm

Accepting valid ROAs is a better idea after checking that the source AS is legitimate from the peer.

Nick

Mark_Tinka1 · July 4, 2019, 5:14pm

In essence, this is also my thought process.

I think Cloudflare are very well-intentioned in making it as painless as
possible to support other operators to get RPKI deployed (and more power
to them to going to such lengths to do so), but you have to determine
whether you are willing to let a service such as this run outside of our
domain.

Every year, someone asks me whether I'd be willing to outsource my route
reflector VNF's to AWS/Azure/e.t.c. My answer to that falls within the
realms of handling RPKI for your network :-).

Mark.

Mark_Tinka1 · July 4, 2019, 5:17pm

We've had 2 cases where customers could not reach a prefix. Both were
mistakes (as we've found most Invalid routes to be), which were promptly
fixed.

One of them was where a cloud provider decided to originate a longer
prefix on behalf of their content-producing customer, using their own AS
as opposed to the one the customer had used to create the ROA for the
covering block.

Mark.

Francois_Lecavalier · July 4, 2019, 6:46pm

At this point in time I think the ideal deployment model is to perform
the validation within your administrative domain and run your own
validators.

+1

We'll definitely look into this shortly. I definitely don't want to leave a security measure in the end of a third party but with my team being so busy it was a quick temp fix.

The larger challenge has been related to vendor implementation choices and bugs, particularly on ios-xe. Happy to go into more detail if anyone is interested.

We are on Juniper MX204's at the edge and they have been solid for the last 60 weeks - we ran into a long list of bugs on other platforms but not on these.

So I had about 4200 routes marked as invalid. After looking at a sample of them it looks like most of them have a valid ROA with an improper mask length - so there is ultimately a route to these prefixes and at worse would result in "suboptimal" routing - or should I say: the remote network can't control its route propagation anymore. In most case they are a stub networks with a single /24 reassigned from the upstream provider. I have no traffic going directly to these networks and I don't expect any to go there anytime soon.

It's been close to 3 hours now since I dropped them - radio silence.

Whoever fears implementing RPKI/ROA/ROV, simply don't. It's very easy to implement, validate and troubleshoot.

Ben_Maddison · July 4, 2019, 6:54pm

Welcome to the club!

_Job_Snijders · July 4, 2019, 6:57pm

It's been close to 3 hours now since I dropped them - radio silence.

I am going to assume that "radio silence" for you means that your
network is fully functional and none of your customers have raised
issues!

Whoever fears implementing RPKI/ROA/ROV, simply don't. It's very easy to implement, validate and troubleshoot.

Thank you for sharing your report. I believe it is good to share rpki
stories with each other, not just to celebrate the deployment of an
exciting technology, but also to help provide debugging information
ahead of time should there be issues between provider A and B due to a
ROA misconfiguration. Announcing to the public that one has deployed
RPKI - in this stage of the lifecycle of the tech - probably is a
productive action to consider.

Anyway, you can now enjoy https://rpki.net/s/rpki-test even more!

Kind regards,

Job

Mark_Tinka1 · July 4, 2019, 6:59pm

Well done! Congrats!

Mark.

Mark_Tinka1 · July 4, 2019, 6:59pm

Well done! Congrats!

Mark.

_Job_Snijders · July 4, 2019, 7:20pm

Anyway, you can now enjoy https://rpki.net/s/rpki-test even more!

my apologies, I fumbled the ball on typing in that URL, I intended to
point here: https://www.ripe.net/s/rpki-test