Analysing traffic in context of rejecting RPKI invalids using pmacct

_Job_Snijders · February 12, 2019, 6:15pm

Dear all,

Whether to deploy RPKI Origin Validation with an "invalid == reject"
policy really is a business decision. One has to weigh the pros and
cons: what are the direct and indirect costs of accepting
misconfigurations or hijacks for my company? what is the cost of
deploying RPKI? What is the cost of honoring misconfigured RPKI ROAs?
There are a few thousand misconfigured ROAs, what does this mean for me?

To answer these questions, Paolo Lucente and myself worked to extend
pmacct traffic analysis engine (http://pmacct.net/) in such a way that
it can do perform the RFC 6811 Origin Validation procedure and present
the outcome as a property in the flow aggregation process.

Pmacct has the ability to ingest BGP feeds and correlate the BGP data to
the sflow/netflow/ipfix data. This allows for fantastic business
intelligence, you can see exactly how much traffic is flowing from what
customers to what endpoints for what reason!

Pmacct implemented Origin Validation in a cute way: it separates out
RPKI invalid BGP announcements into two categories:

a) "invalid with no overlapping or alternative route"
(aka will be blackholed if 'invalid == reject')

b) "invalid but an overlapping unknown/valid announcement also exists"
(end-to-end connectivity can still work).

Because pmacct separates out the various types kinds of (invalid) BGP
announcements, operators don't have to do deploy *anything* in their
network to get a good grasp on how their connectivity to the rest of the
Internet would look like after deploying a "invalid == reject" policy.
No changes to your network configurations are required to make use of
this feature, you don't need to tag routes with communities or do other
tricks. All the analysis happens inside pmacct.

Of course we tested this first in the NTT global backbone AS 2914! At
the moment of writing, we're seeing less than a handful of gigabits per
second being send towards BGP announcements that are RPKI Invalid and
for which no alternative route exists. In context of NTT's backbone
that amount of traffic is just statistical noise. This is a very
encouraging sign, it may help us move towards the goal of deploying RPKI
Origin Validation in AS 2914.

Nusenu wrote a great blog post on where these RPKI ROA misconfigurations
are located, i recommend reading their posts to develop a better
understanding of the problem space:
https://medium.com/@nusenu/where-are-rpki-unreachable-networks-located-65c7a0bae0f8

Even if you don't intend to deploy RPKI Origin Validation (or are
single-homed), pmacct's RPKI capabilities can be useful in forensic
investigations. It'll be easier to analyse how much and what kind of
traffic for what period of time was sent to a possible hijack. This
will help you when writing RFOs!

If you want to testdrive this feature, fetch pmacct version 1.7.3-rc1
from https://github.com/pmacct/pmacct/releases/tag/1.7.3-rc1

Documentation on how to configure the feature:
https://github.com/pmacct/pmacct/blob/master/QUICKSTART#L1783-#L1833
https://github.com/pmacct/pmacct/blob/master/CONFIG-KEYS#L2626-#L2647

Let us know what you think! Or if you'd like to chat telemetry with
Paolo or me about analysing the effects of BGP hijacks and RPKI, we'll
both be at the San Francisco NANOG meeting next week!

Kind regards,

Job

ps. Dear Kentik & Deepfield, please copy+paste this feature! We'll
happily share development notes with you, you can even look at pmacct's
source code for inspiration.

Steve_Meuse3 · March 11, 2019, 9:37pm

Thanks Job, I just wanted to reach back out to you and the NANOG community that we’ve implemented this feature. Currently Kentik can match flow data with the following validation state:

VALID = Prefix fits in ROA, and ROA ASN and Prefix Origin Match
UNKNOWN = we haven’t found any matching ROA
INVALID - ASN mismatch = BGP prefix fits in the ROA prefix’s length BUT the ROA ASN differs from the Prefix Origin ASN
INVALID - Prefix length out of bounds = the BGP prefix doesn’t have an ROA with large enough Max-Length to refer to
INVALID - ASN 0 specified = there is a matching ROA w/ the right max-length but the ASN associated w/ it is 0 (explicit invalid)

If anyone would like more information please hit me up offline.

-Steve

Jay_Borkenhagen · March 12, 2019, 1:26pm

>

> > ps. Dear Kentik & Deepfield, please copy+paste this feature! We'll
> > happily share development notes with you, you can even look at pmacct's
> > source code for inspiration.
>
> Thanks Job, I just wanted to reach back out to you and the NANOG community
> that we've implemented this feature. Currently Kentik can match flow data
> with the following validation state:
>
> - VALID = Prefix fits in ROA, and ROA ASN and Prefix Origin Match
> - UNKNOWN = we haven't found any matching ROA
> - INVALID - ASN mismatch = BGP prefix fits in the ROA prefix's length BUT
> the ROA ASN differs from the Prefix Origin ASN
> - INVALID - Prefix length out of bounds = the BGP prefix doesn't have an
> ROA with large enough Max-Length to refer to
> - INVALID - ASN 0 specified = there is a matching ROA w/ the right
> max-length but the ASN associated w/ it is 0 (explicit invalid)
>

Hi Steve,

Thanks for the update, but based on that description I'm not certain
that you implemented the same thing that pmacct built, which IMO is
what is needed by those considering deploying a drop-invalids policy.
(Perhaps you omitted mentioning that ability in your description but
included it in your implementation.)

Citing from Job's description:

> Pmacct implemented Origin Validation in a cute way: it separates out
> RPKI invalid BGP announcements into two categories:
>
> a) "invalid with no overlapping or alternative route"
> (aka will be blackholed if 'invalid == reject')
>
> b) "invalid but an overlapping unknown/valid announcement also
> exists"
> (end-to-end connectivity can still work).
>

Networks contemplating Origin Validation need to be able to predict
how their traffic with the rest of the Internet would change after
deploying a drop-invalid-routes policy.

When we (as7018) were preparing to begin dropping invalid routes
received from peers earlier this year, that is exactly the kind of
analysis we did. In our case we rolled our own with a two-pass
process: we first found all the traffic to/from invalid routes by a
bgp community we gave them, then outside of our flow analysis tool we
further filtered the traffic for invalid routes which were covered by
less-specific not-invalid routes. What remained was the traffic we
would lose once invalid routes were dropped. Had the pmacct
capability existed at that time, we would have used it.

Regarding the ability to further partition invalid traffic into the
three sub-categories you mentioned: that would not have been of
interest to us at the time we did our analysis, and it's not clear to
me how it would be useful to a network as it contemplates adopting a
drop-invalids policy. In this context, the reason a route is invalid
is not important; what is important is whether it is covered by a
non-invalid route or not.

Thanks.

Jay B.

Steve_Meuse3 · March 13, 2019, 3:17pm

Thanks Jay, you are correct. As we were talking through the logic we realized we missed that bit. Internally, we’re working though the logic to understand if there is a covering route, is that route valid, and if not, will we recurse and look for another covering route that is valid?

Either way, we’ll be updating our software with that functionality shortly.

-Steve

Bandy_Rush1 · March 14, 2019, 12:43am

Thanks for the update, but based on that description I'm not certain
that you implemented the same thing that pmacct built, which IMO is
what is needed by those considering deploying a drop-invalids policy.
(Perhaps you omitted mentioning that ability in your description but
included it in your implementation.)

Thanks Jay, you are correct. As we were talking through the logic we
realized we missed that bit. Internally, we're working though the logic to
understand if there is a covering route, is that route valid, and if not,
will we recurse and look for another covering route that is valid?

daniele's pam paper and ripe preso, layed it out pretty well

Daniele Iamartino, Cristel Pelsser, Randy Bush. "Measuring BGP Route
Origin Registration and Validation," PAM 2015.