Securing the BGP or controlling it?

Interestingly, the article misses interception and other non-outage
potentials due to (sub) prefix hijacking.

you seem to be entering the world of attacks. the AP article's point
was fat fingers.

randy

Interesting. I took it as a set up of why we NEED a Central Authority.

*shrug* -- We'll see I guess, but probably not in time.

There are a lot of problems associated with using IRRDB filters for inbound
prefix filtering.

We used them over 15 years ago near ubiquitously and stopped
mostly because:

1) there was nothing akin to route refresh so you had to bounce
best routes or reset sessions to trigger readvertisement after
policy updates. This made unscheduled a pain and required some
turning of the steam valves.

2) traditional ACLs were used for routing policy specification and
weren't incrementally updatable, which was a huge pita.

3) IRRs were insecure to update, no one ever deletes anything from
IRRs, and some folks even proxy register IRR objects based on BGP
routing table entries.

4) customers complained they had to maintain them (ohh, wait, we
told them if they wanted to be routed they had no choice)

Regarding 1, we have route refresh and inherent soft-reconfig
today. Regarding 2, pretty much all implementations support this
today (although it will be a pita to maintain near exact prefix
list and ACL entries for a customer down the road when used for
both routing policies and ingress anti-spoofing). Regarding 3,
RPKI should help here quite a bit, either used directly, or
enabling IRR object population - the RIRs that run IRRs and
other other folks are helping with secure update mechanisms as
well. Regarding 4 - they didn't scream as loud about the policy
as they did when things broke because of the absence of it, that
I know firsthand.

- some clients announce lots of prefixes. This can make inbound prefix
filtering difficult in some situations.

pixiedust:/home/nick> grep '>' pakistani-telecom.bgpdump.txt | wc -l
    967

Yes, this needs to be automated, clearly.

- there are some endemic data reliability problems with the IRRDBs,
exacerbated by the fact that on most of the widely-used IRRDBs, there is no
link between the RIR and the IRRDB, which means that anyone can register
any address space. whois.ripe.net doesn't allow this, but lots of other
IRRDBs do.

See 3 above.

- the ripe whois server software does not support server-side as-set
expansion. This is a really serious problem if you're expanding large ASNs.

Do it yourself for now and file a feature request..

- there is very little client software. At least irrtoolset compiles these
days, but its front-end is very primitive. rpsltool provides some really
nice templating functionality, but doesn't implement large sections of the
rpsl standards.

Agreed, we need to do work here.

-danny

this is a matter of risk analysis. No secure routing means we'll continue
to see the occasional high profile outage which is dealt with very quickly.

If 3 weeks (e.g., the recent 'i root w/China incident) is "very
quickly" then we're operating on different timescales.

My gut instinct tells me that secure routing and the rpki venture well into
the realm of negative returns.

I believe 'sucks less' falls into the realm of positive, so here
we disagree.

-danny

I don't suspect we'd need a central authority for that. I'm sure it only enough for you traffic to pass with anyones national boundry to be 'at risk' of such things

-jim

Building a database (i.e,. RPKI) aligned with the Internet number
resource allocation hierarchy attesting to who's authorized to originate
what route announcements and telling you how to configure your routers
are two fundamentally different things.

If that database doesn't exist it's tough to discriminate between
legitimate and malicious or erroneous announcements - irrespective of
how you discriminate. If it does exist, and you use it, anyone that
can rub two packets together is surely going to employ preferences
that first consider organizational and local objectives, then
potentially national, and then some global inputs.

This basically helps people to make more informed decisions, methinks.

-danny

Ziad,

I agree, its unfortunate that so many people no longer require route registration. Not that it would solve all the issues. Tom School, Todd Underwood and I present some work we did looking @ this @ nanog in LA a while back. Unfortunately we could never find time to take it to the next steps.

-jim

I think it captures it in such a way that my grandmother might be
more likely to grok it. Regardless, those are just more symptoms
of the same underlying problem, no? And the "i root" incident was
plausibly a hybrid of error and intercept, so there's a nice hefty
gray area there as well.

I suspect no one missed that..

-danny

Yes, I have observed that people who wear funny clothes with blood
constriction devices wrapped tightly around their necks seem to be
concerned primarily with ass covering theatre.

Risk analysis is ass covering without the theatre. You collect data, make
a judgement based on that data, and if it turns out that the judgement says
that signed bgp updates constitute more of a stability risk to network
operations than the occasional shock problem, then you point these people
with odd dress sense towards the conclusions of this risk analysis report,
having made sure that the conclusions are printed in a 48pt font, with no
more than 2 syllables per word, preferably with a filled circle preceding
each sentence.

It may well be that they will ignore the risk analysis and be more
concerned with the theatre than with data; this happens all the time, an
excellent example being airport security, where security theatre seems to
be considered much more important than actual security. Or it could go the
other way, where risk analysis dictates that sensible precautions be taken,
but they are thrown out for other reasons. A good example here is road
safety, where it would be sensible to speed limit all cars to 50km/h, and
ban motorbikes and bull-bars; but instead we substantially choose to
ignore the risk and accept an attrition rate of 80,000 people every year
between Europe and the US.

Nick

So apply the risk management analogy here. We all know that
pretty much anyone can assert reachability for anyone else's
address space inter-domain on the Internet, in particular the
closer you get to 'the core' the easier this gets. We also
know that route "leaks" commonly occur that result in outages
and the potential for intercept or other nefarious activity.
Additionally, we know that deaggregation, and similar events
result in wide-scale systemic effects. We also know that
topologically localized events occur that can impact our reachability,
whether we're party to the actual fault or not. We have a slew of
empirical data to support all of these things, some more high profile
than others, with route leaks likely occurring at the highest
frequency (every single day).

I would suspect that the probability of fire effecting your
network availability is very low, as you can fail over to a
new facility. OTOH, if you have a route hijack (intentional
or not) failover to a new facility with that address space
isn't going to help, and hijacks can be topologically localized
- the same applies for DDoS. Yet I suspect your organization
has invested reasonably in fire suppression systems, but the
asset that matters most that enables the substrate of some
applications and services that you care about - the availability
of your address space within the global routing system, has no
safeguards whatsoever, and can be impacted from anywhere in the
world.

I'd also venture a guess that we've had more routing issues that
have resulted in network downtime of critical sites than we have had
fires (if someone disproves that _nice dinner on me!).

We've got empirical data, we understand the vulnerability and the
risk (probability of a threat being used). Put that in your risk
management equation and consider what assets are most vulnerable
to your organization - I'd venture it's something to do with network,
and if routing ain't working, network ain't working...

-danny

Dear Danny;

Risk analysis is ass covering without the theatre. You collect data, make
a judgement based on that data, and if it turns out that the judgement says
that signed bgp updates constitute more of a stability risk to network
operations than the occasional shock problem

So apply the risk management analogy here. We all know that
pretty much anyone can assert reachability for anyone else's
address space inter-domain on the Internet, in particular the
closer you get to 'the core' the easier this gets. We also
know that route "leaks" commonly occur that result in outages
and the potential for intercept or other nefarious activity.
Additionally, we know that deaggregation, and similar events
result in wide-scale systemic effects. We also know that
topologically localized events occur that can impact our reachability,
whether we're party to the actual fault or not. We have a slew of
empirical data to support all of these things, some more high profile
than others, with route leaks likely occurring at the highest
frequency (every single day).

I would suspect that the probability of fire effecting your
network availability is very low, as you can fail over to a
new facility. OTOH, if you have a route hijack (intentional
or not) failover to a new facility with that address space
isn't going to help, and hijacks can be topologically localized
- the same applies for DDoS. Yet I suspect your organization
has invested reasonably in fire suppression systems, but the
asset that matters most that enables the substrate of some
applications and services that you care about - the availability
of your address space within the global routing system, has no
safeguards whatsoever, and can be impacted from anywhere in the
world.

I'd also venture a guess that we've had more routing issues that
have resulted in network downtime of critical sites than we have had
fires (if someone disproves that _nice dinner on me!).

But there is also recovery time, which you don't mention in your bet.

If the building I am sitting in right now were to
burn down to the ground, the client I am at would be affected for months and months. Yes, they
have backups, and redundancy, but this is their HQ.

If they (say) fat finger their BGP, well, it would be bad, but if they fix it this
afternoon, everything will go back to normal shortly thereafter.

So, sure, network outages may be more frequent than catastrophic fires, but that doesn't mean that
the aggregated duration of disruption from network outages is greater than the aggregated
disruption duration from fires.

Regards
Marshall

You are right, I forgot that 7007 took more than a day. I distinctly remember being able to use the 'Net later that same day, so I did more than "forget", I actually invented something in my memory.

Moreover, Vinny physically unplugged (data _and_ power) all cables attached to the Bay Networks router which was the source of the problem in very little time. Maybe 30 minutes? It was Sprint's custom IOS image which ignored withdrawals that made the problem last a very long time. I would say that is two separate problems, but I guess you could argue they are related and we should be vigilant against hijacking in case Sean re-enters the field and cons $ROUTER_VENDOR into writing custom code because he's too cheap to upgrade his hardware.

Whichever interpretation you prefer the last two sentences, having that information is germane to the discussion. Having all the facts allow us to make good decisions based on more than sound-bites and NYT articles.

Of course, then we couldn't post cryptic one-liners trying to scare the newbies with our vast knowledge of historical events, however we spin them. And then where would we be?