BGP Experiment

Eric_Kuhnke · January 9, 2019, 4:56am

FRR is undergoing a fairly rapid pace of development, thanks to the cloud-scale operators and hosting providers which are using it in production.

https://cumulusnetworks.com/blog/welcoming-frrouting-to-the-linux-foundation/

Bandy_Rush1 · January 9, 2019, 6:54am

We plan to resume the experiments January 16th (next Wednesday), and
have updated the experiment schedule [A] accordingly. As always, we
welcome your feedback.

i did not realize that frr updates propagated so quickly. very cool.

FRR is undergoing a fairly rapid pace of development

that is impressive but irrelevant. the question is how soon the frr
users out on the internet will upgrade. there are a lot of studies on
this. it sure isn't on the order of a week.

randy

Job_Snijders3 · January 9, 2019, 6:59am

Given the severity of the bug, there is a strong incentive for people to upgrade ASAP.

Kind regards,

Job

Tore_Anderson1 · January 9, 2019, 7:34am

* Job Snijders

Given the severity of the bug, there is a strong incentive for people to upgrade ASAP.

The buggy code path can also be disabled without upgrading, by building
FRR with the --disable-bgp-vnc configure option, as I understand it.

I've been told that this is the default in Cumulus Linux.

Tore

Owen_DeLong · January 9, 2019, 5:27pm

+1

Toma_Gavrichenkov · January 9, 2019, 5:51pm

9 Jan. 2019 г., 9:56 Randy Bush <randy@psg.com>:

the question is how soon the frr
users out on the internet will upgrade.
there are a lot of studies on
this. it sure isn’t on the order of a week

Which is, as usual, a pity, because, generally, synchronizing a piece of software with upstream security updates less frequently than once to twice in a week belongs in Jurassic Park today; and doing it hardly more frequently than once in 6 months, as ISPs usually do, clearly belongs in a bughouse.

(wonder if this FRR update has got a CVE number though)

Saku_Ytti1 · January 9, 2019, 6:06pm

Not disputing bug or bog house as ideal location for said policy, just
want to explain my perspective why it is so. SPs are making their
reasonable effort to produce product that customers want to buy.
Hitless upgrades are not really a thing yet, even though they've been
marketed for 20 years now. Customers have expectation on how often
their link flaps which is mutually exclusive with rapid upgrade
cycles.

And mostly all this is for show, the code is very broken, all of it.
And the configurations are very broken, all of them. We regularly
break Internet without trying, BGP parsing crashes are like bi-annual
thing. I'm holding, without any motivation or attempt to do so,
transit -packet-of-death for JNPR applicable to ~all JNPR backbones,
and JNPR isn't outlier here. People happily deploy new devices which
cannot be protected against even trivial (<10Mbps) control-plane
attacks. Only reason things work as well as they do, is because bad
guys are not trying to DoS the infrastructure with BGP or
packet-of-deaths, it would be very cheap if someone should be so
motivated.

If this is something we think should be fixed, then we should have
good guys intentionally fuzzing _public internet_ BGP and
transit-packet-of-deaths with good reporting. But likely it doesn't
actually matter at all that the configurations and implementations are
fragile, if they are abused, Internet will fix those in no more than
days, and trying to guarantee it cannot happen probably is fools
errant

If anything, I suspect if it's cheaper to enter the market with
inferior security and quality then that is likely good business case,
internet works so well, consumers are not willing to pay more for
better, but would gladly sacrifice uptime for cheaper price.

Toma_Gavrichenkov · January 9, 2019, 6:24pm

Not disputing bug or bog house as ideal location for said policy, just
want to explain my perspective why it is so.

So, network device vendors releasing security advisories twice a year
isn't a big part of the explanation?

Hitless upgrades are not really a thing yet, even though they've been
marketed for 20 years now.

This is correct; on the flip side, hitless vulnerabilities haven't
even been marketed, much less invented.

Only reason things work as well as they do, is because bad
guys are not trying to DoS the infrastructure with BGP or
packet-of-deaths

Err... don't they? My experience is quite the opposite.

If this is something we think should be fixed, then we should have
good guys intentionally fuzzing _public internet_ BGP and
transit-packet-of-deaths with good reporting.

If we could be sure that after such fuzzing there would still be a
working transport infrastructure to report on top of, then yes.

if they are abused, Internet will fix those in no more than
days

— just like we did with IoT in 2016 —

and trying to guarantee it cannot happen probably is fools
errant

If anything, I suspect if it's cheaper to enter the market with
inferior security and quality then that is likely good business case

This is also correct so far. I wonder if it's here to stay.

Owen_DeLong · January 9, 2019, 6:31pm

So if I understand you correctly, your statement is that everyone should be (potentially) rebooting every core, backbone, edge, and other router at least once or twice a week…

To quote Randy Bush… I encourage my competitors to try this.

Owen

Saku_Ytti1 · January 9, 2019, 6:32pm

So, network device vendors releasing security advisories twice a year
isn't a big part of the explanation?

Those are scheduled, they have to meet some criteria to be pushed on
scheduled lot. There are also out of cycle SIRTs. And yes, vendors are
delaying them, because customers don't want to upgrade often, because
customer's customers don't want to see connections down often.

Err... don't they? My experience is quite the opposite.

Well that is odd experience, considering anyone with rudimentary
understanding of control-plane policing can bring internet down from
single VPS. Majority of deployed devices _cannot_ be protected against
DoS motivated attacker, and I'm not talking link congestion, I'm
talking control-plane congestion with few Mbps.

If we could be sure that after such fuzzing there would still be a
working transport infrastructure to report on top of, then yes.

If it's important to get right, we should try to prove it wrong
actively and persistently by good guys, at least then reporting and
statistics can be produced. But I'm not sure if it's important to get
right, market seems to indicate security does not matter.

— just like we did with IoT in 2016 —

Internet still running, I'm still getting paid.

> If anything, I suspect if it's cheaper to enter the market with
> inferior security and quality then that is likely good business case

This is also correct so far. I wonder if it's here to stay.

We'd need the current security posture to be sufficiently
unmarketable. But motivation to simply DoS internet doesn't really
exist. DoS is against service end points, infrastucture is trivial
target, but for some reason not really targeted. I'm sure state actors
have library of DoS transit packets and BGP UPDATE packets to be
deployed when strategy requires given network or region to be
disrupted. Because, we, the internet plumbers, keep finding those
without trying, just trying to keep the network working, what can
someone find who is funded and motivated to find those?

Toma_Gavrichenkov · January 9, 2019, 6:37pm

Nope, this is a misunderstanding. One has to *check* for advisories at
least once or twice a week and only update (and reboot is necessary)
if there *is* a vulnerability.

Checking is quite different from, actually, updating. What you may
want to encourage your competition to do is to deploy a piece of
software which actually *gets* a severe CVE twice in a week; that will
certainly bring you a bunch of new customers.

Toma_Gavrichenkov · January 9, 2019, 6:50pm

Those are scheduled, they have to meet some criteria to be pushed on
scheduled lot. There are also out of cycle SIRTs. And yes, vendors are
delaying them, because customers don't want to upgrade often, because
customer's customers don't want to see connections down often.

Yep. The same happened before e.g. to MSFT products and Adobe Flash
for a decade before the former have started to update in days no
matter what, and before the latter was effectively pushed out of most
market niches.

— just like we did with IoT in 2016 —

Internet still running, I'm still getting paid.

Well, I know a couple of guys who aren't.

But motivation to simply DoS internet doesn't really
exist.

Except for hacktivism, fun, gathering a rep within a cracker society,
gathering a rep within one's middle school community, et cetera. But
anyway,

DoS is against service end points, infrastucture is trivial
target, but for some reason not really targeted.

It really is. ISPs don't get that quite frequently for now, but
end-user network services sometimes do.

I'm sure state actors have library of DoS transit packets and
BGP UPDATE packets to be deployed when strategy requires
given network or region to be
disrupted.

There's hardly a reason to rely on your next door neighbor's kid not
chatting on the same Darknet forums where those "state actors" get
their data from. "State actor" thing is highly overrated today. They
are certainly powerful but hardly more powerful than a skilled team of
anonymous blackhat researchers going in for ransom money.

Saku_Ytti1 · January 9, 2019, 6:51pm

I think this contains some assumptions

1. discovering security issues in network devices is expensive (and
thus only those you glean from vendor notices realistically exist)
2. downside of being affected by network device security issue is expensive

I'm very skeptical if either are true. I think it's very cheap to find
security issues in network devices, particularly DoS issues. And I
don't think downside is expensive, maybe it's bad 4h and lot of angry
customers, but ultimately not that expensive.

I think lot of this is self-organising with delay around rules and
justifications no one understands, and we're not upgrading often,
because it's not (currently) sensible approach.

Toma_Gavrichenkov · January 9, 2019, 6:58pm

Well, it's significantly harder to look for vulns in closed source
firmware which only runs on certain expensive devices. My point is
that e.g. FRR is an open source software which is designed to run on
the same Intel-based systems as the one which probably powers your
laptop.

I've received a note from FRR devs stating that they're going to get a
CVE number soon. It's a good sign, though it should have happened a
bit before roughly a thousand of this mailing list subscribers have
been informed about the issue, but anyway.

Saku_Ytti1 · January 9, 2019, 7:03pm

Hey,

firmware which only runs on certain expensive devices. My point is
that e.g. FRR is an open source software which is designed to run on
the same Intel-based systems as the one which probably powers your
laptop.

Most vendors have virtual image for your laptop, all of the modern
routers run Linux and some vendor binary blob, with exception of Nokia
running their own bootingOS (forked off of vxworks ages ago). Finding
control-plane bugs, like BGP UPDATE crash is cheap for hobbyist, you
can download the images off the Internet and run on your laptop.

Finding forwarding issues indeed is harder due to the limited access
to devices, so bit of security through obscurity I guess.

Toma_Gavrichenkov · January 9, 2019, 7:07pm

Or, rather, security by complexity. Today's network infrastructure is
complex enough for people to dive into it, looking for all the
underlying issues. Right, it still saves us the day, though today's
Web JS frontend is also quite complex but it's of no help.

Owen_DeLong · January 9, 2019, 7:33pm

Fair enough, but the frequency of vulnerability announcements even in some of the best implementations is still more often than I think my customers will tolerated reboots.

At the end of the day, this is really about risk analysis and it helps to put things into 1 of 4 risk quadrants based on two axes… Axis 1 is the likelihood of the vulnerability being exploited, while axis 2 is the severity of the cost/consequences of exploitation.

Obviously something that scores high on both axes will have me rolling out the upgrades as rapidly as possible, likely within 24 hours to at least the majority of the network.

Something that scores low on both axes, conversely, is likely not worth the customer disruption and support call volume (not to mention SLA credits, etc.) that come from doing that level of maintenance on short notice (or without notice).

The other two quadrants are a grey area that becomes more of a judgment call where other factors specific to each operator and their customer profile will come into play.

Some operators may have a high tolerance for high-probability low-cost problem, while others may find this very urgent, for example.

Owen

Owen_DeLong · January 9, 2019, 7:37pm

Nope, this is a misunderstanding. One has to *check* for advisories at
least once or twice a week and only update (and reboot is necessary)
if there *is* a vulnerability.

I think this contains some assumptions

1. discovering security issues in network devices is expensive (and
thus only those you glean from vendor notices realistically exist)

Not really… I think the assumption here is that you can’t resolve an issue until the vendor publishes the fix. Outside of the open-source routing solutions (and even for most deployments, including those), I would say this is a valid assertion. (It’s more of an assertion than an assumption, IMHO).

2. downside of being affected by network device security issue is expensive

This depends on the issue, right?

Owen

Toma_Gavrichenkov · January 9, 2019, 7:41pm

At the end of the day, this is really about risk analysis
and it helps to put things into 1 of 4 risk quadrants
based on two axes… Axis 1 is the likelihood of the
vulnerability being exploited, while axis 2 is the
severity of the cost/consequences of exploitation.

Obviously something that scores high on both axes
will have me rolling out the upgrades as rapidly as
possible, likely within 24 hours to at least the
majority of the network.

Good for you (not kidding). Not quite the same on average, as far as I can see.

The other two quadrants are a grey area that
becomes more of a judgment call where other
factors specific to each operator and their
customer profile will come into play.
Some operators may have a high tolerance
for high-probability low-cost problem, while
others may find this very urgent, for example.

I agree with you; however, it's the other quadrant (high cost,
seemingly low probability) which is a real gray area IMO which allows
for collateral damage at a Hollywood blockbuster scale.

Toma_Gavrichenkov · January 9, 2019, 7:53pm

Well, and when I think about it for the second time, I can't help
pointing out that there are long lived efforts from OS developers to
come up with live patching, especially embedded and RTOS developers.

As the recent you-know-which downtime has shown us, there are
Internet-based services like 911 telephony which really start to treat
Internet as a whole as a real time system. The question here is
whether this encourages e.g. the aforementioned FRR developers (along
with device vendors who actually get paid for the uninterruptible BGP
availability) to accept this challenge.