Yahoo outage summary

I put up a diary at the Storm Center
(http://isc.sans.org/diary.html?storyid=3112) that summarizes what we know
about the Yahoo outage on Friday. If anybody has any additional info they
want to share or comments about the write-up please let me know.

Marc

In other words, it was yet another BGP screw-up that secured routing
could have prevented.

Any clue about the root cause, i.e., malice or accident?

    --Steve Bellovin, http://www.cs.columbia.edu/~smb

Yep, soBGP or S-BGP could have prevented this. But that seems to be a
bridge too far right now.

I don't know about the cause - malicious or accidental - perhaps somebody
from Level3 or Hanaro Telecom can explain the rest of the story.

Marc

I put up a diary at the Storm Center
(http://isc.sans.org/diary.html?storyid=3112) that summarizes what we
know about the Yahoo outage on Friday. If anybody has any additional
info they want to share or comments about the write-up please let me
know.

In other words, it was yet another BGP screw-up that secured routing
could have prevented.

Or using route registeries and filters, or any of the other dozen ideas
suggested over the last decade.

Any clue about the root cause, i.e., malice or accident?

Does it matter? You are screwed either way.

Yep, soBGP or S-BGP could have prevented this. But that seems to be a
bridge too far right now.

This seems to have been a route leak from AS9318, not a false origin announcement, so not sure if soBGP or s-BGP would help here.

--Ricardo

It tells us what we need to do to prevent such things from happening in
the future. For example, most misconfigurations could be blocked if
all routers matched prefixes against originating ASNs, and it doesn't
matter much if the assertion is digitally signed or not -- all that
matters is that the check is done against some authoritative database
run, say, by the RIRs. (No, that's not quite the right solution, but
it serves to illustrate my point.) That's completely inadequate
against an attacker.

    --Steve Bellovin, http://www.cs.columbia.edu/~smb

If we had routing registries that were accurate and authoritative, then
soBGP/S-BGP would have something to verify a route change against. It
should not matter if last Friday's event was a leak or a false announcement
- with some sort of verification system we could mitigate errors,
intentional or accidental.

Marc

either way, in this case (and a number of other public incidents/outages
in the last 3 years) simple prefix-list application would have resolved
the issue.

Cogent leak via (i think) turk-telecom
NY-Edison leak
9918 leak
this-leak

all would have been prevented with the most simple of steps: "prefix
filter your customers".

While S*BGP seem like they may offer additional protections and additional
knobs to be used for protecting 'us' from 'them', the very basics are
obviously not being done so added complexity is not going to really help
:frowning: Or, perhaps its not that its not going to help its just not going to
get done because even prefix-lists are 'too hard', apparently.

-Chris

The bad guys will (almost) always say Oops, it was an accident while
being very clever at deliberatly bypassing every safety feature you
can design into a system.

The foolish guys will (almost) always say Oops, I didn't know while also being very clever at accidently bypassing every safety feature you can design into a system.

Unfortunately engineering can't rely on human intentions. Both the evil and the foolish have the same result. The hope is the foolish will give
up before bypassing the last step, so you keep adding more steps to stop the fool. The hope is the evilish will go after something easier before bypassing the last step, so you keep adding more steps to stop the evil (sic).

As always evil or foolish gals do the same thing as evil or foolish guys.

Its not just IP addresses that exhibit misrouting. But it only occassionaly effects the important or famous enough to attract attention.

http://blog.oregonlive.com/siliconforest/2007/06/rivalry_between_qwest_comcast.html

I agree here. So the mostly needed thing is some verification system.
Wonder what people think about this: http://www.nanog.org/mtg-0706/osterweil.html
something one can start using *now*.
There was virtually no comment after the presentation at last NANOG.

Lixia

Yep, if the simple steps were implemented and didn't work, then adding more complex steps may be appropriate. But in the absence of people using
even the simple steps, why do people think adding more complexity will
work better?

The Internet is an on-going example of just-in-time engineering; and fix only when it breaks.

Yes, I know someone will claim Yahoo lost gazillion dollars due to the fubared routing. On the other hand, it was fixed in a short amount of time. While lots of folks have their patent pending solutions waiting,
are those solutions more cost effective than fixing the occassional
fubared nature of the Internet when it happens?

So far, the people who pay the bills don't think so. And the Department of Homeland Security isn't paying those bills.

I agree here. So the mostly needed thing is some verification system.

I keep on coming back to "any proposal that requires universal adoption
and deployment is fatally flawed".

As far as "needing a verification system", is there something deeply
problematic about filtering your customers? It's a fine example of
thinking globally and acting locally.

Wonder what people think about this: http://www.nanog.org/mtg-0706/osterweil.html
something one can start using *now*.
There was virtually no comment after the presentation at last NANOG.

Perhaps that's because there wasn't much to say about it?

I think DNS is already badly overloaded, and adding still more random
crap into it just isn't on my list.

Beyond that, the phrase "leverages concepts from PGP's Web of Trust" makes my hair stand on end. Perhaps the author(s) are comfortable with
using a pile of disassociated islands 'with low reliability for confirming
prefix ownership -- I'd rather apply my efforts to something that doesn't
boil down to the equivalent of an educated judgement call.

cheers!

Irrespective of the stat us of s*BGP deployment, following existing BCPs with currently-deployed techniques/functionality/features would have prevented the issue described in the post. s*BGP deployment is a separate issue, and conflating the two doesn't necessarily follow.

That's what I'm curious about...this boils down to L3 not properly
filtering Hanaro.

Having recently turned up some L3 connectivity, I was happy to discover
that they can use any of the routing registries as the source of their
prefix filters. They told me the prefix filters are automatically
constructed based on the RR of your choice...update the RR, and their
filters will update that night, no need to bug them. Yay for that, wish
everybody worked that way.

But...why wasn't Hanaro being filtered? If the filters are being
automatically generated, I would think they would just filter all of their
peers regardless of number of prefixes etc.

Andy

following existing BCPs with currently-deployed
techniques/functionality/features would have prevented the issue
described in the post.

knowing that level(3) is one of the most serious deployments of
irr-based route filters and other prudent practices, perhaps we should
wait for a post mortem from level(3) before jumping to conclusions?

randy

I said, 'the issue described in the post'. I've no idea whether the post was accurate or complete in its analysis.

"Wow, prefix-lists are *hard*" -- BGP Barbie..

You'd think that by now, we as an industry could do better than that.

(Yes, I know the jury is still out on what really happened at L3-Hanaro.
Doesn't change the fact that we collectively shoot ourselves in the foot
because providers will believe the most implausible things from their
neighbors, like announcements for 128/1 :wink:

> While S*BGP seem like they may offer additional protections and additional
> knobs to be used for protecting 'us' from 'them', the very basics are
> obviously not being done so added complexity is not going to really help
> :frowning: Or, perhaps its not that its not going to help its just not going to
> get done because even prefix-lists are 'too hard', apparently.

"Wow, prefix-lists are *hard*" -- BGP Barbie..

shopping anyone?

You'd think that by now, we as an industry could do better than that.

I think that over all, over a goodly period of time, we are... we
occasionally step on the wrong end of the rake still :frowning:

(Yes, I know the jury is still out on what really happened at L3-Hanaro.

from some other conversations about this, this seems to be a similar
problem to what happened to NY-Edison about 1.5/2 years ago now
(panix.com route hijackage)... 'auto filter from IRR data' without some
form of checking for proper authority.

Of course, now that I stirred the 'l3 shoulda filtered' pot I should
probably also stir the 'large ISP customers should outbound prefix-filter'
pot. It's very likely that they DO filter outbound, atleast to pref
routes from place to place, perhaps twin failures caught them?

:frowning: I think Marcus, Randy, Steve, Lixia all are getting at an underlying
issue: "The interwebs are not as trivial to the world as they once were"
So more strict control and operational due-dilligence should be on
everyone's plate... Atleast for basics like making sure the routing system
functions properly going forward.

Anyway, should be interesting to get some more details on what happened if
they are ever to become available.

-Chris

I agree that we need something better but nobody has shown me a better system than prefix lists and irr that actually *works*.

The simple truth is that prefix lists ARE hard to manage. There are a lot of folks that have complex relationships or don't see why they should register their routes. Some people lack tools and automation to make it work or to manage their networks. It would be nice to see everyone filter routes, including those from even transit and large peers. I don't think we will be able to ignore this forever. I also do not see the status quo changing soon either.

* Valdis Kletnieks:

(Yes, I know the jury is still out on what really happened at L3-Hanaro.
Doesn't change the fact that we collectively shoot ourselves in the foot
because providers will believe the most implausible things from their
neighbors, like announcements for 128/1 :wink:

Well, if L3 creates its filters based on RADB entries (which is still
considered a RR, isn't it?), they will accept a 213/8
announcement. 8-( 128/1 isn't too far away, I fear.