Yahoo outage summary

Jared Mauch wrote:

> The simple truth is that prefix lists ARE hard to manage.

Medium-hard IMHO. Adding prefixes is relatively easy to implement.
Tracking and removing outdated information significantly more challenging.

> Some people lack tools and automation to make it work or to manage their
networks.

Best I can tell, even the largest transit providers handle prefix list
updates manually.

  Some have automated systems, but they're dependent on IRR data
being correct. There are even tools to automate population of IRR data.

At this stage of history, a human interface is probably necessary in making
a reasonable
assessment about the legitimacy of an update request.

  I think here is one of the cruxes of the problem. If it
requires a human, there's a few things that will happen:

  1) prefix-list volume will be too much to be dealt with.
     I see some per-asn prefix lists that would be 255k routes and
     include all sorts of unreasonable junk like /32's

  2) even taking a reasonable network, (in this case, i picked AS286)
     I see 4425 routes. Either you check these all manually (at least
           once), or come up with some way to model it. I currently see 250
           routes in the table with as-path _286_ from my view. Either
           there's a lot of cruft there, or there's a lot of multihomed folks
     where i see a better path. Which is it? Do I have the time to
     crunch this myself?

  3) What about those unique customer relationships? (this is made up)
     Like where ATT buys transit from Cogent for those few prefixes
     in New Zealand they care about? There's always some compelling
     business case to do something wonky. Does this mean that ATT needs
     to register their prefixes in the cogent IRR? How do you keep it
     'quiet' that this is happening, instead of an object saying
     'att priority customer route'? How do you validate these? Even
     the 'big guys' will make policy mistakes once in awhile.

  There needs to be some 'better-way' IMHO, but my ideas on this
topic have not gotten far enough along for me to put code behind them.
Perhaps I'll need to reprioritize those efforts. It seems to me like
someone could do a cool system that churns through the route-views data, or
if necessary just duplicate part of it by getting lots of bgp feeds and
trying to parse the data.

  Too bad there's not a good way to do something like dampening on routes
where depending on the age of the announcement and some 'trust' factor you can
assign a series of local-preferences. I'd really like to see something like
this exist. ie: "dampen" the "new" path (even if the prefix is a longer
one) until some timer has ticked (unless some policy criteria are satisfied,
such as same as-path, etc..).

  There's also the issue of how to implement this in the existing
router(s), some of them with slower cpus. There's a lot of folks using
older hardware to to bgp that just might melt if they had to evaluate some
huge routing policy.

  - Jared

Building customer filters from the IRR seems like it should fall in the "easy" bucket, given how long people have been doing it, and for how long. It's the lack of a way to trust the data that's published in the IRR that always seems to be the stumbling block.

Various ops-aware people have been attacking the correctness issue in the SIDR working group. The work seems fairly well-cooked to me, and I seem to think that Geoff Huston has wrapped some proof-of-concept tools around the crypto.

SIDR is only of any widespread use if it is coupled with policy/procedures at the RIRs to provide certificates for resources that are assigned/allocated. However, this seems like less of a hurdle than you'd think when you look at how many RIR staff are involved in working on it.

So, if you consider some future world where there are suitably machine-readable repositories of number resources (e.g. IRRs) are combined with machine-verifiable certificates affirming a customer's right to use them, how far out of the woods are we? Or are we going to find out that the real problem is some fundamental unwillingness to automate this stuff, or something else?

Joe

> Some have automated systems, but they're dependent on IRR data
> being correct. There are even tools to automate population of IRR data.

Building customer filters from the IRR seems like it should fall in the
"easy" bucket, given how long people have been doing it, and for how long.
It's the lack of a way to trust the data that's published in the IRR that
always seems to be the stumbling block.

-- snip --

So, if you consider some future world where there are suitably
machine-readable repositories of number resources (e.g. IRRs) are combined
with machine-verifiable certificates affirming a customer's right to use
them, how far out of the woods are we? Or are we going to find out that the
real problem is some fundamental unwillingness to automate this stuff, or
something else?

  It's that some folks feel entitled to announce routes without
registering them. Take ANS vs Sprintlink as the classic example. Not
much has changed since then. Nor have the tools evolved significantly.

  Some vendors still don't get router configuration from tools yet.
Try to automate something and it's not easy or impossible. Even the
best solutions on the market have some problems when you feed it a 8+Meg
config. It takes a lot of cpu time to process that much.

  There really need to be some (ick, ignore that I suggested this)
Web 2.0 IRR tools. Something that can smartly populate an IRR or
IRR-like dataset. Something that can be taught to 'learn' what is
reasonable. I've seen some cool things that show promise (eg: pretty
good bgp), but there's always some interesting drawback.

  Plus, as Patrick said earlier, (and i generally agree), these
types of "attacks" are rare and usually short lived. Even those
like the panix situation didn't last very long. Perhaps it's not as
important to think about now.

  - Jared

Going to a model with reasonable and well-defined policies and
procedures is a good thing. However, it renders all the existing IRR
information suspect. Even the RRs run by RIRs are worthless as they
stand. For instance ARIN runs an RR but does no validation of what goes
in there today.

A reasonable approach might be to pick up with tools based on the new
SIDR work and leave the existing IRR info behind.

Tony