Katrina Network Damage Report

As promised, Renesys has released a brief paper on the effects of
Hurricane Katrina as seen from the Internet. We cover the period of
land fall in some detail and also review the recovery efforts.

http://www.renesys.com/resource_library/Renesys-Katrina-Report-9sep2005.pdf

People who are interested should obviously read the report (and I'm
pretty sure it's on-topic, for once! This might be the second
on-topic thread today. Danger!). But highlights include:

--the Internet was fine
--the Gulf Coast wasn't
--Louisiana was hit particularly hard
--many outaged prefixes still haven't been restored, 10 days later

We're happy to take questions on the report, the data, the
methodology, etc.

t.

this report repeatedly uses the term "outage." how is that
determined/measured?

randy

randy,

this report repeatedly uses the term "outage." how is that
determined/measured?

i think this is covered in the report several times, but i'm sorry if
it wasn't clear. this is based on work that we've done for a while
(some of which was presented at nanog30:
http://nanog.org/mtg-0402/ogielski.html).

the general idea is: take a large peerset sending you full
routes, keep every update forever, and take a reasonably long (at
least a month or two) time horizon. calculate a consensus view for
each prefix as to whether that prefix is reachable by some set of
those peers. an outaged prefix is one that used to be reachable that
not no longer is. in other words, one that has been withdrawn from
the full table by some sufficiently large number of peers.

we exclude single-peer outages and outages that only affect a few
peers through some reasonable thresholding.

make sense? that's the general idea. the implementation is obviously
a *lot* more complicated.

t.

(i'm sure the question of covering prefixes will come up shortly and
i'll address it when/if it does). :slight_smile:

but what about existence of covering or more specific prefixes?
while aggregate inferences are likely reasonable, in general,
inferring unreachability of end interfaces by looking only at
routing data, especially multi-hop bgp data, worries me.

randy

randy brings up two separate questions...

but what about existence of covering or more specific prefixes?
while aggregate inferences are likely reasonable, in general,

see? i told y'all that this would come up! yes, covering prefixes
count. there are many fewer covering prefixes than many most net
geeks would like to believe. there are also many prefixes that appear
(in routing data only) to cover that do not, in fact, provide
forwarding for the more specific prefixes.

a simple analysis that only includes a covering prefix if it has
exaclty the same origination pattern (last two ASes maybe), might be
sufficient. still no way to tell, for certain, whether the cover
works.

our analysis didn't look at covering prefixes, but a spot check of the
outaged prefixes doesn't reveal many. perhaps someone else would like
our list of outaged prefixes to check those for cover?

inferring unreachability of end interfaces by looking only at
routing data, especially multi-hop bgp data, worries me.

me, too. that's why we didn't do that.

two issues in this second question:

1) the multi-hop issue is bogus, i believe. i'll ignore it unless randy chooses
to say what he means here.

2) yes, indeed. we chose only to comment on changes in the routing
table as changes in the routing table. inferences about
unreachability of end interfaces is left entirely to the reader
(randy, in this case).

t.

This describes a partioning, not necessarily an outage.

sean,

Re: From: Todd Underwood <todd@renesys.com>

to quote bobby dylan "you don't need a weatherman to know which
way the wind blows." i.e., unless you were the president, the
department of fatherland security, or fema, you probably knew
there was a major disaster ongoing in nola and surrounds. if
you could read the newpapers, you could even have known of it
in advance.

but, the geolocation stuff is cool. could it have told us, in
an operationally useful/timely manner, that at&t had moved from
new jersey to spain the other day?

1) the multi-hop issue is bogus, i believe. i'll ignore it
unless randy chooses to say what he means here.

maybe use <http://nanog.org/mtg-0210/wang.html>. some siteseer
entries seem a bit mangled, but [0] seems ok.

2) yes, indeed. we chose only to comment on changes in the
routing table as changes in the routing table. inferences
about unreachability of end interfaces is left entirely to
the reader

but reachability is what it's all about. the folk here are
paid to deliver packets. the control plane (routing) is one of
the tools we use to achieve that end.

Re: From: George William Herbert <gherbert@retro.com>

Looking at the routing tables you see failures. If a prefix
goes away completely and utterly, and is truly unreachable,
then anyone trying to see it is going to see an outage.

not if a covering or more specific tells us how to get packets
to the destination. but perhaps that's what you mean by a
prefix being unreachable and i am being too picky.

randy

but reachability is what it's all about. the folk here are
paid to deliver packets. the control plane (routing) is one of
the tools we use to achieve that end.

Re: From: George William Herbert <gherbert@retro.com>
> Looking at the routing tables you see failures. If a prefix
> goes away completely and utterly, and is truly unreachable,
> then anyone trying to see it is going to see an outage.

not if a covering or more specific tells us how to get packets
to the destination. but perhaps that's what you mean by a
prefix being unreachable and i am being too picky.

  would that be that -all- your neighbors have no
  information on how to forward that packet, then
  the destination is unreachable.

  what if a neighbor lies about reachablity and you
  dump your packets into their "blackhole"?

  that darned policy-constrained routing ick can be
  tough to deal w/...

randy

--bill (who will return to lurking)

randy, all,

Re: From: Todd Underwood <todd@renesys.com>

but, the geolocation stuff is cool. could it have told us, in
an operationally useful/timely manner, that at&t had moved from
new jersey to spain the other day?

yes, within about 30s. but randy, you should know better than to
think that requires any geolocation. 12/8 didn't move to spain, it
moved to bolivia (AS26210). and since '12956 26210' was a novel
origination pattern for 12/8 (and the other /8s involved), no
geolocation required. simple analysis of bgp updates tells the
story. anyone who can process updates from a large peerset and
compare those to recent routing history or routing policy can report
that as an anomaly.

> 1) the multi-hop issue is bogus, i believe. i'll ignore it
> unless randy chooses to say what he means here.

maybe use <http://nanog.org/mtg-0210/wang.html>. some siteseer
entries seem a bit mangled, but [0] seems ok.

i'm familiar with the presentation, but thanks for citing it. as you
know jim cowie, andy ogielski and bj premore did some related work for
renesys including

http://www.renesys.com/resource_library/renesys-spie2002.pdf
and
http://www.renesys.com/resource_library/Renesys-NANOG23.pdf (linked

these are all interesting work regarding whether bgp session resets
during large-scale worms are the cause of monitoring artifacts or
whether those worms cause instability themselves. there are
differences of opinion about the results, but it's all interesting and
worth reading.

good stuff. but off topic here, i believe.

randy: why do you think resets of multi-hop sessions has anything to
do with these results reporting individual prefix outages in the
Katrina-affected regions? sorry for being slow, but i'm just not
seeing any connection. maybe someone smarter than me can spell it out
in small words for me.

> 2) yes, indeed. we chose only to comment on changes in the
> routing table as changes in the routing table. inferences
> about unreachability of end interfaces is left entirely to
> the reader

but reachability is what it's all about. the folk here are
paid to deliver packets. the control plane (routing) is one of
the tools we use to achieve that end.

yes, of course. prefixes with no entry in a routing table are not
reachable from that device. what i am saying is that we are not
implying that the end interface went down or that the point to point
link between the end user and their provider went down (although both
of these seem likely). we are saying that there was not a routed path
from a consensus of our peers to that prefix. so that is definitely
unreachable from that consensus of those peers.

Re: From: George William Herbert <gherbert@retro.com>
> Looking at the routing tables you see failures. If a prefix
> goes away completely and utterly, and is truly unreachable,
> then anyone trying to see it is going to see an outage.

not if a covering or more specific tells us how to get packets
to the destination. but perhaps that's what you mean by a
prefix being unreachable and i am being too picky.

i think you may be being picky, but i've already admitted that i'm
having trouble following your points. :slight_smile:

we can look in more detail at coverings and more specifics. but the
depressing fact of the matter is that there are very few covering
prefixes for many of these that are effective (which i define to mean:
have the same origination pattern). my claim, and anyone with
routeviews/ripe data and a few hundred MB of space can verify this, is
that these prefixes are really and truly outaged. not reachable from
pretty much anywhere on the Internet. i think that at the higher
level, this is probably not as controversial as it seems to be in
nanog so far. :slight_smile:

t.