So -- what did happen to Panix?

Steven_Bellovin · January 25, 2006, 8:04pm

It's now been 2.5 business days since Panix was taken out. Do we know
what the root cause was? It's hard to engineer a solution until we
know what the problem was.

--Steven M. Bellovin, http://www.cs.columbia.edu/~smb

william_at_elan.net · January 25, 2006, 8:07pm

Is it really that hard to engineer this solution? We do have several of them proposed (SBGP, soBGP, etc) and new WG is likely to be formed soon
within IETF to finally work it out.

Pekka_Savola · January 26, 2006, 5:54am

It'd be darn difficult to engineer a solution that would end up being deployed in any reasonable time if we don't know the requirements first. Yes, there's a draft -- draft-ietf-rpsec-bgpsecrec-03.txt -- but it has been woefully lacking on the operator & deployment requirements. More people should participate in the effort.

Valdis_Kletnieks · January 26, 2006, 6:22am

Fortunately, when we know the requirements and engineer a solution, deployment
is straightforward. RFC2827, for example, has a stellar deployment record.

In other words - what is the business case for deploying this proposed
solution? I may be able to get things deployed at $WORK by arguing that
it's The Right Thing To Do, but at most shops an ROI calculation needs
to be attached to get movement....

Pekka_Savola · January 26, 2006, 6:34am

Exactly. If $OTHER_FOLKS don't deploy it, cases like Panix may not really be avoided.

I think that's what folks proposing perfect -- but practically undeployable -- security solutions are missing.

Steven_Bellovin · January 26, 2006, 6:44am

That is, of course, why I asked the question -- I'm trying to
understand the actual failure modes and feasible fixes. I agree that
many of the solutions proposed thus far are hard to deploy; some
colleagues and I are working on variants that we think are deployable.
But we need data first.

--Steven M. Bellovin, http://www.cs.columbia.edu/~smb

Dan_Golding · January 26, 2006, 3:29pm

In terms of the larger question....

ConEd Communications was recently acquired by RCN. I'm not sure if the
transaction has formally closed. I suspect there are serious transition
issues occurring. "Financial Stability", "Employee Churn", and "Ownership"
are, unfortunately, tough things to factor into BGP algorithms.

http://investor.rcn.com/ReleaseDetail.cfm?ReleaseID=181194

Internet access has always been a sideline for CEC - they are more of a
provider of transport, and their customers have included some very well
known entities in the NY metro area.

Perhaps someone from RCN would care to comment?

- Dan

Matt_Buford · January 26, 2006, 6:14pm

I have no idea if this is really related, but the issue was the same weekend that ConEd had major network maintenance going on. My ConEd service was down (NYC area) for the entire weekend (about 60 hours) during their planned maintenance window to convert their network to MPLS. I saw their maintenance notice and noticed that the window lasted multiple days. I expected the link to go down - but I never imagined they meant it would stay down for the entire maintenance window.

So, I'm speculating that even if there weren't organization issues their engineers were probably very busy and distracted by the major technical changes going on.

Todd_Underwood1 · January 26, 2006, 7:39pm

Steven, all,

It's now been 2.5 business days since Panix was taken out. Do we know
what the root cause was? It's hard to engineer a solution until we
know what the problem was.

I keep hearing that Con Ed Comm was previously an upstream of of Panix
( http://www.renesys.com/blog/2006/01/coned_steals_the_net.shtml#comments )
and that this might have explained why Con Ed had Panix routes in
their radb as-27506-transit object. But I checked our records
of routing data going back to jan 1, 2002, and see no evidence of
27506 and 2033 being adjacent to each other in any announcement from
any of our peers at any time since then. So I can't really verify
that Panix was ever a Con Ed Comm customer. Can anyone else clear
this up? So far, it's not making sense.

The supposition was that all of the other affected ASes that are not
currently customers of Con Ed Comm were also previously customers.
Some appear to have been (Walrus Internet (AS7169), Advanced Digital
Internet (AS23011), and NYFIX (AS20282) for sure) but I haven't been
able to verify that all of them were.

I know that this isn't really a "root cause" that Steven was asking
for, though. The root cause is that filtering is imperfect and out of
date frequently. This case is particularly intersting and painful
because Verio is known for building good filters automatically. In
this case, they did so based on out-of-date information,
unfortunately. This is particularly depressing because normally in
cases of leaks like this, the propagation is via some provider or peer
who doesn't filter at all. In this case, one of the vectors was one
of the most responsible filterers on the net. sigh.

So in terms of engineering good solutions, the space is pretty
crowded. One camp is of the "total solution" variety that involves
new hardware, new protocols, and a Public Key approach where
originations (or any announcements) are signed and verified. This is
obviously a very good and complete approach to the problem but it's
also obviously seeing precious little adoption. And in the mean time
we have nothing.

Another set of approaches has been to look at alternate methods of
building filters, taking into account more information about history
of routing announcements and dampening or refusing to accept novel,
questionable announcements for some fixed, short amount of time. Josh
Karlin's paper suggests that as does some of the stuff that Tom
Scholl, Jim Deleskie and I presented at the last nanog. All of this
has the disadvantage of being a partial solution, the advantage of
being implementable easily and in stages without a network forklift or
a protocol upgrade, but the further disadvantage of being nowhere near
fully baked.

Clearly more, smarter people need to keep searching for good solutions
to this set of problems. Extra credit for solutions that can be
implemented by individual autonomous systems without hardware upgrades
or major protocol changes, but that may not be possible.

t.

p.s.: wrt comments made previously that imply that moving parts of
routing control off of the routers is "Bell-like" or "bell-headed":
although the comments are silly and made somewhat in jest, they're
obviously not true. anyone who builds prefix filters or access lists
off of routers is already generating policy somewhere other than the
router. using additional history or smarts to do that and uploading
prefix filters more often doesn't change that existing architecture or
make the network somehow "bell-like". it might not work well enough
to solve the problem, but that's another, interesting objection.

Jared_Mauch · January 26, 2006, 10:42pm

Dislcaimer: I work for AS2914

Another set of approaches has been to look at alternate methods of
building filters, taking into account more information about history
of routing announcements and dampening or refusing to accept novel,
questionable announcements for some fixed, short amount of time. Josh
Karlin's paper suggests that as does some of the stuff that Tom
Scholl, Jim Deleskie and I presented at the last nanog. All of this
has the disadvantage of being a partial solution, the advantage of
being implementable easily and in stages without a network forklift or
a protocol upgrade, but the further disadvantage of being nowhere near
fully baked.

Clearly more, smarter people need to keep searching for good solutions
to this set of problems. Extra credit for solutions that can be
implemented by individual autonomous systems without hardware upgrades
or major protocol changes, but that may not be possible.

t.

p.s.: wrt comments made previously that imply that moving parts of
routing control off of the routers is "Bell-like" or "bell-headed":
although the comments are silly and made somewhat in jest, they're
obviously not true. anyone who builds prefix filters or access lists
off of routers is already generating policy somewhere other than the
router. using additional history or smarts to do that and uploading
prefix filters more often doesn't change that existing architecture or
make the network somehow "bell-like". it might not work well enough
to solve the problem, but that's another, interesting objection.

This is something that (as i mentioned to you in private) some others
have thought of as well. We at 2914 build the filters and such off-the-route
and load them to the router with sometimes quite large configurations.
(they have been ~8MB in the past)

I'd love to see some prefix stability data (eg: 129.250/16
has been announced by origin-as 2914 for X years/seconds/whatnot)
which can help score the data better. Do we need a origin-as match
in our router policies? does it exist already? What about a way to
dampen/delay announcements that don't match the origin-as data
that exists?

I think a solution like this would help out a number of networks
that have these types of problems/challenges. Obviously noticing an
origin change and alerting or similar on that would be nice and useful,
but would the noise be too much for a NOC display?

- jared

ps. i'm glad our NOC/operations people were able to solve the PANIX
issue quickly for them.

Josh_Karlin · January 26, 2006, 11:22pm

The noise of origin changes is fairly heavy, somewhere in the low
hundreds of alerts per day given a 3 day history window. Supposing a
falsely originated route was delayed, what is the chance of identifying
and fixing it before the end of the delay period? Do operators
commonly catch misconfigurations on their own or do they usually find
out about it from other operators due to service disruption?

Jared_Mauch · January 27, 2006, 1:01am

Are the origin changes for a small set of the prefixes
that tend to repeat (eg: connexion as planes move), or is it a different
set of prefixes day-to-day or week-to-week?

I suspect there are the obvious prefixes that don't change
(eg: 12/8, 18/8, 35/8, 38/8) but subparts of that may change, but
for most people with allocations in the range of 12-17 bits, I suspect
they won't change frequently.

- jared

Josh_Karlin · January 27, 2006, 1:14am

I unfortunately don't have answers to those questions, but you've
piqued my interest so I will try to look into it within the next
couple of days.

Josh

Bandy_Rush1 · January 27, 2006, 1:41am

jared,

i may have missed the answer to my question. but, as verio was
the upstream, and verio is known to use the irr to filter, could
you tell us why that approach seemed not to suffice in this case?

randy

Jared_Mauch · January 27, 2006, 1:56am

Sure, what I saw by going through the diffs, etc.. that I have
available to me is that the prefix was registered to be announced
by our customer and hence made it into our automatic IRR filters. it was
no longer in there by the time that I personally looked things up in
our registry, but saw diffs go through removing that prefix later in
the day (night) from the acl.

Someone that has a snapshot of the various IRR data from
those days can likely put this together better than I can explain.

- jared

william_at_elan.net · January 27, 2006, 11:18am

All these explanations can only go so far as to show that ConEd
and its upstreams may have had these prefixes as something that is
allowed (due to previous transit relationships) to be annnounced.
However presumably all these were transit arrangements with ConEd
and ip blocks would have originated from different ASN where a
during the accident ConEd actually directly announced prefix as
originating from its own ASN.

One thing I can think of is that ConEd started doing syncrhonization
so all eBGP routes were redistributed into ospf or some other igp
protocol. This could led to situation that some previously configured
router that redistributes summarized rotues from igp go bgp could
think the route needs to be advertised as coming from ConEd and announced it Verio. But I think result of all this should have been
that route would be flapping (i.e. they start announcing and then it
gets removed from what they learn from upstream and so no longer redistributed to igp and no longer announced; back to the beginning) and they weren't.

Bandy_Rush1 · January 27, 2006, 12:36pm

what I saw by going through the diffs, etc.. that I have
available to me is that the prefix was registered to be announced
by our customer and hence made it into our automatic IRR filters.

i.e., the 'error' was intended, and followed all process.

so, what i don't see is how any hacks on routing, such as delay,
history, ... will prevent this while not, at the same time, have
very undesired effects on those legitimately changing isps.

seems to me that certified validation of prefix ownership and as
path are the only real way out of these problems that does not
teach us the 42 reasons we use a *dynamic* protocol.

what am i missing here?

randy

bill3 · January 27, 2006, 12:51pm

> what I saw by going through the diffs, etc.. that I have
> available to me is that the prefix was registered to be announced
> by our customer and hence made it into our automatic IRR filters.

i.e., the 'error' was intended, and followed all process.

so, what i don't see is how any hacks on routing, such as delay,
history, ... will prevent this while not, at the same time, have
very undesired effects on those legitimately changing isps.

seems to me that certified validation of prefix ownership and as
path are the only real way out of these problems that does not
teach us the 42 reasons we use a *dynamic* protocol.

  perhaps you mean certified validation of prefix origin
  and path. Ownership of any given prefix is a dicey concept
  at best.

  as a start, i'd want two things for authentication and integrity
  checks: AS P asserts it is the origin of prefix R and prefix R
  asserts the true origin AS is P (or Q or some list). Being able
  to check these assertions and being assured of the authenticity
  and integrity of the answers goes a long way, at least for me.

path validation is something else and a worthwhile goal.
--bill

michael.dillon1 · January 27, 2006, 1:29pm

seems to me that certified validation of prefix ownership and as
path are the only real way out of these problems that does not
teach us the 42 reasons we use a *dynamic* protocol.

Wouldn't a well-operated network of IRRs used by 95% of
network operators be able to meet all three of your
requirements?

-certified prefix ownership
-certified AS path ownership
-dynamic changes to the above two items

It seems to me that most of the pieces needed to do
this already exist. RPSL, IRR softwares, regional
addressing authorities (RIRs). If there are to be
certified AS paths in a central database this also
opens the door to special arrangements for AS path
routing that go beyond peering, i.e. agreements with
the peers of your peers.

Seems to me that operational problem solving works
better when the problem is not thrown into the laps
of the protocol designers.

--Michael Dillon

Josh_Karlin · January 27, 2006, 2:46pm

Wouldn't a well-operated network of IRRs used by 95% of
network operators be able to meet all three of your
requirements?

-certified prefix ownership
-certified AS path ownership
-dynamic changes to the above two items

It seems to me that most of the pieces needed to do
this already exist. RPSL, IRR softwares, regional
addressing authorities (RIRs). If there are to be
certified AS paths in a central database this also
opens the door to special arrangements for AS path
routing that go beyond peering, i.e. agreements with
the peers of your peers.

Hasn't that been said for years? Wouldn't perfect IRRs be great? I
couldn't agree more. But in the meanwhile, why not protect your own
ISP by delaying possible misconfigurations. Our proposed delay does
*not* affect reachability, if the only route left is suspicious, it
will be chosen regardless. If you are changing providers, which takes
awhile anyway, just advertise both for a day and you have no problems.
Or, if you are concerned about speed, simply withdraw one and the new
one will have to be used. If you are anycasting the prefix and a new
origin pops up that your view has not seen before, then you might have
a temporary load balance issue, but there is absolutely no guarantee
of what routers many hops away from you will see anyway.

Josh