inter-domain link recovery

Hi, folks

I find that the link recovery is sometimes very slow when failure occures between different ASes. The outage may last hours. In such cases, it seems that the automatic recovery of BGP-like protocol fails and the repair is took over manually.

We should still remember the taiwan earthquake in Dec. 2006 which damaged almost all the submarine cables. The network condition was quit terrible in the following a few days. One may need minutes to load a web page in US from Asia. However, two main cables luckly escaped damage. Furthermore, we actually have more routing paths, e.g., from Asia and Europe over the trans-Russia networks of Rostelecom and TransTeleCom. With these redundent path, the condition should not be that horrible.

And here is what I'd like to disscuss with you, especially the network operators,
1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

4. what infomation is required for a network operator to find the new route?

Thank you.
  
C. Hu

1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

There are an infinitude of possible answers to these questions which have nothing to do with BGP, per se; those answers are very subjective in nature. Can you provide some specific examples (citing, say, publicly-available historical BGP tables available from route-views, RIPE, et. al.) of an instance in which you believe that the BGP protocol itself is the culprit, along with the supporting data which indicate that the prefixes in question should've remained globally (for some value of 'globally') reachable?

Or are these questions more to do with the general provisioning of interconnection relationships, and not specific to the routing protocol(s) in question?

Physical connectivity to a specific point in a geographical region does not equate to logical connectivity to all the various networks in that larger region; SP networks (and customer networks, for that matter) are interconnected and exchange routing information (and, by implication, traffic) based upon various economic/contractual, technical/operational, and policy considerations which vary greatly from one instance to the next. So, the assertion that there were multiple unaffected physical data links to/from Taiwan in the cited instance - leaving aside for the moment whether this was actually the case, or whether sufficient capacity existed in those links to service traffic to/from the prefixes in question - in and of itself has no bearing on whether or not the appropriate physical and logical connectivity was in place in the form of peering or transit relationships to allow continued global reachability of the prefixes in question.

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

All of the above, and all of the above. Again, it's very situationally dependent.

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

Proximate physical connectivity; capacity; economic/contractual, technical/operational, and policy considerations.

4. what infomation is required for a network operator to find the new route?

By 'find the new route', do you mean a new physical and logical interconnection to another SP?

The following references should help shed some light on the general principles involved:

<http://en.wikipedia.org/wiki/Peering&gt;

<http://www.nanog.org/subjects.html#peering&gt;

<http://www.aw-bc.com/catalog/academic/product/0,1144,0321127005,00.html&gt;

Why do you think BGP was supposed to find the remaining path? Is it possible that the remaining fibers were not owned or leased by the networks in question? Or are you suggesting that any capacity should be available to anyone who "needs" it, whether they pay or not?

BGP cannot find a path that the business rules forbid.

without specific data, everything is guesswork, folklore, and guesswork.
did i say guesswork.

the two biggest causes of long bgp convergence are policy and damping.
policy is hidden and you can only guess and infer. if you are lucky and
have records from folk near the problem (not the incident, but the folk
who can not see it) damping can be seen in places like route views and ris.

can you give specific data? e.g asn42 could not see prefix 666.42/16
being announced by asn96 2007.03.24 at 19:00 gmt.

randy

Thank you for your detailed explainaton.

Just suppose no business fators (like multiple ASes belongs to a same ISP), is it always possible for BGP to automatically find an alternative path when failure occurs if exist one? If not, what may be the causes?

C. Hu

Thank you for comments. I know there are economic/contractual relationships between two networks, and BGP cannot find a path that the business rules forbid. But when in these cases, how to recover it? The network operators just wait for physically reparing the link or they may manully configure an alternative path by paying another network for transit service or finding a peering network?

C. Hu

Barring implementation bugs or network misconfigurations, I've never experienced an operational problem with BGP4 (or OSPF or EIGRP or IS-IS or RIPv2, for that matter) converging correctly due to a flaw in the routing protocol, if that's the gist of the first question. There are many other factors external to the workings of the protocol itself which may affect routing convergence, of course; it really isn't practical to provide a meaningful answer to the second question in a reasonable amount of time, please see the previous reply.

The questions that you're asking essentially boil down to 'How does the Internet work?', or, even more fundamentally, 'How does routing work?'. I would strongly suggest familiarizing oneself with the reference materials cited in the previous reply, as they provide a good introduction to the fundamentals of this topic.

Or they've already sufficient diversity in terms of peering/transit relationships and physical interconnectivity to handle the situation in question - depending upon the situation, of course.

Chengchen Hu wrote:

Thank you for your detailed explainaton.

Just suppose no business fators (like multiple ASes belongs to a same ISP), is it always possible for BGP to automatically find an alternative path when failure occurs if exist one? If not, what may be the causes?

If you have multiple paths to a given prefix in your rib, you're going
to use the shortest one. If it's withdrawn you'll use the next shortest
one. If you have no paths remaining to that prefix, you can't forward
the packet anymore.

I think to look back at your original question. You're asking a specfic
question about the dec 06 earth quake outage... The best people to ask
why to took so long to restore are the operators who were most
dramatically affected.

The fact of the matter is most ISP's are not in the business of buying
more diversity than they think they need in order to insure business
continuity, support sla's and stay in business. The earthquake and
undersea landslide affected a number of fiber paths over a short period
of time.

I think it's fair to assume that a number of operators have updated to
their risk models to account for that sort of threat in the future. it's
to totally anticipate the threat of loosing ~80% of your fiber capacity
in a rather dense and well connected cooridor.

There were two talks on the subject of that particular event at the
first 07 nanog, you can peruse them here:

http://www.nanog.org/mtg-0702/topics.html

In particular the second talk discusses the signature of that outage in
the routing table in some detail.

Thank you for comments. I know there are economic/contractual
relationships between two networks, and BGP cannot find a
path that the business rules forbid. But when in these
cases, how to recover it? The network operators just wait for
physically reparing the link or they may manully configure an
alternative path by paying another network for transit
service or finding a peering network?

It sounds like you are asking this question in the context of an
Internet exchange point where you connect to the exchange point, and
then negotiate separate peering agreements with each participant, or a
telecom hotel/data centre. In the exchange point, you could
theoretically have special "INSURANCE" peering agreements where you
don't exchange traffic until there is an emergency, and then you can
quickly turn it on, perhaps using an automated tool. In the data centre,
you could theoretically have a similar sort of agreement that only
requires cross-connect cables to be installed. In fact, you could
already have the cross-connect cables in place, waiting to be plugged in
on your end, or fully plugged in waiting for you to enable the port.

I wonder if anyone on the list has such INSURANCE peering or transit
arrangements in place?

Given the fact that most providers will go to extra efforts to install
new circuits when there is an emergency like the Taiwan quake, perhaps
there isn't as much value to such insurance arrangements as you might
think.

If we ever get to the point where most circuit connections in the core
are via switched wavelengths, then perhaps BGP will be used to find new
paths when others have failed.

--Michael Dillon

I think everyone here has already covered a lot of the bases to do with your original question (i.e. 'Because kit might not be configured to re-converge in an optimal way'), but if I can summarise the question you are trying to answer as "how can we improve convergence times", you might like to look at the notes that Nate Kushman presented in Toronto this year

http://www.nanog.org/mtg-0702/kushman.html

Summary
Many studies show that when Internet links go up or down, the dynamics of BGP may cause several minutes of packet loss. The loss occurs even when multiple paths between the sender and receiver domains exist, and is unwarranted given the high connectivity of the Internet. Instead, we would like to ensure that Internet domains stay connected as long as the underlying network is connected.

Andy

I find that the link recovery is sometimes very slow when failure occures between different ASes. The outage may last hours. In such cases, it seems that the automatic recovery of BGP-like protocol fails and the repair is took over manually.

We should still remember the taiwan earthquake in Dec. 2006 which damaged almost all the submarine cables. The network condition was quit terrible in the following a few days. One may need minutes to load a web page in US from Asia. However, two main cables luckly escaped damage. Furthermore, we actually have more routing paths, e.g., from Asia and Europe over the trans-Russia networks of Rostelecom and TransTeleCom. With these redundent path, the condition should not be that horrible.

Please see the presentation I made at AMSIX in May (original version by Todd at Renesys): http://www.thedogsbollocks.co.uk/tech/0705quakes/AMSIXMay07-Quakes.ppt

BGP failover worked fine, much of the instability occurs after the cable cuts as operators found their networks congested and tried to manually change to new uncongested routes.

(Check slide 4) - the simple fact was that with something like 7 of 9 cables down the redundancy is useless .. even if operators maintained N+1 redundancy which is unlikely for many operators that would imply 50% of capacity was actually used with 50% spare.. however we see around 78% of capacity is lost. There was simply to much traffic and not enough capacity.. IP backbones fail pretty badly when faced with extreme congestion.

And here is what I'd like to disscuss with you, especially the network operators,
1. Why BGP-like protocol failed to recover the path sometimes? Is it mainly because the policy setting by the ISP and network operators?

No, BGP was fine.. this was a congestion issue - ultimately caused by lack of resiliency in cable routes in and out of the region.

2. What is the actions a network operator will take when such failures occures? Is it the case like that, 1)to find (a) alternative path(s); 2)negotiate with other ISP if need; 3)modify the policy and reroute the traffic. Which actions may be time consuming?

Yes, and as the data shows this only made a bad situation worse.. any routes that may have had capacity were soon overwhelmed.

3. There may be more than one alternative paths and what is the criterion for the network operator to finally select one or some of them?

Pick one that works? But in this case no such option was available.

4. what infomation is required for a network operator to find the new route?

In the case of a BGP change presumably the operator checks that the new path appears to function without latency or delay (a traceroute would be a basic way to check).

In terms of a real fix, it cant be done with BGP, you would need to find unused Layer1 capacity and plug in a new cable. Slides 28-31 show that this occurred with Asian networks picking up Westward paths to Europe but it took some manual intervention, time, and money.

I think the real question given the facts around this is whether South East Asia will look to protect against a future failure by providing new routes that circumvent single points of failure such as the Luzon straights at Taiwan. But that costs a lot of money .. so the futures not hopeful!

Steve

I think the real question given the facts around this is
whether South East Asia will look to protect against a future
failure by providing new routes that circumvent single points
of failure such as the Luzon straights at Taiwan. But that
costs a lot of money .. so the futures not hopeful!

In addition to the existing (fairly new) Rostelekom fiber vie Heihe,
there is a new 10G fiber build by China Unicom and the Russian company
TTC. On the Russian side, TTC is a fully owned subsidiary of the Russian
Railways which means that they have full access to Russia's extensive
rail network rights-of-way. Russia is a huge country and except for a
small are in the west (known as continental Europe) the rail network is
the main means of transport. It's a bit like the excellent European
railways except with huge railcars like in North America. I think that
TTC will become the main land route from the far East into Europe
because of this.

Compare this map of the Trans-Baikal region railroad with the Google
satellite images of the area.
http://branch.rzd.ru/wps/PA_1_0_M1/FileDownload?vp=2&col_id=121&id=9173

The Unicom/TTC project is coming across the Chinese border on the second
spur from the lower right corner. It's actually a cross-border line, the
map just doesn't show the Chinese railways. If you go to the 7th level
of zoom-in on Google Maps, the first Russian town that shows on the
Chinese border (Blagoveshchensk) is where the fibre line will cross.

--Michael Dillon

And then there's the fun of doing actual live fall-over testing to make
sure it works as intended. (I wonder how many people who multi-home for
outage survival actually *test* their multi on a reasonably regular basis?)

And as Michael noted, this only works in a telcom hotel, where you don't
have to pay for dark fiber from your site to the connection point...

Remember the end-to-end principle. IP backbones don't fail with extreme congestion, IP applications fail with extreme congestion.

Should IP applications respond to extreme congestion conditions better?
Or should IP backbones have methods to predictably control which IP applications receive the remaining IP bandwidth? Similar to the telephone
network special information tone -- All Circuits are Busy. Maybe we've
found a new use for ICMP Source Quench.

Even if the IP protocols recover "as designed," does human impatience mean there is a maximum recovery timeout period before humans start making the problem worse?

Source Quench wouldn't be my favored solution here. What I might suggest is taking TCP SYN and SCTP INIT (or new sessions if they are encrypted or UDP) and put them into a lower priority/rate queue. Delaying the start of new work would have a pretty strong effect on the congestive collapse of the existing work, I should think.

I was joking about Source Quench (missing :-), its got a lot of problems.

But I think the fundamental issue is who is responsible for controlling the back-off process? The edge or the middle?

Using different queues implies the middle (i.e. routers). At best it might be the "near-edge," and creating some type of shared knowledge
between past, current and new sessions in the host stacks (and maybe
middle-boxes like NAT gateways).

How fast do you need to signal large-scale back-off over what time period?
Since major events in the real-world also result in a lot of "new" traffic, how do you signal new sessions before they reach the affected
region of the network? Can you use BGP to signal the far-reaches of
the Internet that I'm having problems, and other ASNs should start slowing
things down before they reach my region (security can-o-worms being opened).

Hey Sean,

>(Check slide 4) - the simple fact was that with something like 7 of 9
>cables down the redundancy is useless .. even if operators maintained
>N+1 redundancy which is unlikely for many operators that would imply
>50% of capacity was actually used with 50% spare.. however we see
>around 78% of capacity is lost. There was simply to much traffic and
>not enough capacity.. IP backbones fail pretty badly when faced with
>extreme congestion.

Remember the end-to-end principle. IP backbones don't fail with extreme
congestion, IP applications fail with extreme congestion.

Hmm I'm not sure about that... a 100% full link dropping packets causes many problems:
L7: Applications stop working, humans get angry
L4: TCP/UDP drops cause retransmits, connection drops, retries etc
L3: BGP sessions drop, OSPF hellos are lost.. routing fails
L2: STP packets dropped.. switching fails

I believe any or all of the above could occur on a backbone which has just failed massively and now has 20% capacity available such as occurred in SE Asia

Should IP applications respond to extreme congestion conditions better?

alert('Connection dropped')
"Ping timed out"

kinda icky but its not the applications job to manage the network

Or should IP backbones have methods to predictably control which IP
applications receive the remaining IP bandwidth? Similar to the telephone
network special information tone -- All Circuits are Busy. Maybe we've
found a new use for ICMP Source Quench.

yes and no.. for a private network perhaps, but for the Internet backbone where all traffic is important (right?), differentiation is difficult unless applied at the edge and you have major failure and congestion i dont see what you can do that will have any reasonable effect. perhaps you are a government contractor and you reserve some capacity for them and drop everything else but what is really out there as a solution?

FYI I have seen telephone networks fail badly under extreme congestion. CO's have small CPUs that dont do a whole lot - setup calls, send busy signals .. once a call is in place it doesnt occupy CPU time as the path is locked in place elsewhere. however, if something occurs to cause a serious amount of busy ccts then CPU usage goes thro the roof and you can cause cascade failures of whole COs

telcos look to solutions such as call gapping to intervene when they anticipate major congestion, and not rely on the network to handle it

Even if the IP protocols recover "as designed," does human impatience mean
there is a maximum recovery timeout period before humans start making the
problem worse?

i'm not sure they were designed to do this.. the arpanet wasnt intended to be massively congested.. the redundant links were in place to cope with loss of a node and usage was manageable.

Steve

I'm more worried about state getting "stuck", kind of like the total inability
of the DHS worry-o-meter to move lower than yellow.

let me answer at least twice.

As you say, remember the end-2-end principle. The end-2-end principle, in my precis, says "in deciding where functionality should be placed, do so in the simplest, cheapest, and most reliable manner when considered in the context of the entire network. That is usually close to the edge." Note the presence of advice and absence of mandate.

Parekh and Gallagher in their 1993 papers on the topic proved using control theory that if we can specify the amount of data that each session keeps in the network (for some definition of "session") and for each link the session crosses define exactly what the link will do with it, we can mathematically predict the delay the session will experience. TCP congestion control as presently defined tries to manage delay by adjusting the window; some algorithms literally measure delay, while most measure loss, which is the extreme case of delay. The math tells me that place to control the rate of a session is in the end system. Funny thing, that is found "close to the edge".

What ISPs routinely try to do is adjust routing in order to maximize their ability to carry customer sessions without increasing their outlay for bandwidth. It's called "load sharing", and we have a list of ways we do that, notably in recent years using BGP advertisements. Where Parekh and Gallagher calculated what the delay was, the ISP has the option of minimizing it through appropriate use of routing.

ie, edge and middle both have valid options, and the totality works best when they work together. That may be heresy, but it's true. When I hear my company's marketing line on intelligence in the network (which makes me cringe), I try to remind my marketing folks that the best use of intelligence in the network is to offer intelligent services to the intelligent edge that enable the intelligent edge to do something intelligent. But there is a place for intelligence in the network, and routing its its poster child.

In your summary of the problem, the assumption is that both of these are operative and have done what they can - several links are down, the remaining links (including any rerouting that may have occurred) are full to the gills, TCP is backing off as far as it can back off, and even so due to high loss little if anything productive is in fact happening. You're looking for a third "thing that can be done" to avoid congestive collapse, which is the case in which the network or some part of it is fully utilized and yet accomplishing no useful work.

So I would suggest that a third thing that can be done, after the other two avenues have been exhausted, is to decide to not start new sessions unless there is some reasonable chance that they will be able to accomplish their work. This is a burden I would not want to put on the host, because the probability is vanishingly small - any competent network operator is going to solve the problem with money if it is other than transient. But from where I sit, it looks like the "simplest, cheapest, and most reliable" place to detect overwhelming congestion is at the congested link, and given that sessions tend to be of finite duration and present semi-predictable loads, if you want to allow established sessions to complete, you want to run the established sessions in preference to new ones. The thing to do is delay the initiation of new sessions.

If I had an ICMP that went to the application, and if I trusted the application to obey me, I might very well say "dear browser or p2p application, I know you want to open 4-7 TCP sessions at a time, but for the coming 60 seconds could I convince you to open only one at a time?". I suspect that would go a long way. But there is a trust issue - would enterprise firewalls let it get to the host, would the host be able to get it to the application, would the application honor it, and would the ISP trust the enterprise/host/application to do so? is ddos possible? <mumble>

So plan B would be to in some way rate limit the passage of TCP SYN/SYN-ACK and SCTP INIT in such a way that the hosed links remain fully utilized but sessions that have become established get acceptable service (maybe not great service, but they eventually complete without failing).