Peering/Transit eBGP sessions -pet or cattle?

adamv0025 · February 10, 2020, 12:37pm

Hi,

Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle.

And I appreciate the answer could differ for transit vs peering connections.

However, I’d like to ask this question through a lens of redundant vs non-redundant Internet edge devices.

To explain,

The “pet” case:

Would you rather try improving the failure rate of your transit/peering connections by using resilient Control-Plane (REs/RSPs/RPs) or even designing these as link bundles over separate cards and optical modules?

Is this on the bases that doesn’t matter how hard you try on your end (i.e. distribute your traffic to multitude of transit and peering connections or use BFD or even BGP-PIC Edge to shuffle thing around fast, any disruption to the eBGP session itself will still hurt you in some way, (i.e. at least some partial outage for some proportion of the traffic for not insignificant period of time) until things converge in direction from The Internet back to you.

The “cattle” case:

Or would you instead rely on small-ish non-redundant HW at your internet edge rather than trying to enhance MTBF with big chassis full of redundant HW?

Is this cause eventually the MTBF figure for a particular transit/peering eBGP session boils down to the MTBF of the single card or even single optical module hosting the link, (and creating bundles over separate cards -well you can never be quite sure how the setup looks like on the other end of that connection)?

Or is it because the effects of a smaller/non-resilient border edge device failure is not that bad in your particular (maybe horizontally scaled) setup?

Would appreciate any pointers, thank you.

Thank you

adam

Baldur_Norddahl · February 10, 2020, 3:06pm

No matter how much money you put into your peering router, the session will be no more stable that whatever the peer did to their end. Plus at some point you will need to reboot due to software upgrade or other reasons. If you care at all, you should be doing redundancy by having multiple locations, multiple routers. You can then save the money spent on each router, because a router failure will not cause any change on what the internet sees through BGP.

Also transits are way more important than peers. Loosing a transit will cause massive route changes around the globe and it will take a few minutes to stabilize. Loosing a peer usually just means the peer switches to the transit route, that they already had available.

Peers are not equal. You may want to ensure redundancy to your biggest peers, while the small fish will be fine without.

To be explicit: Router R1 has connections to transits T1 and T2. Router R2 also has connections to the same transits T1 and T2. When router R1 goes down, only small internal changes at T1 and T2 happens. Nobody notices and the recovery is sub second.

Peers are less important: R1 has connection to internet exchange IE1 and R2 to a different internet exchange IE2. When R1 goes down the small peers at IE1 are lost but will quickly reroute through transit. Large peers may be present at both internet exchanges and so will instantly switch the traffic to IE2.

Regards,

Baldur

adamv0025 · February 10, 2020, 4:42pm

Baldur Norddahl
Sent: Monday, February 10, 2020 3:06 PM

No matter how much money you put into your peering router, the session
will be no more stable that whatever the peer did to their end.

Agreed, that's a fair point,

Plus at some
point you will need to reboot due to software upgrade or other reasons.

There are ways of draining traffic for planned maintenance.

If
you care at all, you should be doing redundancy by having multiple
locations, multiple routers. You can then save the money spent on each
router, because a router failure will not cause any change on what the
internet sees through BGP.

I think router failure will cause change on what the Internet sees as you rightly outlined below:

Also transits are way more important than peers. Loosing a transit
will cause massive route changes around the globe and it will take a
few minutes to stabilize. Loosing a peer usually just means the peer
switches to the transit route, that they already had available.

agreed and I suppose the questions is whether folks tend to try minimizing these impacts by all means possible or just take it as necessary evil that will eventually happen.

Peers are not equal. You may want to ensure redundancy to your biggest
peers, while the small fish will be fine without.

To be explicit: Router R1 has connections to transits T1 and T2.
Router R2 also has connections to the same transits T1 and T2. When
router R1 goes down, only small internal changes at T1 and T2 happens.
Nobody notices and the recovery is sub second.

Good point again,
Though if I had only T1 on R1 and only T2 on R2 then convergence won't happen inside each Transit but instead between T1 and T2 which will add to the convergence time.
So thinking about it seems the optimal design pattern in a distributed (horizontally scaled out) edge would be to try and pair up -i.e. at least two edge nodes per Transit (or Peer for that matter), in order to allow for potentially faster intra-Transit convergence rather than arguably slower inter-transit convergence.

adam

Lukas_Tribus · February 10, 2020, 6:23pm

Hello Adam,

Would like to take a poll on whether you folks tend to treat your transit/peering connections (BGP sessions in particular) as pets or rather as cattle.

Cattle every day of the week.

I don't trust control-plane resiliency and things like ISSU any
farther than I can throw the big boxes it runs on.

The entire network is engineered so that my customers *do not* feel
the loss of one node (*). That is the design principal here and while
traffic grows and we keep adding more capacity this is something we
always consider.

How difficult it is to achieve that depends on the particular
situation, and it may be quite difficult in some situations, but not
here.

That is why I can upgrade releases on those nodes (without customers,
just transit and peers) quite frequently. I can achieve that with
mostly zero packet loss because of the design and all-around traffic
draining using graceful shutdown and friends. We had quite some issues
to drain traffic from nodes in the past (brownouts due to FIB mismatch
between routers due to IP lookup on both ingress and egress node with
per vrf label allocation, but since we switched to "per-ce" - meaning
per nexthop - label allocation things work great).

On the other side, transit with support for graceful-shutdown is of
course great, but even if there is no support for it, for maintenance
on your or your transit's box, you still know about the maintenance
beforehand, so you can manually drain your egress traffic (you peer
doesn't have to support RFC8326 for you to drop YOUR loc-pref to
zero), and many transit provider have some kind of "set loc-pref below
peer" community, which allows you to do basically the same thing
manually without actual RFC8326 support on the other side. That said,
for ingress traffic, unless you are announcing *A LOT* of routes,
convergence is usually *very* fast anyway.

I can see the benefit of having internal HW redundancy for nodes where
customers are connected (shorter maintenance sessions, less outages in
some single HW failures scenarios, overall theoretical better service
uptime), but it never covers everything and it may just introduce
unnecessary complexity that is actually root-causing outages and
certainly complexity.

Maybe I'm just a lucky fellow, but the hardware has been so reliable
here that I'm pretty sure the complexity of Dual-RSP, ISSU and friends
would have caused more issues over time than what I'm seeing with some
good old and honest HW failures.

Regarding HW redundancy itself: Dual RSP doesn't have any benefit when
the guy in the MMR pulls the wrong fiber, bringing down my transit. It
will still be BGP that has to converge. We don't have PIC today, maybe
this is something to look into in the future, but it isn't something
that internal HW redundancy fixes.

A straightforward and KISS design, where the engineers actually know
"what happens when", and how to do things properly (like draining
traffic), and also quite frankly accepting some brownouts for uncommon
events is the strategy that worked best for us.

(*) sure, if the node with 700k best-paths towards a transit dies
non-gracefully (hw or power failure), there will be a brownout of the
affected prefixes for some minutes. But after convergence my network
will be fine and my customers will stop feeling. They will ask what
happened and I will be able to explain.

cheers,
lukas

Baldur_Norddahl · February 10, 2020, 6:57pm

I am assuming R1 and R2 are connected and announcing the same routes. Each transit is therefore receiving the same routes from two independent routers (R1 and R2). When R1 goes down, something internally at the transit will change to reflect that. But peers, other customers at that transit and higher tier transits will see no difference at all. Assuming R1 and R2 both announce a default route internally in your network, your internal convergence will be as fast as your detection of the dead router.

This scheme also protects against link failure or failure at the provider end (if you make sure the transit is also using two routers).

Therefore even if R1 and R2 are in the same physical location, maybe the same rack mounted on top of each other, that is a better solution than one big hunky router with redundant hardware. Having them at different locations is better of course but not always feasible.

Many dual homed companies may start out with two routers and two transits but without dual links to each transit, as you describe above. That will cause significant disruption if one link goes down. It is not just about convergence between T1 and T2 but for a major part of the internet. Been there, done that, yes you can be down for up to several minuttes before everything is normal again. Assume tier 1 transits and that contact to T1 was lost. This means T1 will have a peering session with T2 somewhere, but T1 will not allow peer to peer traffic to go via that link. All those peers will need to search for a different way to reach you, a way that does not transit T1 (unless they have a contract with T1).

Therefore, if being down for several minutes is not ok, you should invest in dual links to your transits. And connect those to two different routers. If possible with a guarantee the transits use two routers at their end and that divergent fiber paths are used etc.

Regards,

Baldur

Lukas_Tribus · February 10, 2020, 11:33pm

Hello Baldur,

Many dual homed companies may start out with two routers and two
transits but without dual links to each transit, as you describe
above. That will cause significant disruption if one link goes
down. It is not just about convergence between T1 and T2 but for
a major part of the internet. Been there, done that, yes you can
be down for up to several minuttes before everything is normal
again. Assume tier 1 transits and that contact to T1 was lost.
This means T1 will have a peering session with T2 somewhere,
but T1 will not allow peer to peer traffic to go via that link.
All those peers will need to search for a different way to reach
you, a way that does not transit T1 (unless they have a contract
with T1).

Therefore, if being down for several minutes is not ok, you
should invest in dual links to your transits. And connect those
to two different routers. If possible with a guarantee the
transits use two routers at their end and that divergent fiber
paths are used etc.

That is not my experience *at all*. I have always seen my prefixes
converge in a couple of seconds upstream (vs 2 different Tier1's).
That is with a double-digit number of announcements. Maybe if you
announce tens of thousands of prefixes as a large Tier 2, things are
more problematic, that I can't tell. Or maybe you hit some old-school
route dampening somewhere down the path. Maybe there is another reason
for this. But even if 3 AS hops are involved I don't really understand
how they would spend *minutes* to converge after receiving your BGP
withdraw message.

When I saw *minutes* of brownouts in connectivity it was always
because of ingress prefix convergence (or the lack thereof, due to
slow FIB programing, then temporary internal routing loops, nasty
things like that, but never external).

I agree there are a number of reasons (including best convergence) to
have completely diversified connections to a single transit AS.
Another reason is that when you manually reroute traffic for a certain
AS path (say transit 2 has an always congested PNI towards a third
party ASN), you may not have an alternative to the congested path when
you other transit provider goes away. But I never saw minutes of
brownout because of upstream -> downstream -> downstream convergence
(or whatever the scenario looks like).

lukas

Baldur_Norddahl · February 12, 2020, 7:57pm

Therefore, if being down for several minutes is not ok, you
should invest in dual links to your transits. And connect those
to two different routers. If possible with a guarantee the
transits use two routers at their end and that divergent fiber
paths are used etc.

That is not my experience at all. I have always seen my prefixes
converge in a couple of seconds upstream (vs 2 different Tier1’s).

This is a bit old but probably still thus:

https://labs.ripe.net/Members/vastur/the-shape-of-a-bgp-update

Quote: “To conclude, we observe that BGP route updates tend to converge globally in just a few minutes. The propagation of newly announced prefixes happens almost instantaneously, reaching 50% visibility in just under 10 seconds, revealing a highly responsive global system. Prefix withdrawals take longer to converge and generate nearly 4 times more BGP traffic, with the visibility dropping below 10% only after approximately 2 minutes”.

Unfortunately they did not test the case of withdrawal from one router while having the prefix still active at another.

When I saw minutes of brownouts in connectivity it was always
because of ingress prefix convergence (or the lack thereof, due to
slow FIB programing, then temporary internal routing loops, nasty
things like that, but never external).

That is also a significant problem. In the case of a single transit connection per router, two routers and two providers, there will be a lot of internal convergence between your two routers in the case of a link failure. That is also avoided by having both routers having the same provider connections. That way a router may still have to invalidate many routes but there will be no loops and the router has loop free alternatives loaded into memory already (to the other provider). Plus you can use the simple trick of having a default route as a fall back.

Regards

Baldur

adamv0025 · February 13, 2020, 9:49am

Baldur Norddahl
Sent: Wednesday, February 12, 2020 7:57 PM

> Therefore, if being down for several minutes is not ok, you should
> invest in dual links to your transits. And connect those to two
> different routers. If possible with a guarantee the transits use two
> routers at their end and that divergent fiber paths are used etc.

That is not my experience *at all*. I have always seen my prefixes
converge in a couple of seconds upstream (vs 2 different Tier1's).

This is a bit old but probably still thus:

The Shape of a BGP Update | RIPE Labs

Quote: "To conclude, we observe that BGP route updates tend to
converge globally in just a few minutes. The propagation of newly
announced prefixes happens almost instantaneously, reaching 50%
visibility in just under 10 seconds, revealing a highly responsive
global system. Prefix withdrawals take longer to converge and generate
nearly 4 times more BGP traffic, with the visibility dropping below 10% only after approximately 2 minutes".

Unfortunately they did not test the case of withdrawal from one router
while having the prefix still active at another.

Yes that's unfortunate,
Although I'm thinking that the convergence time would be highly dependent on the first-hop upstream providers involved in the "local-repair" for the affected AS -once that is done doesn't matter that the whole world still routes traffic to affected AS towards the original first-hop upstream AS, as long as it has a valid detour route.
And I guess the topology configuration of this first-hop outskirt from the affected AS involved in the "local-repair" would dictate the convergence time.
E.g. if your upstream A box happens to have a direct (usable) link/session to upstream B box -winner, however the higher the number of boxes involved in the "local-repair" detour that need to be told "A no more, now B is the way to go" the longer the convergence time.
-but if significant portion of the Internet gets withdraw in 2 min -wondering how long could it be for a typical "local-repair" string of bgp speakers to all get the memo.
-but realistically how many bgp speakers could that be, ranging from min 2 - to max... say ~6?

When I saw *minutes* of brownouts in connectivity it was always
because of ingress prefix convergence (or the lack thereof, due to
slow FIB programing, then temporary internal routing loops, nasty
things like that, but never external).

That is also a significant problem. In the case of a single transit
connection per router, two routers and two providers, there will be a
lot of internal convergence between your two routers in the case of a
link failure. That is also avoided by having both routers having the same provider connections.
That way a router may still have to invalidate many routes but there
will be no loops and the router has loop free alternatives loaded into
memory already (to the other provider). Plus you can use the simple
trick of having a default route as a fall back.

This is a very good point actually, indeed since the box has two transit sessions in case of a failure of only one of them it will still retain all the prefixes in FIB -it will just need to reprogram few next-hops to point towards the other eBGP/iBGP speakers, whoever offers a best path. And reprograming next-hops is significantly faster (with hierarchical FIBs anyways).

adam

Mark_Tinka1 · February 16, 2020, 11:41am

This, for us. We pick up transit and peering in multiple cities around the world. The cute boxes at the time were the MX80 and ASR9001. These have since run out of steam for us, and in many sites, our best option was the MX480, as there was no other high-performance, non-redundant device that made sense to us. However, just months after upgrading to the MX480, the MX204 launched. So now, we focus on the MX204 for peering and transit. It’s so massively distributed that it doesn’t make sense to aggregate multiple exchange points or transit providers in a single location. And if a device in one location were to fail, there is sufficient coverage across the backbone to pick up the slack. We also use separate devices for transit and peering. Of course, transit providers (and some exchange points) don’t really enjoy this model with us, as they’d like to sell us a multi-site contract, which doesn’t make any sense to us. Mark.

Mark_Tinka1 · February 16, 2020, 11:42am

Not in our case, where only 15% of our traffic is handled by our transit
providers.

85% of our traffic comes from peering.

Then again, we have a single connection to each the major 7 transit
providers, spread across multiple cities. But I appreciate that not many
operators can be in this position.

Mark.