Multi-homed clients and BGP timers

Hi all,

I've got numerous single-site 100Mb fibre clients who have backup SDSL
links to my PoP. The two services terminate on separate
distribution/access routers.

The CPE that peers to my fibre router sets a community, and my end sets
the pref to 150 based on it. The CPE also sets a higher pref for
prefixes from the fibre router. The SDSL router to CPE leaves the
default preference in place. Both of my PE gear sends default-originate
to the CPE. There is (generally) no traffic that should ever be on the
SDSL link while the fibre is up.

Both of the PE routers then advertise the learnt client route up into
the core:

*>i208.70.107.128/28
                    172.16.104.22 0 150 0 64762 i
* i 172.16.104.23 0 100 0 64762 i

My problem is the noticeable delay for switchover when the fibre happens
to go down (God forbid).

I would like to know if BGP timer adjustment is the way to adjust this,
or if there is a better/different way. It's fair to say that the fibre
doesn't 'flap'. Based on operational experience, if there is a problem
with the fibre network, it's down for the count.

While I'm at it, I've got another couple of questions:

- whatever technique you might recommend to reduce the convergence
throughout the network, can the same principles be applied to iBGP as well?

- if I need to down core2, what is the quickest and easiest way to
ensure that all gear connected to the cores will *quickly* switch to
preferring core1?

Steve

From experience I found that you need to keep all the timers in sync with all your peers. Something like this for every peer in your bgp config.

neighbor xxx.xx.xx.x timers 30 60

Make sure that this is communicated to your peer as well so that their timer setting are reflected the same.

Zaid

Zaid Ali wrote:

From experience I found that you need to keep all the timers in sync with all your peers. Something like this for every peer in your bgp config.

neighbor xxx.xx.xx.x timers 30 60

Make sure that this is communicated to your peer as well so that their timer setting are reflected the same.

Thankfully at this point, we manage all CPE of any clients who peer with
us, and so far, the clients advertise our own space back to us. I'll go
back to looking at adequate timer settings for my environment.

All it takes is a quick phone call to the client IT people to inform
them that a change will be made, and when they prefer I do it (in the
event something goes south). Also thankfully, I'm within a quick
walk/drive to these sites, which I've found to be a comfort during the
last year while I've walked the BGP learning curve (one of my clients in
particular leaves me with quite a few resources (fibre connections,
hardware) for me to *test* with between site-and-PoP :wink:

Cheers, and thanks!

Steve

neighbor xxx.xx.xx.x timers 30 60

Make sure that this is communicated to your peer as well so that their timer setting are reflected the same.

Thankfully at this point, we manage all CPE of any clients who peer with
us, and so far, the clients advertise our own space back to us. I'll go
back to looking at adequate timer settings for my environment.

All it takes is a quick phone call to the client IT people to inform
them that a change will be made, and when they prefer I do it (in the
event something goes south). Also thankfully, I'm within a quick
walk/drive to these sites, which I've found to be a comfort during the
last year while I've walked the BGP learning curve (one of my clients in
particular leaves me with quite a few resources (fibre connections,
hardware) for me to *test* with between site-and-PoP :wink:

Of course, given that the lowest BGP holdtime is selected
when the session is being established, you don't really need
to change the CPE side, all you need to do is make the
change on the network side and reset the session. And it's
typically a good idea to set the keepalive interval to a
higher frequency when employing lower holdtimes such that
transient keepalive loss (or updates, which act as implicit
keepalives) don't cause any unnecessary instability.

Also, there are usually global values you can set for all
BGP neighbors in most implementations, as well as the per-peer
configuration illustrated above. The former requires less
configuration bits if you're comfortable with setting the
values globally.

If you want to converge a little fast than BGP holdtimes here
and the fiber link is directly between the routers, you might
look at something akin to Cisco's "bgp fast-external-fallover",
which immediately resets the session if the link layer is
reset or lost.

While I'm at it, I've got another couple of questions:

- whatever technique you might recommend to reduce the convergence
throughout the network, can the same principles be applied to iBGP as well?

Depending on your definition of convergence, yes. If you're
referring to update advertisements as opposed to session or
router failures, though, MRAI tweaks and/or less iBGP hierarchy
might be the way to go. Then again, there are lots of side
effects with these as well..

- if I need to down core2, what is the quickest and easiest way to
ensure that all gear connected to the cores will *quickly* switch to
preferring core1?

Use your IGP mechanisms akin to IS-IS overload bit or OSPF
stub router (max metric) advertisement.

-danny

If you want to converge a little fast than BGP holdtimes here
and the fiber link is directly between the routers, you might
look at something akin to Cisco's "bgp fast-external-fallover",
which immediately resets the session if the link layer is
reset or lost.

Also things to consider: BFD for BGP and UDLD will help identify link failures faster. (If all of your equipment supports it, YMMV, etc).

Deepak

Danny McPherson wrote:

neighbor xxx.xx.xx.x timers 30 60

Make sure that this is communicated to your peer as well so that
their timer setting are reflected the same.

Thankfully at this point, we manage all CPE of any clients who peer with
us, and so far, the clients advertise our own space back to us. I'll go
back to looking at adequate timer settings for my environment.

Of course, given that the lowest BGP holdtime is selected
when the session is being established, you don't really need
to change the CPE side, all you need to do is make the
change on the network side and reset the session. And it's
typically a good idea to set the keepalive interval to a
higher frequency when employing lower holdtimes such that
transient keepalive loss (or updates, which act as implicit
keepalives) don't cause any unnecessary instability.

Also, there are usually global values you can set for all
BGP neighbors in most implementations, as well as the per-peer
configuration illustrated above. The former requires less
configuration bits if you're comfortable with setting the
values globally.

I remember reading that the lowest value is implemented, but thanks for
the reminder. In this case, since I *can* change it at the CPE, I may as
well. That way, in the event that I move on (or get hit by a bus) and
the next person moves the connection to a new router, the CPE will win.

Also... the global setting is a great idea. Unfortunately, connected to
this router that handles these fibre connections are a couple of local
peers that I don't want to change the 'defaults' for.

I can't remember if timers can be set at a peer-group level, so I'll
look that up and go from there. That will be my best option given what
is connected to this router.

If you want to converge a little fast than BGP holdtimes here
and the fiber link is directly between the routers, you might
look at something akin to Cisco's "bgp fast-external-fallover",
which immediately resets the session if the link layer is
reset or lost.

Well, unfortunately, the local PUC owns the fibre, and they have a
switch aggregating all of their fibre in a star pattern. They then trunk
the VLANs to me across two redundant pair. I'm in the process of
persuading them to allow me to put my own gear in their location so I
can manage it myself (no risk of port-monitor, no risk of their ops
fscking up my clients etc). This way, they connect from their
client-facing converter into whatever port in my switch I tell them.

With that said, and as I said before, L3 and below rarely fails. I'll
look into fast-external-fallover. It may be worth it here.

While I'm at it, I've got another couple of questions:

- whatever technique you might recommend to reduce the convergence
throughout the network, can the same principles be applied to iBGP as
well?

Depending on your definition of convergence, yes. If you're
referring to update advertisements as opposed to session or
router failures, though, MRAI tweaks and/or less iBGP hierarchy
might be the way to go. Then again, there are lots of side
effects with these as well..

I suppose I might not completely understand what I am asking.

- pe1 has iBGP peering with p1 and p2, and pe1 has p2 as it's next hop
in FIB for prefix X (both cores have prefix X in routing table through a
different edge device)
- p2 suddenly falls off the network

Perhaps it's late enough on Friday night after a long day for me to not
be thinking correctly, but I can't figure out exactly what the delay
time would be for a client connected to pe1 to re-reach prefix X if p2
goes down hard.

- if I need to down core2, what is the quickest and easiest way to
ensure that all gear connected to the cores will *quickly* switch to
preferring core1?

Use your IGP mechanisms akin to IS-IS overload bit or OSPF
stub router (max metric) advertisement.

I will certainly look into your suggestions. I have only a backbone area
in OSPF carrying loopbacks and infrastructure, but don't quite
understand the entire OSPF protocol yet.

Thanks Danny,

Steve

Steve Bertrand wrote:

Well, unfortunately, the local PUC owns the fibre, and they have a
switch aggregating all of their fibre in a star pattern. They then trunk
the VLANs to me across two redundant pair. I'm in the process of
persuading them to allow me to put my own gear in their location so I
can manage it myself (no risk of port-monitor, no risk of their ops
fscking up my clients etc). This way, they connect from their
client-facing converter into whatever port in my switch I tell them.

Correct me if I'm wrong, but wasn't this exactly the type of situation that BFD was designed to detect and help with?

Jack

Jack Bates wrote:

Steve Bertrand wrote:

Well, unfortunately, the local PUC owns the fibre, and they have a
switch aggregating all of their fibre in a star pattern. They then trunk
the VLANs to me across two redundant pair. I'm in the process of
persuading them to allow me to put my own gear in their location so I
can manage it myself (no risk of port-monitor, no risk of their ops
fscking up my clients etc). This way, they connect from their
client-facing converter into whatever port in my switch I tell them.

Correct me if I'm wrong, but wasn't this exactly the type of situation
that BFD was designed to detect and help with?

I don't know, but I'm printing it[1] anyway to take home and read. It's
been mentioned a few times, and clearly worth learning about.

Thanks,

Steve

[1] http://bgp.potaroo.net/ietf/all-ids/draft-ietf-bfd-v4v6-1hop-09.txt

For BFD to work, you need:

* ISR + 12.4(15)T (or later)
* 7200 with 12.4T or 12.2SRx
* 7600/6500/GSR + 12.2SRB (or later)
* ASR

A complete list is at the bottom of this document:

http://www.cisco.com/en/US/docs/ios/12_0s/feature/guide/fs_bfd.html

You'll find some more BFD details and usage guidelines here:

http://www.nil.com/ipcorner/bfd/

Best regards
Ivan

http://www.ioshints.info/about
http://blog.ioshints.info/

If you want to converge a little fast than BGP holdtimes here
and the fiber link is directly between the routers, you might
look at something akin to Cisco's "bgp
fast-external-fallover", which immediately resets the session
if the link layer is reset or lost.

For fast external fallover, your physical interface has to go down. Inside
your network you could use BGP fast fallover (which drops BGP session after
the IGP route to the neighbor is lost), details are here:

Fast fallover with EBGP multihop is described here:

http://wiki.nil.com/EBGP_load_balancing_with_EBGP_session_between_loopback_i
nterfaces

Ivan

http://www.ioshints.info/about

From experience I found that you need to keep all the timers in sync with all your peers. Something like this for every peer in your bgp config.

neighbor xxx.xx.xx.x timers 30 60

30 60 isn't a good choice because that means that after 30.1 seconds a keepalive comes in and then after 60.0 seconds the session will expire while the second one would be there in 60.1 seconds.

The other side will typically use hold timer / 3 for their keepalive interval. If you set it to something not divisible by 3 then you get all 3 of those within the hold timer.

I often recommended 5 16 in the past but that's a bit on the short side, some less robust BGP implementations work single threaded and may not be able to send keepalives every 15 seconds when they're very busy.

The minimum possible hold time is 3.

If you only change the setting at your end you can change it to something higher when bad stuff happens, if the other end also sets it then you'll have to change it at both ends as the hold time is negotiated and the lowest is used.

If you really want fast failover terminate the fiber in the BGP router and make sure fast-external-failover is on (I think it's the default).

For manual failover, simply shut down the BGP sessions on the router that you don't want to handle traffic at that time. If you have peergroups you can do "neighbor peergroup shutdown" for the fastest results. Shutting down interfaces is not such a good idea, then the routing protocols have to time out.

I would agree, BFD is the ideal way to go. I've wanted our upstream
provider to use BFD on our OSPF and iBGP links, but they said they're still
testing it internally. They're quite gun-shy on implementing it because the
existing configuration is stable -- they don't want a new protocol creating
unnecessary failovers. I'm just looking to cut failovers from the existing
12 to 45 seconds (depending on the direction) to a second or two.

Frank

We have customers in the same way you do. We only use Cisco (both pop
routers and managed cpe) and use

neighbor xxx.xxx.xxx.xxx timers 5 15

on the pop routers with great success. We haven't found any drawback so far.

// OK

* Iljitsch van Beijnum:

30 60 isn't a good choice because that means that after 30.1 seconds a
keepalive comes in and then after 60.0 seconds the session will expire
while the second one would be there in 60.1 seconds.

Wouldn't the underlying TCP retry sooner than that?

What's the BCP for BGP timers at exchange points?

I imagine if everyone did something low like 5-15 rather than the default
60-180, CPU usage increase could be significant given a high number peers.

Keeping in mind that "bgp fast-external-failover" is of no use at an
exchange since the fabric is likely to stay up when a peer has gone down,
and BFD would need to be negotiated peer-by-peer, is there a
recommendation other than the default 60-180?

Would going below 60-180 without first discussing it with your peers, tend
to piss them off?

Chris

I suspect that given update messages serve as implicit
keepalives, it's _extremely rare that an actual keepalive
message is needed in global routing environments.

-danny

Hi Chris,

.-- My secret spy satellite informs me that at Mon, 25 May 2009, Chris Caputo wrote:

Would going below 60-180 without first discussing it with your peers, tend
to piss them off?

60-180 is fairly conservative. 60-180 is the Cisco default I believe, however
Junipers defaults are 30-90. I never pissed anyone off with that :wink:

Cheers,
Andree

For those in multivendor environments, it's worth also being aware that since 7.6R1 JunOS sets the minimum BGP hold timer to 20 seconds. If I were creating a standard timer config to deploy consistently on customer peers (and needed something on the fast side in timer terms) I would need to take that into account.

(And yes, there is of course a way to override the 20s hold timer, but it's not a supported config last time I checked)

j.

* Danny McPherson:

Steve Bertrand wrote:

My problem is the noticeable delay for switchover when the fibre happens
to go down (God forbid).

I would like to know if BGP timer adjustment is the way to adjust this,
or if there is a better/different way. It's fair to say that the fibre
doesn't 'flap'. Based on operational experience, if there is a problem
with the fibre network, it's down for the count.

Thanks to all for the great feedback. In summary, I've learnt:

- Even though BFD would be a fantastic solution and would require only
minimal changes (to my strict uRPF setup), it's a non-starter, as I
don't fit all of the requirements that Ivan pointed out

- fast-external-fallover is already enabled by default, but in order for
this to be effective, the interface has to physically go into down
state. In my case, although not impossible, it is extremely unlikely

- adjusting BGP timers is the best option given it's really the only one
left. Although I generally try to keep consistency among all equipment
(if I set the timers at one end, I would set them the same at the
other). Iljitsch recommended to leave the CPE end alone, so if something
bad happens, access to the CPE would not be necessary to revert the change

- I'm going to set the timers to 5/16. I like the idea of the extra
second on top of being divisible by three. That will ensure that at
least three keepalives have a chance to make it before the session hold
timer is reached

Cheers!

Steve