<Keepalives are temporarily in throttle due to closed TCP window>

I am having difficulty maintaining my BGP session from my 6509 with
Sup-7203bxls to a 7206 VXR NPE-400. The session bounces every 3
minutes. I do have other IBGP sessions that are established with no
problems, however, this is the only IBGP peer that is bouncing
regularly.

cr1.AUSTTXEE#show ip bgp neighbors 67.214.64.100

<SNIP>

  BGP state = Established, up for 00:02:54

  Last read 00:00:53, last write 00:02:54, hold time is 180, keepalive
interval is 60 seconds

  Keepalives are temporarily in throttle due to closed TCP window

  Neighbor capabilities:

    Route refresh: advertised and received(new)

    Address family IPv4 Unicast: advertised and received

  Message statistics:

What does exactly the message mean and how do I stabilize this? Any
help will be appreciated.

Michael Ruiz

Network Engineer

Office 210-448-0040

Cell 512-744-3826

mruiz@telwestservices.com <mailto::mruiz@telwestservices.com>

"I don't measure a man's success by how high he climbs but how high he
bounces when he hits bottom."

- General George S. Patton Jr.

How am I doing? Please email my Director of Engineering Jared Martin
with any feedback at: jmartin@telwestservices.com

image001.jpg

This is most likely an MTU problem. Your SYN/SYN+ACK goes thru, but then the first fullsize MSS packet is sent, and it's not getting to the destination. 3 minutes is the dead timer for keepalives, which are not getting thru either because of the stalled TCP session.

* Mikael Abrahamsson:

What does exactly the message mean and how do I stabilize this? Any
help will be appreciated.

This is most likely an MTU problem.

Does IOS enable PMTUD for BGP sessions by default these days? The 476
(or something like that) MTU is unlikely an issue. There could be a
forwarding bug which causes drops dependent on packet size, though.

* Mikael Abrahamsson:

What does exactly the message mean and how do I stabilize this? Any
help will be appreciated.

This is most likely an MTU problem.

Does IOS enable PMTUD for BGP sessions by default these days? The 476
(or something like that) MTU is unlikely an issue. There could be a
forwarding bug which causes drops dependent on packet size, though.

I am not sure. I think it is, but I went ahead and put in the command
manually.

Here is more of the configuration to do with TCP information.

ip tcp selective-ack
ip tcp window-size 65535
ip tcp synwait-time 10
ip tcp path-mtu-discovery

Every time I turn those on (plus timestamping), it breaks something. The
last time I tried it broke ftp based transfers of new IOS, had to
disable or use tftp to get a non-corrupted image (SRA). The time before
that, it occasionally caused bgp keepalives to be missed and thus
dropped the session (SXF). It may work now, or there may be more subtle
Cisco bugs lurking, who knows. :slight_smile:

You can confirm what MSS is actually being used in show ip bgp neighbor,
under the "max data segment" line. I believe in modern code there is a
way to turn on pmtud for all bgp neighbors (or individual ones) which
may or may not depend on the global ip tcp path-mtu-discovery setting. I
don't recall off the top of my head, but you should be able to confirm
what size messages you're actually trying to send. FWIW I've run
extensive tests on BGP with > 9000 byte MSS (though numbers that large
are completely irrelevent, since bgp's maximum message size is 4096
bytes) and never hit a problem. I once saw a bug where Cisco
miscalculated the MSS when doing tcp md5 (off by the number of bytes
that the tcp option would take, I forget which direction), but I'm sure
that's fixed now too. :slight_smile:

Every time I turn those on (plus timestamping), it breaks something.

The

last time I tried it broke ftp based transfers of new IOS, had to
disable or use tftp to get a non-corrupted image (SRA). The time before
that, it occasionally caused bgp keepalives to be missed and thus
dropped the session (SXF). It may work now, or there may be more subtle

Cisco bugs lurking, who knows. :slight_smile:

I tried that, no dice. I thought it would actually work.

You can confirm what MSS is actually being used in show ip bgp

neighbor,

under the "max data segment" line. I believe in modern code there is a
way to turn on pmtud for all bgp neighbors (or individual ones) which
may or may not depend on the global ip tcp path-mtu-discovery setting.

I

don't recall off the top of my head, but you should be able to confirm
what size messages you're actually trying to send. FWIW I've run
extensive tests on BGP with > 9000 byte MSS (though numbers that large
are completely irrelevent, since bgp's maximum message size is 4096
bytes) and never hit a problem. I once saw a bug where Cisco
miscalculated the MSS when doing tcp md5 (off by the number of bytes
that the tcp option would take, I forget which direction), but I'm sure
that's fixed now too. :slight_smile:

Below is snap shot of the neighbor in question.

Datagrams (max data segment is 4410 bytes):
Rcvd: 6 (out of order: 0), with data: 4, total data bytes: 278

Could there be a problem with the total data bytes size exceeds the size
of the max data segment?

Below is the router (7206 NPE-400) I am trying to establish a session
with BGP neighbor.

<snip>
Description: cr1.AUSTTXEE
Member of peer-group TelWest-iBGP for session parameters
  BGP version 4, remote router ID 67.214.64.97
  BGP state = Established, up for 00:00:02
  Last read 00:00:02, hold time is 180, keepalive interval is 60 seconds
  Neighbor capabilities:
    Route refresh: advertised and received(old & new)
    Address family IPv4 Unicast: advertised and received
  Message statistics:
<snip>

Datagrams (max data segment is 4410 bytes):
Rcvd: 4 (out of order: 0), with data: 1, total data bytes: 64
bytes: 259
cr2.CRCHTXCB#

The maximum BGP message size is 4096 and there is no padding, so you
would need a heck of a lot of overhead to get another 300+ bytes on
there. I'd say the answer is no, unless you're running this over a MPLS
over GRE over MPLS over IPSec over MPLS over... well... you get the
picture. :slight_smile:

It's possible that your link isn't actually capable of passing 4096-ish
byte packets for whatever resaon. A quick way to validate or eliminate
that theory is to do some pings from the router with different size
payloads, sourced from your side of the /30 and pinging the far side,
and using the df-bit to prevent fragmentation. Failing that, make sure
you aren't doing anything stupid with your control plane policiers,
maybe try turning those off to see if there is an improvement.

And more specifically, possibly an interface MTU (or ip mtu, I forget which).

If there is a mismatch between ends of a link, in one direction, MTU-sized packets get sent, and the other end sees those as "giants".

I've seen situations where the MTU is calculated incorrectly, when using some technology that adds a few bytes (e.g. VLAN tags, MPLS tags, etc.).

On Cisco boxes, when talking to other Cisco boxes, even.

Take a look at the interfaces over which the peering session runs, at both ends.
I.e., is this the only BGP session *over that interface*, for the local box?

(It might not be the end you think it's at, BTW.)

Oh, and if you find something, please, let us know.
War stories make for great bar BOFs at NANOG meetings. :slight_smile:

Brian

Take a look at the interfaces over which the peering session runs, at

both ends.

I.e., is this the only BGP session *over that interface*, for the local

box?

You are going to find this even more strange. I have two routers that
are communicating over the same transport medium and are actually in the
same rack. One router is a Cisco 7606 which has an IBGP session
established with my Cisco 6509. Both equipment have Sup-7203bxls 1 Gig
of memory. Ironically from the 6509's perspective, I cannot seem to
maintain a session with my 7206VXR which has two directly connected
DS-3s. In order for my 6509 to establish an IBGP session with my 7606,
it has to go through the 7206 VXR. Crazy right?

Yeah I can already this is going to be a *War Story* as you said it. :slight_smile:

And more specifically, possibly an interface MTU (or ip mtu, I forget
which).

If there is a mismatch between ends of a link, in one direction,
MTU-sized packets get sent, and the other end sees those as "giants".

Well if the interface or ip mtu was smaller on one end, this would
result in a lower mss negotiation and you would just have smaller but
working packets. The bad situation is when there is a layer 2 device in
the middle which eats the big packets and doesn't generate an ICMP
needfrag. For example, if there was a 1500-byte only ethernet switch in
the middle of this link, it would drop anything > 1500 bytes and prevent
path mtu discovery from working, resulting in silent blackholing. I was
assuming that wasn't the case here based on the 4474 mtu (was assuming
sonet links or something), but looking at the original message he
doesn't say what media or what might be in the middle, so its possible
4474 is a manually configured mtu causing blackholing.

I've seen situations where the MTU is calculated incorrectly, when
using some technology that adds a few bytes (e.g. VLAN tags, MPLS
tags, etc.).

Even when things are working as intended, different vendors mean
different things when they talk about MTU. For example, Juniper and
Cisco disagree as to whether the mtu should include layer 2 or .1q tag
overhead, resuling in inconsistent MTU numbers which are not only
different between the vendors, but which can change depending on what
type of trunk you're running between the devices. Enabling > 1500 byte
MTUs is a dangerous game if you don't know what you're doing, or if
you're connected to other people who are sloppy and don't fully verify
their MTU settings on every link.

War stories make for great bar BOFs at NANOG meetings. :slight_smile:

Never ending supply of those things. :slight_smile:

I was assuming that wasn't the case here based on the 4474 mtu (was

assuming

sonet links or something), but looking at the original message he
doesn't say what media or what might be in the middle, so its possible
4474 is a manually configured mtu causing blackholing.

Here is the network architecture from the Cisco 6509 to the 7206 VXR.
The 6509 has a successful BGP session established with another router,
Cisco 7606 w/ Sup720-3bxls. The 7606 and 7206 VXR are connected
together by a Cisco 3550 switch. In order for the 6509 to establish the
IBGP session to the 7606, it has to pass through two DS-3s, go through
the 7206 VXR, out the Fast E, through the Cisco 3550, and then to the
7606. I checked the MTUs on the 3550s and I am seeing the Fast E
interfaces are still showing 1500 bytes. Would increasing the MTU size
on the switches cause any harm?

I checked the MTUs on the 3550s and I am seeing the Fast E
interfaces are still showing 1500 bytes. Would increasing the MTU size
on the switches cause any harm?

The 3550s are very limited with respect to MTU - the standard model
can only do up to 1546 byte, while I believe the -12G model can to
2000 byte. In any case - you won't get a 4470 byte packet through a
3550. Also, changing the MTU on the 3550 requires a reboot.

Steinar Haug, Nethelp consulting, sthaug@nethelp.no

As other people have said, this definitely sounds like an MTU problem.
Basically you're trying to pass 4470 byte BGP packets over a link that
drops anything bigger than 1500. The session will establish because all
the setup packets are small, but the tcp session will stall as soon as
you try to send routes across it.

What should be happening here is the 6509 will generate a 4470 byte
packet because it sees the directly connected interface as a DS3 and
doesn't know the path is incapable of supporting > 1500 bytes end to
end. The layer 3 device on the mtu choke point, in this case the faste
interface on the 7206vxr, should be configured to a 1500 byte mtu. This
will cause the 7206vxr to generate an ICMP neegfrag when the 4470 byte
packet comes along, and cause path mtu discovery to lower the MSS on the
IBGP session. Either a) you have the mtu misconfigured on that 7206vxr
port, b) your router is misconfigured not to generate the icmp, c)
something in the middle is misconfigured to filter this necessary icmp
packet, or d) some other screwup probably related to one of the above.

Generally speaking increasing the MTU size on a switch can never hurt
anything, but having an insufficiently large MTU on the switch is what
will break you the most (as is happening here). The problem occurs when
you increase the MTU on the layer 3 routers to something beyond what the
layer 2 link in the middle is capable of supporting. Layer 3 devices
will either fragment (deprecated) or generate ICMP NeedFrags which will
cause path MTU discovery to shrink the MSS. Layer 2 devices are
incapable of doing this, so you MUST NOT set the layer 3 MTU above what
the layer 2 link is capable of handling.

Now that said, increasing the mtu on the 3550 won't work here because
3550 MTU support is terrible. The only option you have is to configure
the MTU of all interfaces to 1546 with the "system mtu 1546" command,
followed by a reload. This is not big enough to pass your 4470 byte
packets, and will also break any MTU dependent configuration you might
be running. For example, after you do this, any OSPF speakers on your
3550 will have to have their MTUs adjusted as well, or OSPF will not
come back up due to the interface mismatch. For more details see:

http://www.cisco.com/en/US/products/hw/switches/ps700/products_configuration_example09186a008010edab.shtml#c4

Your best bet (in order of most preferable to least) is to a) fix
whatever is breaking path mtu discovery on the 7206vxr in the first
place, b) force the mss of the ibgp session to something under 1460, or
c) lower the mtu on the ds3 interface to 1500.

RAS wrote:

[ lots of good stuff elided for brevity ]

c) lower the mtu on the ds3 interface to 1500.

This will have another benefit, if it is done to all such interfaces on the two devices.
(Where by "all such interfaces", I mean "everything with set-able MTU > 1500".)

Configuring one common MTU size on the interfaces, means the buffer pool on the box will switch from several pools of varying sizes, to one pool.
The number of buffers per pool get pro-rated by speed * buffer size, so high-speed, high-MTU interfaces get a gigantic chunk of buffer space.

Once you reduce things to one pool, you significantly reduce the likelihood of buffer starvation.

Note that the discussion on benefits to buffers is old info, and may even be moot these days, but buffers are fundamental enough that I doubt it.

However, the original problem, iBGP not working, will definitely be resolved by this.

Note also, changing this often won't take effect until a reboot, and/or may result in buffer re-carving with an attendant "hit" of up to 30 seconds of no forwarding packets (!!). You've been warned...

In other words, plan this carefully, and make sure you have remote hands available or are on site. This qualifies as "deep voodoo". :wink:

Brian

Either a) you have the mtu misconfigured on that 7206vxr

That part is where I am at a loss. How is it the 6509 can establish a
IBGP session with a 7606 when it has to go through the 7206 VXR? The
DS-3s are connected to the 7206 VXR. To add more depth to the story. I
have 8 IBGP sessions that are connected to the 7206 VXR that have been
up and running for over a year. Some of the sessions traverse the DS-3s
and or a GigE long haul connections. There are a total 10 Core routers
that are mixture of Cisco 7606, 6509s, 7206 VXR w/ NPE400s or G1s. Only
this one IBGP session out of 9 routers is not being established. Since
I have a switch between the 7606 and 7206, I plan to put a packet
capture server and see what I can see.

And is that the one that traverses the 3550 with the 1500 byte MTU?
Re-read what we said. You should be able to test the MTU theory by
disabling path-mtu-discovery, which will cause MSS to fail back to the
minimum 576.