MTU

Hi

What is best practice regarding choosing MTU on transit links?

Until now we have used the default of 1500 bytes. I now have a project were
we peer directly with another small ISP. However we need a backup so we
figured a GRE tunnel on a common IP transit carrier would work. We want to
avoid the troubles you get by having an effective MTU smaller than 1500
inside the tunnel, so the IP transit carrier agreed to configure a MTU of
9216.

Obviously I only need to increase my MTU by the size of the GRE header. But
I am thinking is there any reason not to go all in and ask every peer to go
to whatever max MTU they can support? My own equipment will do MTU of 9600
bytes.

On the other hand, none of my customers will see any actual difference
because they are end users with CPE equipment that expects a 1500 byte MTU.
Trying to deliver jumbo frames to the end users is probably going to end
badly.

Regards,

Baldur

See the below:

    http://mailman.nanog.org/pipermail/nanog/2016-March/084598.html

You can reliably run Jumbo frames in your own network core, and also to
another network that can guarantee you the same (which would typically
be under some form of commercial, private arrangement like an NNI).

Across the Internet, 1,500 bytes is still safest, simply because that is
pretty much the standard. Trying to achieve Jumbo frames across an
Internet link (which includes links to your upstreams, links to your
peers and links to your customers) is an exercise in pain.

Mark.

❦ 22 juillet 2016 14:01 CEST, Baldur Norddahl <baldur.norddahl@gmail.com> :

Until now we have used the default of 1500 bytes. I now have a project were
we peer directly with another small ISP. However we need a backup so we
figured a GRE tunnel on a common IP transit carrier would work. We want to
avoid the troubles you get by having an effective MTU smaller than 1500
inside the tunnel, so the IP transit carrier agreed to configure a MTU of
9216.

Obviously I only need to increase my MTU by the size of the GRE header. But
I am thinking is there any reason not to go all in and ask every peer to go
to whatever max MTU they can support? My own equipment will do MTU of 9600
bytes.

You should always match the MTU of the remote end. So, if your transit
carrier configured 9126 on its side, you should do the same on
yours. There is no MTU discovery at the L2 layer: if you setup the MTU
of your interface at 9600 and you happen to route a 9500-byte packets,
it will be silently dropped by your transit carrier.

This topic seems to come up more lately. Much like it did often during
IPSec related deployments. I simplify on 9,000 as an easy number and I
don't have to split hairs (read 9,214 v 9,216) that some vendors have.

My experience has been making a view phone calls and agreeing on 9,000 is
simple enough. I've only experienced one situation for which the MTU must
match and that is on OSPF neighbor relationships, for which John T. Moy's
book (OSPF - Anatomy of an Internet Routing Protocol) clearly explains why
MTU became an issue during development of that protocol. As more and more
of us choose or are forced to support 'jumbo' frames to accommodate Layer 2
extensions (DCI [Data Center Interconnects]) I find myself helping my
customers work with their carriers to ensure that jumbo frames are
supported. And frequently remind them to inquire that they be enabled not
only on the primary path/s but any possible back up path as well. I've had
customers experience DCI-related outages because their provider performed
maintenance on the primary path and the re-route was sent across a path
that did not support jumbo frames.

As always, YMMV but I personally feel having the discussions and
implementation with your internal network team as well as all of your
providers is time well spent.

Later,
-chris

Hi Baldur,

On a link containing only routers, you can safely increase the MTU to
any mutually agreed value with these caveats:

1. Not all equipment behaves well with large packets. It supposed to
but you know what they say.

2. No protocol guarantees that every device on the link has the same
MTU. It's a manual configuration task on each device and if the
maximum receive unit on any device should happen to be less than the
maximum transmit unit on any other, you will be intermittently
screwed.

This includes virtual links like the GRE tunnel. If you can guarantee
the GRE tunnel travels a 9k path, you can set a slightly smaller MTU
on the tunnel itself.

MTU should never be increased above 1500 on a link containing
workstations and servers unless you know for certain that packets
emitted on that link will never traverse the public Internet. Path MTU
discovery on the Internet is broken. It was a poor design - broke the
end to end principle - and over the years we've misimplemented it so
badly that it has no serious production-level of reliability.

Where practical, it's actually a good idea to detune your servers to a
1460 or lower packet size in order to avoid problems transiting those
parts of the Internet which have allowed themselves to fall beneath a
1500 byte MTU. This is often accomplished by asking the firewall to
adjust the TCP MSS value in flight.

Regards,
Bill Herrin

My experience has been making a view phone calls and agreeing on 9,000
is simple enough. I've only experienced one situation for which the
MTU must match and that is on OSPF neighbor relationships, for which
John T. Moy's book (OSPF - Anatomy of an Internet Routing Protocol)
clearly explains why MTU became an issue during development of that
protocol. As more and more of us choose or are forced to support
'jumbo' frames to accommodate Layer 2 extensions (DCI [Data Center
Interconnects]) I find myself helping my customers work with their
carriers to ensure that jumbo frames are supported. And frequently
remind them to inquire that they be enabled not only on the primary
path/s but any possible back up path as well. I've had customers
experience DCI-related outages because their provider performed
maintenance on the primary path and the re-route was sent across a
path that did not support jumbo frames.

DCI links tend to be private in nature, and 100% on-net or off-net with
guarantees (NNI).

The question here is about the wider Internet.

As always, YMMV but I personally feel having the discussions and
implementation with your internal network team as well as all of your
providers is time well spent.

I don't disagree.

The issue comes when other networks beyond your provider, and their
providers/peers, whose providers/peers, and their providers/peers, is
something you cannot control.

This falls into the same category of "Can QoS markings be honored across
the Internet" cases.

Mark.

Worth reading this on choosing MTU on transit link.

http://blog.apnic.net/2014/12/15/ip-mtu-and-tcp-mss-missmatch-an-evil-for-network-performance/

-Sad

Hi

What is best practice regarding choosing MTU on transit links?

Until now we have used the default of 1500 bytes. I now have a project were
we peer directly with another small ISP. However we need a backup so we
figured a GRE tunnel on a common IP transit carrier would work. We want to
avoid the troubles you get by having an effective MTU smaller than 1500
inside the tunnel, so the IP transit carrier agreed to configure a MTU of
9216.

Obviously I only need to increase my MTU by the size of the GRE header. But
I am thinking is there any reason not to go all in and ask every peer to go
to whatever max MTU they can support? My own equipment will do MTU of 9600
bytes.

If you're just doing this for the GRE overhead and given that you're talking about backup over transit and possibly $deity-knows-where paths, TBH I might just lean towards pinning your L3 MTU inside the tunnel to 1500 bytes and configuring IP fragmentation post-encap. Not pretty, but probably fewer chances for WTF moments than trying to push >1500 on a transit path.

This *might* be coloured by my past fights with having to force GRE through a 1500-byte path and trying to make that transparent to transit traffic, but there you have it...

What I noticed a few years ago was that BGP convergence time was faster with higher MTU.
Full BGP table load took twice less time on MTU 9192 than on 1500.
Of course BGP has to be allowed to use higher MTU.

Anyone else observed something similar?

I have read about others experiencing this, and did some testing a few months back -- my experience was that for low latency links, there was a measurable but not huge difference. For high latency links, with Juniper anyway, there was a very negligible difference, because the TCP Window size is hard-coded at something small (16384?), so that ends up being the limit more than the tcp slow-start issues that MTU helps with.

With that said, we run MTU at >9000 on all of our transit links, and all of our internal links, with no problems. Make sure to do testing to send pings with do-not-fragment at the maximum size configured, and without do-not-fragment just slightly larger than the maximum size configured, to make sure that there are no mismatches on configuration due to vendor differences.

Best Regards,
-Phil Rosenthal

Quite obvious thing - BGP by default on Cisco and Juniper will use up to max allowed 4k message per packet, which for typical unicast IPv4/v6 helps to pack all attributes with prefix. This not only improves (lowers) CPU load on sending side but also on the receiving end and helps with routing convergence.

There was a draft to use up to 9k for BGP messaging, but I belive it's buried somewhere on the outside of town called "our current version RFC".

I tested Cisco CRS-1 (or maybe already upgraded to CRS-3) to Juniper MX480 or MX960 on about 10 ms latency link. It was iBGP carrying internal routes plus full BGP table (both ways).
I think the bottleneck was CPU on the CRS side and maxing MSS helped a lot. I recall doing later on tests Juniper to Juniper and indeed the gain was not that big, but it was still visible.

Juniper command 'show system connections' showed MSS around 9kB. I haven't checked TCP Window size.

What I noticed a few years ago was that BGP convergence time was faster
with higher MTU.
Full BGP table load took twice less time on MTU 9192 than on 1500.
Of course BGP has to be allowed to use higher MTU.

Anyone else observed something similar?

I have read about others experiencing this, and did some testing a few
months back -- my experience was that for low latency links, there was a
measurable but not huge difference. For high latency links, with Juniper
anyway, there was a very negligible difference, because the TCP Window size
is hard-coded at something small (16384?), so that ends up being the limit
more than the tcp slow-start issues that MTU helps with.

I think the Cisco default window size is 16KB but you can change it with
ip tcp window-size NNN

Lee

Yes, of course. Larger MSS for BGP updates means fewer BGP updates
within which convergence can occur.

The problem is eBGP sessions are generally ran between different
networks, where co-ordinating MTU can be an issue.

Mark.

* Baldur Norddahl

What is best practice regarding choosing MTU on transit links?

Until now we have used the default of 1500 bytes. I now have a
project were we peer directly with another small ISP. However we need
a backup so we figured a GRE tunnel on a common IP transit carrier
would work. We want to avoid the troubles you get by having an
effective MTU smaller than 1500 inside the tunnel, so the IP transit
carrier agreed to configure a MTU of 9216.

You use case as described above puzzles me. You should already your
peer's routes being advertised to you via the transit provider and vice
versa. If your direct peering fails, the traffic should start flowing
via the transit provider automatically. So unless there's something
else going on here you're not telling us there should be no need for
the GRE tunnel.

That said, it should work, as long as the MTU is increased in both ends
and the transit network guarantees it will transports the jumbos.

We're doing something similar, actually. We have multiple sites
connected with either dark fibre or DWDM, but not always in a redundant
fashion. So instead we run GRE tunnels through transit (with increased
MTU) between selected sites to achieve full redundancy. This has worked
perfectly so far. It's only used for our intra-AS IP/MPLS traffic
though, not for eBGP like you're considering.

Obviously I only need to increase my MTU by the size of the GRE
header. But I am thinking is there any reason not to go all in and
ask every peer to go to whatever max MTU they can support? My own
equipment will do MTU of 9600 bytes.

I'd say it's not worth the trouble unless you know you're going to use
it for anything. If I was your peer I'd certainly need you to give me a
good reason why I should deviate from my standard templates first...

On the other hand, none of my customers will see any actual difference
because they are end users with CPE equipment that expects a 1500
byte MTU. Trying to deliver jumbo frames to the end users is probably
going to end badly.

Depends on the end user, I guess. Residential? Agreed. Business? Who
knows - maybe they would like to run fat GRE tunnels through your
network? In any case: 1500 by default, other values only by request.

Tore

I did not say we were doing internet peering...

In case you are wondering, we are actually running L2VPN tunnels over MPLS.

Regards,

Baldur

* Baldur Norddahl

I did not say we were doing internet peering...

Uhm. When you say that you peer with another ISP (and keep in mind what
the "I" in ISP stands for), while giving no further details, then folks
are going to assume that you're talking about a standard eBGP peering
with inet/inet6 unicast NLRIs.

In case you are wondering, we are actually running L2VPN tunnels over
MPLS.

Okay. Well, I see no reason why using GRE tunnels for this purpose
shouldn't work, it does for us (using mostly VPLS and Martini tunnels).

That said, I've never tried extending our MPLS backbone outside of
our own administrative domain or autonomous system. That sounds like a
really scary prospect to me, but I'll admit I've never given serious
consideration to such an arrangement before. Hopefully you know what
you're doing.

Tore

Well, you can extend your MPLS-based services outside your domain
through an NNI. Fair point, you generally won't run MPLS with your NNI
partner, but they will carry your services across their own MPLS network
toward their destination on the B-end.

With such an arrangement, one can co-ordinate that capabilities between
different networks are mirrored even though there isn't end-to-end
control for either NNI partner.

Mark.

This has been well known for years:

http://morse.colorado.edu/~epperson/courses/routing-protocols/handouts/bgp_scalability_IETF.ppt

You have to adjust the MTU, Input queues and such. The default TCP stack is very conservative.

- Jared