Dear NANOGers,
It irks me that today, the effective MTU of the internet is 1500 bytes, while more and more equipment can handle bigger packets.
What do you guys think about a mechanism that allows hosts and routers on a subnet to automatically discover the MTU they can use towards other systems on the same subnet, so that:
1. It's no longer necessary to limit the subnet MTU to that of the least capable system
2. It's no longer necessary to manage 1500 byte+ MTUs manually
Any additional issues that such a mechanism would have to address?
:-> "Iljitsch" == Iljitsch van Beijnum <iljitsch@muada.com> writes:
> Dear NANOGers,
> It irks me that today, the effective MTU of the internet is 1500
> bytes, while more and more equipment can handle bigger packets.
> What do you guys think about a mechanism that allows hosts and
> routers on a subnet to automatically discover the MTU they can use
> towards other systems on the same subnet, so that:
> 1. It's no longer necessary to limit the subnet MTU to that of the
> least capable system
> 2. It's no longer necessary to manage 1500 byte+ MTUs manually
> Any additional issues that such a mechanism would have to address?
wouldn't that work only if the switch in the middle of your neat
office lan is a real switch (i.e. not flooding oversize packets to
hosts that can't handle them, possibly crashing their NIC drivers) and
it's itself capable of larger MTUs?
Pf
Well, yes, being compatible with stuff that doesn't support larger packets pretty much goes without saying. I don't think there is any need to worry about crashing drivers, packets that are longer than they should are a common error condition that drivers are supposed to handle without incident. (They often keep a "giant" count.)
A more common problem would be two hosts that support jumboframes with a switch in the middle that doesn't. So it's necessary to test for this and avoid excessive numbers or large packets when something in the middle doesn't support them.
What do you guys think about a mechanism that allows hosts and
routers on a subnet to automatically discover the MTU they can use
towards other systems on the same subnet, so that:
1. It's no longer necessary to limit the subnet MTU to that of the
least capable system
2. It's no longer necessary to manage 1500 byte+ MTUs manually
To me this sounds adding complexity for rather small pay-off. And
then we'd have to ask IXP people, would the enable this feature
if it was available? If so, why don't they offer high MTU VLAN
today?
And in the end, pay-off of larger MTU is quite small, perhaps
some interrupts are saved but not sure how relevant that is
in poll() based NIC drivers. Of course bigger pay-off
would be that users could use tunneling and still offer 1500
to LAN.
IXP peeps, why are you not offering high MTU VLAN option?
From my point of view, this is biggest reason why we today
generally don't have higher end-to-end MTU.
I know that some IXPs do, eg. NetNOD but generally it's
not offered even though many users would opt to use it.
Thanks,
the internet is broken.. too many firewalls dropping icmp, too many hard coded systems that work for 'default' but dont actually allow for alternative parameters that should work according to the RFCs
if you can fix all that then it might work
alternatively if you can redesign path mtu discovery that might work too..
Martin Levy suggested this too me only two weeks ago, he had an idea of sending two packets initially - one 'default' and one at the higher mtu .. if the higher one gets dropped somewhere you can quickly spot it and revert to 'default' behaviour.
I think his explanation was more complicated but it was an interesting idea
Steve
Netnod in Sweden offer MTU 4470 option.
Otoh it's not so easy operationally since for instance Juniper and Cisco calculates MTU differently.
But I don't really see it beneficial to try to up the endsystem MTU to over standard ethernet MTU, if you think it's operationally troublesome with PMTUD now, imagine when everybody is running different MTU.
Biggest benefit would be if the transport network people run PPPoE and other tunneled traffic over, would allow for whatever MTU needed to carry unfragmented 1500 byte tunneled packets, so we could assure that all hosts on the internet actually have 1500 IP MTU transparently.
* swmike@swm.pp.se (Mikael Abrahamsson) [Thu 12 Apr 2007, 14:07 CEST]:
Last I heard, the IEEE won't go along, and they're the ones who
standardize 802.3.
A few years ago, the IETF was considering various jumbogram options.
As best I recall, that was the official response from the relevant
IEEE folks: "no". They're concerned with backward compatibility.
Perhaps that has changed (and I certainly) don't remember who sent that
note.
--Steve Bellovin, http://www.cs.columbia.edu/~smb
I agree. The throughput gains are small. You’re talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).
One could argue a decreased pps impact on intermediate systems, but when factoring in the existing packet size distribution on the Internet and the perceived adjustment seen by a migration to 4470 MTU support, the gains remain small.
Development costs and the OpEx costs of implementation and support will, likely, always outweigh the gains.
Gian Anthony Constantine
* Steven M. Bellovin:
A few years ago, the IETF was considering various jumbogram options.
As best I recall, that was the official response from the relevant
IEEE folks: "no". They're concerned with backward compatibility.
Gigabit ethernet has already broken backwards compatibility and is
essentially point-to-point, so the old compatibility concerns no
longer apply. Jumbo frame opt-in could even be controlled with a
protocol above layer 2.
I agree. The throughput gains are small. You're talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).
6% including ethernet overhead and assuming the very common TCP timestamp option.
One could argue a decreased pps impact on intermediate systems, but when factoring in the existing packet size distribution on the Internet and the perceived adjustment seen by a migration to 4470 MTU support, the gains remain small.
Average packet size on the internet has been fairly constant at around 500 bytes the past 10 years or so from my vantage point. You only need to make 7% of all packets 9000 bytes and you double that. This means that you can have twice the amount of data transferred for the same amount of per-packet work. If you're at 100% of your CPU or TCAM capacity today, that is a huge win. On the other hand, if you need to buy equipment that can do line rate at 64 bytes per packet, it doesn't matter much.
There are other benefits too, though. For instance, TCP can go much faster with bigger packets. Additional tunnel/VPN overhead isn't as bad.
Development costs and the OpEx costs of implementation and support will, likely, always outweigh the gains.
Gains will go up as networks get faster and faster, implementation should approach zero over time and support shouldn't be an issue if it works fully automatically.
Others mentioned ICMP filtering and PMTUD problems. Filtering shouldn't be an issue for a mechanism that is local to a subnet, and if it is, there's still no problem if the mechanism takes the opposite approach of PMTUD. With PMTUD, the assumption is that large works, and extra messages result in a smaller packet size. By exchanging large messages that indicate the capability to exchange large messages, form and function align, and if an indication that large messages are possible isn't received, it's not used and there are no problems.
I'm neither attacking nor defending the idea; I'm merely reporting.
I'll also note that the IETF is very unlikely to challenge IEEE on
this. There's an informal agreement on who owns which standards. The
IETF resents attempts at modifications to its standards by other
standards bodies; by the same token, it tries to avoid doing that to
others.
--Steve Bellovin, http://www.cs.columbia.edu/~smb
Last I heard, the IEEE won't go along, and they're the ones who
standardize 802.3.
I knew there was a reason we use ethernet II rather than IEEE 802.3 for IP. 
A few years ago, the IETF was considering various jumbogram options.
As best I recall, that was the official response from the relevant
IEEE folks: "no". They're concerned with backward compatibility.
Obviously keeping the same maximum packet size when moving from 10 to 100 to 1000 to 10000 Mbps is suboptimal. However, if the newer standards were to mandate a larger maximum packet size, a station connected to a 10/100/1000 switch at 1000 Mbps would be able to send packets that a 10 Mbps station wouldn't be able to receive. (And the 802.3 length field starts clashing with ethernet II type codes.)
However, to a large degree this ship has sailed because many vendors implement jumboframes. If we can fix the interoperability issue at layer 3 for IP that the IEEE can't fix at layer 2 for 802.3, then I don't see how anyone could have a problem with that. Also, such a mechanism would obviously be layer 2 agnostic, so in theory, it doesn't step on the IEEE's turf at all.
I think it’s a great idea operationally, less work for the routers and more efficient use of bandwidth. It would also be useful to devise some way to at least partially reassemble fragmented frames at links capable of large MTU’s. Since most PC’s are on a subnet with a MTU of 1500 (or 1519) packets would still be limited to 1500B or fragmented before they reach the higher speed links. The problem with bringing this to fruition in the internet is going to be cost and effort. The ATT’s and Verizons of the world are going to see this as a major upgrade without much benefit or profit. The Cisco’s and Junipers are going to say the same thing when they have to write this into their code plus interoperability with other vendors implementations of it.
Iljitsch van Beijnum iljitsch@muada.com
Sent by: owner-nanog@merit.edu
04/12/2007 05:20 AM
To
NANOG list nanog@merit.edu
cc
Subject
Thoughts on increasing MTUs on the internet
Dear NANOGers,
It irks me that today, the effective MTU of the internet is 1500
bytes, while more and more equipment can handle bigger packets.
What do you guys think about a mechanism that allows hosts and
routers on a subnet to automatically discover the MTU they can use
towards other systems on the same subnet, so that:
1. It's no longer necessary to limit the subnet MTU to that of the
least capable system
2. It's no longer necessary to manage 1500 byte+ MTUs manually
Any additional issues that such a mechanism would have to address?
* Steven M. Bellovin:
* Steven M. Bellovin:
> A few years ago, the IETF was considering various jumbogram options.
> As best I recall, that was the official response from the relevant
> IEEE folks: "no". They're concerned with backward compatibility.
Gigabit ethernet has already broken backwards compatibility and is
essentially point-to-point, so the old compatibility concerns no
longer apply. Jumbo frame opt-in could even be controlled with a
protocol above layer 2.
I'm neither attacking nor defending the idea; I'm merely reporting.
I just wanted to point out that the main reason why this couldn't be
done without breaking backwards compatibility is gone (shared physical
medium with unknown and unforeseeable receiver capabilities).
I'll also note that the IETF is very unlikely to challenge IEEE on
this.
It's certainly unwise to do so before PMTUD works without ICMP
support. 
Niels Bakker wrote:
* swmike@swm.pp.se (Mikael Abrahamsson) [Thu 12 Apr 2007, 14:07 CEST]:
IXP peeps, why are you not offering high MTU VLAN option?
Biggest benefit would be if the transport network people run PPPoE and
other tunneled traffic over, would allow for whatever MTU needed to
carry unfragmented 1500 byte tunneled packets, so we could assure that
all hosts on the internet actually have 1500 IP MTU transparently.
How much traffic from DSLAM to service provider is currently being
exchanged across IXPs?
How much l2 vpn traffic is being exchanged across the public internet?
(my money is on a lot)
I agree. The throughput gains are small. You’re talking about a difference between a 4% header overhead versus a 1% header overhead (for TCP).
One of the “benefits” of larger MTU is that, during the additive increase phase, or after recovering from congestion, you reach full speed sooner – it does also mean that if you do reach congestion, you throw away more data, and, because of the length of flows, are probably more likely to cause congestion…
Out of curiosity how is this calculated?
[ytti@ytti.fi ~]% echo "1450/(1+7+6+6+2+1500+4+12)*100"|bc -l
94.27828348504551365400
[ytti@ytti.fi ~]% echo "8950/(1+7+6+6+2+9000+4+12)*100"|bc -l
99.02633325957070148200
[ytti@ytti.fi ~]%
I calculated less than 5% from 1500 to 9000, with ethernet and
adding TCP timestamp. What did I miss?
Or compared without tcp timestamp and 1500 to 4470.
[ytti@ytti.fi ~]% echo "1460/(1+7+6+6+2+1500+4+12)*100"|bc -l
94.92847854356306892000
[ytti@ytti.fi ~]% echo "4410/(1+7+6+6+2+4470+4+12)*100"|bc -l
97.82608695652173913000
Less than 3%.
However, I don't think it's relevant if it's 1% or 10%, bigger
benefit would be to give 1500 end-to-end, even with eg. ipsec
to the office.
Or compared without tcp timestamp and 1500 to 4470.
[ytti@ytti.fi ~]% echo "1460/(1+7+6+6+2+1500+4+12)*100"|bc -l
94.92847854356306892000
[ytti@ytti.fi ~]% echo "4410/(1+7+6+6+2+4470+4+12)*100"|bc -l
97.82608695652173913000
Apparently 70-40 is too hard for me.
[ytti@ytti.fi ~]% echo "4430/(1+7+6+6+2+4470+4+12)*100"|bc -l
98.26974267968056787900
so ~3.3%
I did a rough, top-of-the-head, with ~60 bytes header (ETH, IP, TCP) into 1500 and 4470 (a mistake, on my part, not to use 9216).
I still think the cost outweighs the gain, though there are some reasonable arguments for the increase.
Gian Anthony Constantine