I-D on operational MTU/fragmentation issues in tunneling

Hi all,

I've written a very short (about 5 pages of meat) Internet-Draft
describing the issues and operational approaches to the problems faced
with doing tunneling in the network -- as these issues kept coming up
again and again with IP-in-IP, GRE, L2TP, etc. The approaches may be
different for passive monitoring ('wiretapping' etc.) and 'active'
tunneling.

The document is about to be IETF Last Called for Informational RFC,
but prior to that, I'd like to solicit comments/feedback/review from
the people here because I'm 100% sure a lot of people have been faced
with these issues (we certainly have..).

Please send comments to me by the end of this week, either on- of
off-list, as you deem appropriate.

Find it at:
http://www.ietf.org/internet-drafts/draft-savola-mtufrag-network-tunneling-01.txt

Abstract
   Tunneling techniques such as IP-in-IP when deployed in the middle of
   the network, typically between routers, have certain issues regarding
   how large packets can be handled: whether such packets would be
   fragmented and reassembled (and how), whether Path MTU Discovery
   would be used, or how this scenario could be operationally avoided.
   This memo justifies why this is a common, non-trivial problem, and
   goes on to describe the different solutions and their characteristics
   at some length.

Well, tunnels suck. No news there.

    It is interesting to note that at least one implementation provides a
    special knob to fragment the inner packet prior to encapsulation even
    if the DF bit has been set -- this is non-compliant behaviour, but
    possibly has been required in certain tightly controlled passive
    monitoring scenarios. Such a setup wouldn't work for packets which
    have already been fragmented if they needed to be fragmented again,
    though.

Why would it be impossible to refragment fragments???

I have a setup with dial-up over L2TP that doesn't support an MTU bigger than 576 (which is completely unnecessary of course, but try telling the people at the other end of the L2TP thingy that) so I clear the DF bit for all incoming packets that have to go through the PPP/L2TP tunnel. Works like a charm. (Surprisingly, all users seem to have systems that are capable of reassembling 1.5 kB packets now.)

But I don't understand why anyone would want to use tunnels in the backbone. That's what VLANs are for. And if you don't use ether, you aren't bound by yester-millennium's 1500 byte MTU anyway.

In IPv6 there is the interesting problem that there are already many tunnels all over the place that often have a 1280 byte MTU, so tunneling over that can't be done because of the mandatory minimum MTU of 1280 bytes.

Hi Pekka and others,

Please send comments to me by the end of this week, either on- of
off-list, as you deem appropriate.

With the risk of stating the obvious I would say that normally, PMTUD
should do the trick. Afterall, there is no real difference between the
lower MTU of a tunnel and the lower MTU of any other link. With this in
mind, the real problem can be found on networks and hosts that block
ICMP-host-unreachables (or simply all ICMP traffic for "security"
reasons). Taking this one step further, one might realise that we (as
networking community) are looking for a technical solution to compensate
for the lack of knowledge of the end-user administrators or webmasters.

In my work I have been using tunnels quite a lot, and have delt with a
lot if issues regarding PMTUD problems. For end-users behind a tunnel,
the best solution is usually to turn PMTUD off completely, such as

[root@bofh root]# sysctl -w net.inet.tcp.path_mtu_discovery=0
net.inet.tcp.path_mtu_discovery: 1 -> 0

on a FreeBSD box. I agree that this is far less efficient than it should
be, but that's always the flipside of the tunnel-coin. Another option
would be to simply strip the DF bit on your tunnel entrance point, but
that would be rather undesirable..

Sabri Berisha wrote:

... or vendors of equipment that these people use. There are plenty of
vendors out there who make loadsharing-equipment for the enterprise that
doesn't handle all these cases.

It's just a myth that this is a simple user ignorance issue, it's a much
bigger problem than that, it's a vendor ignorance issue as well.

Mikael Abrahamsson wrote:

Thanks to you, and all who have replied (both off and on-list). I was
pleasantly surprised at the amount of review I've received. Keep them
coming! I'll try to respond/react to them shortly.

I'll respond to both posts on this list in one message:

> The document is about to be IETF Last Called for Informational RFC,
> but prior to that, I'd like to solicit comments/feedback/review from
> the people here because I'm 100% sure a lot of people have been faced
> with these issues (we certainly have..).

Well, tunnels suck. No news there.

    It is interesting to note that at least one implementation provides a
    special knob to fragment the inner packet prior to encapsulation even
    if the DF bit has been set -- this is non-compliant behaviour, but
    possibly has been required in certain tightly controlled passive
    monitoring scenarios. Such a setup wouldn't work for packets which
    have already been fragmented if they needed to be fragmented again,
    though.

Why would it be impossible to refragment fragments???

True -- thanks for catching this. I had a brain fart when I thought
that there isn't enough information in the IP header to do that. But
as long as you don't exhaust the IP identification number space, it's
OK..

But I don't understand why anyone would want to use tunnels in the
backbone. That's what VLANs are for. And if you don't use ether, you
aren't bound by yester-millennium's 1500 byte MTU anyway.

I don't think it's quite as simple as that. First, even if you used
Ethernet, you would seem to have to require that all the tunnel entry
and exit points reside in the same Ethernet VLAN "space". That is,
all the entry/exit points would have to be hooked to the Ethernet
switch core network (somehow), or that the routers would support some
kind of VLAN 'passthrough' -- encapsulating the VLAN's traffic to some
other interface's VLAN.

These may hold in some situations, but not in general.

Remember that the problem comes up especially if you need to tunnel
beyond the "domain" where you have a high MTU (or can use VLANs). If
you can assume that.. well, that's one solution proposed in the draft.

In IPv6 there is the interesting problem that there are already many
tunnels all over the place that often have a 1280 byte MTU, so
tunneling over that can't be done because of the mandatory minimum MTU
of 1280 bytes.

Actually, it can be done, see RFC2473 ('Generic Packet Tunneling in
IPv6'). The entry point trying to encapsulate a 1280 byte packet in
1280 byte MTU just have to do some fragmentation, see section 7.1 (b).

..........

Hi Pekka and others,

> Please send comments to me by the end of this week, either on- of
> off-list, as you deem appropriate.

With the risk of stating the obvious I would say that normally, PMTUD
should do the trick. [...]

For some (mostly host-based) tunnels, yes. But the point is that if
you insert such a tunnel in the middle of the network, where you have
e.g. Internet traffic from millions of nodes passing through on both
directions, just counting on PMTUD would require that your network
originated billions of Packet too Big messages each day, and depended
on the fact that the users have not blocked the ICMPs. Further, there
are also passive monitoring applications (like wiretaps) where you
DON'T want anyone to know something "fishy" is going on.

So, in practice, I fail to see how PMTUD or the like would really work
in the more generic environments than just host-based or "last-hop"
tunnels.

Unfortunately yes. In fact, I quite recently found a problem in
Riverstone's SSR2000's which just drop host-unreachables on
tcp-loadbalanced connection.. However, we still need to ask the
question "do we keep finding workarounds for other peoples poor
administration/implementation of technical solutions?"..

The technical solution for MTU problems is Path MTU Discovery. If a
vendor fails to implement, one should not buy its equipment. If an
end-user breaks his own connectivity, he/she needs education.

That would be the ideal world. In the less-than ideal world we have to
find a way to defeat the cluelessness (excuse the language) of vendors
and end-users.

You mean something like Packetization Layer Path MTU Discovery (PLPMTUD)?

http://www.ietf.org/internet-drafts/draft-ietf-pmtud-method-02.txt

http://www.psc.edu/~mathis/MTU/pmtud/

Sam

Sam Stickland wrote:

Sabri Berisha wrote:

Hi Pekka and others,

Please send comments to me by the end of this week, either on- of
off-list, as you deem appropriate.

With the risk of stating the obvious I would say that normally, PMTUD
should do the trick.

On todays internet everything is more reliable than PMTUD.

How about replacing it completely with something more inband, less prone to firewall breakage?

You mean something like Packetization Layer Path MTU Discovery (PLPMTUD)?

http://www.ietf.org/internet-drafts/draft-ietf-pmtud-method-02.txt

http://www.psc.edu/~mathis/MTU/pmtud/

Sam

Thanks for raising this to the forefront. I had been aware of this I-D in previous form, also referenced in the linked to by parent I-D.

Its a very ingenuous mechanism to allow discovery while still delivering packets and looks like a big improvement over what we live with now.

--Downsides as applies to the I-D that pretty much apply as well to the current PMTUD
* its pretty complex and needs to be re-incarnated into every l4 protocol.
* data delivery can be interrupted pending retransmission of dropped probe packets (if not sent concurrently)
* data packets can only be sent concurrently in different sized packets if the l4 layer supports detecting duplicate data
* does not operate on the layer it is meant to interrogate. IOW -- its a l4 protocol feature concerned about l3 features

Other ideas I mentioned that may very well be unworkable or naive.
I would appreciate any pointers to any prior discussion for any of them.

All these do NOT need to set the DF bit.

*A probing mechanism that does not turn on the DF bit would not interrupt data flow with dropped probes. The protocol would need to support being informed by the remote site of max payload size received. It can then use this as the outgoing value or as an indication to fallback to a previous value and/or reset a timer for when to try a higher packet size again. Except for spoofing concerns this naturally belongs in the l3 protocol. A cookie option might mitigate spoofing concerns.

This could be implemented in a l3 or l4 protocol. A l3 protocol implemenation could allow the upper l4 protocol the decision to turn the l3 one off, turn its own mechanism off, or use both.

One gotcha. hops that optimize by fragging into equal or other sized packets not clearly corresponding to actual link mtu. An implementation would need heuristics to catch this, instead of merely using the returned value.

*A protocol that is dedicated completely to path mtu discovery would be a nice addition to the stacks toolbox and would be fairly usefull
for any protocol on the stack that does not have its own method or for some reason cannot trust its own methods results or just want a second opinion.
This is outband enough that if successfull or unsuccessfull operation should not affect the main traffic flow of interest. A UDP protocol would need to use cookie values to prevent easy spoofs. Heuristics might also be neccessary.

* An IP option that when present triggers a new ICMP message, Fragemented and Delivered with frag size and link size as values. A returned cookie or packet header contents would minimize spoofs.

* The above without the new IP option.

It now occurs to me that I should take this over to the WG.......oh well. I have already written it. Sorry for the BW.