IP Fragmentation

Glen_Kent · August 20, 2008, 4:13pm

Hi,

Do transit routers in the wild actually get to do IP fragmentation
these days? I was wondering if routers actually do it or not, because
the source usually discovers the path MTU and sends its data with the
least supported MTU. Is this true?

Even if this is, then this would break for multicast IP. The source
cannot determine which receivers would get interested in the traffic
and what capacities the links connecting them would support. So, a
source would send IP packets with some size, and theres a chance that
one of the routers *may* have to fragment those IP packets before
passing it on to the next router.

I would wager that the vendors and operators would want to avoid IP
fragmentation since thats usually done in SW (unless you've got a very
powerful ASIC or your box is NP based).

Thanks,
Glen

Jim_Logajan · August 20, 2008, 5:19pm

Glen Kent wrote:

Do transit routers in the wild actually get to do IP fragmentation
these days? I was wondering if routers actually do it or not, because
the source usually discovers the path MTU and sends its data with the
least supported MTU. Is this true?

I believe that is only true for TCP over IPv4. UDP over IPv4 per se doesn't involve any MTU path discovery. Some UDP applications may in fact attempt MTU discovery and self-limit teh size of their packets, but that's not part of the UDP protocol.

A hypothetical specific "real world" example of where very large UDP packets might occur is SNMP. An SNMP "get" or "set" operation generally has to fit inside a UDP packet. But UDP allows up to 64k bytes in the datagram. If an SNMP object value is a really long string (say 2000 bytes long), then it will exceed the typical 1500 MTU most Ethernet interfaces expect. So I believe fragmentation will occur at the originating system. On the other hand, some systems support Ethernet jumbograms, so I believe it is possible that a default gateway router would be the first network element forced to fragment the datagram.

IPv6 is a different (and more complex) story of course - fragmentation is only supposed to occur on end points - even for UDP.

Quick experiment you can try if you have a Unix-like system handy: use ping (and/or ping6 or an IPv6 aware ping) and supply it with a "-s" data size parameter of, say, 2000. That makes a larger than normal packet that can't fit into a standard Ethernet frame. Use wireshark or ethereal to see what happens. If your Ethernet cards support jumbograms, use the mtu parameter of ifconfig and set it up larger than 1500. Repeat the experiment with the large data sized pings with both locally and remote systems.

Even if this is, then this would break for multicast IP. The source
cannot determine which receivers would get interested in the traffic
and what capacities the links connecting them would support. So, a
source would send IP packets with some size, and theres a chance that
one of the routers *may* have to fragment those IP packets before
passing it on to the next router.

I would wager that the vendors and operators would want to avoid IP
fragmentation since thats usually done in SW (unless you've got a very
powerful ASIC or your box is NP based).

I'm not sure how to address the above points since there appear to be some incorrect assumptions at play. It all depends on whether the Don't Fragment (DF) bit is set in IPv4 and how the source application responds to any resulting ICMP error responses (if the DF is set and one of the routes requires fragmentation).

Leo_Bicknell1 · August 20, 2008, 5:24pm

In a message written on Wed, Aug 20, 2008 at 09:43:44PM +0530, Glen Kent wrote:

Do transit routers in the wild actually get to do IP fragmentation
these days? I was wondering if routers actually do it or not, because
the source usually discovers the path MTU and sends its data with the
least supported MTU. Is this true?

Yes.

A GigE jumbo frames host (9120) to a standard POS interface (4420)
to a DS3 customer (1500) happens, and the GigE->POS and POS->DS3
routers must both do fragmentation.

I would wager that the vendors and operators would want to avoid IP
fragmentation since thats usually done in SW (unless you've got a very
powerful ASIC or your box is NP based).

As far as I know the "big" routers all do it in hardware with no real
performance penality; but I haven't studied in detail.

Jim_Shankland1 · August 20, 2008, 5:48pm

Leo Bicknell wrote:

In a message written on Wed, Aug 20, 2008 at 09:43:44PM +0530, Glen Kent wrote:

Do transit routers in the wild actually get to do IP fragmentation
these days? [...]

Yes.

A GigE jumbo frames host (9120) to a standard POS interface (4420)
to a DS3 customer (1500) happens, and the GigE->POS and POS->DS3
routers must both do fragmentation.

From the application (as opposed to network operator) point of view,
the big problem with fragmentation is that if you lose one fragment
in transit, all the fragments eventually get discarded, even if they've
made it all the way to the destination. This hurts performance and
wastes resources. So you may be better off not sending those jumbo
frames in the first place.

If your packet loss rate, end-to-end, is epsilon, and epsilon is so
small that even several times epsilon is negligible, then maybe you
don't care. But you're clearly now relying on a higher standard of
performance from the network fabric than you otherwise would be.

Way back when, before my beard was gray, Sun came out with the Sun-4
servers, based on the new SPARC architecture. These were then widely
deployed as NFS servers for Sun-3 desktops. The default NFS blocksize
was 8K, the default (maybe only) transport was UDP. Sun-3 would make
a read request, Sun-4 would send an 8K+ UDP response, which would
get fragmented into a burst of 6 IP fragments, Sun-3 would get
the first 3 or 4 before falling behind (this was, after all, the
blistering fast 10 megabit Ethernet) and dropping a fragment.
Eventually, the reassembly would time out, all the received fragments
would get discarded, NFS would resend ... lather, rinse, repeat.
Setting the NFS read and write sizes to 1460 fixed this by avoiding
fragmentation.

This concludes today's presentation from the history channel.

Jim Shankland

Valdis_Kletnieks · August 20, 2008, 6:04pm

Hypothetically true. Unfortunately, enough places do bozo firewalling and drop
the ICMP Frag Needed packets to severely limit the utility of PMTU Discovery.

Iljitsch_van_Beijnum · August 20, 2008, 6:09pm

Yet all OSes have it enabled and there is no fallback to fragmentation in PMTUD: if your system doesn't get the ICMP messages, your session is dead in the water.

John_Lee1 · August 20, 2008, 6:10pm

Glen,

With the v4 networks that I have worked on in the past, they did not do end to end MTU discovery before sending packets. The TTL had to be set appropriately so that if you had low speed links, for example, the packet and response would get through in time. On our DS3 (T3) and OC-3c packet links we did 4k, 9k, and 16k packet sizes for video and file transfers.

At the other end of the spectrum are civilian and military systems with tactical links, both wired and radio, with low bit rates and header compression on IP and TCP packets. Speeds range from 300 -9,600 bps, 16k, 32k, 64k and Nx64k bps links that can do packet fragmentation and adding proprietary ECC codes for the radio links. Some systems strip the IP packet and use standard or non-standard link layer protocols across the mediums. Some of these systems are store and forward so that the computer/router that is connected to the low speed link will ack the packet for the high speed network connection and buffer it up until it can be sent on the lower speed system.

IMHO current IPv6 protocols ignore the lower end segment by specifying the lowest MTU for the circuit be the MTU for the entire circuit and not allow fragmentation. I do not see this as an efficient use of high speed network resources and local link management can handle fragmentation just fine.

John (ISDN) Lee

A slightly different History Channel.

Colin · August 20, 2008, 6:57pm

Well obviously, ICMP is only used by hackers to DDoS you. Everyone knows that, especially all the banks. It's even more important to obliterate PMTU discovery when you're using HTTPS - for security, you know.

Sorry, I spent the better part of today bashing my head against the wall trying to fix MSS and PMTU issues somewhere which was being aggravated by the tragic programming of Linux l2tpns package...

Tim_Sanderson1 · August 20, 2008, 8:37pm

The "network" may not but the end hosts may try. Many client operating systems perform PMTU by default. Some also do blackhole probing that can also change the MTU.

Sam_Stickland1 · August 20, 2008, 10:07pm

Iljitsch van Beijnum wrote:

Fernando_Gont1 · August 25, 2008, 10:27am

IPv4 minimum MTU is 68 bytes, not 536. 536 is the minimum fragment re-assembly buffer size. Falling back to 536-byte packets does not guarantee that sessions will be kept up.

Kind regards,

Iljitsch_van_Beijnum · August 25, 2008, 10:56am

http://support.microsoft.com/kb/925280

IPv4 minimum MTU is 68 bytes,

That's kind of like "a human being can live without food for four to six weeks". It's not a recommendation.

536 is the minimum fragment re-assembly buffer size. Falling back to 536-byte packets does not guarantee that sessions will be kept up.

But:

"PMTU black hole router detection is triggered on a TCP connection when TCP starts retransmitting full-sized segments with the DF flag set. TCP resets the PTMU for the connection to 536 bytes. Then, TCP retransmits its segments when the DF flag is clear."

Simon_Leinen · August 26, 2008, 9:40pm

Sam Stickland writes:

Iljitsch van Beijnum wrote:

Yet all OSes have it enabled and there is no fallback to
fragmentation in PMTUD: if your system doesn't get the ICMP
messages, your session is dead in the water.

Windows Vista/2007 has black hole detection enabled by default. It's
not massively elegant, but it will keep sessions up (falls back to
536 byte MTU).

http://support.microsoft.com/kb/925280

Note that there's a new IETF specification (RFC 4821) for
("Packetization Layer") Path MTU discovery, which doesn't rely on ICMP
messages to work. If what I wrote here

http://kb.pert.geant2.net/PERTKB/PathMTU

is correct, this has been implemented in recent (>= 2.6.17) Linux
kernels. I don't know of any other OSes that have this yet - not that
they'd tell me (but they could go and edit the page above, that's why
it's a Wiki).

Glen_Kent · August 28, 2008, 11:44pm

>
I'm not sure how to address the above points since there appear to be some
incorrect assumptions at play. It all depends on whether the Don't Fragment
(DF) bit is set in IPv4 and how the source application responds to any
resulting ICMP error responses (if the DF is set and one of the routes
requires fragmentation).

OK, so what happens if a transit router does not support IP
fragmentation and it receives a packet which is bigger than the
outgoing link's MTU. Should it simply drop the packet or proactively
send an ICMP Dest Unreachable error (Frag required) to the peer?

I understand that routers usually must send this error only when a
fragmentation is required and they recieve a packet with DF bit set.
However, in this case this router would drop the packet (for it doesnt
support fragmentation) and sending an ICMP error back to the host,
warning it that its packets will get dropped seems to be a better
option.

OTOH, what do most of the implementations do if they send a regular IP
packet and receive an ICMP dest unreachable - Fragmentation reqd
message back? Do they fragment this packet and then send it out, or
this message is silently ignored?

Glen

Fernando_Gont1 · August 28, 2008, 11:46pm

You may want to have a look at this IETF I-D: http://www.gont.com.ar/drafts/icmp-attacks/draft-ietf-tcpm-icmp-attacks-03.txt. The PMTUD modification described in the draft ships (at least) in OpenBSD and NetBSD.

Thanks!

Kind regards,

tony1athome · August 28, 2008, 11:57pm

OK, so what happens if a transit router does not support IP
fragmentation

All IPv4 routers are supposed to support fragmentation per RFC 1812 (Router
Requirements), section 4.2.2.7.

Tony

Glen_Kent · August 29, 2008, 12:14am

I understand, but the question is what if they dont?

Or let me rephrase the question.

What do standard implementations do if they send a regular IP packet
(no DF bit set) and receive an ICMP dest unreachable - Fragmentation
reqd message back? Do they fragment this packet and then send it out
again with the MTU reported in the ICMP error message, or is the ICMP
error message silently ignored?

Glen

Iljitsch_van_Beijnum · August 29, 2008, 9:18am

Then the internet breaks.

Valdis_Kletnieks · August 29, 2008, 5:19pm

I understand, but the question is what if they dont?

If it's an alleged router, and it doesn't know how to frag a packet, it's
probably so brain-damaged that it can't send a recognizable 'Frag Needed'
ICMP back either. At that point, all bets are off...

What do standard implementations do if they send a regular IP packet
(no DF bit set) and receive an ICMP dest unreachable - Fragmentation
reqd message back? Do they fragment this packet and then send it out
again with the MTU reported in the ICMP error message, or is the ICMP
error message silently ignored?

A quick perusal of the current Linux 2.6 net/ipv4/icmp.c source says this

        case ICMP_FRAG_NEEDED:
                if (ipv4_config.no_pmtu_disc) {
                        LIMIT_NETDEBUG(KERN_INFO "ICMP: " NIPQUAD_FMT ": "
                                                 "fragmentation needed "
                                                 "and DF set.\n",
                                       NIPQUAD(iph->daddr));
                } else {
                        info = ip_rt_frag_needed(net, iph,
                                                 ntohs(icmph->un.frag.mtu),
                                                 skb->dev);

In other words, if we're configured to do PMTU discovery, we cut back the MTU,
and if PMTUD is disabled, we make a note in the kernel log that something odd
happened and keep going. Note that it's by definition "odd", because if PMTUD
is disabled, we didn't *send* a packet with the DF bit set, so any ICMP error
complaining about a DF bit we didn't set is considered spurious.