IP Fragmentation - Not reliable over the Internet?

This is what I'm concerned about:

"""
1. If I originate IP packet fragments, such as an 8000 byte NFS packet
broken into 1500 byte fragments, what's the probability of some host
before the other endpoint dropping one or all of those fragments?
"""

For wide area NFS I would be using TCP not UDP. If you can't use
TCP you should ensure that the firewalls at both ends pass fragmented
UDP packet. NFS is generally not open to the world so fragmentation
and NFS is essentially a local issue. Fragments don't get routinely
dropped in the core.

Ensure that the firealls at both ends pass ICMP/ICMPv6 PTB. Only
idiots block all ICMP/ICMPv6. Yes there are a lot of idiots in the
world.

This is what I'm concerned about:

"""
1. If I originate IP packet fragments, such as an 8000 byte NFS packet
broken into 1500 byte fragments, what's the probability of some host
before the other endpoint dropping one or all of those fragments?
"""

For wide area NFS I would be using TCP not UDP. If you can't use
TCP you should ensure that the firewalls at both ends pass fragmented
UDP packet. NFS is generally not open to the world so fragmentation
and NFS is essentially a local issue. Fragments don't get routinely
dropped in the core.

However, passing fragmented UDP packets has its own (undesirable)
set of security implications.

Of course running NFS over an unencrypted path in the wild is, well,
something with additional (undesirable) set of security implications.
(IOW, this should be happening inside a VPN)

Ensure that the firealls at both ends pass ICMP/ICMPv6 PTB. Only
idiots block all ICMP/ICMPv6. Yes there are a lot of idiots in the
world.

+1 This cannot be stressed enough.

Owen

Mark Andrews wrote:

Ensure that the firealls at both ends pass ICMP/ICMPv6 PTB. Only
idiots block all ICMP/ICMPv6. Yes there are a lot of idiots in the
world.

The worst idiots are people who designed ICMPv6 [RFC2463] as:

         (e.2) a packet destined to an IPv6 multicast address (there are
               two exceptions to this rule: (1) the Packet Too Big
               Message - Section 3.2 - to allow Path MTU discovery to
               work for IPv6 multicast, and (2) the Parameter Problem
               Message, Code 2 - Section 3.4 - reporting an unrecognized
               IPv6 option that has the Option Type highest-order two
               bits set to 10), or

which makes it necessary, unless you are idiots, to filter ICMPv6
PTB against certain packets, including but not limited to,
multicast ones.

            Masataka Ohta

In a study using the RIPE Atlas probes, we have used a heuristic to
figure out where the fragments where dropped. And from the Atlas
probes where IP fragments did not arrive, there is a high likelihood
the problem is with the last hop to the Atlas probe.

i wonder if this is correlated with the high number of probes being
behind nats.

randy

That would be a viable explanation, although we have not tried to
fingerprint the probes to figure out if this was true.

If we will rerun the experiments in the future, we should spent more
effort into identifying the router/middlebox that is giving the IP
fragmentation problems (drops or blocking PMTUD ICMP).

-- Benno

In a study using the RIPE Atlas probes, we have used a heuristic to
figure out where the fragments where dropped. And from the Atlas
probes where IP fragments did not arrive, there is a high likelihood
the problem is with the last hop to the Atlas probe.

i wonder if this is correlated with the high number of probes being
behind nats.

That would be a viable explanation, although we have not tried to
fingerprint the probes to figure out if this was true.

If we will rerun the experiments in the future, we should spent more
effort into identifying the router/middlebox that is giving the IP
fragmentation problems (drops or blocking PMTUD ICMP).

Maybe this provides a bit of insight:

From a test last week from all RIPE Atlas probes to a single "known

good" MTU 1500 host I compared probes where I had both a ping test with
ipv4.len 1020 and ipv4.len 1502.
behind NAT probes: 12% 1020 bytes ping worked while 1502 failed
non-NATted probes: 6% ""

hth,
Emile Aben
RIPE NCC

i wonder if this is correlated with the high number of probes being
behind nats.

Maybe this provides a bit of insight:
From a test last week from all RIPE Atlas probes to a single "known
good" MTU 1500 host I compared probes where I had both a ping test with
ipv4.len 1020 and ipv4.len 1502.
behind NAT probes: 12% 1020 bytes ping worked while 1502 failed
non-NATted probes: 6% ""

this needs publication on your adventure game of a web site, please. it
will seriously 'inform' some discussion going back and forth on ietf
lists.

randy

could you please test with ipv6?

thanks!

randy

This is what I see for various IPv6 payloads (large ICMPv6 echo
requests) from all RIPE Atlas probes that where available at the time to
a single "known good" MTU 1500 destination:

plen fail% nr_probes
100 9.64 1266
500 9.34 1039
1000 9.94 1298
1240 9.94 1308
1241 11.62 1300
1440 12.70 890
1441 14.70 1306
1460 15.18 1304
1461 19.84 1290
1462 22.02 1294

plen: IPv6 payload length (ie. not including 40byte IPv6 header)
fail%: percentage of probes that didn't get any of the 5 pkts that were
sent. Note that there is a large baseline failure rate in IPv6 on RIPE
Atlas probes [1], which would explain the ~10% failure rate for the
smaller packets.

I plan to do more analysis and start writing this up on RIPE Labs over
the next few days.

cheers,
Emile Aben
RIPE NCC

[1]
https://labs.ripe.net/Members/stephane_bortzmeyer/how-many-atlas-probes-believe-they-have-ipv6-but-are-wrong

I mostly agree. I will argue that the actual path of an IP datagram is end to end, so the question is not the core, but the end to end path.

That said, with today's congestion control algorithms, TCP does pretty badly with an other-than-negligible loss rate, so end to end, fragmented messages have a negligible probability of being dropped, so the probability of sending a message that is fragmented and having it arrive at the intended destination is a negligibly small probability smaller than then probability of sending an unfragmented message and having it arrive.

The primary argument against that is firewall behavior, in which firewalls are programmed to drop fragments with high probability.

If we had a protocol that sat atop IP and did what fragmentation does that we could expect all non-TCP/SCTP protocols to use, I would have a very different viewpoint. But, playing the ball where it lies, the primary change I would recommend would be to support any firewall rule that permitted dropping the first fragment of a fragmented datagram in which the first fragment did NOT include the entire IP header and the entire subsequent header, and expecting a host to keep a fragment of a datagram no more than some stated number of seconds (I might pick "two") with express permission to drop it more rapidly should the need arise. I would *not* support a rule that simple dropped fragments, or a protocol change that disallowed them.

If I send a packet out as a legitimate series of fragments, what is the chance
that they will get dropped somewhere in the middle of the path between the
emitting host and the receiving host?

To my thinking, the answer to that question is basically "pretty close to 0 and
if that changes in the core, very bad things will happen."

I mostly agree. I will argue that the actual path of an IP datagram is end to end, so the question is not the core, but the end to end path.

That said, with today's congestion control algorithms, TCP does pretty badly with an other-than-negligible loss rate, so end to end, fragmented messages have a negligible probability of being dropped, so the probability of sending a message that is fragmented and having it arrive at the intended destination is a negligibly small probability smaller than then probability of sending an unfragmented message and having it arrive.

Yes, the path is end-to-end and things happening near the end-points can be bad for a particular conversation.

My point is that if somewhere in the core starts doing bad things to fragments on a regular basis, it will be very bad
for massive numbers of users and not just the localized damage one would expect from something closer to the edge.

Otherwise, we are saying the same thing.

The primary argument against that is firewall behavior, in which firewalls are programmed to drop fragments with high probability.

Which fortunately tend to be located at the edge and not in the core.

If we had a protocol that sat atop IP and did what fragmentation does that we could expect all non-TCP/SCTP protocols to use, I would have a very different viewpoint. But, playing the ball where it lies, the primary change I would recommend would be to support any firewall rule that permitted dropping the first fragment of a fragmented datagram in which the first fragment did NOT include the entire IP header and the entire subsequent header, and expecting a host to keep a fragment of a datagram no more than some stated number of seconds (I might pick "two") with express permission to drop it more rapidly should the need arise. I would *not* support a rule that simple dropped fragments, or a protocol change that disallowed them.

I think I mostly agree, but I'd need to think it through a bit more than I can at the moment.

Owen

I know I'm digging up an old thread here but I've spent some time
analyzing some of the significant changes that Apple has made to the
Facetime protocol, apparently with a huge focus on IP packet size to
avoid fragmentation issues:

http://blog.krisk.org/2013/09/apples-new-facetime-sip-perspective.html

I'm betting they've had HUGE issues with IP+UDP MTU issues over the
last three years...

This is now published on RIPE Labs. For the adventurous:
https://labs.ripe.net/Members/emileaben/ripe-atlas-packet-size-matters

regards,
Emile Aben
RIPE NCC

this needs publication on your adventure game of a web site, please. it
will seriously 'inform' some discussion going back and forth on ietf
lists.

This is now published on RIPE Labs. For the adventurous:
RIPE Atlas - Packet Size Matters | RIPE Labs

some hours back, i posted the url to the ietf list arguing frag

thanks a million

randy