IP Fragmentation - Not reliable over the Internet?

Christopher_Palmer · August 27, 2013, 12:01am

I am trolling for information/community wisdom.

What is the probability that a random path between two Internet hosts will traverse a middlebox that drops or otherwise barfs on fragmented IPv4 packets?

If anyone has any data or anecdotes, please feel free to send an off-list email or whatever.

Thanks!

Valdis_Kletnieks · August 27, 2013, 5:02am

THe fact you're posting indicates that you already know the practical
answer: "Often enough that you need to take defensive measures".

But there's really several separate questions here:

1) What is the probability that a given path ends up fragging a packet
because it isn't MTU 1500 end-to-end?

2) What is the probability that a frag needed is detected by a router
that then botches it?

2a) What is the probability that the router does it right but the source node
shoots itself in the foot by requesting PMTUD, but then blocks inbound ICMP for
"security reasons"?

3) What is the probability that one router correctly frags a packet, but
a subsequent box (most likely a firewall or target host) botches the
re-assembly or other handling?

4) When confronted with the fact that there's a very high correlation between
the level of technical clue that results in procuring and deploying a broken
device, and the level of technical clue clue available to resolve the problem
when you try to contact them, what's the appropriate beverage?

Saku_Ytti1 · August 27, 2013, 6:55am

[ytti@ytti.fi ~]% ssh ring ring-all -t90 ping -s 1473 -c2 -w3 ip.fi|pastebinit
http://p.ip.fi/KA7N

[ytti@sci ~]% curl -s http://p.ip.fi/KA7N|grep transmitted|wc -l
224
[ytti@sci ~]% curl -s http://p.ip.fi/KA7N|grep "0 received"|wc -l
10

UUOC wc, but that's how I roll.

224 vantage points, 10 failed.

Owen_DeLong · August 27, 2013, 7:34am

That's a lot of questions he didn't ask.

As I read it, the question he asked is:

If I send a packet out as a legitimate series of fragments, what is the chance
that they will get dropped somewhere in the middle of the path between the
emitting host and the receiving host?

To my thinking, the answer to that question is basically "pretty close to 0 and
if that changes in the core, very bad things will happen."

Owen

Emile_Aben · August 27, 2013, 8:45am

Same tests from RIPE Atlas pings towards
nl-ams-as3333.anchors.atlas.ripe.net today:

48 byte ping: 42 out of 3406 vantage points fail (1.0%)
1473 byte ping: 180 out of 3540 vantage points fail (5.1%)

Of the 180 vantage points that failed for the 1473 byte ping, 142 were
successful in receiving at least 1 reply for the 48 byte ping.

Measurement IDs in RIPE Atlas are 1019675 and 1019676.

Emile Aben
RIPE NCC

Tony_Finch · August 27, 2013, 9:25am

This question is important for large EDNS packets so you'll find some
recent practical investigations from the perspective of people interested
in DNSSEC. For instance, a couple of presentations from Roland van
Rijswijk:

https://ripe64.ripe.net/presentations/91-20120418_-RIPE64-Ljubljana-DNSSEC-_UDP_issues.pdf
http://toronto45.icann.org/meetings/toronto2012/presentation-dnssec-fragmentation-17oct12-en.pdf

Tony.

Jaap_Akkerhuis · August 27, 2013, 10:04am

This question is important for large EDNS packets so you'll find some
    recent practical investigations from the perspective of people interested
    in DNSSEC. For instance, a couple of presentations from Roland van
    Rijswijk:

    https://ripe64.ripe.net/presentations/91-20120418_-RIPE64-Ljubljana-DNSSEC-_UDP_issues.pdf
    http://toronto45.icann.org/meetings/toronto2012/presentation-dnssec-fragmentation-17oct12-en.pdf

Related to this and maybe be of interest is the following blog post
<https://www.nlnetlabs.nl/blog/2013/06/04/pmtud4dns/>.

jaap

Saku_Ytti1 · August 27, 2013, 11:24am

Nice, it's starting to almost sound like data rather than anecdote, both
tests implicate 4<5% having fragmentation issues.

Much larger number than I intuitively had in mind.

Leo_Bicknell1 · August 27, 2013, 2:04pm

I'm pretty sure the failure rate is higher, and here's why.

The #1 cause of fragments being dropped is firewalls. Too many admins configuring a firewall do not understand fragments or how to properly put them in the rules.

Where do firewalls exist? Typically protecting things with public IP space, that is (some) corporate networks and banks of content servers in data centers. This also includes on-box firewalls for Internet servers, ipfw or iptables on the server is just as likely to be part of the problem.

Now, where are RIPE probes? Most RIPE probes are probably either with somewhat clueful ISP operators, or at Internet Clueful engineer's personal connectivity (home, or perhaps a box in a colo). RIPE probes have already significantly self-selected for people who like non-broken connectivity. What's more, the ping test was probably to some "known good" host(s), rather than a broad selection of Internet hosts, so effectively it was only testing the probe end, not both ends.

Basically, I see RIPE probes as an almost best-case scenario for this sort of broken behavior.

I bet the ISC Netalyzer folks have somewhat better data, perhaps skewed a bit towards broken connections as people run Netalyzer when their connection is broken! I suspect reality is somewhere between those two book ends.

Valdis_Kletnieks · August 27, 2013, 2:33pm

That's a lot of questions he didn't ask.

This isn't your first rodeo. You should know by now that the question
actually asked, the question *meant* to be asked, and the question that
actually needed answering are often 3 different things.

If I send a packet out as a legitimate series of fragments, what is the chance
that they will get dropped somewhere in the middle of the path between the
emitting host and the receiving host?

To my thinking, the answer to that question is basically "pretty close to 0 and
if that changes in the core, very bad things will happen."

Saku Ytti and Emile Aben have numbers that say otherwise. And there must
be a significantly bigger percentage of failures than "pretty close to 0",
or Path MTU Discovery wouldn't have a reputation of being next to useless.

Blake_Dunlap1 · August 27, 2013, 3:00pm

And then you have other issues like networks that arbitrarily set DF on all
packets passing through them. That burnt a good three days of my life back
in the day.

-Blake

Dave_Brockman · August 27, 2013, 5:25pm

It's not just firewalls.... border-routers are also apt to have ACLs
like these[1]:

ip access-list extended BORDER-IN
10 deny tcp any any fragments
20 deny udp any any fragments
30 deny icmp any any fragments
40 deny ip any any fragments

I see these a *LOT* on customer routers, before the packets even get
to the firewall....

Regards,

dtb

1. I found it most recently at
http://hurricanelabs.com/blog/cisco-security-routers/ but I know there
are many other "guides" that include these as part of their ACL.

William_Herrin · August 27, 2013, 5:45pm

Hi Christopher,

I think there might be three rather different questions here:

1. If I originate IP packet fragments, such as an 8000 byte NFS packet
broken into 1500 byte fragments, what's the probability of some host
before the other endpoint dropping one or all of those fragments?

2. If I send an IP packet that's too large for the path and *don't*
set the don't-fragment bit, what' the chance that the router with the
too-small next hop will fail to correctly fragment that packet (or
that the correctly fragmented packet will fall into trap #1 above)?

3. If I send an IP packet that's too large for the path and *do* set
the don't-fragment bit, what's the chance of failing to receive the
"packet too big" message it causes the intermediate router to send?

Are you after the answer to one in particular?

Regards,
Bill Herrin

Owen_DeLong · August 27, 2013, 6:47pm

No, their numbers describe what happens to single packets of differing sizes.

Nothing they did describes results of actually fragmented packets.

Owen

Tore_Anderson1 · August 28, 2013, 6:05am

* Owen DeLong

Emile_Aben · August 28, 2013, 9:26am

For Saku: yes. For me: that was my intention, but later I discovered the
Atlas ping does include the ICMP header in it's 'size' parameter so what
I did in effect was 1473 + 20 = 1493 (and not the 1501 I intended).

Redid the tests to a "known good" destination where I knew interface MTU
(1500) and could tcpdump which confirmed that I was looking at
fragmentation. I also took an offline recommendation to do different
packet sizes to try to distinguish fragmentation issues from general
corruption-based packet loss.

Results:
size = ICMP packet size, add 20 for IPv4 packet size
fail% = % of vantage points where 5 packets where sent, 0 where received.
#size fail% vantage points
100 0.88 2963
300 0.77 3614
500 0.88 1133
700 1.07 3258
900 1.13 3614
1000 1.04 770
1100 2.04 3525
1200 1.91 3303
1300 1.76 681
1400 2.06 3014
1450 2.53 3597
1470 3.01 2192
1470 3.12 3592
1473 4.96 3566
1475 4.96 3387
1480 6.04 679
1480 4.93 3492 [*]
1481 9.86 3489
1482 9.81 3567
1483 9.94 3118

There is a ~5% difference going up from 1480 to 1481.

As to interpreting this: Leo Bicknell's observations (this is to a
"known good" host, and the RIPE Atlas vantage points may very well have
a clueful-operator bias) stand, so interpret with care. Also: roughly
2/3 of these vantage points are behind NATs that may also have some
firewall(ish) behaviour.

Hope this data point helps interpreting the magnitude of IPv4
fragmentation problems.

Emile Aben
RIPE NCC

[*] redid the 'size 1480' experiment because the first time around it
had significantly less vantage points.

Owen_DeLong · August 29, 2013, 2:22am

Has the path MTU been measured for all vantage point pairs?

Is it known to be 1500 or just the end-point MTUs?

That could affect your results very differently.

Owen

Benno_Overeinder · August 29, 2013, 8:24am

I'm pretty sure the failure rate is higher, and here's why.

The #1 cause of fragments being dropped is firewalls. Too many
admins configuring a firewall do not understand fragments or how to
properly put them in the rules.

Where do firewalls exist? Typically protecting things with public
IP space, that is (some) corporate networks and banks of content
servers in data centers. This also includes on-box firewalls for
Internet servers, ipfw or iptables on the server is just as likely
to be part of the problem.

In a study using the RIPE Atlas probes, we have used a heuristic to
figure out where the fragments where dropped. And from the Atlas
probes where IP fragments did not arrive, there is a high likelihood
the problem is with the last hop to the Atlas probe. All other
situations are with the router just before the last hop. We did not
find any problems in the core. Of course this was rather limited
study using the RIPE Atlas probes in a certain setting.

See for the full report "Discovering Path MTU Black Holes on the
Internet Using the RIPE Atlas",
http://www.nlnetlabs.nl/downloads/publications/pmtu-black-holes-msc-thesis.pdf.

Now, where are RIPE probes? Most RIPE probes are probably either
with somewhat clueful ISP operators, or at Internet Clueful
engineer's personal connectivity (home, or perhaps a box in a
colo). RIPE probes have already significantly self-selected for
people who like non-broken connectivity. What's more, the ping
test was probably to some "known good" host(s), rather than a broad
selection of Internet hosts, so effectively it was only testing the
probe end, not both ends.

With help from RIPE NCC (many thanks), we did measurements both ways.

Cheers,

-- Benno

Emile_Aben · August 29, 2013, 10:22am

I didn't, but see
http://www.nlnetlabs.nl/downloads/publications/pmtu-black-holes-msc-thesis.pdf
Fig 23 (page 24) for path MTU data from roughly a year ago (thanks
Benno for posting that link).

Emile

Christopher_Palmer · August 30, 2013, 12:51am

This is what I'm concerned about:

"""
1. If I originate IP packet fragments, such as an 8000 byte NFS packet broken into 1500 byte fragments, what's the probability of some host before the other endpoint dropping one or all of those fragments?
"""

Big thanks to everyone who has sent thoughts already, really quite helpful.