PMTUD for IPv4 Multicast - How?

Chris_Marget · August 31, 2015, 4:12pm

I recently discovered that my routers weren't generating ICMP Type 3 Code 4
(unreachable, DF-bit) messages in response to too-big IPv4 multicast
packets with DF=1.

At first, I thought this was a bug, but then learned that RFCs 1112, 1122
and 1812 all specify that ICMP unreachables not be sent in response to
multicast packets.

RFC1981 (PMTUD for IPv6), on the other hand, is explicit that PMTUD works
for multicast flows, that the path MTU for a multicast flow is the smallest
MTU available anywhere in the distribution tree, and that a single
multicast packet may provoke many ICMP unreachables from routers along the
tree.

Further complicating matters, the default Linux behavior (ip_no_pmtu_disc =
0) sets the DF bit on all packets unless the application is explicit
(setsockopt()) that DF be cleared. This behavior strikes me as a
troublesome assumption (that the application will interpret unreachables)
in the case of unicast UDP sockets, and downright broken (because traffic
will be dropped silently) in the case of multicast UDP sockets.

I'm struggling to grok the rationale behind not sending unreachables in
response to multicast packets. It seems to me that our networks put IPv4
multicast speakers in a position where it's impossible for them to do the
right thing.

Does anybody understand why PMTUD for IPv4 multicast flows is disabled in
routers?

Is there a secret lever to enable it in Cisco IOS?

What should a responsible IPv4 multicast application do when receivers are
flung far and wide with un-knowable MTUs in the transit path?

Thanks,

/chris

Valdis_Kletnieks · August 31, 2015, 4:37pm

For the exact same reason that replying to an ICMP Echo Request sent to
your broadcast address is generally considered a Bad Idea.

The obvious solution is "Doctor, it hurts when I do that" "Don't do that anymore".

Don't send multicast packets with DF set.

Chris_Marget · August 31, 2015, 4:51pm

> At first, I thought this was a bug, but then learned that RFCs 1112, 1122
> and 1812 all specify that ICMP unreachables not be sent in response to
> multicast packets.

> I'm struggling to grok the rationale behind not sending unreachables in
> response to multicast packets. It seems to me that our networks put IPv4
> multicast speakers in a position where it's impossible for them to do the
> right thing.

For the exact same reason that replying to an ICMP Echo Request sent to
your broadcast address is generally considered a Bad Idea.

The obvious solution is "Doctor, it hurts when I do that" "Don't do that
anymore".

It's not as obvious to me as it is to you. I mean, v6 *requires* exactly
this behavior, so it can't be all that bad, can it?

Don't send multicast packets with DF set.

Are you asserting that the default behavior of the Linux kernel (setting DF
on multicast packets) is wrong then?

I'll probably come around, but I've not yet concluded that "screw it,
fragment my traffic, I don't care" is the stance that a conscientious
application should be taking.

/chris

sthaug · August 31, 2015, 7:49pm

> > At first, I thought this was a bug, but then learned that RFCs 1112, 1122
> > and 1812 all specify that ICMP unreachables not be sent in response to
> > multicast packets.
>
> > I'm struggling to grok the rationale behind not sending unreachables in
> > response to multicast packets. It seems to me that our networks put IPv4
> > multicast speakers in a position where it's impossible for them to do the
> > right thing.
>
> For the exact same reason that replying to an ICMP Echo Request sent to
> your broadcast address is generally considered a Bad Idea.
>
> The obvious solution is "Doctor, it hurts when I do that" "Don't do that
> anymore".
>

It's not as obvious to me as it is to you. I mean, v6 *requires* exactly
this behavior, so it can't be all that bad, can it?

ICMP replies to multicast packets can cause ICMP "implosion". This is
not a new discussion - see for instance

http://mailman.nanog.org/pipermail/nanog/2012-June/048685.html

Steinar Haug, Nethelp consulting, sthaug@nethelp.no

William_Herrin · August 31, 2015, 8:38pm

It's a shame we handle path MTU as a layer 3 problem that gets an ICMP
response from a middlebox. It'd make more sense to truncate the
packet, set a flag, and then let layer 4 at the recipient deal with
negotiating a new size with the sender. You know, end to end principle
and all. That'd eliminate the problems with firewall-blocked protocols
and routers using private IP addresses, the usual culprits for pmtud
breakage.

It'd also let multicast protocols make reasonable choices for that
particular protocol without being stuck with the stack's default.

-Bill

Masataka_Ohta · August 31, 2015, 9:17pm

Chris Marget wrote:

For the exact same reason that replying to an ICMP Echo Request sent to
your broadcast address is generally considered a Bad Idea.

The obvious solution is "Doctor, it hurts when I do that" "Don't do that
anymore".

And, it implies that some ISPs will filter all the ICMPv6 PTB including
those generated against unicast ones, which means PMTUDv6 won't work.

Filtering ICMPv6 PTB generated against multicast packets but not unicast
ones is not very easy.

It's not as obvious to me as it is to you. I mean, v6 *requires* exactly
this behavior, so it can't be all that bad, can it?

Yes, of course.

See

Design by committee - Wikipedia

which is why we should avoid IPv6 entirely, especially because NAT,
with its 48bit effective address space, is fair enough and, for
theoretical purity, NAT can be modified to have full end to end
transparency (draft-ohta-e2e-nat-00),
or, UPnP capable NAT already practically have the transparency.

I'll probably come around, but I've not yet concluded that "screw it,
fragment my traffic, I don't care" is the stance that a conscientious
application should be taking.

Don't you care, for routers, generating ICMP PTB is as burdensome
as generating fragments?

Masataka Ohta

PS

Pages 87-101 of

ftp://chacha.hpcl.titech.ac.jp/2014/infra5.ppt

is my presentation at APNIC32 on the problem.

Chris_Marget · August 31, 2015, 9:28pm

I'll probably come around, but I've not yet concluded that "screw it,
fragment my traffic, I don't care" is the stance that a conscientious
application should be taking.

Don't you care, for routers, generating ICMP PTB is as burdensome
as generating fragments?

I don't think so. If PMTUD is working (big IF, I know), the ICMP PTB
generation is a one-time thing (or once per 10 minutes or whatever)
and can be rate limited with little impact. Fragmenting transit
traffic, on the other hand, needs to be done for every transit packet.

Chris_Marget · August 31, 2015, 9:34pm

> It's not as obvious to me as it is to you. I mean, v6 *requires* exactly
> this behavior, so it can't be all that bad, can it?

ICMP replies to multicast packets can cause ICMP "implosion". This is
not a new discussion - see for instance

http://mailman.nanog.org/pipermail/nanog/2012-June/048685.html

Thanks very much for the pointer to that discussion. "ICMP implosion"
has been a helpful search term.

The position taken there appears to boil down to:
- The IPv6 requirement to generate "too big" messages *really is a problem*
- RFC2463 should not have made the exception which allows sending these messages
- Multicast PMTUD should not be a thing
- Multicast speakers should send un-fragmentable minimum-sized packets

I remain fuzzy on exactly the nature of the implosion problem. Is the
concern that I might DDoS myself by sending un-fragmentable traffic?

It's hard for me to recognize this as a problem, but I'm working on
it. It seems to me that as a multicast speaker, the influx of ICMP
errors is both desirable (I set DF because I intend to react) and
under my control.

It certainly beats sending minimum-sized packets, which appears to be
the recommendation in the linked discussion.

If somebody would be so kind as to detail the disastrous nature of the
implosion, that would be helpful.

William_Herrin · August 31, 2015, 9:38pm

No, it isn't.

When a router fragments a packet, it has to fragment the next and the
next and the next. Maybe tens or hundreds of thousands of packets
before the end of that one user's session.

When a router generates a PTB, there is no next. PTB is a soft
failure. The origin must correct the error (by reducing packet size)
before communication can succeed.

There are potentially several orders of magnitude of difference in the
burden on the router.

Regards,
Bill Herrin

Masataka_Ohta · September 1, 2015, 12:42am

William Herrin wrote:

It'd make more sense to truncate the
packet, set a flag, and then let layer 4 at the recipient deal with
negotiating a new size with the sender.

For routers, truncating the packet and setting a flag is as
burdensome as fragmentation or ICMP generation.

Moreover, just with plain fragmentation enabled IPv4 packets, layer
4 can deal similarly.

You know, end to end principle and all.

PMTUD requires "knowledge and help" (quote from the end to end
argument) of all the intermediate routers. That is, you apply the
end to end argument completely wrongly.

That'd eliminate the problems with firewall-blocked protocols
and routers using private IP addresses, the usual culprits for pmtud
breakage.

With your approach, you will find firewalls dropping truncated packets.

Masataka Ohta

Masataka_Ohta · September 1, 2015, 12:49am

William Herrin wrote:

for routers, generating ICMP PTB is as burdensome
as generating fragments?

No, it isn't.

Yes, it is. Generating an ICMP PTB @aclet is as burdensome as
fragmenting a packet.

When a router fragments a packet, it has to fragment the next and the
next and the next. Maybe tens or hundreds of thousands of packets
before the end of that one user's session.

Not necessarily, because transport layer can react against fragmented
packets.

When a router generates a PTB, there is no next. PTB is a soft
failure. The origin must correct the error (by reducing packet size)

What if, the origin does not reduce packet size?

Masataka Ohta

Masataka_Ohta · September 1, 2015, 12:55am

Chris Marget wrote:

I'll probably come around, but I've not yet concluded that "screw it,
fragment my traffic, I don't care" is the stance that a conscientious
application should be taking.

Don't you care, for routers, generating ICMP PTB is as burdensome
as generating fragments?

I don't think so. If PMTUD is working (big IF, I know),

Yup.

the ICMP PTB
generation is a one-time thing (or once per 10 minutes or whatever)

A meaningful interval of retry is not 10 minutes but RTT measured
at layer 4 or above.

Is the concern that I might DDoS myself

Or, with spoofed source addresses, someone else.

Masataka Ohta

Mark_Andrews2 · September 1, 2015, 1:05am

William Herrin wrote:

>> for routers, generating ICMP PTB is as burdensome
>> as generating fragments?
>
> No, it isn't.

Yes, it is. Generating an ICMP PTB is as burdensome as
fragmenting a packet.

Well it could be done at wire speed. It just requires more complicated
hardware. Routers usually punt it to the cpu but there is no real
reason that they have to do that. There is no theoretical reason
why it has to be more burdensome than forwarding a packet. It's a
implementation choice.

> When a router fragments a packet, it has to fragment the next and the
> next and the next. Maybe tens or hundreds of thousands of packets
> before the end of that one user's session.

Not necessarily, because transport layer can react against fragmented
packets.

> When a router generates a PTB, there is no next. PTB is a soft
> failure. The origin must correct the error (by reducing packet size)

What if, the origin does not reduce packet size?

The communiction fails. Additionally routers normally rate limit
PTB generation thereby reducing cpu loads to a acceptable level
which is the whole point of moving the fragmentation to the originating
node.

Masataka_Ohta · September 1, 2015, 1:21am

Mark Andrews wrote:

Yes, it is. Generating an ICMP PTB is as burdensome as
fragmenting a packet.

Well it could be done at wire speed.

Both of them could be.

There is no theoretical reason
why it has to be more burdensome than forwarding a packet.

That's not my point.

The communiction fails.

It depends on layer 4 and above.

Additionally routers normally rate limit
PTB generation thereby reducing cpu loads to a acceptable level
which is the whole point of moving the fragmentation to the originating
node.

Routers can rate limit fragment generation, too.

Masataka Ohta

Chris_Marget · September 1, 2015, 3:47am

Chris Marget wrote:

I'll probably come around, but I've not yet concluded that "screw it,
fragment my traffic, I don't care" is the stance that a conscientious
application should be taking.

Don't you care, for routers, generating ICMP PTB is as burdensome
as generating fragments?

I don't think so. If PMTUD is working (big IF, I know),

Yup.

the ICMP PTB
generation is a one-time thing (or once per 10 minutes or whatever)

A meaningful interval of retry is not 10 minutes but RTT measured
at layer 4 or above.

I took the 10 minute value from RFC1191's recommendation about when
it's appropriate to try larger MTU sizes. One ICMP message should hold
the sender off for 10 minutes (or whatever), just like it does with
unicast traffic.

Would you explain why a router might need to generate ICMP PTB at a
rate corresponding to intervals of RTT? I don't see why the error rate
would be correlated with path length. Anyway, RTT isn't something that
necessarily exists with multicast applications.

Is the concern that I might DDoS myself

Or, with spoofed source addresses, someone else.

The latter concern isn't unique to this case and applies to many
(most? all?) types of reflection attacks. Indeed, many other
protocols have disabled potentially useful features in order to thwart
reflection attacks which rely on spoofed source addresses. At least in
the case of those protocols, we have the choice to enable monlist,
ip_respond_to_echo_broadcast or whatever as appropriate for the
environment.

I'm still not sure where this leaves the application which wants to do
the right thing.
- Send 1500 byte frames and expect fragmentation?
- Guess at the least of all likely path MTUs?
- Send 576 byte frames?
- Build a feedback mechanism into the application?