[NANOG] Microsoft.com PMTUD black hole?

Brandon_Butterworth1 · May 6, 2008, 7:58pm

Has anyone else here seen problems with microsoft/msn/hotmail/live.com
sites not performing PMTUD correctly?

I used to see it a lot when hosting on windows was popular and people
realised they needed a firewall or decided to add a load balancer
but broke PMTUD by leaving it enabled on the servers.

I've not heard of it for some time so those people got
a clue or moved to something else (or everyone worked around them)

brandon

Iljitsch_van_Beijnum · May 6, 2008, 8:26pm

Many years ago I had occasion to terminate dial-up service over L2TP from modem pools operated by a service provider who shall remain nameless to protect the guilty. This service had the unfortunate tendency to drap all packets larger than 576 bytes. So we needed to negotiate a 576-byte MTU over PPP.

We then got many complaints from users who dialed in using ISDN routers (yes this was a while ago) because of broken path MTU discovery. The behavior that Microsoft exhibits was EXTREMELY common in those days, and I have no reason to assume it's any less common today. (I also see it regularly with IPv6.) What I did was clear the DF bit on packets going out to the L2TP virtual interfaces so the packets could be fragmented.

A more common approach is to rewrite the MSS option in all TCP SYNs with a smaller value so there won't be TCP segments large enough to trigger the problem. AFAIK, all boxes that do PPPoE do this.

All of this even went so far that the IETF came up with RFC 4821, which will do path MTU discovery by correlating lost packets with packet sizes to determine the path MTU rather than depend on ICMP messages.

Nathan_Anderson · May 6, 2008, 8:57pm

Brandon Butterworth wrote:

I used to see it a lot when hosting on windows was popular and people
realised they needed a firewall or decided to add a load balancer
but broke PMTUD by leaving it enabled on the servers.

Yeah, but this is Microsoft's OWN server farm we are talking about here, not some small podunk IIS-based hosting provider.

...well, you may be right. I am probably giving MS too much credit here.

On another note, someone pointed out to me off-list that I apparently tyop'd "hostmaster" when I sent the e-mail to MS. I have since re-sent it to the properly-spelled address and again promptly received a "User unknown" bounceback.

Nathan_Anderson · May 6, 2008, 9:29pm

Iljitsch van Beijnum wrote:

A more common approach is to rewrite the MSS option in all TCP SYNs

[snip]

Yeah, we do this now, but the software that we have been using for PPPoE termination as well as for a huge portion of our clients (MikroTik RouterOS) doesn't do it correctly in my estimation when you flip on the automatic "change-tcp-mss" option...it rewrites the MSS in ALL SYNs passing through it, either coming OR going. This has the effect of breaking communication with other hosts that actually have a SMALLER MSS than our PPPoE customers since our client will get a SYN+ACK from the remote host that we have rewritten to reflect a larger MSS than the remote host is capable of dealing with. Because MikroTik rewrote both the SYNs generated by us as well as received by us, our customer's host is now under the impression that the lowest MSS between the two hosts matches its own.

At least that's the best theory I've come up with. We can write (and have written) custom IP manglers on the MikroTik boxes that only touch SYNs generated by our clients, and only when the MSS is larger than a certain value (in order to honor MSSes even lower than that allowed by their PPPoE gateway). But it's a PITA to deal with. I'd just rather everyone follow protocol. Although we can't always expect everyone to do it by the book, I don't think it is too much to ask that those who operate sizable networks that nearly everyone is required to interact with on a daily basis (read: Microsoft) act responsibly.

All of this even went so far that the IETF came up with RFC 4821, which will do path MTU discovery by correlating lost packets with packet sizes to determine the path MTU rather than depend on ICMP messages.

What's funny is that I ran my tests from a Windows XP host with the recently-released Service Pack 3 installed, which is supposed to activate Microsoft's "PMTUD Black Hole Router Detection" by default (available pre-SP3 but apparently not turned on without a registry change). I haven't read up on exactly how it's supposed to work, but I think the basic idea is that if the TCP connection is negotiated properly but it doesn't get a response beyond that, it will try lower and lower MSSes until it does.

However it works (or doesn't as the case may be), it didn't make a lick of difference. I waited and waited for content to be delivered to me until eventually Microsoft's end sent me a TCP RST.

While I was poking at this, though, I had a thought...most IP stacks I believe keep a path MTU cache of some sort. I know Windows does: if I send an ICMP packet with DF set that is larger than the PPPoE gateway can handle, I get something similar to the following:

C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472

Pinging 64.126.160.1 with 1472 bytes of data:

Reply from 64.126.142.249: Packet needs to be fragmented but DF set.
Packet needs to be fragmented but DF set.
[...]

Next time that I try the same thing, Windows doesn't even bother trying to send the packet. It looks at its PMTU table for that IP, and already KNOWS it is too big:

C:\Documents and Settings\nathana>ping 64.126.160.1 -f -l 1472

Pinging 64.126.160.1 with 1472 bytes of data:

Packet needs to be fragmented but DF set.
[...]

However, even when trying this with www.msnbc.msn.com, and with the MSNBC entry in its PMTU cache (and its IP set statically in my 'hosts' file so that Akamai/MS round-robin DNS doesn't screw with me during the test), when I tried to build a TCP connection to MSNBC from this same host, Windows told the remote host it had a 1460 MSS.

Now, although that makes sense, in order to avoid issues like the one we are facing with Microsoft, would it not make _more_ sense for the stack to look at the PMTU cache first, and then adjust its own MSS just for connections to that one host? Maybe even send out an MTU - 40 ICMP packet to the host that we want to build a TCP connection with FIRST to get an ICMP type 3 code 4 response from the router in-between with the smaller MTU?

That would put the burden of PMTUD on the host requesting the TCP session rather than on the one responding, but if hosts were "smarter" like this it seems to me it might smooth out some of these issues. The remote end could be "broken" with respect to PMTUD but it wouldn't matter.

Thoughts?

Nathan_Anderson · May 6, 2008, 9:32pm

Nathan Anderson/FSR wrote:

[...]

connections to that one host? Maybe even send out an MTU - 40 ICMP

:s/40/sized. Brain fart.

Iljitsch_van_Beijnum · May 7, 2008, 5:22am

A more common approach is to rewrite the MSS option in all TCP SYNs
with a smaller value so there won't be TCP segments large enough to
trigger the problem. AFAIK, all boxes that do PPPoE do this.

And just the other day, you were saying:

Very few people out there use an MTU significantly below 1500 bytes. A
1500-byte MTU will give you an _average_ packet size of ~1000 on long-
lived TCP flows because there is one tiny ACK for every two full size
data segments.

Right. Why is that noteworthy?

I have a lot more to say about MTU issues in this draft about negotating MTUs between two hosts/routers on a subnet so jumboframes can be deployed without manual configuration:

http://www.ietf.org/internet-drafts/draft-van-beijnum-multi-mtu-02.txt

Apparently, there's a *reason* why RFC1122, section 3.3.3 says:

        It is generally desirable to avoid local fragmentation and to
        choose EMTU_S low enough to avoid fragmentation in any gateway
        along the path. In the absence of actual knowledge of the
        minimum MTU along the path, the IP layer SHOULD use
        EMTU_S <= 576 whenever the destination address is not on a
        connected network, and otherwise use the connected network's
        MTU.

Tell it to Microsoft and their ICMP-filtering friends...

Iljitsch_van_Beijnum · May 7, 2008, 5:29am

No. This would add significant delay because you'd have to give the other side enough time to respond to the large packet (also sending a large packet on something like GPRS/EDGE is a waste of bandwidth and battery power) while if there is ICMP filtering, there won't be a response, which is exactly the reason why we're in this bind in the first place (along with the stupid idea that DF should be set for ALL packets rather than just once in a while).

And adjusting the MSS based on ephemeral information is the wrong thing to do in the first place. The path MTU can vary. Once you've advertised a small MSS you can never increase it.

It is incredibly unprofessional that people enable PMTUD, then break it and require the rest of the world to implement workarounds. Either use PMTUD properly by accepting the ICMP messages or turn PMTUD off.

Bjorn_Mork · May 7, 2008, 8:10am

Iljitsch van Beijnum <iljitsch@muada.com> writes:

Many years ago I had occasion to terminate dial-up service over L2TP
from modem pools operated by a service provider who shall remain
nameless to protect the guilty. This service had the unfortunate
tendency to drap all packets larger than 576 bytes. So we needed to
negotiate a 576-byte MTU over PPP.

We then got many complaints from users who dialed in using ISDN
routers (yes this was a while ago) because of broken path MTU
discovery. The behavior that Microsoft exhibits was EXTREMELY common
in those days, and I have no reason to assume it's any less common
today. (I also see it regularly with IPv6.) What I did was clear the
DF bit on packets going out to the L2TP virtual interfaces so the
packets could be fragmented.

Right. I once stumbled across a SOHO-router doing just that. I never
understood why, but now you've given at least one explanation how it
could appear to be a good idea.

I can also provide the reason why we found it to be an extremely bad
idea at the time: Some (most? all?) systems won't set both the DF flag
and the identification field at the same time. If you clear the DF flag
without changing the identification field, you might end up with
fragmented packets that are impossible to reassemble. Which was why I
stumbled across the DF-clearing SOHO-router in the first place. The
random problems it generated were extremely difficult to debug, and when
we started we truly believed that we had a problem with a layer 4 load
balancing switch.

Note: There are solutions that will both clear the DF flag and generate
a new id. E.g. http://www.openbsd.org/faq/pf/scrub.html

This is the proper way to clear DF, if you must. Never just clear it.

Bjørn

Nathan_Anderson · May 7, 2008, 6:50pm

Iljitsch van Beijnum wrote:

No. This would add significant delay because you'd have to give the other side enough time to respond to the large packet (also sending a large packet on something like GPRS/EDGE is a waste of bandwidth and battery power) while if there is ICMP filtering, there won't be a response, which is exactly the reason why we're in this bind in the first place

I admit the idea needs tweaking (at best), and it was just a stray thought :-), but 1) even if there is ICMP filtering happening way at the other end, I (the TCP initiator) will still get a response from the router in the middle (RITM) that is reducing the total path MTU if I try to send a packet through it larger than the actual path MTU, and 2) if I don't get a response to my single large packet (either from a RITM or the other end) in a timely fashion (less than a second?), then the client/initiator may just assume that path MTU == local MTU and will set its MSS accordingly (which is no different than what is happening now), until it has a reason to think differently.

Also, if there is already something in the local PMTU cache for a single host address, I'm not sure I follow why it would be a bad idea for the TCP initiator to consult that cache when preparing the SYN. Although, on second thought, I suppose it is possible (and, in more than a few cases, likely) that in instances of route path asymmetry, the PMTU of the path from the initiator to the server may be different than the PMTU of the path back from the server to the client. Hmmm.

Okay, scratch that idea then.

Nathan_Anderson · May 7, 2008, 7:38pm

Yes, but my point was precisely that one OR the other side (server OR client) is going to NOT have the ICMP-munching firewall in between itself and the "RITM" as I have affectionately been calling it (although it is definitely possible that there are two ICMP-munchers on either side of the RITM).

And case #2 is exactly what is occurring right now _anyway_: hosts assume that path MTU == local MTU even if there is already an active PMTU cache entry from a recent earlier communication with the remote host. So I don't see how making that assumption _after_ making an honest attempt at actively determining whether or not it is actually the case is any more broken than they way things are already being done.

The problem is that, as I realized at the end of the message you quoted, there are potentially multiple paths between the same two hosts, and the path that the packet takes in one direction is not guaranteed to be the same path that the packet takes in the opposite direction.

Tomas_L_Byrnes · May 7, 2008, 7:43pm

I'm not sure what the issue is here.

Just about every modern firewall I've used has an option to enable PMTU
on interfaces, while blocking all other ICMP.

Is MS not running something manufactured in the last 10 years at their
perimeter?

Nathan_Anderson · May 7, 2008, 9:00pm

Tomas L. Byrnes wrote:

I'm not sure what the issue is here.

Just about every modern firewall I've used has an option to enable PMTU
on interfaces, while blocking all other ICMP.

Is MS not running something manufactured in the last 10 years at their
perimeter?

Not sure, but you actually entered in here to a subthread of the original conversation, this one about other possible ways of dealing with black hole "ICMP-munchers" in a pre-emptive fashion. I had a brainstorm that I thought would be workable, which is what we were discussing here. Apparently, it turns out my idea was no good.

The original discussion about MS blocking ICMP to their own servers, which is the discussion it sounds like you are looking for, is over that-a-way... *points*

Matthew_Petach2 · May 12, 2008, 4:19pm

Unless things have changed drastically since we parted ways, it's a simple
ACL applied on all edge interfaces. It should be possible for them to modify
it to allow the list of ICMP subtypes listed at
http://www.cymru.com/Documents/icmp-messages.html

It would *certainly* make troubleshooting easier for the poor folks at
Microsoft, since one side effect of the edge filter being set that way
meant we couldn't traceroute outside the network; the port unreachable
messages never made it back, so everything outside the edge routers
was all just stars.

Of course, that was in a former lifetime, so it's entirely possible and
probable things have changed considerably since then. ^_^;;

Matt
(speaking only for myself, not for my current employer, and most
certainly not for my previous employer who I'm still somewhat bitter
at, not having gotten any of my hardware back yet...)