[NANOG] Microsoft.com PMTUD black hole?

Hello,

Has anyone else here seen problems with microsoft/msn/hotmail/live.com sites not performing PMTUD correctly? We have, for a while now, had people on our network complain of poor microsoft.com reachability, and discovered we can work around the issue by changing MSS on all TCP SYN as they go out of our network.

I recently watched the whole conversation between msn.com and a host on our network (with the MSS rewrite disabled), and if I'm reading it right, we are following PMTUD protocol correctly by sending back ICMP type 3 code 4, but all Microsoft hosts seem to ignore this and continue to send packets back to our host with an MSS that is too large.

I hope I'm wrong and that it is we who are doing something stupid, but after cruising Google for a while, I found a multitude of other complaints from people connected to other ISPs specifically about not being able to reach Microsoft web sites. It seems crazy that MS could have PMTUD broken for so long with nobody ever raising a complaint to them directly, though, which makes me wonder if there is another answer here that I'm missing.

I sent the following message to a couple of addresses that I gleaned from ARIN WHOIS for the IP block in question and threw hostmaster in there just in case it went somewhere, but noc@microsoft.com appears to be defunct. I have yet to receive acknowledgment of receipt from the other address.

Are there any microsoft.com admins that hang out here that can comment on this or get in touch with me, or is there perhaps someone on here with connections to the Microsoft NOC?

(BTW, I stripped the referenced libpcap attachment off of this message to the list just so that I wouldn't accidentally incur the wrath of NANOG...if y'all want to see it, I'm happy to post it.)

Thanks,

I thought I'd post a few constructive comments on this thread. (Full disclosure: I am an ex-Microsoft employee. I do not speak for the company, I'm just trying to help out the network community.)

1) Yes, Microsoft blocks ICMP for the most part, which will break Path MTU Discovery. This is a known issue. If you run into it, its most likely because the servers you are trying to talk to in MS-land don't have black hole router detection turned on.

2) Instead of trying to get all the various ACLs and firewalls in Microsoft fixed to allow PMTUD, you are more likely to experience joy if you can contact the server owners. Ask if they have black hole router detection turned on, and if not, if they can do so.

3) So how do you get in contact with the server owners or MSN's networking people? msnalert@microsoft.com is your best bet. That's the email address monitored by the basic Tier 1 "Service Operations Center". They cut tickets, follow scripts, and do very basic front line work. They probably won't be able to fix the problem for you, but they CAN get you in touch with the right people.

4) FINDING the right people can be a challenge, even internally. Microsoft is a very big company, and its far from centralized. Be specific in what URLs and IPs you are having trouble with, and be prepared to bounce around a bit. The people who run microsoft.com's servers aren't the same group that does hotmail, etc. Have patience, and try to get ticket numbers for tracking at much as possible.

5) Try to give a realistic estimate of how many users are being impacted by the problem. Your problem will be triaged as it moves through various groups, and yes, the response time may not be what you want. Your problem is one fire among many, and there aren't enough firefighters.

6) Be nice. Seriously. People love to hate Microsoft, and sometimes take it out on the poor overworked geeks who are trying to actually make things better. Every vulnerability, BSOD, or Vista delay is not the fault of the network or systems engineer you get in touch with. :wink:

* netgeek@bgp4.net (Janet Sullivan) [Thu 08 May 2008, 23:35 CEST]:

1) Yes, Microsoft blocks ICMP for the most part, which will break Path MTU Discovery. This is a known issue. If you run into it, its most likely because the servers you are trying to talk to in MS-land don't have black hole router detection turned on.

I find it hilarious that one part of the company had to come up with a hack to work around the inability of another part of the company to understand how TCP/IP works

  -- Niels.