MTU to CDN's

Hi,

N00b here trying to understand why certain CDN's such as Cloudfare have
issues where my MTU is low. For instance if I am using pptp and the MTU is
at 1300 it wont work. If I increase to 1478 it may or may not work.

TIA.

Hi,

N00b here trying to understand why certain CDN's such as Cloudfare have
issues where my MTU is low. For instance if I am using pptp and the MTU is
at 1300 it wont work. If I increase to 1478 it may or may not work.

PMTUD has a lot of trouble working reliability when the destination of
the PTB is a stateless load-balancer.

If your tunnel or host clamps the mss to the appropriate value it can
support. it is highly likely that connection attempts to the same
destination will work fine.

I've done some measurements over the internet in the past year or
so and 1400 byte packets with DF bit seem to make it just fine.

  - Jared

CDN’s (or anyone using a load balancer to multiple server instances) needs to
assume that traffic may be encapsulated (4in6, 6in4, 464XLAT) and lower the
interface MTU’s so that all traffic generated can be encapsulated without
fragmentation or PTB’s being generated.

This is only going to get worse as more and more eyeballs are being forced
into using IPv4 as a service scenarios.

Wait, what? MTU 1300 fails but 1478 sometimes works? Or was 1300 a typo
and you meant 1500?

This is understandable, but if this is also an operational practice we as the operational community want to condone (people using solutions where PMTUD doesn't work), then we also need to make sure that all applications do PLMTUD (RFC4821, Packet Level MTU Discovery). This is currently NOT the case, and from what I can tell, there isn't even an IETF document saying this is the best current practice.

So, is this something we want to say? We should talk about that.

❦ 8 janvier 2018 15:08 -0800, joel jaeggli <joelja@bogus.com> :

N00b here trying to understand why certain CDN's such as Cloudfare have
issues where my MTU is low. For instance if I am using pptp and the MTU is
at 1300 it wont work. If I increase to 1478 it may or may not work.

PMTUD has a lot of trouble working reliability when the destination of
the PTB is a stateless load-balancer.

More explanations are available here:
Path MTU discovery in practice

Vincent,

Thanks. That URL explained a lot.

if I was an ISP (Im not) and a CDN came and said "we want to be inside
you" (ewww) why wouldn't I say "sure: lets jumbo"

not even "asking for a friend" I genuinely don't understand why a CDN
who colocates and is not using public exchange, but is inside your
transit boundary (which I am told is actually a bit thing now) would
not drive to the packet size which works in your switching gear.

I understand that CDN/DC praxis now drives to cheap dumb switches, but
even dumb switches like bigger packets dont they? less forwarding
decision cost, for more throughput?

Because the CDN delivers to your customers not you. It’s your customers link
requirements that are the ones you need to worry about. If you support
jumbo frames to all of your customers and their gear also supports jumbo
frame then sure go ahead and use jumbo frames otherwise use the lowest
common denominator MTU when transmitting. This is less than 1500 on
today Internet and encapsulated traffic is reasonable common.

  embedded CND <--> NAT64 <--> CLAT <--> client
         1500 14XX 1500
  embedded CDN <--> B4 <— > 6RD <— > client
               1500. 14XX 1500

Now you can increase the first 1500 easily. The rest of the path not so
easily.

thanks. good answer. low risk answer. "it will work" answer.

If its a variant of "the last mile is your problem" problem, I'm ok
with that. If its a consequence of the middleware deployment I feel
like its more tangibly bad decision logic, but its real.

-G

The reason is most customers are at a lower MTU size. lets say i can
send you a 9K packet. If you receive that frame, and realize you need
to fragment, then it’s your routers job to slice 9000 into 5 x 1500.
I may have caused you to hit your exception path (which could be expensive)
as well as made your PPS load 5x larger downstream.

This doesn’t even account for the fact that you may need to have a speed
mismatch, whereby I am sending 100Gb+ and your outputs may be only 10G.

If you’re then doing DSL + PPPoE and your customers really see a MTU
of 1492 or less, then another device has to fragment 5x again.

For server to server, 9K makes a lot of sense, it reduces the packet processing
and increases the throughput. If your consumer electronic wifi gear or switch
can’t handle >1500, and doesn’t even have a setting for layer-2 > 1500, the
cost is just too high. Much easier for me to send 5x packets in the first place
and be more compatible.

Like many things, I’d love for this to be as simple and purist as you
purport. I might even be willing to figure out if at $DayJob we could see
a benefit from doing this, but from the servers to switches to routers then
a partner interface.. it’s a lot of things to make sure are just right.

Plus.. can your phone do > 1500 MTU on the Wifi? Where’s that setting?

(mumbling person about CSLIP and MRUs from back in the day)

- Jared

In practice, no, because the packet you sent had the "don't fragment"
bit set. That means my router is not allowed to fragment the packet.
Instead, I must send the originating host an ICMP destination
unreachable packet stating that the largest packet I can send further
is 1500 bytes.

You might receive my ICMP message. You might not. After all, I am not
the host you were looking for.

Good luck.

Regards,
Bill Herrin

P.S. This makes Linux servers happy:

iptables -t mangle --insert POSTROUTING --proto tcp \
        --tcp-flags SYN,RST,FIN SYN --match tcpmss --mss 1241:65535 \
        --jump TCPMSS --set-mss 1240

lets say i can
send you a 9K packet. If you receive that frame, and realize you need
to fragment, then it’s your routers job to slice 9000 into 5 x 1500.

In practice, no, because the packet you sent had the "don't fragment"
bit set. That means my router is not allowed to fragment the packet.
Instead, I must send the originating host an ICMP destination
unreachable packet stating that the largest packet I can send further
is 1500 bytes.

You might receive my ICMP message. You might not. After all, I am not
the host you were looking for.

This gets especially bad in cases such as anycast where the return path may be asymmetrical and could result in delivery of the ICMP PTB message to a different anycast instance or to a stateless load balancer that is incapable of determining which machine originated the packet being referenced.

One of the many reasons I continue to question the wisdom of using anycast for multi-packet transactions.

Owen

lets say i can
send you a 9K packet. If you receive that frame, and realize you need
to fragment, then it’s your routers job to slice 9000 into 5 x 1500.

In practice, no, because the packet you sent had the "don't fragment"
bit set.

Which packet? Is there a specific CDN that does this? I’d be curious to see
data vs speculation.

That means my router is not allowed to fragment the packet.
Instead, I must send the originating host an ICMP destination
unreachable packet stating that the largest packet I can send further
is 1500 bytes.

You might receive my ICMP message. You might not. After all, I am not
the host you were looking for.

:slight_smile:

Nor is it likely the reply.

- Jared

Howdy,

Path MTU discovery (which sets the DF bit on TCP packets) is enabled
by default on -every- operating system that's shipped for decades now.
If you don't want it, you have to explicitly disable it. Disabling it
for any significant quantity of traffic is considered antisocial since
routers generally can't fragment in the hardware fast path.

Regards,
Bill

❦ 19 janvier 2018 08:53 +1000, George Michaelson <ggm@algebras.org> :

if I was an ISP (Im not) and a CDN came and said "we want to be inside
you" (ewww) why wouldn't I say "sure: lets jumbo"

Most traffic would be with clients limited to at most 1500 bytes.

I don't mind letting the client premises routers break down 9000 byte
packets. My ISP controls end to end connectivity. 80% of people even let
our techs change settings on their computer, this would allow me to give
~5% increase in speeds, and less network congestion for end users for a one
time $60 service many people would want. It's also where the internet
should be heading... Not to beat a dead horse(re:ipv6 ) but why hasn't the
entire internet just moved to 9000(or 9600 L2) byte MTU? It was created for
the jump to gigabit... That's 4 orders of magnitude ago. The internet
backbone shouldn't be shuffling around 1500byte packets at 1tbps. That
means if you want to layer 3 that data, you need a router capable of more
than half a billion packets/s forwarding capacity. On the other hand, with
even just a 9000 byte MTU, TCP/IP overhead is reduced 6 fold, and
forwarding capacity needs just 100 or so mpps capacity. Routers that
forward at that rate are found for less than $2k.

As usual, there are 5-10 (or more) factors playing into this. Some, in random order:

1. IEEE hasn't standardised > 1500 byte ethernet packets
2. DSL/WIFI chips typically don't support > ~2300 because reasons.
3. Because 2, most SoC ethernet chips don't either
4. There is no standardised way to understand/probe the L2 MTU to your next hop (ARP/ND and probing if the value actually works)
5. PMTUD doesn't always work.
6. PLPMTUD hasn't been implemented neither in protocols nor hosts generally.
7. Some implementations have been optimized to work on packets < 2000 bytes and actually has less performance than if they have to support larger packets (they will allocate 2k buffer memory per packet), 9k is ill-fitting across 2^X values
8. Because of all above reasons, mixed-MTU LAN doesn't work, and it's going to be mixed-MTU unless you control all devices (which is typically not the case outside of the datacenter).
9. The PPS problem in hosts and routers was solved by hardware offloading to NICs and forwarding NPUs/ASICs with very high lookup speeds where PPS no longer was a big problem.

On the value to choose for "large MTU", 9000 for edge and 9180 for core is what I advocate, after non-trivial amount of looking into this. All major core routing platforms work with 9180 (with JunOS only supporting this after 2015 or something). So if we'd want to standardise on MTU that all devices should support, then it's 9180, but we'd typically use 9000 in RA to send to devices.

If we want a higher MTU to be deployable across the Internet, we need to make it incrementally deployable. Some key things to achieve that:

1. Get something like https://tools.ietf.org/html/draft-van-beijnum-multi-mtu-05 implemented.
2. Go to the IETF and get a document published that advises all protocols to support PLMTUD (RFC4821)

1 to enable mixed-MTU lans.
2 to enable large MTU hosts to actually be able to communicate when PMTUD doesn't work.

With this in place (wait ~10 years), larger MTU is now incrementally deployable which means it'll be deployable on the Internet, and IEEE might actually accept to standardise > 1500 byte packets for ethernet.

Other than people improperly blocking ICMP, when does PMTUD not work? Honest question, not troll.