ECN

Hello

I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?

Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.

I believe it likely that it is the destination that has the problem.

Regards,

Baldur

Hello

I have a customer that believes my network has a ECN problem. We do not, we just move packets. But how do I prove it?

Are you saying that none of your routers support ECN or that you think ECN only applies to endpoints?

Is there a tool that checks for ECN trouble? Ideally something I could run on the NLNOG Ring network.

I believe it likely that it is the destination that has the problem.

I’d say start with asking the reporter to provide a PCAP of the problem and review the packet trace to provide clues of tap points
in your network to investigate where ECN is (or should be) occurring and the opposite is occurring.

Owen

Hello

I have a customer that believes my network has a ECN problem. We do
not, we just move packets. But how do I prove it?

Is there a tool that checks for ECN trouble? Ideally something I could
run on the NLNOG Ring network.

I believe it likely that it is the destination that has the problem.

Hi Baldur

I believe I may be that customer :slight_smile:

First of all, thank you for looking into the issue! We've been having
great fun over on the ecn-sane mailing list trying to figure out what's
going on. I'll summarise below, but see this thread for the discussion
and debugging details:
https://lists.bufferbloat.net/pipermail/ecn-sane/2019-November/000527.html

The short version is that the problem appears to come from a combination
of the ECMP routing in your network, and Cloudflare's heavy use of
anycast. Specifically, a router in your network appears to be doing ECMP
by hashing on the packet header, *including the ECN bits*. This breaks
TCP connections with ECN because the TCP SYN (with no ECN bits set) end
up taking a different path than the rest of the flow (which is marked as
ECT(0)). When the destination is anycasted, this means that the data
packets go to a different server than the SYN did. This second server
doesn't recognise the connection, and so replies with a TCP RST. To fix
this, simply exclude the ECN bits (or the whole TOS byte) from your
router's ECMP hash.

For a longer exposition, see below. You should be able to verify this
from somewhere else in the network, but if there's anything else you
want me to test, do let me know. Also, would you mind sharing the router
make and model that does this? We're trying to collect real-world
examples of network problems caused by ECN and this is definitely an
interesting example.

-Toke

The long version:

From my end I can see that I have two paths to Cloudflare; which is

taken appears to be based on a hash of the packet header, as can be seen
by varying the source port:

$ traceroute -q 1 --sport=10000 104.24.125.13
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
1 _gateway (10.42.3.1) 0.357 ms
2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 4.707 ms
3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.283 ms
4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.667 ms
5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.406 ms
6 104.24.125.13 (104.24.125.13) 1.322 ms

$ traceroute -q 1 --sport=10001 104.24.125.13
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
1 _gateway (10.42.3.1) 0.293 ms
2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 3.430 ms
3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.194 ms
4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.297 ms
5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.805 ms
6 149.6.142.130 (149.6.142.130) 6.925 ms
7 104.24.125.13 (104.24.125.13) 1.501 ms

This is fine in itself. However, the problem stems from the fact that
the ECN bits in the IP header are also included in the ECMP hash (-t
sets the TOS byte; -t 1 ends up as ECT(0) on the wire and -t 2 is
ECT(1)):

$ traceroute -q 1 --sport=10000 104.24.125.13 -t 1
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
1 _gateway (10.42.3.1) 0.336 ms
2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 6.964 ms
3 customer-185-24-168-46.ip4.gigabit.dk (185.24.168.46) 1.056 ms
4 te0-1-1-5.rcr21.cph01.atlas.cogentco.com (149.6.137.49) 1.512 ms
5 netnod-ix-cph-blue-9000.cloudflare.com (212.237.192.246) 1.313 ms
6 104.24.125.13 (104.24.125.13) 1.210 ms

$ traceroute -q 1 --sport=10000 104.24.125.13 -t 2
traceroute to 104.24.125.13 (104.24.125.13), 30 hops max, 60 byte packets
1 _gateway (10.42.3.1) 0.339 ms
2 albertslund-edge1-lo.net.gigabit.dk (185.24.171.254) 2.565 ms
3 customer-185-24-168-38.ip4.gigabit.dk (185.24.168.38) 1.301 ms
4 10ge1-2.core1.cph1.he.net (216.66.83.101) 1.339 ms
5 be2306.ccr42.ham01.atlas.cogentco.com (130.117.3.237) 6.570 ms
6 149.6.142.130 (149.6.142.130) 6.888 ms
7 104.24.125.13 (104.24.125.13) 1.785 ms

So why is this a problem? The TCP SYN packet first needs to negotiate
ECN, so it is sent without any ECN bits set in the header; after
negotiation succeeds, the data packets will be marked as ECT(0). But
because that becomes part of the ECMP hash, those packets will take
another path. And since the destination is anycasted, that means they
will also end up at a different endpoint. This second endpoint won't
recognise the connection, and reply with a TCP RST. This is clearly
visible in tcpdump; notice the different TOS values, and that the RST
packet has a different TTL than the SYN-ACK:

12:21:47.816359 IP (tos 0x0, ttl 64, id 25687, offset 0, flags [DF], proto TCP (6), length 60)
    10.42.3.130.34420 > 104.24.125.13.80: Flags [SEW], cksum 0xf2ff (incorrect -> 0x0853), seq 3345293502, win 64240, options [mss 1460,sackOK,TS val 4248691972 ecr 0,nop,wscale 7], length 0
12:21:47.823395 IP (tos 0x0, ttl 58, id 0, offset 0, flags [DF], proto TCP (6), length 52)
    104.24.125.13.80 > 10.42.3.130.34420: Flags [S.E], cksum 0x9f4a (correct), seq 1936951409, ack 3345293503, win 29200, options [mss 1400,nop,nop,sackOK,nop,wscale 10], length 0
12:21:47.823479 IP (tos 0x0, ttl 64, id 25688, offset 0, flags [DF], proto TCP (6), length 40)
    10.42.3.130.34420 > 104.24.125.13.80: Flags [.], cksum 0xf2eb (incorrect -> 0x503e), seq 1, ack 1, win 502, length 0
12:21:47.823665 IP (tos 0x2,ECT(0), ttl 64, id 25689, offset 0, flags [DF], proto TCP (6), length 117)
    10.42.3.130.34420 > 104.24.125.13.80: Flags [P.], cksum 0xf338 (incorrect -> 0xc1d4), seq 1:78, ack 1, win 502, length 77: HTTP, length: 77
  GET / HTTP/1.1
  Host: 104.24.125.13
  User-Agent: curl/7.66.0
  Accept: */*
  
12:21:47.825485 IP (tos 0x2,ECT(0), ttl 60, id 0, offset 0, flags [DF], proto TCP (6), length 40)
    104.24.125.13.80 > 10.42.3.130.34420: Flags [R], cksum 0x3a65 (correct), seq 1936951410, win 0, length 0

The fix is to stop hashing on the ECN bits when doing ECMP. You could
keep hashing on the diffserv part of the TOS field if you want, but I
think it would also be fine to just exclude the TOS field entirely from
the hash.

This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.

Even without anycast, an ECMP shouldn't hash on the ECN bits. Doing so will split the flow over multiple paths; avoiding that is the whole point of doing the flow-based hashing in the first place.

Anycast "only" turns a potential degradation of TCP performance into a hard failure... :slight_smile:

-Toke

Not ideal, sure, but if it’s only for the SYN (as you seem to indicate), splitting the flow shouldn’t have material performance degradation?

It is certainly odd, but it's definitely a "thing."

https://archive.nanog.org/meetings/nanog37/presentations/matt.levine.pdf

Not to condone what cloudflare is doing, but…

An ECN connection will have different bits on various packets for the duration of the connection – pure ACKs (ACKs not piggybacking on data) will have the ECN bits as 00b, while all other packets will have either 01b, 10b (when no congestion was experienced) or 11b (when congestion was experienced). So using the ECN bits as part of the hash would affect performance throughout the life of the connection.

as one of the authors of that talk, it definitely is “a thing”, has been for years and years and years, and indeed, mostly works.

t

It does when the split flows land in different anycast origin POPs. Making a few assumptions from the traceroutes, the ECMP paths are sending some packets to Hamburg and some to Denmark. Each POP may be getting parts of what should be a single TCP stream, and I doubt they have anything to cope with that (another assumption).

I am testing disabling our use of ECMP as it is not strictly necessary and we are moving to a new platform anyway. Waiting for feedback from the customer to hear if this fixes the issue.

In any case, is it not recommended that users of anycast proxy packets that arrive at the wrong place? To avoid this kind of issue.

Regards,

Baldur

In typical anycast deployments there is no feasible way to figure out where the "right place" is.

It would be very interesting if your could share what equipment you're using that is doing ECMP hashing based on ECN bits. That vendor needs to fix that or people should avoid their devices.

ZTE M6000-S V3.00.20(3.40.1)

We are moving away from this platform so I can not be bothered with requesting a fix. In the past they have made fixes for us, so I believe they would also fix this issue if we asked them to do so.

Also I would like to state that I have not personally verified that the equipment is doing hashing based on the ECN bits. I just turned off ECMP so the customer can test. If it works we will either let ECMP stay off or move the customer to the new platform.

Regards,

Baldur

Not true. Hash result should indicate discreet flow, more importantly
discreet flow should not result into two unique hash numbers. Using
whole TOS byte breaks this promise and thus breaks ECMP.

Platforms allow you to configure which bytes are part of hash
calculation, whole TOS byte should not be used as discreet flow SHOULD
have unique ECN bits during congestion. Toke has diagnosed the problem
correctly, solution is to remove TOS from ECMP hash calculation.

Like it or not (and I really don’t), the majority of modern CDNs are using TCP over Anycast.

It’s ugly and it’s prone to problems like this. It’s nice to see a customer with know-how actually publicizing and digging into the problem.

Until now, I believe an unknown number of customers have been suffering in silence or relegated to the ISPs “We can’t reproduce you problem” bin without resolution.

I’ve had lots of discussions on the subject and the usual end result is “It’s too hard to measure or quantify and there’s no visible contingent of impacted users”.

Now we at least have one visible impacted user.

Owen

Hello,

> This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.

Not true. Hash result should indicate discreet flow, more importantly
discreet flow should not result into two unique hash numbers. Using
whole TOS byte breaks this promise and thus breaks ECMP.

Platforms allow you to configure which bytes are part of hash
calculation, whole TOS byte should not be used as discreet flow SHOULD
have unique ECN bits during congestion. Toke has diagnosed the problem
correctly, solution is to remove TOS from ECMP hash calculation.

In fact I believe everything beyond the 5-tuple is just a bad idea to
base your hash on. Here are some examples (not quite as straight
forward than the TOS/ECN case here):

TTL:
https://mailman.nanog.org/pipermail/nanog/2018-September/096871.html

IPv6 flow label:
https://blog.apnic.net/2018/01/11/ipv6-flow-label-misuse-hashing/
https://pc.nanog.org/static/published/meetings/NANOG71/1531/20171003_Jaeggli_Lightning_Talk_Ipv6_v1.pdf
https://www.youtube.com/watch?v=b0CRjOpnT7w

Lukas

Yes true.

Equal Cost MultiPath (ECMP) consistency over the life of a TCP connection is not a promise. Anycasters would love it to be but it’s not.

ECMP’s only promise is that packets for a particular connection will tend to prefer a particular path so that throughput doesn’t suffer overly much from the packet reordering you’d get by round-robining the packets on different links. Choosing an alternate path during congestion is a perfectly reasonable thing for ECMP to do.

Don’t blame the network. This is Cloudflare choosing not to handle the anycast spray corner case because it happens rarely enough with symptoms obscure enough that they only occasionally get called to carpet. Their BGP announcements make the claim they’re ready for your packet at any of their sites, but they’re not.

Regards,
Bill Herrin

This sounds like a bug on Cloudflare’s end (cause trying to do anycast TCP is... out of spec to say the least), not a bug in ECN/ECMP.

Errrrrr. I really don't think that there is any sort of spec that
covers that :stuck_out_tongue:

Using Anycast for TCP is incredibly common - the DNS root servers for
one obvious example.
More TCP centric well-known examples are Fastly and LinkedIn -
LinkedIn in particular did a really good podcast on their experience
with this.

There is also a good NANOG talk from the ~2000s (?) on people using
TCP anycast for long lived (serving ISO files, which were long-lived
in those days) flows, and how reliable it is - perhaps that's the talk
Todd mentioned?

W

RFC 7094 (https://tools.ietf.org/html/rfc7094) describes the pitfalls & risks of using TCP with an anycast address. It recognizes that there are valid use cases for it, though.

Specifically, section 3.1 says this:

   Most stateful transport protocols (e.g., TCP), without modification,
   do not understand the properties of anycast; hence, they will fail
   probabilistically, but possibly catastrophically, when using anycast
   addresses in the presence of "normal" routing dynamics.
...

* Saku Ytti

Not true. Hash result should indicate discreet flow, more importantly
discreet flow should not result into two unique hash numbers. Using
whole TOS byte breaks this promise and thus breaks ECMP.

Platforms allow you to configure which bytes are part of hash
calculation, whole TOS byte should not be used as discreet flow SHOULD
have unique ECN bits during congestion. Toke has diagnosed the problem
correctly, solution is to remove TOS from ECMP hash calculation.

Agreed. This also goes for the other bits, so whole byte must be excluded.

For example, the OpenSSH client will by default change the code point from zero (during authentication) to af21/cs1 (when it enters a interactive/non-interactive session).

I have experienced this break IPv6 SSH sessions to an anycasted SSH server instance that was reached through old Juniper DPC cards with ECMP enabled. Symptom was that authentication went fine, only for the connection to be reset immediately after (unless default IPQoS config was changed). The «solution» was to simply disable ECMP for all IPv6 traffic, since I could not figure out how to make the Juniper exclude the DiffServ byte from the ECMP hash calculation.

Tore