Unless you specifically configure true “per-packet” on your LAG:
Well, not exactly the same thing. (But it’s my mistake, I was referring to L3 balancing, not L2 interface stuff.)
load-balance per-packet will cause massive reordering, because it’s random spray , caring about nothing except equal loading of the members. It’s a last resort option that will cause tons of reordering. (And they call that out quite clearly in docs.) If you don’t care about reordering it’s great.
load-balance adaptive generally did a decent enough job last time I used it much. stateful was hit or miss ; sometimes it tested amazing, other times not so much. But it wasn’t a primary requirement so I never dove into why
Well, not exactly the same thing. (But it's my mistake, I was referring to L3 balancing, not L2 interface stuff.)
Fair enough.
load-balance per-packet will cause massive reordering, because it's random spray , caring about nothing except equal loading of the members. It's a last resort option that will cause tons of reordering. (And they call that out quite clearly in docs.) If you don't care about reordering it's great.
load-balance adaptive generally did a decent enough job last time I used it much.
Yep, pretty much my experience too.
stateful was hit or miss ; sometimes it tested amazing, other times not so much. But it wasn't a primary requirement so I never dove into why
Never tried stateful.
Moving 802.1Q trunk from N x 10Gbps LAG's to native 100Gbps links resolved this load balancing conundrum for us. Of course, it works well because we spread these router<=>switch links across several 100Gbps ports, so no single trunk is ever that busy, even for customers buying N x 10Gbps services.
Yes, this has been my understanding of, specifically, Juniper's
forwarding complex.
Correct, packet is sprayed to some PPE, and PPEs do not run in
deterministic time, after PPEs there is reorder block that restores
flow, if it has to.
EZchip is same with its TOPs.
Packets are chopped into near-same-size cells, sprayed across all
available fabric links by the PFE logic, given a sequence number, and
protocol engines ensure oversubscription is managed by a request-grant
mechanism between PFE's.
This isn't the mechanism that causes reordering, it's the ingress and
egress lookup where Packet or PacketHead is sprayed to some PPE where
it can occur.
Can find some patents on it:
When a PPE 315 has finished processing a header, it notifies a Reorder
Block 321. The Reorder Block 321 is responsible for maintaining order
for headers belonging to the same flow, and pulls a header from a PPE
315 when that header is at the front of the queue for its reorder
flow.
Note this reorder happens even when you have exactly 1 ingress
interface and exactly 1 egress interface, as long as you have enough
PPS, you will reorder outside flows, even without fabric being
involved.
Per packet LB is one of those ideas that at a conceptual level are great, but in practice are obvious that they’re out of touch with reality. Kind of like the EIGRP protocol from Cisco and using the load, reliability, and MTU metrics.
TCP looks quite different in 2023 than it did in 1998. It should handle
packet reordering quite gracefully;
Maybe and, even if it isn't, TCP may be modified. But that
is not my primary point.
ECMP, in general, means pathes consist of multiple routers
and links. The links have various bandwidth and other
traffic may be merged at multi access links or on routers.
Then, it is hopeless for the load balancing points to
control buffers of the routers in the pathes and delays
caused by buffers, which makes per-packet load balancing
hopeless.
However, as I wrote to Mark Tinka;
: If you have multiple parallel links over which many slow
: TCP connections are running, which should be your assumption,
with "multiple parallel links", which are single hop
pathes, it is possible for the load balancing point
to control amount of buffer occupancy of the links
and delays caused by the buffers almost same, which
should eliminate packet reordering within a flow,
especially when " many slow TCP connections are
running".
And, simple round robin should be good enough
for most of the cases (no lab testing at all, yet).
A little more aggressive approach is to fully
share a single buffer by all the parallel links.
But as it is not compatible with router architecture
today, I did not proposed the approach.
Those multi metrics are in ISIS as well (if you don't use wide). And I
agree those are not for common cases, but I wouldn't be shocked if
someone has legitimate MTR use-case where different metric-type
topologies are very useful. But as long as we keep context as the
Internet, true.
100% reordering does not work for the Internet, not without changing
all end hosts. And by changing those, it's not immediately obvious how
we end-up in better place, like if we wait bit longer to signal
packet-loss, likely we end up in worse place, as reordering just is so
dang rare today, because congestion control choices have made sure no
one reorders, or customers will yell at you, yet packet-loss remains
common.
Perhaps if congestion control used latency or FEC instead of loss, we
could tolerate reordering while not underperforming under loss, but
I'm sure in decades following that decision we'd learn new ways how we
don't understand any of this.
But for non-internet applications, where you control hosts, per-packet
is used and needed, I think HPC applications, and GPU farms etc. are
the users who asked JNPR to implement this.
set interfaces ae2 aggregated\-ether\-options load\-balance per\-packet
I ran per-packet on a Juniper LAG 10 years ago. It produced 100%
perfect traffic distribution. But the reordering was insane, and the
applications could not tolerate it.
Unfortunately that is not strict round-robin load balancing. I do not
know about any equipment that offers actual round-robin
load-balancing.
Juniper's solution will cause way too much packet reordering for TCP to
handle. I am arguing that strict round-robin load balancing will
function better than hash-based in a lot of real-world
scenarios.
Well, not exactly the same thing. (But it's my mistake, I was referring to
L3 balancing, not L2 interface stuff.)
That should be a correct referring.
load-balance per-packet will cause massive reordering,
If buffering delay of ECM paths can not be controlled , yes.
because it's random
spray , caring about nothing except equal loading of the members.
Equal loading on point to point links between two routers by
(weighted) round robin means mostly same buffering delay, which
won't cause massive reordering.
Since TCP does not know whether a duplicate ACK is caused by a lost
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
segment or just a reordering of segments, it waits for a small number
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
of duplicate ACKs to be received. It is assumed that if there is
just a reordering of the segments, there will be only one or two
duplicate ACKs before the reordered segment is processed, which will
then generate a new ACK. If three or more duplicate ACKs are
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
received in a row, it is a strong indication that a segment has been
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
lost.
Unfortunately that is not strict round-robin load balancing.
Oh? What is it then, if it's not spraying successive packets across member links?
I do not
know about any equipment that offers actual round-robin
load-balancing.
Cisco had both per-destination and per-packet. Is that not it in the networking world?
Juniper's solution will cause way too much packet reordering for TCP to
handle. I am arguing that strict round-robin load balancing will
function better than hash-based in a lot of real-world
scenarios.
Ummh, no, it won't.
If it did, it would have been widespread. But it's not.
I believe the suggestion is that round-robin out-performs random
spray. Random spray is what the HPC world is asking, not round-robin.
Now I've not operated such network where per-packet is useful, so I'm
not sure why you'd want round-robin over random spray, but I can see
easily why you'd want either a) random traffic or b) random spray, if
neither are true, if you have strict round-robin and you have
non-random traffic, say every other packet is big data delivery, every
other packet is small ACK, you can easily synchronise one link to 100%
util, and and another near 0%, if you do true round-robin, but not of
you do random spray.
I don't see downside random spray would have over round-robin, but I
wouldn't be shocked if there is one.
I see this thread is mostly starting to loop around two debates
1) Reordering is not a problem
- if you control the application, you can make it 0 problem
- if you use TCP shipping in Androids, iOS, macOS, Windows, Linux,
BSD reordering is in practice as bad as packet loss.
- people who know this in the list, don't know it because they read
it, they know it, because they got caught pants down and learned it,
because they had reordering and tcp performance was destroyed, even at
very low reorder rates
- we could design TCP congestion control that is very tolerant to
reordering, but I cannot say if it would be overall win or loss
2) Reordering won't happen in per-packet, if there is no congestion
and latencies are equal
- the receiving distributed router (~all of them) do not have
global synchronisation, they do not make any guarantees that ingress
order is honored for egress, when ingress is >1 interface, the amount
of reordering this alone causes will destroy customer expectation of
TCP performance
- we could quite easily guarantee order as long as interfaces are
in same hardware complex, but it would be very difficult to guarantee
between hardware complexes
Oh? What is it then, if it's not spraying successive packets across
member links?
It sprays the packets more or less randomly across links, and each link
then does individual buffering. It introduces an unnecessary random
delay to each packet, when it could just place them successively on the
next link.
Ummh, no, it won't.
If it did, it would have been widespread. But it's not.
It seems optimistic to argue that we have reached perfection in
networking.
The Linux TCP stack does not immediately start backing off when it
encounters packet reordering. In the server world, packet-based
round-robin is a fairly common interface bonding strategy, with the
accompanying reordering, and generally it performs great.
At a previous $dayjob at a Tier 1, we would only support LAG for a customer L2/3 service if the ports were on the same card. The response we gave if customers pushed back was “we don’t consider LAG a form of circuit protection, so we’re not going to consider physical resiliency in the design”, which was true, because we didn’t, but it was beside the point. The real reason was that getting our switching/routing platform to actually run traffic symmetrically across a LAG, which most end users considered expected behavior in a LAG, required a reconfiguration of the default hash, which effectively meant that [switching/routing vendor]'s TAC wouldn’t help when something invariably went wrong. So it wasn’t that it wouldn’t work (my recollection at least is that everything ran fine in lab environments) but we didn’t trust the hardware vendor support.
We've had the odd bug here and there with LAG's for things like VRRP, BFD, e.t.c. But we have not run into that specific issue before on ASR1000's, ASR9000's, CRS-X's and MX. 98% of our network is Juniper nowadays, but even when we ran Cisco and had LAG's across multiple line cards, we didn't see this problem.
The only hashing issue we had with LAG's is when we tried to carry Layer 2 traffic across them in the core. But this was just a limitation of the CRS-X, and happened also on member links of a LAG that shared the same line card.
If you have
Linux - 1RU cat-or-such - Router - Internet
Mostly round-robin between Linux-1RU is gonna work, because it
satisfies the a) non congested b) equal rtt c) non-distributed (single
pipeline ASIC switch, honoring ingress order on egress),
requirements. But it is quite a special case, and of course there is
only a round-robin on one link in one direction.
Between 3.6-4.4 all multipath in Linux was broken, and I still to this
day help people with problems on multipath complaining it doesn't
perform (in LAN!).
3.6 introduced FIB to replace flow-cache, and made multipath essentially random
4.4 replaced random with hash
When I ask them 'do you see reordering', people mostly reply 'no',
because they look at PCAP and it doesn't look important to the human
observer, it is such an insignificant amount.. Invariable problem goes
away with hashing. (netstat -s is better than intuition on PCAP).