interesting troubleshooting

Nimrod_Levy · March 20, 2020, 9:33pm

I just ran into an issue that I thought was worth sharing with the NANOG community. With recently increased visibility on keeping the Internet running smoothly, I thought that sharing this small experience could benefit everyone.

I was contacted by my NOC to investigate a LAG that was not distributing traffic evenly among the members to the point where one member was congested while the utilization on the LAG was reasonably low. Looking at my netflow data, I was able to confirm that this was caused by a single large flow of ESP traffic. Fortunately, I was able to shift this flow to another path that had enough headroom available so that the flow could be accommodated on a single member link.

With the increase in remote workers and VPN traffic that won’t hash across multiple paths, I thought this anecdote might help someone else track down a problem that might not be so obvious.

Please take this message in the spirit in which it was intended and refrain from the snarky “just upgrade you links” comments.

_Job_Snijders · March 20, 2020, 9:50pm

Do we know which specific VPN technologies specifically are harder to
hash in a meaningful way for load balanacing purposes, than others?

If the outcome of this troubleshooting is a list of recommendations
about which VPN approaches to use, and which ones to avoid (because of
the issue you described), that'll be a great outcome.

Kind regards,

Job

Jared_Mauch · March 20, 2020, 9:57pm

It’s the protocol 50 IPSEC VPNs. They are very sensitive to path changes and reordering as well.

If you’re tunneling more than 5 or 10Gb/s of IPSEC it’s likely going to be a bad day when you find a low speed link in the middle. Generally providers with these types of flows have both sides on the same network vs going off-net as they’re not stable on peering links that might change paths.

You also need to watch out to ensure you’re not on some L2VPN type product that bumps up against a barrier. I know it’s a stressful time for many networks and systems people as traffic shifts. Good luck out there!

- Jared

Saku_Ytti1 · March 20, 2020, 10:07pm

Hey Nimrod,

I was contacted by my NOC to investigate a LAG that was not distributing traffic evenly among the members to the point where one member was congested while the utilization on the LAG was reasonably low. Looking at my netflow data, I was able to confirm that this was caused by a single large flow of ESP traffic. Fortunately, I was able to shift this flow to another path that had enough headroom available so that the flow could be accommodated on a single member link.

With the increase in remote workers and VPN traffic that won't hash across multiple paths, I thought this anecdote might help someone else track down a problem that might not be so obvious.

This problem is called elephant flow. Some vendors have solution for
this, by dynamically monitoring utilisation and remapping the
hashResult => egressInt table to create bias to offset the elephant
flow.

One particular example:

Ideally VPN providers would be defensive and would use SPORT for
entropy, like MPLSoUDP does.

Chris_Adams2 · March 20, 2020, 10:15pm

Once upon a time, Nimrod Levy <nimrodl@gmail.com> said:

With the increase in remote workers and VPN traffic that won't hash across
multiple paths, I thought this anecdote might help someone else track down
a problem that might not be so obvious.

Last week I ran into an issue where traffic between my home and work
networks had high latency, but only to certain IPs (even different IPs
on the same server). Since my work network peers with my home provider,
I was able to go to the provider's NOC, and they were very helpful (they
ended up turning up more bandwidth). I expect this was also a case of
one LAG member being congested, and my problem IP pairs were hashing to
that member.

My traffic wasn't VPN (SSH, with ping/mtr for testing), but it is
possible that somebody else's was - I didn't get detailed with the other
NOC.

Matthew_Petach2 · March 20, 2020, 10:23pm

There are several caveats to doing dynamic monitoring and remapping of
flows; one of the biggest challenges is that it puts extra demands on the
line cards tracking the flows, especially as the number of flows rises to
large values. I recommend reading
https://www.juniper.net/documentation/en_US/junos/topics/topic-map/load-balancing-aggregated-ethernet-interfaces.html#id-understanding-aggregated-ethernet-load-balancing

before configuring it.

“Although the feature performance is high, it consumes significant amount of line card memory. Approximately, 4000 logical interfaces or 16 aggregated Ethernet logical interfaces can have this feature enabled on supported MPCs. However, when the Packet Forwarding Engine hardware memory is low, depending upon the available memory, it falls back to the default load balancing mechanism.”

What is that old saying?

Oh, right–There Ain’t No Such Thing As A Free Lunch. ^_^;;

Matt

_Job_Snijders · March 20, 2020, 10:26pm

A few years ago we did a presentation about what can happen if hashing
for load balancing purposes doesn't work well (be it either IP or L2VPN
traffic). I think some of the information is still relevant as there
really isn't much difference between the problem existing in the
underlay network's implementation of algorithms or the properties of the
enveloppe that encompasses the overlay network packet.

video of younger job + jeff: https://www.youtube.com/watch?v=cXSwoKu9zOg
slides: https://archive.nanog.org/meetings/nanog57/presentations/Tuesday/tues.general.SnijdersWheeler.MACaddresses.14.pdf

Kind regards,

Job

Steve_Meuse3 · March 21, 2020, 2:17am

What that large flow in a single LSP? Is this something that FAT lsp would fix?

-Steve

William_Herrin · March 21, 2020, 3:11am

I would expect it to be true of any site to site VPN data flow. The
whole idea is for the guy in the middle to be unable to deduce
anything about the flow. If the technology provides hints about which
packets match the same subflow, it isn't doing a very good job.

Regards,
Bill Herrin

Saku_Ytti1 · March 21, 2020, 7:53am

Hey Matthew,

There are *several* caveats to doing dynamic monitoring and remapping of
flows; one of the biggest challenges is that it puts extra demands on the
line cards tracking the flows, especially as the number of flows rises to
large values. I recommend reading
Load Balancing on Aggregated Ethernet Interfaces | Junos OS | Juniper Networks
before configuring it.

You are confusing two features. Stateful and adaptive. I was proposing
adaptive, which just remaps the table, which is free, it is not flow
aware. Amount of flow results is very small bound number, amount of
states is very large unbound number.

Saku_Ytti1 · March 21, 2020, 7:58am

No.

FAT adds additional MPLS label for entropy, ingressPE calculates flow
hash, based on traditional flow keys and injects that flow number as
MPLS label, so transit LSR can use MPLS labels for balancing, without
being able to parse the frame. Similarly VPN provider could do that,
and inject that flow hash as SPORT at the time of tunneling, by
looking at the inside packet. And any defensive VPN provider should do
this, as it would be a competitive advantage.
Now for some vendors, like Juniper and Nokia transit LSR can look
inside pseudowire L3 packet for flow keys, so you don't even need FAT
for this. Some other like ASR9k cannot, and you'll need FAT for it.

But all of this requires that there is entropy to use, if it's truly
just single fat flow, then you won't balance it. Then you have to
create bias to the hashResult=>egressInt table, which by default is
fair, each egressInt has same amount of hashResults, for elephant
flows you want the congested egressInt to be mapped to fewer amount of
hashResults.

Mark_Tinka1 · March 21, 2020, 4:15pm

So the three or four times we tried to get FAT going (in a multi-vendor
network), it simply didn't work.

Have you (or anyone else) had any luck with it, in practice?

Mark.

Saku_Ytti1 · March 21, 2020, 4:25pm

Yeah we run it in a multivendor network (JNPR, CSCO, NOK), works.

I would also recommend people exclusively using CW+FAT and disabling
LSR payload heuristics (JNPR default, but by default won't do with CW,
can do with CW too).

Tassos_Chatzithomaog · March 21, 2020, 4:52pm

Only between Cisco boxes.

I still don't understand why the vendors cannot make it work in one direction only (the low-end platform would only need to remove an extra label, no need to inspect traffic).
That would help us a lot, since the majority of our traffic is downstream to the customer.

Saku_Ytti1 · March 21, 2020, 5:04pm

It is signalled separately for TX and RX and some vendors do allow you
to signal it separately.

Christopher_Morrow · March 21, 2020, 5:42pm

(skipping up the thread some)

It’s the protocol 50 IPSEC VPNs. They are very sensitive to path changes and reordering as well.

If you’re tunneling more than 5 or 10Gb/s of IPSEC it’s likely going to be a bad day when you find a low speed link in the middle. Generally providers with these types of flows have both sides on the same network vs going off-net as they’re not stable on peering links that might change paths.

a bunch of times the advice given to folk in this situation is: "Add
more entropy", which really for ipsec/gre/etc vpns means more
endpoints.
For instance, adding 3 more ips on either side for tunnel
egress/ingress will make the flows (ideally) smaller and more probable
to hash across different links in the intermediary network(s). This
also moves the loadbalancing back behind the customer prem so ideally
perhaps even the nxM flows are now balanced a little better as well.

sometimes this works, sometimes it's hard to accomplish

Tassos_Chatzithomaog · March 21, 2020, 8:51pm

Yep, the RFC gives this option.
Does Juniper MX/ACX series support it?
I know for sure Cisco doesn't.

Mark_Tinka1 · March 22, 2020, 7:41am

We weren't as successful (MX480 ingress/egress devices transiting a CRS
core).

In the end, we updated our policy to avoid running LAG's in the
backbone, and going ECMP instead. Even with l2vpn payloads, that spreads
a lot more evenly.

Mark.

Adam_Atkinson1 · March 22, 2020, 8:08am

I don't know how well-known this is, and it may not be something many people would want to do, but Enterasys switches, now part of Extreme's portfolio, allow "round-robin" as a load-sharing algorithm on LAGs.

see e.g.

https://gtacknowledge.extremenetworks.com/articles/How_To/How-to-configure-LACP-Output-Algorithm-as-Round-Robin

This may not be the only product line supporting this.

Mark_Tinka1 · March 22, 2020, 8:17am

So Junos does support both per-flow and per-packet load balancing on
LAG's on Trio line cards.

We tested this back in 2014 for a few months, and while the spread is
excellent (obviously), it creates a lot of out-of-order frame delivery
conditions, and all the pleasure & joy that goes along with that.

So we switched back to per-flow load balancing, and more recently, where
we run LAG's (802.1Q trunks between switches and an MX480 in the data
centre), we've gone 100Gbps so we don't have to deal with all this
anymore :-).

Mark.