interesting troubleshooting

I just ran into an issue that I thought was worth sharing with the NANOG community. With recently increased visibility on keeping the Internet running smoothly, I thought that sharing this small experience could benefit everyone.

I was contacted by my NOC to investigate a LAG that was not distributing traffic evenly among the members to the point where one member was congested while the utilization on the LAG was reasonably low. Looking at my netflow data, I was able to confirm that this was caused by a single large flow of ESP traffic. Fortunately, I was able to shift this flow to another path that had enough headroom available so that the flow could be accommodated on a single member link.

With the increase in remote workers and VPN traffic that won’t hash across multiple paths, I thought this anecdote might help someone else track down a problem that might not be so obvious.

Please take this message in the spirit in which it was intended and refrain from the snarky “just upgrade you links” comments.

Do we know which specific VPN technologies specifically are harder to
hash in a meaningful way for load balanacing purposes, than others?

If the outcome of this troubleshooting is a list of recommendations
about which VPN approaches to use, and which ones to avoid (because of
the issue you described), that'll be a great outcome.

Kind regards,

Job

It’s the protocol 50 IPSEC VPNs. They are very sensitive to path changes and reordering as well.

If you’re tunneling more than 5 or 10Gb/s of IPSEC it’s likely going to be a bad day when you find a low speed link in the middle. Generally providers with these types of flows have both sides on the same network vs going off-net as they’re not stable on peering links that might change paths.

You also need to watch out to ensure you’re not on some L2VPN type product that bumps up against a barrier. I know it’s a stressful time for many networks and systems people as traffic shifts. Good luck out there!

- Jared

Hey Nimrod,

I was contacted by my NOC to investigate a LAG that was not distributing traffic evenly among the members to the point where one member was congested while the utilization on the LAG was reasonably low. Looking at my netflow data, I was able to confirm that this was caused by a single large flow of ESP traffic. Fortunately, I was able to shift this flow to another path that had enough headroom available so that the flow could be accommodated on a single member link.

With the increase in remote workers and VPN traffic that won't hash across multiple paths, I thought this anecdote might help someone else track down a problem that might not be so obvious.

This problem is called elephant flow. Some vendors have solution for
this, by dynamically monitoring utilisation and remapping the
hashResult => egressInt table to create bias to offset the elephant
flow.

One particular example:

Ideally VPN providers would be defensive and would use SPORT for
entropy, like MPLSoUDP does.

Once upon a time, Nimrod Levy <nimrodl@gmail.com> said:

With the increase in remote workers and VPN traffic that won't hash across
multiple paths, I thought this anecdote might help someone else track down
a problem that might not be so obvious.

Last week I ran into an issue where traffic between my home and work
networks had high latency, but only to certain IPs (even different IPs
on the same server). Since my work network peers with my home provider,
I was able to go to the provider's NOC, and they were very helpful (they
ended up turning up more bandwidth). I expect this was also a case of
one LAG member being congested, and my problem IP pairs were hashing to
that member.

My traffic wasn't VPN (SSH, with ping/mtr for testing), but it is
possible that somebody else's was - I didn't get detailed with the other
NOC.

There are several caveats to doing dynamic monitoring and remapping of
flows; one of the biggest challenges is that it puts extra demands on the
line cards tracking the flows, especially as the number of flows rises to
large values. I recommend reading
https://www.juniper.net/documentation/en_US/junos/topics/topic-map/load-balancing-aggregated-ethernet-interfaces.html#id-understanding-aggregated-ethernet-load-balancing

before configuring it.

“Although the feature performance is high, it consumes significant amount of line card memory. Approximately, 4000 logical interfaces or 16 aggregated Ethernet logical interfaces can have this feature enabled on supported MPCs. However, when the Packet Forwarding Engine hardware memory is low, depending upon the available memory, it falls back to the default load balancing mechanism.”

What is that old saying?

Oh, right–There Ain’t No Such Thing As A Free Lunch. ^_^;;

Matt

A few years ago we did a presentation about what can happen if hashing
for load balancing purposes doesn't work well (be it either IP or L2VPN
traffic). I think some of the information is still relevant as there
really isn't much difference between the problem existing in the
underlay network's implementation of algorithms or the properties of the
enveloppe that encompasses the overlay network packet.

    video of younger job + jeff: https://www.youtube.com/watch?v=cXSwoKu9zOg
    slides: https://archive.nanog.org/meetings/nanog57/presentations/Tuesday/tues.general.SnijdersWheeler.MACaddresses.14.pdf

Kind regards,

Job

What that large flow in a single LSP? Is this something that FAT lsp would fix?

-Steve

I would expect it to be true of any site to site VPN data flow. The
whole idea is for the guy in the middle to be unable to deduce
anything about the flow. If the technology provides hints about which
packets match the same subflow, it isn't doing a very good job.

Regards,
Bill Herrin

Hey Matthew,

There are *several* caveats to doing dynamic monitoring and remapping of
flows; one of the biggest challenges is that it puts extra demands on the
line cards tracking the flows, especially as the number of flows rises to
large values. I recommend reading
Load Balancing on Aggregated Ethernet Interfaces | Junos OS | Juniper Networks
before configuring it.

You are confusing two features. Stateful and adaptive. I was proposing
adaptive, which just remaps the table, which is free, it is not flow
aware. Amount of flow results is very small bound number, amount of
states is very large unbound number.

No.

FAT adds additional MPLS label for entropy, ingressPE calculates flow
hash, based on traditional flow keys and injects that flow number as
MPLS label, so transit LSR can use MPLS labels for balancing, without
being able to parse the frame. Similarly VPN provider could do that,
and inject that flow hash as SPORT at the time of tunneling, by
looking at the inside packet. And any defensive VPN provider should do
this, as it would be a competitive advantage.
Now for some vendors, like Juniper and Nokia transit LSR can look
inside pseudowire L3 packet for flow keys, so you don't even need FAT
for this. Some other like ASR9k cannot, and you'll need FAT for it.

But all of this requires that there is entropy to use, if it's truly
just single fat flow, then you won't balance it. Then you have to
create bias to the hashResult=>egressInt table, which by default is
fair, each egressInt has same amount of hashResults, for elephant
flows you want the congested egressInt to be mapped to fewer amount of
hashResults.

So the three or four times we tried to get FAT going (in a multi-vendor
network), it simply didn't work.

Have you (or anyone else) had any luck with it, in practice?

Mark.

Yeah we run it in a multivendor network (JNPR, CSCO, NOK), works.

I would also recommend people exclusively using CW+FAT and disabling
LSR payload heuristics (JNPR default, but by default won't do with CW,
can do with CW too).

Only between Cisco boxes.

I still don't understand why the vendors cannot make it work in one direction only (the low-end platform would only need to remove an extra label, no need to inspect traffic).
That would help us a lot, since the majority of our traffic is downstream to the customer.

It is signalled separately for TX and RX and some vendors do allow you
to signal it separately.

(skipping up the thread some)

It’s the protocol 50 IPSEC VPNs. They are very sensitive to path changes and reordering as well.

If you’re tunneling more than 5 or 10Gb/s of IPSEC it’s likely going to be a bad day when you find a low speed link in the middle. Generally providers with these types of flows have both sides on the same network vs going off-net as they’re not stable on peering links that might change paths.

a bunch of times the advice given to folk in this situation is: "Add
more entropy", which really for ipsec/gre/etc vpns means more
endpoints.
For instance, adding 3 more ips on either side for tunnel
egress/ingress will make the flows (ideally) smaller and more probable
to hash across different links in the intermediary network(s). This
also moves the loadbalancing back behind the customer prem so ideally
perhaps even the nxM flows are now balanced a little better as well.

sometimes this works, sometimes it's hard to accomplish :frowning:

Yep, the RFC gives this option.
Does Juniper MX/ACX series support it?
I know for sure Cisco doesn't.

We weren't as successful (MX480 ingress/egress devices transiting a CRS
core).

In the end, we updated our policy to avoid running LAG's in the
backbone, and going ECMP instead. Even with l2vpn payloads, that spreads
a lot more evenly.

Mark.

I don't know how well-known this is, and it may not be something many people would want to do, but Enterasys switches, now part of Extreme's portfolio, allow "round-robin" as a load-sharing algorithm on LAGs.

see e.g.

https://gtacknowledge.extremenetworks.com/articles/How_To/How-to-configure-LACP-Output-Algorithm-as-Round-Robin

This may not be the only product line supporting this.

So Junos does support both per-flow and per-packet load balancing on
LAG's on Trio line cards.

We tested this back in 2014 for a few months, and while the spread is
excellent (obviously), it creates a lot of out-of-order frame delivery
conditions, and all the pleasure & joy that goes along with that.

So we switched back to per-flow load balancing, and more recently, where
we run LAG's (802.1Q trunks between switches and an MX480 in the data
centre), we've gone 100Gbps so we don't have to deal with all this
anymore :-).

Mark.