Strange connectivity issue Frontier EVPL

Jay_Hennigan · November 6, 2020, 4:59pm

We have a strange issue that defies logic. We have a NNI at our POP with Frontier serving as an aggregation circuit with different customers on different VLANs. It's working well to several customers.

Bringing up a new customer shows roughly half of the IP addresses unreachable across the link, as if there's some kind of load-balancing or hashing function that's mis-directing half of the traffic. It's consistent, if an address is reachable it's always reachable. If it's not reachable, it's never reachable. Everything ARPs fine.

The Frontier circuit is layer 2 so shouldn't care about IP addresses. Frontier tech shows no trouble. They changed the RAD device on-premise. We've triple-checked configurations, torn down and rebuilt subinterface, etc. with no joy.

Any suggestions?

Matt_Hoppes · November 6, 2020, 5:08pm

Could you be running up against a MAC table limit on the circuit?

Aaron · November 6, 2020, 5:27pm

EVPL (eline) should not be learning macs. So mac table size should be a non-issue. Unless someone somewhere has constructed a 2-part bridge domain (mef-speak, etree or elan of sorts) which would have mac learning, then Matt's question comes into play.

-Aaron

Jay_Hennigan · November 6, 2020, 5:34pm

Unlikely. The only MACs that should be in play are our gateway on our PE router and the customer's router and those are both in the address table and ARP. At layer 3, customer can consistently reach about 50% of the IP addresses attempted.

Jeff_Richmond · November 6, 2020, 5:39pm

Jay, I previously ran the engineering org over there, so sent this to my old team to look at, including the best engineer I know in regard to the RADs. Will pass along anything they come back with.

Thanks,
-Jeff

Will · November 6, 2020, 5:54pm

I have similar Frontier NNI's out of One Wilshire, some 1gig some 10.

While I haven't seen the half-IP-reachable issue you describe I have spent
days and days chasing performance issues on them. I finally got gig
line-rate capable iperf3 boxes at both ends and see distinct differences
in single-TCP stream performance vs running 3-4 streams, and the difference
disappears like clockwork at "unbusy hours" (1am-7am) every day.

After running hundreds of tests and adjusting my buffering and RED on both
ends of these circuits I just have come to the conclusion that they have
some LAGs somewhere "in the middle" that get busy during the day, and
they don't care if I have to run 4 TCP streams to max a 1gig circuit.

It makes browser-based speedtests look really bad but otherwise the
circuits are usable. We're trying to replace the worst ones with
wavelength services.

-Will Orton

Mike_Lyon · November 6, 2020, 6:14pm

What hardware is on each side?

Jay_Hennigan · November 6, 2020, 6:31pm

On our aggregate side an ASR920. Customer has a RAD device as the Frontier handoff. We've seen the same issue with multiple devices at the customer side including a laptop direct to the RAD.

Mike_Lyon · November 6, 2020, 6:46pm

Recently saw a relatively same problem when Wave migrated us off of their antiquated 6500 to a brand new ASR920. EVPL had been working flawlessly for years on the 6500, but then stopped working when migrated to the ASR. Tried multiple ports on the ASR and then even another brand new ASR, same problem. Moved the circuit over to another (different) antiquated 6500 and all was good.

On my side, i was using a Mikrotik, i had the port in a bridge group and was seeing all the MAC addresses across the link but for some reason, they weren’t showing up in the ARP table of the Mikrotik. Tried a couple other Mikrotik devices, same thing. Installed a dumb gigabit switch in the middle, same thing. However, when my laptop was plugged in, that worked.

So yes, seen the same weird behavior. As to how to fix it, no idea

-Mike

Karsten_Thomann · November 6, 2020, 6:48pm

It sounds a bit like loadbalancing with one broken link…

Have you verified, for example with acl counters at both sides of the link, in which direction the packets are dropped?

As the customer has changed the devices, does the ASR uses a MAC starting with 4 or 6?

My only idea at the moment is to generate load on the link with a udp traffic generator which does not work end to end and let them check where the traffic dies within their network.

Karsten

Mike_Hammett · November 6, 2020, 8:18pm

This is my biggest complaint about non-wavelength transport. The provider is overselling a port somewhere in the circuit, unless it’s a wave.

Aaron · November 6, 2020, 8:57pm

My coworker is having similar issues with PS Lightwave and Alpheus/Logix
from San Antonio to Houston whereas some things work and somethings don't

-Aaron

Tim_Burke1 · November 7, 2020, 12:23am

I'm amazed you can get *anything* to work with Logix involved. Haven't heard of many issues with PSLightwave in Houston, however... they seem to be one of the only halfway decent options here.

Adam_Korab2 · November 11, 2020, 4:18am

As it happens, I've just recently turned up a peering circuit with PSL in Houston, and their senior engineer is clue++

Naturally, he's on vacation this week, but [Aaron] ping me unicast if I might be able to assist/lend eyeballs/make an introduction of you guys next week.

--Adam

    I'm amazed you can get *anything* to work with Logix involved. Haven't
    heard of many issues with PSLightwave in Houston, however... they seem
    to be one of the only halfway decent options here.