10G switch drops traffic for a split second

I recently upgraded my core network from 1G to 10G and after the upgrade I
have noticed that my 10G switch during peak traffic (1500mbps, 100,000pps)
seems to be dropping traffic for a split second across all ports and all
vlans. I immediately replaced the switch with a different brand/model and
the problem persists.

Sometimes traffic drops to zero, others it drops to 50%, problem is very
random but seems to occur with much more frequency during high PPS (pushing
high traffic / iperf does not induce problem)

Could this be MTU? I've tried flow control, hard code duplex, stp on/off etc

I'm at a loss any ideas?

TJ Trout
Volt Broadband

What model switch?
What's the config look like, all L2 or L3 as well?

Luke Guillory
Network Operations Manager

Tel: 985.536.1212
Fax: 985.536.0300
Email: lguillory@reservetele.com

Reserve Telecommunications
100 RTC Dr
Reserve, LA 70084

Without more detail, I'm grasping at straws here, but see this recent
thread about QoS and microbursts on the juniper-nsp list:

https://puck.nether.net/pipermail/juniper-nsp/2016-November/033692.html

Do you have ports with different speeds connected?

Another idea: Are you using Spanning Tree Protocol and seeing lots of
TCNs?

If you have congestion on outgoing interfaces you are most likely running out of packet buffer space on your switch. Especially campus class switches have small buffers, 4 MB or so and it can run out during high bursts and interface congestion. With some switches you could alleviate problem by rearranging congested interfaces to ports with seperate buffer pool, but you have to check with your switch vendor or documentation if your switches have shared or split buffer pools. Or just replace your switches with ones having deeper buffers.

Tomi

As others have pointed out, you probably have a switch with small buffers.

If you also have flow control and you have something that triggers flow control to turn off packet forwarding, your small-buffer-switch might fill up all (shared) buffers on that port and now you're dropping traffic to all ports.

So trying to find if you have something where flow control is enabled and is being triggered might be something worthwhile to do, and also perhaps just turn off flow control on all ports to make sure.

Luke;

All l2, no l3. only 4 vlans. 2 peers trunked to a router which trunks back
to 2 devices (microwave backhauls).

Chuck;

All ports are 10g except the 2 peers are 1g and trunk back to a 10g port
for the router wan

No TCN's

Brian;

I have tried a IBM G8124 and a Ubiquiti ES-16-XG both show same exact drops
across all ports, makes me think it's a config issue. MTU, FC, something.

Andrew;

I have tried with FC disabled, but I will try that one more time.

Mikael;

Is it possible to over run the buffers of a 320gbps backplane switch with
only 1.5gbps traffic? I think the switch is rated for 140m PPS and I'm only
pushing 100k PPS

Yes it is absolutely possible to overrun the buffers. Any kind of
backpressure (FC) from hosts, or 10G->1G transitions can easily cause
it. Even if in a 10s window you're not over 1G if the 10G sender
attempts to back to back too many frames in a row (Like say sendfile()
API type calls) BOOM, dropping frames in the switch.

I plan on disabling FC on everything tonight, I've done that before but I
want to be sure.

Anything that can be done about the 2 x 1G peers trunking to the 10G router
transition that can be fixed? should I be rate limiting the vlan for the
peers at 1G so the 10G router isn't trying to send more than 1G?

This thread reminded me of a blog post that struck me as useful 5 years
  ago, and again today. Measuring throughput, when dealing with buffers and
  troubleshooting errors and packet loss, must be done at a sub-one-second
  sampling rate.

  http://blog.serverfault.com/2011/06/27/per-second-measurements-dont-cut-it/

Beckman

Yeah you also have to look for not so obvious things like MAC Pause
frames sent/received...QoS counters, all sorts of VERY platform
specific stuff. Right royal pain, especially since some do not expose
these statistics at all.

Here is the video from Facebook on Monitoring, managing and troubleshooting large scale networks they did last year on the subject as well.

https://www.youtube.com/watch?v=BRY9xwg5nAU

Luke Guillory
Network Operations Manager

Tel: 985.536.1212
Fax: 985.536.0300
Email: lguillory@reservetele.com

Reserve Telecommunications
100 RTC Dr
Reserve, LA 70084

If your switch is the typical small-buffered-switch that has become more and more common the past few years, then the entire switch might have buffer to keep packets for 0.1ms or less. So if someone says "flow control off" for 0.1ms, depending on the implementation, you might then start seeing packet drops on all ports until that device turns flow control back on.

I always disabled flow control on the theory that VoIP & flow control
are incompatible.
just out of curiosity - anyone have it enabled? if so, why?

Lee

Generally speaking, allowing any ethernet switch to *send* PAUSE frames is very
bad idea, causing external head-of-line blocking and congestion spreading.

OTOH, a decent use-case of flow control is for subrate services.
For example, 622 Mbps microwave link with gigabit ethernet interfaces
ultimately needs to use flow control to properly inform the connected
equipment that this is only "622M ethernet" link and not a gigabit one.

     M.