10GE TOR port buffers (was Re: 10G switch recommendaton)

Hi,

Is there a reason switch vendors 1U TOR 10GE aggregation switches are
all cut-through and there are no models with deep buffers?
I've ben looking at all vendors I can think of and all have the same models.

TOR switches as cut-through with little buffers, and chassis based
boxes with deep buffers.

TOR:
Juniper EX4500 208KB/10GE (4MB shared per PFE)
Cisco 4900M 728KB/10GE (17.5MB shared)
Cisco Nexus 3064 140KB/10GE (9MB shared)
Cisco Nexus 5000 680KB/10GE
Force10 S2410 I can't find it anymore, but it wasn't much
Arista 7148SX 123KB/10GE (80KB per port plus 5MB dynamic)
Arista 7050S 173KB/10GE (9MB shared)
Brocade VDX 6730-32 170KB/10GE
Brocade TurboIron 24X 85KB/10GE
HP 6600-24XG 4500KB/10GE
HP 5820-24XG-SFP+ 87KB/10GE
Extreme Summit X650 375KB/10GE

Chassis:
Juniper EX8200-8XS 512MB/10GE
Cisco WS-X6708-10GE 32MB/10GE (or 24MB)
Cisco N7K-M132XP-12 36MB/10GE
Arista DCS-7548S-LC 48MB/10GE
Brocade BR-MLX-10Gx8-X 128MB/10GE (not sure)

1GE aggregation.
Force10 S60 1250MB shared
HP 5830 3000MB shared

I am at a loss why there are no 10GE TOR switches with deep buffers.
Apparently there is a need for deep buffers as the vendors make them
available in the chassis linecards.
There also are deep buffer 1GE aggregation switches.

Is there some (technical) reason for this?
I can imagine some vendors would say that you need to scale up to a
chassis if you need deep buffers, but at least one vendor should be
able to get quite some customers with a 10G deep buffer TOR switch.

I understand that flow-control should prevent loss with microbursts,
but in my customers get adverse effects, with strong negative
performance if they let flow-control do its thing.

Any pointers why this is, or if there is a solution for microburst
loss would be greatly appreciated.

Thanks,

Bas

I'd take some of these with grain of salt, take EX8200-8XS, PDF
indeed does agree:

The HP6600 is a store and forward, not a cut-through. The HP reps that I have dealt with seem to be pretty open to sharing architecture drawings of their stuff, so I bet you could probably get your hands on the same one that I have. Their NDA is a mutual disclosure, though, so that might make things tough depending on your organization's policies.

Tom

But do you generally agree that "the market" has a requirement for a
deep-buffer TOR switch?

Or am I crazy for thinking that my customers need such a solution?

Bas

In a message written on Fri, Jan 27, 2012 at 10:40:03PM +0100, bas wrote:

But do you generally agree that "the market" has a requirement for a
deep-buffer TOR switch?

Or am I crazy for thinking that my customers need such a solution?

You're crazy. :slight_smile:

You need to google "bufferbloat", which while the aim has been more
at (SOHO) routers that have absurd (multi-second) buffers, the
concepts at play work here as well.

Let's say you have a VOIP application with 250ms of jitter tolerance,
and you're going 80ms across country. You then add in a switch on
one end that has 300ms of buffer.

Ooops, you go way over, but only from time to time when the switch
is full, getting 300+80ms of latency for a few packets.

Dropped packets are a _GOOD_ thing. If your ethernet switch can't
get the packet out another port in ~1-2ms it should drop it. The
output port is congested, congestion is what tells the sender to
back off. If you buffer the packets you get congestion collapse,
which is far worse for throughput in the end, and in particular has
severely detremental effects on the others on the LAN, not just the box
filling the buffers.

A network dropping packets is healthy, telling the upstream boxes
to throttle to the appropiate speeds with packet loss which is how
TCP operates. I can' tell you how many times I've seen network
engineers tell me "no matter how big I make the buffers performance
gets worse and worse". Well duh, you're just introducing more and
more latency in your network, and making TCP backoff fail, rather
than work properly. I go in and slash their 50-100 packet buffers
down to 5 and magically the network performs great, even when full.

Now, how much buffer do you need? One packet is the minimum. If
you can't buffer one packet it becomes hard to reach 100% utilization
on a link. Anyone who's tried with a pure cut-through switch can
tell you it tops out around 90% (with multiple senders to a single
egress). Amazing one packet of buffer almost entirely fixes the
problem.

When I can manually set the buffers, I generally go for 1ms of buffers
on high speed (e.g. 10GE) links, and might increase that to as much as
15 ms of buffers on extremely low speed links, like sub-T1.

Remember, your RTT will vary (jitter) +- the sum of all buffers on all
hops along the path. A 10 hop path with 15ms per hop could see 150ms of
jitter if all links go between full and not full!

Buffers in most network gear is bad, don't do it.

Hi,

In a message written on Fri, Jan 27, 2012 at 10:40:03PM +0100, bas wrote:

But do you generally agree that "the market" has a requirement for a
deep-buffer TOR switch?

Or am I crazy for thinking that my customers need such a solution?

You're crazy. :slight_smile:

You need to google "bufferbloat", which while the aim has been more
at (SOHO) routers that have absurd (multi-second) buffers, the
concepts at play work here as well.

While your reasoning holds truth it does not explain why the expensive
chassis solution (good) makes my customers happy, and the cheaper TOR
solution makes my customers unhappy.....

Bufferbloat does not matter to them as jitter and latency does not matter.
As long as the TCP window size negotioation is not reset the total
amount of bit/sec increases for them.

If deep buffers are bad I would expect high-end chassis solutions not
to offer them either.
But the market seems to offer expensive deep buffer chassis solutions
and cheap (per 10GE) TOR solutions.
IMHO there is no reasoning why....
(why the expensive solution is not offered in a 1U box)

My customers want to buffer 10 to 24 * 10GE in a 1 or 2 10GE uplinks
to do this they need some buffers....

Bas

Buffers in most network gear is bad, don't do it.

+1

I'm amazed at how many will spend money on switches with more buffering but won't take steps to ease the congestion. Part of the reason is trying to convince non-technical people that packet loss in and of itself doesn't have to be a bad thing, that it allows applications to adapt to network conditions. They can use tools to see packet loss, that gives them something to complain about. They don't know how to interpret jitter or understand what impact that has on their applications. They just know that they can run some placket blaster and see a packet dropped and want that to go away, so we end up in "every packet is precious" mode.

They would rather have a download that starts and stops and starts and stops rather than have one that progresses smoothly from start to finish and trying to explain to them that performance is "bursty" because nobody wants to allow a packet to be dropped sails right over their heads.

They'll accept crappy performance with no packet loss before they will accept better overall performance with an occasional packet lost.

If an applications is truly intolerant of packet loss, then you need to address the congestion, not get bigger buffers.

While I agree _again_!!!!!

It does not explain why TOR boxes have little buffers and chassis box
have many.....

Because that is what customers think they want so that is what they sell. Customers don't realize that the added buffers are killing performance.

I have had network sales reps tell me "you want this switch over here, it has bigger buffers" when that is exactly the opposite of what I want unless I am sending a bunch of UDP through very brief microbursts. If you are sending TCP streams, what you want is less buffering. Spend the extra money on more bandwidth to relieve the congestion.

Going to 4 10G aggregated uplinks instead of 2 might get you a much better performance boost than increasing buffers. But it really depends on the end to end application.

In a message written on Fri, Jan 27, 2012 at 11:30:14PM +0100, bas wrote:

While your reasoning holds truth it does not explain why the expensive
chassis solution (good) makes my customers happy, and the cheaper TOR
solution makes my customers unhappy.....

Bufferbloat does not matter to them as jitter and latency does not matter.
As long as the TCP window size negotioation is not reset the total
amount of bit/sec increases for them.

I obviously don't know your application. The bufferbloat problem
exists for 99.99% of the standard applications in the world. There
are, however, a few corner cases. For instance, if you want to
move a _single_ TCP stream at more than 1Gbps you need deep buffers.
Dropping a single packet slows throughput too much due to a slow-start
event. For most of the world with hundreds or thousands of TCP
streams across a single port, such problems never occur.

If deep buffers are bad I would expect high-end chassis solutions not
to offer them either.
But the market seems to offer expensive deep buffer chassis solutions
and cheap (per 10GE) TOR solutions.

The margin on a top-of-rack switch is very low. 48 port gige with
10GE uplinks are basically commodity boxes, with plenty of competition.
Saving $100 on the bill of materials by cutting out some buffer
makes the box more competitive when it's at a $2k price point.

In contrast, large, modular chasses have a much higher margin. They are
designed with great flexability, to take things like firewall modules
and SSL accelerator cards. There are configs where you want some (not
much) buffer due to these active appliances in the chassis, plus it is
easier to hide an extra $100 of RAM in a $100k box.

Also, as was pointed out to me privately, it is also important to loook
at adaptive queue management features. The most famous is WRED, but
there are other choices. Having a queue management solution on your
routers and switches that works in concert with the congestion control
mechanism used by the end stations always results in better goodput.
Many of the low end switches have limited or no AQM choices, while the
higher end switches with fancier ASICs can default to something like
WRED. Be sure it is the deeper buffers that are making the difference,
and not simply some queue management.

Hi,

Also these TOR boxes go to my (more expensive ASR9K and MX) boxes, so
from an CAPEX standpoint I simply do not want to give them more ports
than required.

Hi,

The margin on a top-of-rack switch is very low. 48 port gige with
10GE uplinks are basically commodity boxes, with plenty of competition.
Saving $100 on the bill of materials by cutting out some buffer
makes the box more competitive when it's at a $2k price point.

The list of 10GE TOR switches I sent earlier are list from $20K to $100K
So actual purchase cost for us would be $10K to $30K
$500 for some (S)(Q)(bla)RAM shouldn't hold back a vendor from
releasing a bitchin switch....

Again this argument does not explain why there are 1GE aggregation
switches with deep buffers..

Also, as was pointed out to me privately, it is also important to loook
at adaptive queue management features. The most famous is WRED, but
there are other choices. Having a queue management solution on your
routers and switches that works in concert with the congestion control
mechanism used by the end stations always results in better goodput.
Many of the low end switches have limited or no AQM choices, while the
higher end switches with fancier ASICs can default to something like
WRED. Be sure it is the deeper buffers that are making the difference,
and not simply some queue management.

All true... Still no reason why not to offer a deep buffer TOR...

While I agree _again_!!!!!

It does not explain why TOR boxes have little buffers and chassis box
have many.....

you need purportionally more buffer when you need to drain 16 x 10 gig
into 4 x 10Gig then when you're trying to drain 10Gb/s into 2 x 1Gb/s

there's a big incentive bom wise to not use offchip dram buffer in a
merchant silicon single chip switch vs something that's more complex.

2:54 PM To: George Bonser Subject: Re: 10GE TOR port buffers (was
Re: 10G switch recommendaton)

While I agree _again_!!!!!

It does not explain why TOR boxes have little buffers and chassis
box have many.....

Because that is what customers think they want so that is what they
sell. Customers don't realize that the added buffers are killing
performance.

It is possible, trivial in fact to buy a switch that has a buffer too
small to provide stable performance at some high fraction of it's uplink
utilization. You can differentiate between the enterprise/soho 1gig
switch you bought to support your ip-phones and wireless APs and the
datacenter spec 1u tor along these lines.

It is also possible and in fact easy to have enough to accumulate
latency in places where you should be discarding packets earlier.

I'd rather not be in either situation, but in the later I can police my
way out of it.

Hi All,

Hi All,

While I agree _again_!!!!!

It does not explain why TOR boxes have little buffers and chassis box
have many.....

you need purportionally more buffer when you need to drain 16 x 10 gig
into 4 x 10Gig then when you're trying to drain 10Gb/s into 2 x 1Gb/s

there's a big incentive bom wise to not use offchip dram buffer in a
merchant silicon single chip switch vs something that's more complex.

I'm almost ready to throw the towel in the ring, and declare myself a looney..
I can imagine at least one vendor ingnoring the extra BOM capex, and
simpky try to please #$%^#@! like me.

C NSP has been full with threads about appalling microburst
performance of the 6500 for years..

And people who care have been using something other than a c6500 for
years. it's a 15 year old architecture, and it's had a pretty good run,
but it's 2012.

An ex8200 has 512MB per port on non-oversuscribed 10Gig ports and 42MB
per port on 1Gig ports. that's a lot of ram.

to take this back to actual tors.

a broadcom 56840 based switch has something in the neighborhood of 9MB
available for packet buffer on chip if you need more then more drams are
in order. while the TOR can cut-through-switch the chassis can't. the
tor is also probably not built with offchip cam (there are examples of
off chip cam as well) for much the same reason.

There are a couple of reasons for this: first, dropping the amount of buffer space decreases the cost of the hardware. Secondly, you really only need large buffers when you need to shape traffic. Shaping traffic is important if you're down stepping from a faster port to a slower port (this is a common use case for a blade switch like a c6500), or else if you're running qos on the port and you need to implement sophisticated queuing and policing. You can't run qos effectively without having generous buffers which is why LAN switches typically have very little buffer space and metro Ethernet switches typically have lots.

In the case of a tor switch, the use case is typically in a situation where you're not downstepping from a higher speed to a lower speed, and where you don't really need fancy qos. So as its not generally needed for the sort of things that tor switches are used for, its not added to the hardware spec.

Nick

It is also possible and in fact easy to have enough to accumulate
latency in places where you should be discarding packets earlier.

I'd rather not be in either situation, but in the later I can police my
way out of it.

That is why I added the "it depends on the end to end application" caveat.

I assumed since he was asking about a "top of rack" (TOR) switch, he was actually using it as a top of rack switch and adding a couple more uplinks to his core would be cheaper than replacing all the hardware. Not understanding the topology and the application makes good recommendations a crap shoot, at best.