Broadcom vs Mellanox based platforms

Kasper_Adel · June 4, 2018, 5:41am

Hello

I’m asked to evaluate switching platforms that has different forwarding
chips but the same OS.

Assuming these vendors give the same SDK and similar documentation/support,
then what would be comparison points to consider, other than the obvious
(price, features, bps, pps).

I’m thinking, how do i validate their claims about capability to do
leaf/spine arch, ToR/Gateways, telemetry, serviceability, facilities to
troubleshoot packet drops or FIB programming misses, hidden tools...etc

It would be great if anyonw can give some thoughts around it, specially if
you have tried one or both.

Thanks
Kim

ff1 · June 4, 2018, 8:33am

Hi Kim,

I'll share key learnings about since I started to work on high speed software networking in 2006, when everyone was laughing at me becaused I claimed to achieve 10Gbps networking with a CPU.

CPU is less important than memory/QPI
On x86 memory subsytem include things like Cache Boxes, Home Agent, DRAM controllers... Home Agent is reponsible to know on which CPU node is a cacheline. So it can become a centralized bottleneck.... DRAM controllers have a queue of pending DRAM requests (instruction pipeline, data prefetch, data...). QPI routing may also severely impact performance. I remember using a 4 socket system that was half the performance of a 2 socket system because of either bad QPI routing programing by the BIOS or a hardware issue.
An order of magnitude to keep in mind is that at 100Gbps, each 64-byte packet and each associated 64-byte used metadata cacheline is consuming roughly a full DRAM channel. As an example and not counting application data to be leveraged (FIB, DNS database...) a 100Gbps DPDK bridging application requires 3 memory channels per port (to reach line rate if the IO allows it)... There is a lot more to say but I let you do your own research
BTW, why would you want to do 100GBps line rate (or very close to it)? To ensure that each node has the capacity to resist a DDoS attack powered by DPDK/ODP/native "applications".

PCI is your ennemy (or not that a good friend)
PCI chipset behavior is complex. The typical payload on x86 is 256bytes. So I assumed that using a 1KB max payload to support the average 670 byte internet packet size would give better results... But no, early DMA transaction acknowledgement is disabled if payload not 256 so it dropped performance significantly.
You may have an embedded switch on the NIC. So you think that offloading will give you a benefit. Yes at low speed but you can't build a 50Gbps service chain because most of the NIC are on PCI x8 Gen3 slots which is limited to 50Gbps BW.
So the conclusion is: don't try to understand those limits, create a testbed that really mimics the target "size" and topology of your use case and measure.

Don't do tests at 10Gbps if your target is 100Gbps.
Starting at 50Gbps you will be bumping on PCI DMA transaction rate barrier. Unless you have a smart IO model (multiple packets per DMA transaction - see Netcope for instance) supported in zero-copy by the SDK architecture you won't reach line rate or be able to have an application (zero-copy of data or metadata reduction can save a DRAM channel for application at this "speed"). I think (but not sure) you can squeeze two packets in a buffer with Mellanox cards: that can be instrumental in reaching 50Gbps line rate but I don't know if DPDK supports this feature.

Don't do pps at the switch level if your target is fast VM application behavior.
Measuring that a software switch can do 10Gbps line rate with 64 byte packets does not help at all to predict TCP application performance in a VM. Factors such as GRO/GSO support are more important as limiting factor is TCP window opening.
I measured web traffic over IPSec links between VMs. The key performance factor was latency of the switching/IPsec combo: if latency is above a certain level, TCP window of the endpoints does not open and the in-between software switches become under-utilized.

My vision is that if you use a hardware specific SDK to build your hardware specific application, you will get the best of the hardware. The gains can range from 30% to 100% depending on HW, so it is not negligible (you may have to prove this assertion ;-). One major reason being the ability to use the exact sotfware metadata which may become a single cache line or even no software metadata at all as you could leverage the hardware descriptor directly. The other reason is to leverage the native IO model for the device which DPDK may not support. The price to pay is hardware or vendor dependence.

FF

PS1: You may want to clarify your search: you haven't stated if your interest is L2 switch or L3 switch, if you consider baremetal switching, container or VM switching.
If you want L3 then you probably want to focus on VPP, Contrail or Snabb rather than the low level packet io frameworks. With latest Intel AVF technology, DPDK is almost irrelevant for VPP and actually slows things down with the same hardware (Intel XL 710 card)
AAdditionally, the kernel community is working on AF_XDP which may be relevant for your case.

PS2: I am not sure NANOG is the best list to discuss the technical details you want. That said, it may be the best place to discuss the use cases or realistic testbed setup.

Jean_Delestre · June 4, 2018, 3:24pm

Hello

Not sure how to compare but i'd be very interested in the result of your
work !

Chris_Grundemann · June 4, 2018, 4:24pm

Mellanox commissioned a report along these lines from Tolly in 2016:
https://www.mellanox.com/related-docs/tolly/tolly-report-performance-evaluation-2016-march.pdf

Obviously a grain of salt is needed with any commissioned study - but it
will at least point you to some tests and methodologies that you can use...

Tom_Hill · June 4, 2018, 5:23pm

I'd start with a software vendor that supports both. The Cumulus Linux
docs are pretty good, and available online:

https://docs.cumulusnetworks.com/display/DOCS

Caveats for Mellanox Spectrum, and various Broadcom ASICs, are usually
listed in boxouts where appropriate. There's a whole page on *tested*
scales.

The software vendors are the ones that get access to the people at both
companies that /really/ know where the limitations are, so you're more
likely to find the best information dealing with one of them.

HTH,

Nick_Hilliard3 · June 4, 2018, 7:25pm

power draw. Depending on your hosting costs, the differences in power draw between each chipset may impact the total cost of ownership over the accounting depreciation period of the kit. It's worth measuring and putting into your costs analysis.

Nick

Sylvain_COUTANT · June 5, 2018, 12:57pm

Not willing to troll too much, but ... beside price, the main difference
is about disclosed chips specs.

Broadcom gives marketings specs (basic features, bps, pps) only. Chipset
documentation is disclosed only to some partners and is under NDA.

Mellanox is open about specs, afaik drivers are open source, etc.

Mellanox' commissioned study stated in another reply is a good starting
point. Think what you want about the results, but without public specs
from Broadcom, people have to expriment to understand and find out how
their silicon work. And where are the (undisclosed) limits. Good luck
with this if you expect predictable results under load ...

/Sylvain.