100G - Whitebox

Mike_Hammett · August 20, 2017, 3:45pm

I first sent to an IX-specific mailing list, but as I have yet to see the message hit the list, I figured I would post it here as well.

We've had multiple requests for 100G interfaces (instead of Nx10G) and no one seems to care about the 40G interfaces we have available.

Looking at cost effective options brings us to whitebox switches. Obviously there's a wide range of hardware vendors and there are a few OSes available as well. Cumulus seems to be the market leader, while IPinFusion seems to be the most feature-rich.

We're not doing any automation on the switches at this time, so it would still need decent manual configuration. It wouldn't need a Cisco-centric CLI as we're quite comfortable managing standard Linux-type config files. We're not going all-in on some overlay either given that we wouldn't be replacing our entire infrastructure, only supplementing it where we need 100G. I know that LINX has gone IPinfusion. What OS would be appropriate for our usage? I'm not finding many good comparisons of the OSes out there. I'm assuming any of them would work, but there may be gotchas that a "cheapest that meets requirements" doesn't quite unveil.

Any particular hardware platforms to go towards or avoid? Broadcom Tomahawk seems to be quite popular with varying control planes. LINX went Edgecore, which was on my list given my experience with other Accton brands. Fiberstore has a switch where they actually publish the pricing vs. a bunch of mystery.

Thoughts?

Joel_Jaeggli · August 20, 2017, 4:07pm

<snip>

Any particular hardware platforms to go towards or avoid? Broadcom Tomahawk seems to be quite popular with varying control planes. LINX went Edgecore, which was on my list given my experience with other Accton brands. Fiberstore has a switch where they actually publish the pricing vs. a bunch of mystery.

Tomahawk and tomahawk 2 have precious little in the form of packet packet buffer (e.g. As little as 4 x 4MB for original tomahawk) which might be a problem in a environment where you need to rate convert 100G attached peers to a big bundle of 10s).

White box Broadcom dnx / jericho is somewhat less common but does exist.

That 40s are less popular I think is no surprise. They were / are largely consigned to datacenter applications.

Mike_Hammett · August 20, 2017, 4:30pm

DNX/Jericho would have sufficient buffers to handle the rate conversions?

Ray_Burkholder · August 20, 2017, 4:38pm

DNX/Jericho would have sufficient buffers to handle the rate conversions?

You could try Mellanox. Some of the promotional stuff I've seen/heard indicates their focus on appropriate sized buffers & low packet loss on rate conversion.

_Bill_Woodcock · August 20, 2017, 4:40pm

Why don't we just swap out your 40g switch for a 100g switch? You've had the 40g one for a while, and we anticipate upgrades every 18-24 months.

-Bill

Fredrik_Korsback · August 20, 2017, 5:24pm

The only viable merchant silicon chip that would be useful for a IXP is from the StrataDNX-family which house the jericho/qumran/petra/arad chips from broadcom. No packetbuffer in the exhangepoint will shred performance significantly, especially when one of your bursty 100G customers starts sending data into 1/10G customers.

To the best of my knowledge the only one that offers DNX in whitebox-fashion is Agema and Edgecore. But why whitebox? Except on a very few occasions whitebox is just "i like paying hardware and software on different invoices = whitebox" the TCO is just the same but. As an exchangepoint i also see that it can hard to reap the benefits of all the hipstershit going on in these NOS-startups, you want spanning-tree, port-security, something to loadbalance over links and perhaps a overlaying-technology if the IXP becomes to big and distributed, like vxlan. This is to easy almost.

Whenever i see unbuffered mix-speed IXPs i ask if i can pay 25% of the portcost since that is actually how much oumpfff i would get through the port.

// hugge @ 2603

Nick_Hilliard3 · August 20, 2017, 10:12pm

Fredrik Korsbäck wrote:

The only viable merchant silicon chip that would be useful for a IXP
is from the StrataDNX-family which house the
jericho/qumran/petra/arad chips from broadcom. No packetbuffer in the
exhangepoint will shred performance significantly, especially when
one of your bursty 100G customers starts sending data into 1/10G
customers.

To the best of my knowledge the only one that offers DNX in
whitebox-fashion is Agema and Edgecore. But why whitebox? Except on a
very few occasions whitebox is just "i like paying hardware and
software on different invoices = whitebox" the TCO is just the same
but. As an exchangepoint i also see that it can hard to reap the
benefits of all the hipstershit going on in these NOS-startups, you
want spanning-tree, port-security, something to loadbalance over
links and perhaps a overlaying-technology if the IXP becomes to big
and distributed, like vxlan. This is to easy almost.

so yeah, hmm.

spanning-tree: urgh.
port-security: no thank you. Static ACLs only.
core ecmp/lag load-balance: don't use vpls.

Buffering is hard and gets down to having a good understanding of cost /
benefit analysis of your core network and your stakeholders'
requirements. The main problem category which IXPs will find it
difficult to handle is the set of situations where two participant
networks are exchanging individual traffic streams at a rate which is
comparable to the egress port speed of the receiving network.

This could be 100G-connected devices sending 50G-80G traffic streams to
other 100G-connected devices, but the other main situation where this
would occur would be high speed CDNs sending traffic to access networks
where the individual ISP->customer links would be provisioned in roughly
the same speed category as the IXP-ISP link. For example, a small
provider doing high speed gpon or docsis, with a small IXP port, e.g.
because it's only for a couple of small peers, or maybe it's a backup
port or something.

In that situation, tcp streams will flood the IXP port at rates which
are comparable to the ISP-access layer. If you're not buffering in that
situation, the ISP will end up in trouble because they'll drop packets
like crazy and the IXP can end up in trouble because shared buffers will
be exhausted and may be unavailable for other IXP participants.

Mostly you can engineer around this, but it's not as simple as saying
that small-buffer switches aren't suitable for an IXP. They can be, but
it depends on the network engineering requirements of the ixp
participants, and how the ixp is designed. No simple answers here,
sorry

The flip side of this is that individual StrataDNX asics don't offer the
raw performance grunt of the StrataXGS family, and there is a serious
difference in costs and performance between a 1U box with a single
tomahawk and a 2U box with a 4-way jericho config, make no mistake about it.

Otherwise white boxes can reduce costs if you know what you're doing and
you're spot on that TCO should be one of the main determinants if the
performance characteristics are otherwise similar. TCO is a strange
beast though. Everything from kit capex to FLS to depreciation term
counts forwards TCO, so it's important to understand your cost base and
organisational costing model thoroughly before making any decisions.

Nick

Mikael_Abrahamsson · August 21, 2017, 4:33am

Could you please elaborate on this?

How do you engineer around having basically no buffers at all, and especially if these very small buffers are shared between ports.

Nick_Hilliard3 · August 21, 2017, 11:10am

Mikael Abrahamsson wrote:

Mostly you can engineer around this, but it's not as simple as saying
that small-buffer switches aren't suitable for an IXP.

Could you please elaborate on this?

How do you engineer around having basically no buffers at all, and
especially if these very small buffers are shared between ports.

you assess and measure, then choose the appropriate set of tools to deal
with your requirements and which is cost appropriate for your
financials, i.e. the same as in any engineering situation.

At an IXP, it comes down to the maximum size of tcp stream you expect to
transport. This will vary depending on the stakeholders at the IXP,
which usually depends on the size of the IXP. Larger IXPs will have a
wider traffic remit and probably a much larger variance in this regard.
Smaller IXPs typically transport content to access network data, which
is usually well behaved traffic.

Traffic drops on the core need to be kept to the minimum, particularly
during normal operation. Eliminating traffic drops is unnecessary and
unwanted because of how IP works, so in your core you need to aim for
either link overengineering or else enough buffering to ensure that
site-to-site latency does not exceed X ms and Y% packet loss. Each
option has a cost implication.

At the IXP participant edge, there is a different set of constraints
which will depend on what's downstream of the participant, where the
traffic flows are, what size they are, etc. In general, traffic loss at
the IXP handoff will tend only to be a problem if there is a disparity
between the bandwidth left available on the egress direction and the
maximum link speed downstream of the IXP participant.

For example, a content network has servers which inject content at 10G,
which connects through a 100G IXP port. The egress IXP port is a
mid-loaded 1G link which connects through to 10mbit WISP customers. In
this case, the ixp will end up doing negligible buffering because most
of the buffering load will be handled on the WISP's internal
infrastructure, specifically at the core-to-10mbit handoff. The IXP
port might end up dropping a packet or two during the initial tcp burst,
but that is likely to be latency specific and won't particularly harm
overall performance because of tcp slow start.

On the other hand, if it were a mid-loaded 1G link with 500mbit access
customers on the other side (e.g. docsis / gpon / ftth), then the IXP
would end up being the primary buffering point between the content
source and destination and this would cause problems. The remedy here
is either for the ixp to move the customer to a buffered port (e.g.
different switch), or for the access customer to upgrade their link.

If you want to push 50G-80G streams through an IXP, I'd argue that you
really shouldn't, not just because of cost but also because this is very
expensive to engineer properly and you're also certainly better off with
a pni.

This approach works better on some networks than others. The larger the
IXP, the more difficult it is to manage this, both in terms of core and
edge provisioning, i.e. the greater the requirement for buffering in
both situations because you have a greater variety of streaming scales
per network. So although this isn't going to work as well for top-10
ixps as for mid- or smaller-scale ixps, where it works, it can provide
similar quality of service at a significantly lower cost base.

IOW, know your requirements and choose your tools to match. Same as
with all engineering.

Nick

Jerome_Nicolle · August 21, 2017, 12:40pm

Hello Mike,

Looking at cost effective options brings us to whitebox switches. Obviously there's a wide range of hardware vendors and there are a few OSes available as well. Cumulus seems to be the market leader, while IPinFusion seems to be the most feature-rich.

You may want to take a look at this paper before choosing your NOS :

https://hal-univ-tlse3.archives-ouvertes.fr/hal-01276379v1

This described an OpenFlow SDN approach to dumb-down whitebox switches
as to fully avoid broadcast traffic, while performing as well as a
complex VPLS fabric.

This design is operating consistently on the Toulouse Internet Exchange
since 2015 using Pica8's gear.

So maybe a CLI and feature-full NOS will better suit your needs, but for
an IXP, the programmatic approach has demonstrated to be working just fine.

Best regards,

Coyo_Stormcaller · August 21, 2017, 7:24pm

Very fascinating read. Thank you for posting.

Mike_Hammett · December 4, 2017, 7:45pm

In terms of 1G - 10G steps, it looks like UCSC has done some of that homework already.

https://people.ucsc.edu/~warner/Bufs/summary

"Ability to buffer 6 Mbytes is sufficient for a 10 Gb/s sender and a 1 Gb/s receiver." I'd suspect 10x would be appropriate for 100G - 10G (certainly made more accurate by testing).

http://people.ucsc.edu/~warner/I2-techs.ppt

Looking through their table ( https://people.ucsc.edu/~warner/buffer.html ), it looks like more switches than not in the not-100g realm have just enough buffers to handle one, possibly two mis-matches at a time. Some barely don't have enough and others are woefully inadequate.

Stephen_Fulton2 · December 4, 2017, 8:30pm

Mike,

Whether it becomes a practical problem depends on the use case and by
that I mean buffers can cut both ways. If buffers are too small,
traffic can be dropped and even worse, other traffic could be affected
depending on factors like ASIC design an HOLB. Too large, latency or
order sensitive traffic can be adversely affected.

We're still dealing with the same limitations of switching which were
identified 30+ years ago as the technology was developed. Sure we have
better chips, the options of better buffers and years ago experience to
help minimize those limitations, but those still exist and likely always
will with switching.

Honestly at this point it comes down to understanding what the use case
it and understanding the nuances that each vendor's offerings provide
and determining where things line up. Then test test test.

-- Stephen