Thousands of hosts on a gigabit LAN, maybe not

John_L · May 8, 2015, 6:53pm

Some people I know (yes really) are building a system that will have
several thousand little computers in some racks. Each of the
computers runs Linux and has a gigabit ethernet interface. It occurs
to me that it is unlikely that I can buy an ethernet switch with
thousands of ports, and even if I could, would I want a Linux system
to have 10,000 entries or more in its ARP table.

Most of the traffic will be from one node to another, with
considerably less to the outside. Physical distance shouldn't be a
problem since everything's in the same room, maybe the same rack.

What's the rule of thumb for number of hosts per switch, cascaded
switches vs. routers, and whatever else one needs to design a dense
network like this? TIA

R's,
John

Chuck_Church · May 8, 2015, 7:18pm

Sounds interesting. I wouldn't do more than a /23 (assuming IPv4) per subnet. Join them all together with a fast L3 switch. I'm still trying to visualize what several thousand tiny computers in a single rack might look like. Other than a cabling nightmare. 1000 RJ-45 switch ports is a good chuck of a rack itself.

Chuck

Christopher_Morrow · May 8, 2015, 7:19pm

consider the pain of also ipv6's link-local gamery.
look at the nvo3 WG and it's predecessor (which shouldn't have really
existed anyway, but whatever, and apparently my mind helped me forget
about the pain involved with this wg)

I think 'why one lan' ? why not just small (/26 or /24 max?) subnet
sizes... or do it all in v6 on /64's with 1/rack or 1/~200 hosts.

Dave_Taht · May 8, 2015, 7:26pm

Some people I know (yes really) are building a system that will have
several thousand little computers in some racks.

Very cool-ly crazy.

Each of the
computers runs Linux and has a gigabit ethernet interface. It occurs
to me that it is unlikely that I can buy an ethernet switch with
thousands of ports, and even if I could, would I want a Linux system
to have 10,000 entries or more in its ARP table.

Agreed. You don't really want 10,000 entries in a routing FIB
table either, but I was seriously encouraged by the work going
on in linux 4.0 and 4.1 to improve those lookups.

I'd love to know the actual scalability of some modern
routing protocols (isis, babel, ospfv3, olsrv2, rpl) with that
many nodes too....

Most of the traffic will be from one node to another, with
considerably less to the outside. Physical distance shouldn't be a
problem since everything's in the same room, maybe the same rack.

That is an awful lot of ports to fit in a rack (48 ports, 36 2U slots
in the rack (and is that too high?) = 1728
ports) A thought is you could make it meshier using multiple
interfaces per tiny linux box? Put, say
3-6 interfaces and have a very few switches interconnecting given
clusters (and multiple paths
to each switch). That would reduce your arp table (and fib table) by a
lot at the cost of adding
hops...

What's the rule of thumb for number of hosts per switch, cascaded
switches vs. routers, and whatever else one needs to design a dense
network like this? TIA

max per vlan 4096. Still a lot.

Another approach might be max density on a switch (48?) per cluster,
routed (not switched) 10GigE
to another 10GigE+ switch.

I'd love to know the rule of thumbs here also, I imagine some rules
must exist for those in the VM
or VXLAN worlds.

Rafael_Possamai1 · May 8, 2015, 7:26pm

- The more switches a packet has to go through, the higher the latency, so
your response times may deteriorate if you cascade too many switches.
Legend says up to 4 is a good number, any further you risk creating a big
mess.

- The more switches you add, the higher your bandwidth utilized by
broadcasts in the same subnet.
http://en.wikipedia.org/wiki/Broadcast_radiation

- If you have only one connection between each switch, each switch is going
to be limited to that rate (1gbps in this case), possibly creating a
bottleneck depending on your application and how exactly it behaves.
Consider aggregating uplinks.

- Bundling too many Ethernet cables will cause interference (cross-talk),
so keep that in mind. I'd purchase F/S/FTP cables and the like.

Here I am going off on a tangent: if your friends want to build a "super
computer" then there's a way to calculate the most "efficient" number of
nodes given your constraints (e.g. linear optimization). This could save
you time, money and headaches. An example: maximize the number of TFLOPS
while minimizing number of nodes (i.e. number of switch ports). Just a
quick thought.

Brandon_Martin · May 8, 2015, 7:41pm

Unless you have some dire need to get these all on the same broadcast domain, those kind of numbers on a single L2 would send me running for the hills for lots of reasons, some of which you've identified.

I'd find a good L3 switch and put no more ~200-500 IPs on each L2 and let the switch handle gluing it together at L3. With the proper hardware, this is a fully line-rate operation and should have no real downsides aside from splitting up the broadcast domains (if you do need multicast, make sure your gear can do it). With a divide-and-conquer approach, you shouldn't have problems fitting the L2+L3 tables into even a pretty modest L3 switch.

Densest chassis switches I know of are going to be gets about 96 ports per RU (48 ports each on a half-width blade, but you need breakout panels to get standard RJ45 8P8C connectors as the blades have MRJ21s) less rack overhead for power supplies, management, etc.. That should get you ~2000 ports per rack [1]. Such switches can be quite expensive. The trend seems to be toward stacking pizza boxes these days, though. Get the number of ports you need per rack (you're presumably not putting all 10,000 nodes in a single rack) and aggregate up one or two layers. This gives you a pretty good candidate for your L2/L3 split.

[1] Purely as an example, you can cram 3x Brocade MLX-16 chassis into a 42U rack (with 0RU to spare). That gives you 48 slots for line cards. Leaving at least one slot in each chassis for 10Gb or 100Gb uplinks to something else, 45x48 = 2160 1000BASE-T ports (electrically) in a 42U rack, and you'll need 45 more RU somewhere for breakout patch panels!

Miles_Fidelman1 · May 8, 2015, 7:53pm

John Levine wrote:

Some people I know (yes really) are building a system that will have
several thousand little computers in some racks. Each of the
computers runs Linux and has a gigabit ethernet interface. It occurs
to me that it is unlikely that I can buy an ethernet switch with
thousands of ports, and even if I could, would I want a Linux system
to have 10,000 entries or more in its ARP table.

Most of the traffic will be from one node to another, with
considerably less to the outside. Physical distance shouldn't be a
problem since everything's in the same room, maybe the same rack.

What's the rule of thumb for number of hosts per switch, cascaded
switches vs. routers, and whatever else one needs to design a dense
network like this? TIA

It's become fairly commonplace to build supercomputers out of clusters of 100s, or 1000s of commodity PCs, see, for example:
www.rocksclusters.org
http://www.rocksclusters.org/presentations/tutorial/tutorial-1.pdf
or
http://www.dodlive.mil/files/2010/12/CondorSupercomputerbrochure_101117_kb-3.pdf (a cluster of 1760 playstations at AFRL Rome Labs)

Interestingly, all the documentation I can find is heavy on the software layers used to cluster resources - but there's little about hardware configuration other than pretty pictures of racks with lots of CPUs and lots of wires.

If the people you know are trying to do something similar - it might be worth some nosing around the Rocks community, or some phone calls. I expect that interconnect architecture and latency might be a bit of an issue for this sort of application.

Miles Fidelman

Blake_Hudson · May 8, 2015, 7:54pm

Linux has a (configurable) limit on the neighbor table. I know in RHEL variants, the default has been 1024 neighbors for a while.

net.ipv4.neigh.default.gc_thresh3
net.ipv4.neigh.default.gc_thresh2
net.ipv4.neigh.default.gc_thresh1

net.ipv6.neigh.default.gc_thresh3
net.ipv6.neigh.default.gc_thresh2
net.ipv6.neigh.default.gc_thresh1

These may be rough guidelines for performance or arbitrary limits someone thought would be a good idea. Either way, you'll need to increase the number if you're using IP on Linux.

Although not explicitly stated, I would assume that these computers may be virtualized or inside some sort of blade chassis (which reduces the number of physical cables to a switch). Strictly speaking, I see no hardware limitation in your way, as most top of rack switches will easily do a few thousand or 10's of thousands of MAC entries and a few thousand hosts can fit inside a single IP4 or IP6 subnet. There are some pretty dense switches if you actually do need 1000 ports, but as others have stated, you'll utilize a good portion of the rack in cable and connectors.

--Blake

skhosla · May 8, 2015, 7:55pm

You may want to look at CLOS / leaf/spine architecture. This design tends to be optimized for east-west traffic, scales easily as bandwidth needs grow, and keeps thing simple, l2/l3 boundry on the ToR switch, L3 ECMP from leaf to spine. Not a lot of complexity and scale fairly high on both leafs and spines.

Sk.

Miles_Fidelman1 · May 8, 2015, 8:02pm

Forgot to mention - you might also want to check out Beowulf clusters - there's an email list at http://www.beowulf.org/ - probably some useful info in the list archives, maybe a good place to post your query.

Miles

Miles Fidelman wrote:

John_L · May 8, 2015, 8:05pm

to have 10,000 entries or more in its ARP table.

Agreed. You don't really want 10,000 entries in a routing FIB
table either, but I was seriously encouraged by the work going
on in linux 4.0 and 4.1 to improve those lookups.

One obvious way to deal with that is to put some manageable number of
hosts on a subnet and route traffic between the subnets. I think we
can assume they'll all have 10/8 addresses, and I'm not too worried
about performance to the outside world, just within the network.

R's,
John

Brian_R · May 8, 2015, 8:16pm

Agree with many of the other comments. Smaller subnets (the /23 suggestion sounds good) with L3 between the subnets.

<off topic>
The first thing that came to mind was "Bitcoin farm!" then "Ask Bitmaintech" and then "I'd be more worried about the number of fans and A/C units".
</off topic>

Brian

Niels_Bakker · May 8, 2015, 8:17pm

* lists.nanog@monmotha.net (Brandon Martin) [Fri 08 May 2015, 21:42 CEST]:

[1] Purely as an example, you can cram 3x Brocade MLX-16 chassis into a 42U rack (with 0RU to spare). That gives you 48 slots for line cards.

You really can't. Cables need to come from the top, not from the sides, or they'll block the path of other linecards.

-- Niels.

John_L · May 8, 2015, 8:19pm

<off topic>
The first thing that came to mind was "Bitcoin farm!" then "Ask Bitmaintech" and then "I'd be more worried about the number of fans and A/C units".
</off topic>

I promise, no bitcoins involved.

R's,
John

bensons · May 8, 2015, 8:25pm

Morrow's comment about the ARMD WG notwithstanding, there might be some useful context in https://tools.ietf.org/html/draft-karir-armd-statistics-01

Cheers,
-Benson

Brandon_Martin · May 8, 2015, 8:28pm

Hum, good point. "Cram" may not be a strong enough term It'd work on the horizontal slot chassis types (4/8 slot), but not the vertical (16/32 slot).

You might be able to make it fit if you didn't care about maintainability, I guess. There's some room to maneuver if you don't care about being able to get the power supplies out, too. I don't recommend this approach... Those MRJ21 cables are not easy to work with as it is.

Charles_N_Wyble2 · May 8, 2015, 8:40pm

Some people I know (yes really) are building a system that will have
several thousand little computers in some racks.

How many racks?
How many computers per rack unit? How many computers per rack?
(How are you handling power?)
How big is each computer?

Do you want network cabling to be contained to each rack? Or do you want to run the cable to a central networking/switching rack?

Hmmmm even a 6513 fully populated with POE 48 port line cards (which could let you do power and network in the same cable (I think? Does POE work on gigabit these days)? would get you (12*48 = 576) ports.

So.... 48U rack - 15U (I think the 6513 is 15U total) leaves you 33U. Can you fit 576 systems in 33U?

Each of the

computers runs Linux and has a gigabit ethernet interface.

Copper?

It occurs

to me that it is unlikely that I can buy an ethernet switch with
thousands of ports

6515?

, and even if I could, would I want a Linux system

to have 10,000 entries or more in its ARP table.

Add more ram. That's always the answer. LOL.

Most of the traffic will be from one node to another, with
considerably less to the outside. Physical distance shouldn't be a
problem since everything's in the same room, maybe the same rack.

What's the rule of thumb for number of hosts per switch, cascaded
switches vs. routers, and whatever else one needs to design a dense
network like this? TIA

We need more data.

Phil · May 8, 2015, 11:20pm

The real answer to this is being able to cram them into a single chassis which can multiplex the network through a backplane. Something like the HP Moonshot ARM system or the way others like Google build high density compute with integrated Ethernet switching.

Phil

Joe_Hamelin · May 8, 2015, 11:50pm

Though a bit off-topic I ran in to this project at the CascadeIT
conference. I'm currently in corp IT that is Notes/Windows based so I
haven't had a good place to test it but the concept is very interesting.
They distributed way they monitor would greatly reduce bandwidth overhead.

http://assimproj.org

The Assimilation Project is designed to discover and monitor
infrastructure, services, and dependencies on a network of potentially
unlimited size, without significant growth in centralized resources. The
work of discovery and monitoring is delegated uniformly in tiny pieces to
the various machines in a network-aware topology - minimizing network
overhead and being naturally geographically sensitive.

The two main ideas are:

   - distribute discovery throughout the network, doing most discovery
   locally
   - distribute the monitoring as broadly as possible in a network-aware
   fashion.
   - use autoconfiguration and zero-network-footprint discovery techniques
   to monitor most resources automatically. during the initial installation
   and during ongoing system addition and maintenance.

Jima1 · May 9, 2015, 12:19am

I won't pretend to know best practices, but my inclination would be to connect the devices to 48-port L2 ToR switches with 2-4 SFP+ uplink ports (a number of vendors have options for this), with the 10gbit ports aggregated to a 10gbit core L2/L3 switch stack (ditto). I'm not sure I'd attempt this without 10gbit to the edge switches, due to Rafael's aforementioned point of the bottleneck/loss of multiple ports for trunking.

Not knowing the architectural constraints, I'd probably go with others' advice of limiting L2 zones to 200-500 hosts, which would probably amount to 4-10 edge switches per VLAN.

Dang. The more I think about this project, the more expensive it sounds.

Jima