FYI: An Easy way to build a server cluster without top of rack switches (MEMO)

NAOTO_MATSUMOTO1 · February 12, 2015, 7:32am

Hi all!

We wrote up TIPS memo "an easy way to build a server cluster
without top of rack switches" concept.

This model have a reduce switches and cables costs and high network
durability
by lightweight and simple configuration.

if you interest in, please try to do yourself this concept

An Easy way to build a server cluster without top of rack switches (MEMO)
http://slidesha.re/1EduYXM

Best regards,

Dan_Eckert · February 13, 2015, 10:08pm

I'm having a hard time seeing how this reduces cable costs or increases network durability. Each individual server is well connected to 3-4 other servers in the rack, but the rack still only has two uplinks. For many servers in the rack you're adding 3-4 routing hops between an end node and the rack uplink.

Additionally, with only 2 external links tied to 2 specific nodes, you introduce more risks. If one of the uplink nodes fails, you've got to re-route all of the nodes that were using it as the shortest path to now exit through the other uplink node -- the worst case in the example then increases from the original 4-hops-to-exit to now 7-hops-to-exit.

As far as cable costs go, you might have slightly shorter cables, but far more complex wiring pattern -- so in essence you're trading off a small amount of cable cost for a higher amount of installation and troubleshooting cost.

Also, using this layout, you dramatically reduce the effective bandwidth available between devices, since per-device links now have to be used for backhaul/transport in addition to device-specific traffic.

Finally, you have to manage per-server routing service configurations to make this work -- more points of failure and increased setup/troubleshooting cost. In a ToR switch scenario, you do one config on one switch, plug in the cables, and you're done -- problems happen, you go to the one switch, not chasing a needle through a haystack of interconnected servers.

If your RU count is worth more than the combination of increased installation, server configuration, troubleshooting, latency, and capacity costs, then this is a good solution. Either way, it's a neat idea and a fun thought experiment to work through.

Thanks!
Dan

Ken_Chase1 · February 14, 2015, 4:09pm

We did similar way back in the day (2001?) when GBE switches were ridiculously
expensive and we wanted many nodes instead of expensive gear. The
(deplorably hot!) NatSemi 83820 gbe cards were a mere $40 or something however.

Uplink for loading data via NFS/control was the onboard FE (via desktop 8 port
Surecoms), but 2x GBE was used for inter-node. GROMACS, a molecular modeller,
only talked to adjacent nodes, so we just hooked up a linear network
A-B-C-D-E-F-A in a loop.

With 40 nodes though, some nodes had 3 cards in them however to effectively
make two separate smaller cluster loops (A-B-C-A and D-E-F-D for eg) without
having to visit the machine and move cards around.

Perfectly reasonable where A talks to B and C or F only. A ridiculous concept
for A talking to C however. Latency on the network was our big thing for
GROMACS' speed of course, thus the GBE, so multihop would have been totally
anathema.

While our cluster ran about twice as slow per job (with no net gain in speed
beyond 16-20 nodes due to latency catching up with us) as the way more pricey
competing quote's infiniband-based solution, their total of 8 nodes were no
match for us running 5 jobs in parallel on our 40 nodes for the same cost

Considering the lab had multiple grad students in it, there was ample
opportunity for running multiple jobs at the same time - while this may have
thrashed the CPU cache (and increased our memory requirements slightly) in
terms of pure compute efficiency, the end throughput per dollar and
happiness-per-grad-student was far higher.

Feel free to trawl the archives on beowulf-l ca. 2001-2 for more details of
dirt cheap cluster design (or reply to me directly).

Here's some pics of the cluster, but please keep in mind we were young and
foolish.

http://sizone.org/m/i/velocet/cluster_construction/133-3371_IMG.JPG.html

/kc

NAOTO_MATSUMOTO1 · February 16, 2015, 3:23am

Hi Dan and ken.

I respect your great works.

Certainly, our scenario was network classics and it just does not "one size
fits all" network architecture.
Many people tried to built centralized and decentralized networks many
years ago, some guys output
implementation like this.

Interconnect of K computer (torus fusion)
https://www.fujitsu.com/global/Images/fujitsu-hpc-roadmap-beyond-petascale-computing.pdf

I agree with you point. Our approach have to do more simple way on physical
and logical
network engineering, and change the mindset, I think.
(e.g. network cabling procedure and troubleshooting and handling)

But, some guys need more cost effective server cluster environment and they
does't care
network latency like Low-End Web Hosting.

e.g. Intel Diversity of Server workloads http://bit.ly/1BgFH65 [JPG].

Now, Many people do not use Dijkstra and automaton theory on the server
side,
but it is great mechanism for network durability if they controlled.

The Ethernet NIC's bandwidth is increasing day by day, the cost is
decreasing too.

I say again, our scenario is not "one size fits all" network architecture,
but we believe that something will happen for some guys works.

best regards,

NAOTO_MATSUMOTO1 · February 20, 2015, 2:50pm

BTW: This scenario's combination has another portion for us like as below.

High Availability Server Clustering without ILB(Internal Load Balancer)
(MEMO)
http://slidesha.re/1vld6uB