Layer 2 vs. Layer 3 to TOR

Guys,

I am wondering how many of you are doing layer 3 to top of rack switches and what the pros and cons are. Also, if you are doing layer 3 to top of rack do you guys have any links to published white papers on it?

Thanks,
Raj Singh

We are heading towards that type of deployment beginning next year with
Juniper EX4200 switches in a redundant configuration. This will be pure
Layer2 in nature on the switches and they will "uplink" to Juniper
M10i's for layer3... the power savings, space savings etc over
traditional Cisco 6500 chassis (plus all the cabling between cabinets
which is in our case a nightmare) made this a pretty easy choice... and
price too..:wink:

Somewhere on Juniper's website in the product info section they have
deployment whitepapers on this kind of stuff if that's of interest....

Hope this helps..

Paul

Dani Roisman gave an excellent talk on this subject at NANOG 46 in Philadelpha:

   http://www.nanog.org/meetings/nanog46/abstracts.php?pt=MTQwOCZuYW5vZzQ2&nm=nanog46

  Steve

We are actually looking at going Layer 3 all the way to the top of rack and make each rack its own /24. This provides us flexibility when doing maintenance (spanning-tree). Also, troubleshooting during outages is much easier by using common tools like ping and trace routes.

I want to make sure this is something other people are doing out there and want to know if anyone ran into any issues with this setup.

Thanks,
Raj Singh | Director Network Engineering

Steve Feldman wrote:

Guys,

I am wondering how many of you are doing layer 3 to top of rack
switches and what the pros and cons are. Also, if you are doing layer
3 to top of rack do you guys have any links to published white papers
on it?

Dani Roisman gave an excellent talk on this subject at NANOG 46 in
Philadelpha:

http://www.nanog.org/meetings/nanog46/abstracts.php?pt=MTQwOCZuYW5vZzQ2&nm=nanog46

I'd always wondered how you make a subnet available across racks with L3
rack switching. It seems that you don't.

~Seth

It's possible, with prior planning. You can have the uplinks be layer 2
trunks, with a layer 3 SVI in the trunk acting as your actual routed uplink.

Requires much planning in advance regarding what vlans are trunked where,
etc. Allows one to do layer 3 termination at top of rack for single
servers, but offer vlans that span multiple layer 3 switches with HSRP at
distribution as an option for systems/services that require a common
broadcast domain.

If you use stackable switches, you can stack across cabinets (up to 3 with 1 meter Cisco 3750 Stackwise), and uplink on the ends. It's a pretty solid layout if you plan your port needs properly based on NIC density and cabinet size, plus you can cable cleanly to an adjacent cabinet's switch if necessary.

Slightly off-topic.. Consider offloading 100Mb connections like PDUs, DRAC/iLO, etc. to lower cost switches to get the most out of your premium ports.

-Tim

If you use stackable switches, you can stack across cabinets (up to 3 with
1 meter Cisco 3750 Stackwise), and uplink on the ends. It's a pretty solid
layout if you plan your port needs properly based on NIC density and cabinet
size, plus you can cable cleanly to an adjacent cabinet's switch if
necessary.

Slightly off-topic.. Consider offloading 100Mb connections like PDUs,
DRAC/iLO, etc. to lower cost switches to get the most out of your premium
ports.

Agreed. We use Netgear gigabit unmanaged switches for what Tim suggests to
save the higher-cost-per-port switchports for server gear.

-brandon

Not just that, you can also use lower cost switches to move your management fully out-of-band with respect to your production traffic. This can work well in times of catastrophe.

Nick

Seth Mattinen wrote:

I'd always wondered how you make a subnet available across racks with L3
rack switching. It seems that you don't.

You could route /32s within your L3 environment, or maybe even leverage something like VPLS - Not sure of any TOR-level switches that MPLS pseudowire a port into a VPLS cloud though.

Kinda makes L3 and spanning tree sound like a great option, doesn't it?

Hej,

We are actually looking at going Layer 3 all the way to the top of rack and
make each rack its own /24.

what a waste of IPs and unnecessary loss of flexibility!

This provides us flexibility when doing maintenance (spanning-tree).

If you use a simple setup for aggregation, you do not need xSTP. Even including
redundancy, RTG (big C: flex-link) will be sufficient. Spanning the L2 over more
than one rack is dirty when you do L3 on the TORs, because you need to build a
Virtual Chassis or VPLS tunnels (not sure if EX4200 does that as of today).

Also, troubleshooting during outages is much easier by using
common tools like ping and trace routes.

Oh, c'mon. Yes, Layer 2 is a wild jungle compared to clean routing, but tracing
isn't that magic there. You have LLDP, mac-address-tables, arp-tables...

I want to make sure this is something other people are doing out there and
want to know if anyone ran into any issues with this setup.

From the design POV, it is a clean and nice concept to do L3 on the
TOR-switches, but in real life, it's not working very well. Everytime I played
with such, with every vendor I've seen, there is just always the same conclusion:

Let routers route and let switches switch.

Switches which are supposed to do routing never scale, provide almost always
immature implementations of common L3 features and run into capacity problems
just too fast (too small tables for firewall roules, route entries, no full IPv6
capabilities, sometimes expensive licenses needed for stuff like IS-IS...).

I understand the wish to keep broadcast domains small and network paths
deterministic and clean, but the switches you can buy today for
not-too-much-money aren't ready yet.

So my hint is: Look at model #4 from the mentioned NANOG presentation.

My 2 Euro-Cents,

.m

Raj Singh wrote:

We are actually looking at going Layer 3 all the way to the top of rack and make each rack its own /24. This provides us flexibility when doing maintenance (spanning-tree). Also, troubleshooting during outages is much easier by using common tools like ping and trace routes.

I'm confused where STP fits into this. If you're doing /24s to each switch, why even bring STP into the picture? Do /31s to each TOR switch and use OSPF or ISIS. I don't know too many people who have not had an awful experience with STP at some point.

Excerpts from David Coulson's message of Thu Nov 12 13:07:35 -0800 2009:

You could route /32s within your L3 environment, or maybe even leverage
something like VPLS - Not sure of any TOR-level switches that MPLS
pseudowire a port into a VPLS cloud though.

I was recently looking into this (top-of-rack VPLS PE box). Doesn't seem
to be any obvious options, though the new Juniper MX80 sounds like it
can do this. It's 2 RU, and looks like it can take a DPC card or comes
in a fixed 48-port GigE variety.

I like the idea of doing IP routing to a top-of-rack or edge device, but
have found others to be skeptical.

Are there any applications that absolutely *have* to sit on the same
LAN/broadcast domain and can't be configured to use unicast or multicast
IP?

--j

Jonathan Lassoff wrote:

I was recently looking into this (top-of-rack VPLS PE box). Doesn't seem
to be any obvious options, though the new Juniper MX80 sounds like it
can do this. It's 2 RU, and looks like it can take a DPC card or comes
in a fixed 48-port GigE variety.
  

The MX-series are pretty nice. That should be able to do VPLS PE, however I've never tried it - MX240 did it pretty well last time I tried. I've no clue how the cost of that switch compares to a cisco 4900 or something (not that a 4900 is anything special - L3 is all in software).

Are there any applications that absolutely *have* to sit on the same
LAN/broadcast domain and can't be configured to use unicast or multicast
IP?
  

The biggest hurdle we hit when trying to do TOR L3 (Cisco 4948s w/ /24s routed to each one) was devices that either required multiple physical Ethernet connections that we typically use LACP with, or any environments that do IP takeover for redundancy. Both are obviously easily worked around if you run an IGP on your servers, but that was just insanely complex for our environment. It's hard to convince people that a HP-UX box needs to work like a router now.

So now we have a datacenter full of 4948s doing pure L2 and spanning tree... What a waste :slight_smile:

Hi,

Are there any applications that absolutely *have* to sit on the same
LAN/broadcast domain and can't be configured to use unicast or multicast
IP?

yes. There are at least some implementations of iSCSI and the accompanying
management services (e.g., for redundancy) that do not work well via routed
connections. Generally, storage services may be difficult being routed.

Further, some aspects of VMware (clusters) including management "need" L2
connectivity, for example when you want to dynamically shift VMs from one
hardware node to another transparently and so on and so forth.

The same applies to several load balancing and/or redundancy/failover mechanisms.

rgds,

.m

I believe the issue will become a moot point in the next 12 months when
vendors begin to ship switches with TRILL.

TRILL is basically a layer 2 routing protocol that will replace spanning
tree. It will allow you to connect several uplinks, utilize all the
bandwidth of the uplinks, prevent loops, and find the best path to the
destination through the switch fabric. Think of it like OSPF for layer
2.

It should be shipping within the next 6 to 9 months.

I believe TRILL will render this discussion moot. It should be shipping
on gear from various vendors within the next year.

For both 4948/4948-10GE and 4900M L3 is in hardware. For
4948/4948-10GE IPv6 is in software, for 4900M it's in hardware.

I would suggest doing a VC with the TOR switches. That way you can
have "one" switch for a lot of racks (I believe 10 would be the upper
limit if using Juniper). If you have a VC you could do L3 and L2
where needed on every rack that the VC covers.

// Olof

* Jonathan Lassoff

Are there any applications that absolutely *have* to sit on the same
LAN/broadcast domain and can't be configured to use unicast or multicast
IP?

FCoE comes to mind.