Load balancing in routers

Does anybody know what are load balancing algorithms
by most routers ? Where can I more information about
this ?

thanks
Abhi

depends on make of router, model, switching method, routing protocol

best to go read www.cisco.com www.juniper.net www.3com.com or any vendor
your interested in

Steve

That is somewhat of a vague question, but lets see here... The first
question is at what layer the load balancing is going to happen, which is
basically either layer 2, layer 3, or "other".

Layer 2 switches (and routers using multiple layer 2 circuits) do a form
of load balancing based on the MAC address of the sender and receiver.
Probably the versions most people are famialer with is Cisco EtherChannel,
or IEEE 802.3ad. A single virtual port is presented to the world, composed
of multiple physical "member" ports. The actual forwarding of the frames
is determined by a relatively simple hash on the src and dst MAC addresses
which aims to keep a "flow" of frames between any given stations on the
same physical port, to prevent frame reordering and other such nastiness.
While the aggregate bandwidth across such a trunk may be that of multiple
physical links, the bandwidth of any given src/dst flow is limited to a
single physical link. Usually the hash is admin configurable because of
the nature of load balancing on layer 2 addresses. For example, if you had
1 router with 2 ports to a switch, and 2 devices hanging off that switch,
transmitting to the 1 router, without an admin configurable hash it is
quite possible to end up with all "flows" on a single link. Failure
recovery is usually configurable too, if 1 link in a bundle of 8 goes
down, you can either drop 1/8th of your packets, move the traffic from
that 1 interface to only 1 other interface, or change your hash algorithm
and redistribute all traffic evenly among all interfaces. Unless
explicitly configured, most layer 2 devices will not have multiple
simultanious forwarding paths due to the possability of a loop.

Layer 3 devices usually do a form a load balancing called "equal cost"
forwarding. If you have two routes to a single prefix (say you have two
physical links), and both have the same routing "cost", packets may be
load balanced across those links. Some mechanisms (for example Cisco CEF)
can do this on a per-destination (flow-based) basis, to prevent packet
reordering. But some protocols can't support this, for example UDP or ICMP
traceroutes usually don't get grouped into a "flow", so you can see this
kind of load balancing in practice on the internet when you get back
traceroute answers from different probes on the same hop.

"Other" layer devices, such as layer 4-7 load balancers (which only
loosely fit under the definition of "router") may have other algorithms,
such as round robin, or least weighted. In order to do more advanced
things, such as non-equal capacity load balancing, you need to have
knowledge of the "load" on a link (or the servers in the case of 4-7 load
balancers). This is something that "routers" have typically avoided, and
I'm not aware of any router vendors who attempt to do load balancing based
on the load of a link.

Did you have any more specific questions?

Richard A Steenbergen wrote:

In order to do more advanced
things, such as non-equal capacity load balancing, you need to have
knowledge of the "load" on a link (or the servers in the case of 4-7 load
balancers). This is something that "routers" have typically avoided, and
I'm not aware of any router vendors who attempt to do load balancing based
on the load of a link.

cisco's EIGRP can do it, but it is disalbed by default, and not recommended.

Not exactly a "router vendor", but many of the "route-optimization
vendors" can implement this on top of bgp by modifying routes based upon
link utilization, loss, latency, etc..

-jba

Layer 3 devices usually do a form a load balancing called "equal cost"
forwarding. If you have two routes to a single prefix (say you have two
physical links), and both have the same routing "cost", packets may be
load balanced across those links. Some mechanisms (for example Cisco CEF)
can do this on a per-destination (flow-based) basis, to prevent packet
reordering.

I seem to remember fast switching was per-destination, and CEF was
round robin. But it seems CEF is now per-destination as well in IOS 12.2.
Round robin is optional.

But some protocols can't support this, for example UDP or ICMP
traceroutes usually don't get grouped into a "flow", so you can see this
kind of load balancing in practice on the internet when you get back
traceroute answers from different probes on the same hop.

Routers usually don't really take full flow information into account, but
only look at the destination IP address or do a hash over some fields. So
usually traceroute doesn't behave differently from regular traffic.

This link answers the original question for another router vendor:

http://www.juniper.net/techpubs/software/junos51/swconfig51-policy/html/policy-actions-config10.html#1015470

you remember incorrectly.

by default, CEF uses a hash based on both src & dst to determine the path to take.
somewhat
paradoxically this is referred to as "per-destination" load-balancing (or "deterministic").

on many platforms, you can reconfigure CEF to use a per-packet distribution.

"ip load-sharing XX" is the interface command to set the policy.

"per-destination" historically is what fast-switching used to do -- and it did a particularly bad job of handling large amounts of traffic sourced to one ip-address (such as a news-server or proxy-server). this was particularly apparent for multiple <E1 links load-balanced using equal cost routes.

cheers,

lincoln.

I seem to remember fast switching was per-destination, and CEF was
round robin. But it seems CEF is now per-destination as well in IOS 12.2.
Round robin is optional.

CEF is flow-hashed, and the hash seems to include both source and
destination, and seems to include the port numbers. This is by observing
the behaviour of flows hitting various members of the F.ROOT-SERVERS.NET
set, each of whom sends F's address to several upstream routers using OSPF.
CEF works like a charm -- the load is never split by more than 45-55 and
that's damn good for wire speed hashing in my view.

We used CEF in 11.x and it behaved the same way. It was never round-robin
in any way we could observe.

You're right. I was thinking of process switching.

According to:
http://www.ils.unc.edu/dempsey/186s00/reorderingpaper.pdf

packet reordering at MAE East was extremely common a few years ago. Does
anyone have information whether this is still happening?

I don't think flow-caching is necessarily due to CEF.

Even on dinky 2500 & 2600 series where you don't run CEF, load balancing
over multiple links uses a flow-hashed method. If you want per-packet load
distribution you have to specifically enable it by saying "no ip
route-cache" on each interface.

Paul's statement about CEF is interesting. It's probably the first public
statement I've ever heard where someone was praising CEF. Usually
discussions about CEF are accompanied by liberal amounts of swearing...

Joe

If by "round-robin" you mean by destination only, then this is correct. However, if
you strict per-packet load sharing regardless of flow, then CEF does have this
capability, although the default behavior is the flow-based load sharing you describe.

http://www.cisco.com/univercd/cc/td/doc/product/software/ios120/12cgcr/switch_c/xcprt2/xccefc.htm#33184

However, IIRC, code stability issues have plagued this feature in many IOS
releases; I recall Intermedia selling a "bonded T1" product that used this feature, and
supporting it was...not pleasant.

-C

Thus spake "joe mcguckin" <joe@via.net>

I don't think flow-caching is necessarily due to CEF.

Even on dinky 2500 & 2600 series where you don't run CEF,
load balancing over multiple links uses a flow-hashed method.

For the 312534906703247th time:

Switching Balancing per-
Process packet
Fast dest (net)
Flow flow
CEF* src-dest pair
CEF** packet

* by default
** with "ip load-sharing per-packet" on incoming interface

If you want per-packet load distribution you have to specifically
enable it by saying "no ip route-cache" on each interface.

If you want to crater your router, sure. Otherwise, I'd consider one of the
other options.

S

A few comments:

I don't think flow-caching is necessarily due to CEF.

CEF, afaik, is unaware of flows.

Even on dinky 2500 & 2600 series where you don't run CEF,

Many people run CEF on 2600's, it's about the only way to get to the
cisco-advertised PPS on the box.

load balancing over multiple links uses a flow-hashed method. If you
want per-packet load distribution you have to specifically enable it by
saying "no ip route-cache" on each interface.

That is very deadly, please, don't anyone actually try that.

CEF load balancing, IIRC, had two options, specifyable on a per-interface
basis -- 'per-packet', and 'per-destination'. Both have obvious meanings.

Newer IOS's seem to have a defaulting mechanism available in global config
mode, but being a weirdo, I don't trust it. I still specify on the
per-interface.

We use this in several scenerios, specifically for load-balancing T1's,
and it amazingly works well, with the links often being in balance to the
tune of 1 to 3%. I've seen similar performance at DS3 rates.

Paul's statement about CEF is interesting. It's probably the first public
statement I've ever heard where someone was praising CEF. Usually
discussions about CEF are accompanied by liberal amounts of swearing...

I dunno; except for some silliness in 12.1(8a)E[1-4] on a MSFC2, we've
seen general goodness from CEF from 2600, 3600, 4700, 5300, 7200, 7500.

Then again, we're not UU or Sprint, and don't have the traffic loading
they do.

Joe

>
>
>>> I seem to remember fast switching was per-destination, and CEF was
>>> round robin. But it seems CEF is now per-destination as well in IOS 12.2.
>>> Round robin is optional.
>
>> CEF is flow-hashed, and the hash seems to include both source and
>> destination, and seems to include the port numbers. This is by observing
>> the behaviour of flows hitting various members of the F.ROOT-SERVERS.NET
>> set, each of whom sends F's address to several upstream routers using OSPF.
>> CEF works like a charm -- the load is never split by more than 45-55 and
>> that's damn good for wire speed hashing in my view.
>
>> We used CEF in 11.x and it behaved the same way. It was never round-robin
>> in any way we could observe.
>
> You're right. I was thinking of process switching.
>
> According to:
> http://www.ils.unc.edu/dempsey/186s00/reorderingpaper.pdf
>
> packet reordering at MAE East was extremely common a few years ago. Does
> anyone have information whether this is still happening?
>
>

-- Alex Rubenstein, AR97, K2AHR, alex@nac.net, latency, Al Reuben --
-- Net Access Corporation, 800-NET-ME-36, http://www.nac.net --

EIGRP is certainly off by default, as are all routing protocols. You
may not recommend it, but we have lots of customers who like it; it's
not like _cisco_ doesn't recommend it.

There's another way to do this with MPLS-TE, but not everybody likes
that, either. :slight_smile:

eric

Well most complaints revolve around distributed "dCEF", CEF itself it just
the (tm) of using an mtrie data structure to create a prefix table that is
pre-populated and optimized to be used for forwarding only (a FIB).

When you use dCEF, the individual linecards or VIPs each have their own
copy of the FIB, and make the forwarding decisions on their own processor
without having to consult the central route-processor. The problem comes
when you have a routing change (I've never calculated the average rate of
BGP churn but I'd guess its at least 1-2 changes per second) and need to
rebuild the FIBs, sometimes the individual line cards get "confused" and
put the wrong destination in their FIB. Thus traffic coming in on a
specific source linecard starts being forwarded to the wrong destination
interface, which can make for some very difficult diagnosis (and the
aforementioned swearing).

But other than that, CEF works just fine. Newer architectures (for example
Juniper, which does its FIB work in the IP2) don't have any other legacy
route-caches at all. One of the benefits of having a guaranteed FIB for
doing all longest prefix match lookups is that you can design your RIB so
it is optimized for what it does most, insertions and deletions. Many RIB
applications improve greatly when they no longer need a Patricia tree.

To quote Avi Freedman, "Customer Enragement Feature".
To quote Majdi Abbas, "John Chambers owes me a pony".

> load balancing over multiple links uses a flow-hashed method. If you
> want per-packet load distribution you have to specifically enable it by
> saying "no ip route-cache" on each interface.

That is very deadly, please, don't anyone actually try that.

How so? So it uses a little more cpu, but that may not be relevant in
a lot of applications (like down at the T1 level).

I've had a customer on the end of 8 T1, no ip route cache, on a 4700
(their end) and a 7206/300 (my end). 4700 runs a little hot, but survives.

Similarly, I currently have a couple of 4*T1, a 3*T1, and several 2*T1
on PA-MC-T3 ports on a 7206/300 with no issues whatsoever. Max cpu
usage is 35%. Everything works.

Now, contrast that with my first use of cef, this was back when the
only cef configuration was "ip cef" or something similar. Very
difficult to screw things up when the config is a one-liner, and yet
when I turned this on the 7206 immediately crashed.

-mark

If by "round-robin" you mean by destination only, then this is
correct.

The term "round-robin" refers to a schedule which cycles
through some number of things in a fixed order.

A packet arrives and the router makes a forwarding decision.
The things that it can cycle through are entries in a forwarding table.

Those entries can be either physical paths (ie. one packet
goes over this link, next packet over the next and so forth).
Call that "per-packet load-sharing"

The entries could be cached destinations.
Call that "per-destination load-sharing"

The entries could be based on source/destination pairs.
Call that "flow-based" or "src/dst-based".

The entries could be based on the whole "5-tuple" of
source and destination address, IP protocol, source
and destination port numbers (and ToS value)?
Call that "full-flow".

Now, how effective any of these schemes are at sharing load
at a bit level (how links operate) vs. packet level (how
forwarding decisions are made) has been the subject of
debate for some time.

However, if you [mean] strict per-packet load sharing regardless
of flow, then CEF does have this capability, although the default
behavior is the flow-based load sharing you describe.

http://www.cisco.com/univercd/cc/td/doc/product/software/ios120/12cgcr/switch_c/xcprt2/xccefc.htm#33184

However, IIRC, code stability issues have plagued this feature in
many IOS releases; I recall Intermedia selling a "bonded T1" product
that used this feature, and supporting it was...not pleasant.

-C

I'm not sure about stability, I think it's been stable.
However, there's a fundamental issue with the GSR architecture
which affects it's applicability there.

On the GSR, forwarding decisions are made on ingress to the router;
but, per-packet is configured on the outbound interface. The ingress
interface has no way to keep track of which outbound interface a
packet was last sent to; that would require some kind of counter.
I believe, some kind of workaround may have been coded for this
problem recently, but it might have significant performance impact
and/or negate other "extended" feature sets.

Tony

a large router running low bandwidth will be fine but as was previously
said, if you do this on most properly sized routers you will use all the
cpu

another thing is you will see increased latency and jitter as your packets
individually queue for cpu process time

Steve

>> > load balancing over multiple links uses a flow-hashed method. If you
>> > want per-packet load distribution you have to specifically enable it by
>> > saying "no ip route-cache" on each interface.
>>
>> That is very deadly, please, don't anyone actually try that.

How so? So it uses a little more cpu, but that may not be relevant in
a lot of applications (like down at the T1 level).

Besides just driving up the CPU load through the roof for no real reason,
process switching produced per-packet load balancing. This is not a
desirable thing, since it introduces packet reordering which can be VERY
detrimental to TCP performance. Just think, if you had a slightly
different cable length, packets could spend more time on one wire than
another, and become totally out of sync.

      Input flow
           1
           2
           3
           4
           5
Link 1 6 Link 2
   1 2
   3 4
   5 6
           2
           1
           3
           4
           6
           5
      Output flow

I've had a customer on the end of 8 T1, no ip route cache, on a 4700
(their end) and a 7206/300 (my end). 4700 runs a little hot, but survives.

Similarly, I currently have a couple of 4*T1, a 3*T1, and several 2*T1
on PA-MC-T3 ports on a 7206/300 with no issues whatsoever. Max cpu
usage is 35%. Everything works.

If all you want to do is a few T1's on an NPE300, you'll be fine. I'm
certain Alex is used to doing more and scraping every last packet out of
his routers. :slight_smile:

Now, contrast that with my first use of cef, this was back when the
only cef configuration was "ip cef" or something similar. Very
difficult to screw things up when the config is a one-liner, and yet
when I turned this on the 7206 immediately crashed.

It's really not much more complex now. I saw some "CEF Watchdog" (to check
for dCEF corruption) type functionality in recent 12.0S builds, but on a
7200 it doesn't matter since is distributed.

As for your crash... Well, my first guess is that you were running the
"wrong" IOS image. 7200's are simple enough that they are usually safe to
run whatever the "newest" code on. That practice that will get you burned
on GSR's. But in the end... It's Cisco, what do you expect. Call TAC or
try again with new code. :slight_smile:

another thing is you will see increased latency and jitter as your packets
individually queue for cpu process time

Thanks, that statement is significantly different than:

1) That is very deadly
2) If you want to crater your router, sure

both referring to "no ip route-cache"

I was just pushing for more moderate statements :slight_smile:

-mark