DNS Based Load Balancers

I'm soliciting recommendations for DNS based load balancers. Currently, we have Cisco Global Site Selectors deployed buy have reached a limit for the number of active HTTP HEAD checks we can perform. This lack of scalability is restricting us severely with regards to the number of customers we can deploy for our product, which requires a separate HTTP HEAD check per IP per customer.

I am hoping to receive recommendations for devices which allow for DNS based load balancing (round robin and proximity based) as well as HTTP health checks (including content based health checks). It must be scalable to, at least, 2000 active checks and active answers.

I am currently investigating the Netscaler DNS offering as well as F5's 3DNS (or whatever they've changed the name to).

F5 BigIP appears quite good. If you add their 3DNS software, you get
wide-IP's as well.

I'm soliciting recommendations for DNS based load balancers.

my recommendation is: "don't do it." for background, see:

http://www.ops.ietf.org/lists/namedroppers/namedroppers.2002/msg02168.html
http://www.cctec.com/maillists/nanog/current/msg03572.html
http://www.cctec.com/maillists/nanog/current/msg00671.html

Having implemented F5's 3DNS product for a large entertainment company, I'd like to wholeheartedly agree with Paul.

Please, for the love of god, don't do it.

matto

--matt@snark.net------------------------------------------<darwin><
   Moral indignation is a technique to endow the idiot with dignity.
                                                 - Marshall McLuhan

So, you guys have been pretty clear on what he shouldn't do.

What should he do as an alternative to using DNS for a proximity based
solution?

-Dave

In the above posts, you claim it is a protocol violation. Would you mind pointing out exactly which part of the protocol has been violated? Specifically, I do not see where "offering back a different rrset based on criteria like source ip address ... is a protocol violation" [quote from Paul Vixie, second URL above] violates the protocol. However, I do admit you know more about the protocol than I do, so could you please educate us?

Also, I note that "Stupid DNS tricks" have been in use for at least a decade now and seem to work just fine. A significant fraction of Internet traffic is based on these "tricks", so it can't be horrifically bad. Of course, the 'Net is resilient, so the fact "doing X has not killed the Internet" does not prove X is good. However,
Paul saying X is bad" does not prove X is bad either. So let's have the logic behind your statement that these tricks are somehow bad for the Internet.

One strong way to say things are bad is if everyone did it, it would take down the Internet. I submit that the Internet would not die if everyone did this. I also submit it is better than relying on BGP to load balance. If you care to argue any of those points, I'll be happy to explain my reasoning. Otherwise, I think the onus is on you to support your claim.

dave@rightmedia.com ("David Temkin") writes:

So, you guys have been pretty clear on what he shouldn't do.

What should he do as an alternative to using DNS for a proximity based
solution?

http://www.redbooks.ibm.com/redbooks/pdfs/sg245858.pdf
http://www.cisco.com/univercd/cc/td/doc/product/iaabu/distrdir/dd2501/ovr.htm
http://www.radware.com/content/products/library/faq_wsd.pdf
http://www.foundrynet.com/solutions/appNotes/GSLB.html
http://www.ifi.unizh.ch/ifiadmin/staff/rofrei/DA/DA_Arbeiten_2000/Masutti_Oliver.pdf

note that several of these describe or offer a dns-based solution as an option,
but they all describe session-level redirection and most recommend that (as i
do) and some even say "using dns for this is bad" (as i do, but for different
reasons.)

The problem being that most of what you linked to below is either A) out
of date, or B) the only way to get proximity based load balancing (GSLB
type stuff) with them is with DNS tricks.

Breaking it down in order:

The IBM solution hasn't been updated since 1999. It also seems
relatively proprietary.

The Cisco solution relies on either doing HTTP redirects (which is
useless if you're not doing HTTP) or DNS.

Both Foundry and Radware rely 100% on DNS to do their GSLB. You can do
local load balancing on both boxes without, however.

The last link is an outdated thesis paper that makes reference moreso
to local load balancing and not global.

It seems that in lieu of a real, currently produced solution, the only
option is presently DNS to meet the requirements. Others have sent me
off-list stuff they're working on, but none of it's ready for prime
time.

-Dave

From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On
Behalf Of Paul Vixie
Sent: Sunday, July 02, 2006 2:03 PM
To: nanog@merit.edu
Subject: Re: DNS Based Load Balancers

dave@rightmedia.com ("David Temkin") writes:

> So, you guys have been pretty clear on what he shouldn't do.
>
> What should he do as an alternative to using DNS for a
proximity based
> solution?

http://www.redbooks.ibm.com/redbooks/pdfs/sg245858.pdf
http://www.cisco.com/univercd/cc/td/doc/product/iaabu/distrdir
/dd2501/ovr.htm
http://www.radware.com/content/products/library/faq_wsd.pdf
http://www.foundrynet.com/solutions/appNotes/GSLB.html
http://www.ifi.unizh.ch/ifiadmin/staff/rofrei/DA/DA_Arbeiten_2

000/Masutti_Oliver.pdf

Would you mind giving us a little more to go on than "the love of god" before making strategic architectural decisions?

Just in case we like to decide things for ourselves. :slight_smile:

For instance, was F5's implementation flawed, or do you have a reason to dislike the basic idea? And why?

was it proximity or just loadbalancing he was trying to accomplish? I
didn't hear/see which was the purpose actually :frowning:

The problem being that most of what you linked to below is either A) out
of date, or B) the only way to get proximity based load balancing (GSLB
type stuff) with them is with DNS tricks. =20

"most of", huh? let's have a looksie.

Breaking it down in order:

The IBM solution hasn't been updated since 1999. It also seems
relatively proprietary.

the ibm white paper i referred you to was writteh in 1999. websphere is
quite current, and its implementation of GSLB functionality has been updated
plenty since 1999. and the competitors james baldwin said he was eval'ing
(cisco, f5) are certainly patent-holders offering proprietary solutions.

The Cisco solution relies on either doing HTTP redirects (which is
useless if you're not doing HTTP) or DNS. =20

james baldwin said he was using the cisco solution today, so clearly HTTP is
the main target. i can't think of a protocol requiring GSLB that isn't HTTP
based (either web browsing or web services). FTP just isn't a growth industry
and the transaction processing systems i know of (the ones that aren't based
on HTTP, that is) have GSLB hooks built into them.

IOW, either you can do GSLB with session redirects, or you don't need GSLB.

Both Foundry and Radware rely 100% on DNS to do their GSLB. You can do
local load balancing on both boxes without, however.

did you read the same radware white paper i did? in

  http://www.radware.com/content/products/library/faq_wsd.pdf

it says that they can do session level redirects. so, less than 100% of
radware is dns. i can see that i misread the foundry whitepaper i ref'd
(perhaps we both saw most readily that data which fit our preconceptions?)

The last link is an outdated thesis paper that makes reference moreso
to local load balancing and not global.

why is it "outdated"? as a survey of the desired functionality it's still
pretty good background. no new GSLB has been invented since then, surely?

It seems that in lieu of a real, currently produced solution, the only
option is presently DNS to meet the requirements. Others have sent me
off-list stuff they're working on, but none of it's ready for prime
time. =20

well, i see that fezhead is dead. but 3-party TCP is alive and well:
<http://www.cs.bu.edu/~best/res/projects/DPRClusterLoadBalancing/>.

see also <http://www.tenereillo.com/GSLBPageOfShame.htm>
and <http://www.tenereillo.com/GSLBPageOfShameII.htm>.

the references sections of those last three are particularly informative.

Paul - I'm still eagerly waiting your reply to Patrick's questions.

Here at least we finally have something to read other than relying on blind faith, but
the author is so convinced DNS based GSLB doesn't work[1] (and gives good examples
of why it doesn't). However, these are all pretty much theoretical examples, and there's
no explanation of why DNS based CDNs do in fact work so well in practice[2].

[1] FSVO "doesn't work" that is...
[2] I was going to say "appear to work so well", but that's unfair use of sarcasm - I know just how well at least one CDN works :slight_smile:

Without getting into a massive back and forth, I just want to make 3
points:

1) Websphere is proprietary to IBM and requires their servers. It's not
scalable to other applications. It's also not targeted to the same
market as, say, F5.

2) There are definitely protocols that require GSLB that aren't HTTP.
Off the top of my head: RTSP/MMS, VoIP services. I'd say that, at the
very least, VoIP protocols are the killer app for GSLB moreso than HTTP.
Surely the internet isn't only the web, right?

3) TCP-redirect solutions, such as the Radware one you pointed out, do
not work in large scales. Have you ever met anyone who's actually
implemented that in a large scale? The solution they point to they
don't even sell anymore (the WSD-DS/NP). If you talk to their sales,
they'll point you at the DNS based solution because they know that doing
Triangulation is a joke. Triangulation and NAT-based methods both
crumble under any sort of DoS and provide no site isolation.

Pete Tenereillo's papers are interesting, but they're also slanted and
ignore other implementation methods of DNS GSLB. How about handing out
NS records instead of A records? That's an method that would make
large parts of his papers irrelevant.

My main point here is that each solution has it's evils, and when faced
with a choice, he needs to evaluate what method works best for him.
Anyone could just as easily say that Triangulation and NAT are a hack
just the same as GSLB DNS is a hack. Akamai and UltraDNS will actually
sell you GSLB without even buying localized hardware to do it - are
these bad services, too? Patrick said it best: Just in case we like to
decide things for ourselves.

-Dave

Without getting into a massive back and forth, I just want to make 3
points:

as long as the back-and-forth remains informative and constructive, i'll play:

1) Websphere is proprietary to IBM and requires their servers. It's not
scalable to other applications. It's also not targeted to the same
market as, say, F5.

websphere is a trade name for a family of products and services. the GSLB
component is able to play as a proxy to someone else's web server. (don't
take my word for it, call an ibm salesweenie.)

2) There are definitely protocols that require GSLB that aren't HTTP.
Off the top of my head: RTSP/MMS, VoIP services. I'd say that, at the
very least, VoIP protocols are the killer app for GSLB moreso than HTTP.
Surely the internet isn't only the web, right?

according to <http://www.isc.org/pubs/tn/isc-tn-2004-2.html>, the internet
is much larger than the web. but i'm not sure what you're replying to. i
said that session level redirection would be possible in all cases where
GSLB was needed. voip has session level redirection (several kinds).

3) TCP-redirect solutions, such as the Radware one you pointed out, do
not work in large scales. Have you ever met anyone who's actually
implemented that in a large scale? The solution they point to they
don't even sell anymore (the WSD-DS/NP). If you talk to their sales,
they'll point you at the DNS based solution because they know that doing
Triangulation is a joke. Triangulation and NAT-based methods both
crumble under any sort of DoS and provide no site isolation.

i did not know radware has given up on wsd. but i don't see an explaination
of what you mean by "not work in large scales" beyond "radware gave up". i
gave another reference to third-party TCP, have you looked at it or surveyed
the rest of the field to find out how assymetric IP (satellite downlink,
terrestrial uplink) and third-party TCP is working for the various pacific
islands who depend on it?

Pete Tenereillo's papers are interesting, but they're also slanted and
ignore other implementation methods of DNS GSLB. How about handing out
NS records instead of A records? That's an method that would make
large parts of his papers irrelevant.=20

just as one can always find an example that supports one's preconceptions,
one can always find a single counterexample that will support one's
prejudices. i'm sure that any technology can be successfully demo'd or
successfully counter-demo'd. this conversation started out as "what DNS
GSLB should i use?" and then "if DNS GSLB is such a bad idea then what do
you propose as an alternative?" and now it's "every alternative has known
failure modes that are as bad as DNS GSLB's worst case." does that mean
we're done with the informative and constructive part of this thread?

My main point here is that each solution has it's evils, and when faced
with a choice, he needs to evaluate what method works best for him.
Anyone could just as easily say that Triangulation and NAT are a hack
just the same as GSLB DNS is a hack. Akamai and UltraDNS will actually
sell you GSLB without even buying localized hardware to do it - are
these bad services, too? Patrick said it best: Just in case we like to
decide things for ourselves.

nobody ever got fired for buying akamai's or ultradns's DNS GSLB services,
that's for sure.

just as one can always find an example that supports one's
preconceptions, one can always find a single counterexample
that will support one's prejudices. i'm sure that any
technology can be successfully demo'd or successfully
counter-demo'd. this conversation started out as "what DNS
GSLB should i use?" and then "if DNS GSLB is such a bad idea
then what do you propose as an alternative?" and now it's
"every alternative has known failure modes that are as bad as
DNS GSLB's worst case." does that mean we're done with the
informative and constructive part of this thread?

I don't think anyone disagrees with you there. I just felt that any
comprehensive answer should go beyond "DNS GSLB is broken, don't use
it".

  As someone who administers a rather large both appliance and
service provider based GSLB network, as well as someone who's
administered triangulation and BGP-based methods in the past, I can
honestly say that thus far the DNS implementation has been far less
broken.. Does that mean that someone else feels differently? I sure
hope so.

> My main point here is that each solution has it's evils, and when
> faced with a choice, he needs to evaluate what method works
best for him.
> Anyone could just as easily say that Triangulation and NAT
are a hack
> just the same as GSLB DNS is a hack. Akamai and UltraDNS
will actually
> sell you GSLB without even buying localized hardware to do it - are
> these bad services, too? Patrick said it best: Just in
case we like
> to decide things for ourselves.

nobody ever got fired for buying akamai's or ultradns's DNS
GSLB services, that's for sure.

Very true, but does that mean they're a viable alternative for him? Or
are they just as broken as hardware vendor GSLB?
The local load balancing piece can be served by any number of hardware
appliances or software products.

-Dave

Would you mind giving us a little more to go on than "the love of god" before making strategic architectural decisions?

Just in case we like to decide things for ourselves. :slight_smile:

Patrick, I am sorry if I have hit a nerve with you- it seems you've got a vested interest in the answer to this question, and I appreciate your position.

For instance, was F5's implementation flawed, or do you have a reason to dislike the basic idea? And why?

For the record, what I _should_ have advised the OP was "for the love of god, don't try to do this yourself with an appliance." I wholeheartedly encourage him to give his local Akamai sales rep a call. I am sorry for the confusion and angst my brevity has caused.

cheers,
matto

--matt@snark.net------------------------------------------<darwin><
   Moral indignation is a technique to endow the idiot with dignity.
                                                 - Marshall McLuhan

Matt,

A few quick questions for you, if you got the time to answer it would be
appreciated (questions inline):

From: owner-nanog@merit.edu [mailto:owner-nanog@merit.edu] On Behalf Of
Matt Ghali
Sent: 04 July 2006 07:21
To: Patrick W. Gilmore
Cc: nanog@merit.edu
Subject: Re: DNS Based Load Balancers

> Would you mind giving us a little more to go on than "the love of
> god" before making strategic architectural decisions?
>
> Just in case we like to decide things for ourselves. :slight_smile:

Patrick, I am sorry if I have hit a nerve with you- it seems you've
got a vested interest in the answer to this question, and I
appreciate your position.

> For instance, was F5's implementation flawed, or do you have a reason to
> dislike the basic idea? And why?

For the record, what I _should_ have advised the OP was "for the
love of god, don't try to do this yourself with an appliance." I
wholeheartedly encourage him to give his local Akamai sales rep a
call. I am sorry for the confusion and angst my brevity has caused.

We work with a couple of different technologies here - our own GSS's, cache
farms and also external CDNs (for overflow). This is currently and area that
is currently under evaluation for a quite significant expansion.

Are you able to give some kind of description as to the problems you
experienced whilst using your own appliances? It would be very useful to be
able to avoid making the same mistakes.

Sam

As someone who has also deployed GSLB's with hardware applicances I
would also like to know real world problems and issues people are
running into "today" on modern GSLB implementations and not
theoretical ones, as far as I can tell our GSLB deployment was very
straight forward and works flawlessly.

geographic load balancing via an appliance can be reduced to the criteria they have available to decide what answer to give a query.

- Pings and traceroutes are both subject to rapid state change. Paths and latencies change for a number of reasons not related to network proximities. Traceroute hops in particular are a terrible metric to use in judging proximity, as it could be very easy for a 14-hop path inside the US to trump a 4-hop transatlantic path. Pings/traceroutes also take a long time, and are only valuable for repeat queries from the same client, dumping the first on some default pool. Not so load balanced.

- BGP aspath length. This is actually probably the 'best' data that a geopraphical load balancing system can use. The data is detailed and metrics for any inbound connection are already in the 'db'. However, expecting a corporate IT or ops department to configure bgp peering on their load balancer is probably expecting a bit much. To the best of my knowledge, no appliance uses aspath length.

- Maps of RIR allocations and their geographic locations. OK. I can see how these might be useful for balancing traffic roughly across global regions, but the lack of granularity makes this a somewhat elaborate way to skin this particular cat. Also, we all know how well RIR allocation corresponds to actual location in the real world. At work, I am pleasantly surprised by 'geolocation' tools that claim my office in Redwood City is actually in Washington DC or London.

If you're looking to distribute traffic across several data centers, across many geographic regions, why not anycast a set of auth nameservers, with each pointing at their own data center in answers? This solution probably gives you a better 'correct' hit rate than any commercial appliance, and can be implemented yourself, or with the help of a commercial provider who specializes in this sort of thing. No vendor lock-in, no arm and a leg for a substandard PC in a rack mount case. (but you dont get the big fancy logo either)

matto

--matt@snark.net------------------------------------------<darwin><
   Moral indignation is a technique to endow the idiot with dignity.
                                                 - Marshall McLuhan

As someone who has also deployed GSLB's with hardware applicances I
would also like to know real world problems and issues people are
running into "today" on modern GSLB implementations and not
theoretical ones, as far as I can tell our GSLB deployment was very
straight forward and works flawlessly.

GSLB based on DNS have one significant shortcoming that moone here has yet
mentioned: they are performing their magic on the location of the
_nameserver_ that issued the query.

this can be VERY different to that of the ACTUAL location of the client.

for example, Akamai always sends to off to a serverfarm in Northern
California, because that's where my DNS query is originating from.

that is almost the exact opposite side of the planet from where I'm coming
from.......
irony is that there is an akamai cluster about 10 feet away from where my
[subsequent] http requests originate from...

sure - perhaps this isn't the norm - split-tunnel VPNs being what they are -
but it's a perfect example of why GSLB based on DNS ain't perfect.

cheers,

lincoln.