Anycast 101

I got some messages from people who weren't exactly clear on how anycast works and fails. So let me try to explain...

In IPv6, there are three ways to address a packet: one-to-one (unicast), one-to-many (multicast), or one-to-any (anycast). Like multicast addresses, anycast addresses are shared by a group of systems, but a packet addressed to the group address is only delivered to a single member of the group. IPv6 has "round robin ARP" functionality that allows anycast to work on local subnets.

Anycast DNS is a very different beast. Unlike IPv6, IPv4 has no specific support for anycast, and the point here is to distribute the group address very widely, rather than over a single subnet anyway. So what happens is that a BGP announcement that covers the service address is sourced in different locations, and each location is basically configured to think it's the "owner" of the address.

The idea is that BGP will see the different paths towards the different anycast instances, and select the best one. Now note that the only real benefit of doing this is reducing the network distance between the users and the service. (Some people cite DoS benefits but DoSsers play the distribution game too, and they're much better at it.)

Anycast is now deployed for a significant number of root and gtld servers. Before anycast, most of those servers were located in the US, and most of the rest of the world suffered significant latency in querying them. Due to limitations in the DNS protocol, it's not possible to increase the number of authoritative DNS servers for a zone beyond around 13. With anycast, a much larger part of the world now has regional access to the root and com and net zones, and probably many more that I don't know about.

However, there are some issues. The first one is that different packets can end up at different anycast instances. This can happen when BGP reconverges after some network event (or after an anycast instance goes offline and stops announcing the anycast prefix), but under some very specific circumstances it can also happen with per packet load balancing. Most DNS traffic consists of single packets, but the DNS also uses TCP for queries sometimes, and when intermediate MTUs are small there may be fragmentation.

Another issue is the increased risk of fait sharing. In the old root setup, it was very unlikely for a non-single homed network to see all the root DNS servers behind the same next hop address. With anycast, this is much more likely to happen. The pathological case is one where a small network connects to one or more transit networks and has local/regional peering, and then sees an anycast instance for all root servers over peering. If then something bad happens to the peering connection (peering router melts down, a peer pulls an AS7007, peering fabric goes down, or worse, starts flapping), all the anycasted addresses become unreachable at the same time.

Obviously this won't happen to the degree of unreachability in practice (well, unless there are only two addresses that are both anycast for a certain TLD, then your milage may vary), but even if 5 or 8 or 12 addresses become unreachable the timeouts get bad enough for users to notice.

The 64000 ms timeout query is: at what point do the downsides listed above (along with troubleshooting hell) start to overtake the benefit of better latency? I think the answer lies in the answers to these three questions:

- How good is BGP in selecting the lowest latency path?
- How fast is BGP convergence?
- Which percentage of queries go to the first or fastest server in the list?

Iljitsch van Beijnum wrote:

Due to limitations in the DNS protocol, it's not possible to increase the number of authoritative DNS servers for a zone beyond around 13.

I believe you misspelled, "Due to people who do not understand the DNS
protocol being allowed to configure firewalls..."

I got some messages from people who weren't exactly clear on how
anycast works and fails. So let me try to explain...

Nice try.

Anycast is now deployed for a significant number of root and gtld
servers. Before anycast, most of those servers were located in the US,
and most of the rest of the world suffered significant latency in
querying them. Due to limitations in the DNS protocol, it's not
possible to increase the number of authoritative DNS servers for a zone
beyond around 13. With anycast, a much larger part of the world now has
regional access to the root and com and net zones, and probably many
more that I don't know about.

Think of this also as a reliability measure. If a region of the world has
poor connectivity to the so-called "Internet core" (Remember the Sri Lanka
international fiber outage a few months ago?), a loss of international
connectivity can mean a loss of DNS, which breaks even local connectivity.

However, there are some issues. The first one is that different packets
can end up at different anycast instances. This can happen when BGP
reconverges after some network event (or after an anycast instance goes
offline and stops announcing the anycast prefix), but under some very
specific circumstances it can also happen with per packet load
balancing. Most DNS traffic consists of single packets, but the DNS
also uses TCP for queries sometimes, and when intermediate MTUs are
small there may be fragmentation.

You're misunderstanding how per-packet load balancing is generally used.

Per-packet load balancing works very well when you've got two identical
circuits between the same two routers, and you want to make sure neither
circuit fills up while the other has spare capacity.

Using per-packet load balancing on non-identical paths (in your example,
out different peering or transit connections) doesn't work. Even when
connecting to a unicast host, the packets would arrive out of order,
leading to some really nasty performance problems. If anybody is using
per-packet load balancing in that sort of situation, anycast DNS is the
least of their problems.

Another issue is the increased risk of fait sharing. In the old root
setup, it was very unlikely for a non-single homed network to see all
the root DNS servers behind the same next hop address. With anycast,
this is much more likely to happen. The pathological case is one where
a small network connects to one or more transit networks and has
local/regional peering, and then sees an anycast instance for all root
servers over peering. If then something bad happens to the peering
connection (peering router melts down, a peer pulls an AS7007, peering
fabric goes down, or worse, starts flapping), all the anycasted
addresses become unreachable at the same time.

You appear to be assuming that every anycast server in the world announces
routes for every anycasted address.

The general Anycast rule is that for however many anycasted IP addresses
you have serving a zone, you have that many separate sets of anycast
nodes. So, if you have a zone served by anyns1, anyns2, and anyns3, there
will be a set of nodes that is anyns1, a set of nodes that is anyns2, and
a set of nodes that is anyns3. Different servers, different routers,
and probably different physical locations. Are there scenarios where an
outage would lead to a loss of all of the anycast clouds? Of course, but
those scenarios would apply to Unicast servers as well.

The potentially valid point you've made is about switching servers during
BGP convergence. As such, anycast might well be inappropriate for long
term stateful connections. However, BGP reconvergence should be
relatively rare, DNS queries finish quickly, and DNS is good about failing
over to another DNS server IP address when a query fails. If your example
is a network whose entire routing table is reconverging, and they're
changing their routes to all the name servers for a particular zone, their
network performance is going to be pretty bad until convergence finishes
anyway.

Obviously this won't happen to the degree of unreachability in practice
(well, unless there are only two addresses that are both anycast for a
certain TLD, then your milage may vary), but even if 5 or 8 or 12
addresses become unreachable the timeouts get bad enough for users to
notice.

Right, but if you're losing 5 or 8 or 12 diverse routes at the same time,
your problem probably has very little to do with anycast.

-Steve

My question:

I noticed that people always talked about BGP when
they talked about anycast dns server farm.

But, is there any problem or anything must be taken
care about when anycast is employed within a DNS
server farm within MAN?

What I mean is, if we want to employ anycast in a
cache server farm which is located within a big OSPF
network, is there anything problemetic ? or should we
consider anycast only when root server is to be
installed ?

Some people said, it's not needed to set up anycast in
MAN because DNS system in such situation is very small
( less than 10 SUN servers ).

regards

Joe

--- Iljitsch van Beijnum <iljitsch@muada.com> wrote:

a message of 68 lines which said:

and then sees an anycast instance for all root servers over
peering. If then something bad happens to the peering connection

...

but even if 5 or 8 or 12 addresses become unreachable the timeouts
get bad enough for users to notice.

We can turn this into a Good Practice: do not put an instance of every
root name server on any given exchange point.

Actually, this is only a theoretical issue, the current maximum seems
to be only three (at the LINX in London).

is there any problem or anything must be taken
care about when anycast is employed within a DNS
server farm within MAN?

What I mean is, if we want to employ anycast in a
cache server farm which is located within a big OSPF
network, is there anything problemetic ? or should we
consider anycast only when root server is to be
installed ?

Since OSPF generally converges orders of magnitude faster and it's easier to get OSPF to select the highest bandwidth/lowest latency path than BGP, this should be easier. The problem of how to revoke the anycast route when the service goes away is pretty much the same. The benefits (especially latency) are also likely to be less, though.

Some people said, it's not needed to set up anycast in
MAN because DNS system in such situation is very small
( less than 10 SUN servers ).

The only problem that is unsolvable without anycast is getting response times below 25 ms or so in every corner of the world. In all other cases it all depends on the pros and cons of different ways to get the job done.

under some very
specific circumstances it can also happen with per packet load
balancing.

You're misunderstanding how per-packet load balancing is generally used.

I wasn't saying anything about how per packet load balancing is generaly used, the point is that it's possible that subsequent packets end up at different anycast instances when a number of specific prerequesites exists. In short: a customer must pplb across two routers at the same ISP, and each of those routers must have different preferred paths to different anycast instances. This isn't going to happen often, but it's not impossible, and it's not bad engineering on the customer's or ISP's part if it does, IMO.

Using per-packet load balancing on non-identical paths (in your example,
out different peering or transit connections) doesn't work.

That's right, because BGP only installs two or more routes when the path attributes are identical or nearly identical.

However, the attributes may be different (different next hop, IGP metric, MED) inside the ISP network but the differences can then go away at the next hop.

Even when connecting to a unicast host, the packets would arrive out of order,
leading to some really nasty performance problems. If anybody is using
per-packet load balancing in that sort of situation, anycast DNS is the
least of their problems.

Yes, this is why people are so terrified of per packet load balancing. Most of this fear is unfounded, though: the only way to get consistent out of order packets (a few here and there doesn't matter) is when the links in the middle are the same or lower effective bandwidth than the links at the source edge. And even then it will mostly happen for packets of different sizes.

You appear to be assuming that every anycast server in the world announces
routes for every anycasted address.

No. I'm not concerned about what happens at the anycasted ends, it's the way it looks from any given vantage point throughout the network that matters.

Are there scenarios where an
outage would lead to a loss of all of the anycast clouds? Of course, but
those scenarios would apply to Unicast servers as well.

The assumption is that it's universally benificial to see DNS addresses "close". While it is good to be able to see several addresses "close", it's better for redundancy when there are also some that are seen "far away", since when big failures happen, it's less likely that everything "close" _and_ everything "far away" is impacted at the same time.

Obviously this won't happen to the degree of unreachability in practice
(well, unless there are only two addresses that are both anycast for a
certain TLD, then your milage may vary), but even if 5 or 8 or 12
addresses become unreachable the timeouts get bad enough for users to
notice.

Right, but if you're losing 5 or 8 or 12 diverse routes at the same time,
your problem probably has very little to do with anycast.

That's not the point. If without anycast this is better than with anycast, then this should go on the "con" list for anycast.

Well, there may be only three "at" the LINX, but from where I'm sitting, 7 are reachable over the AMS-IX, 4 over ISP #1 and 1 over ISPs #2 and #3, respectively.

Interestingly enough, b, c, d and f all share this hop:

  6 portch1.core01.ams03.atlas.cogentco.com (195.69.144.124)

(195.69.144.0/23 is the AMS-IX exchange subnet.)

That's not the point. If without anycast this is better than with
anycast, then this should go on the "con" list for anycast.

People often confuse two separate technical things
here. One is the BGP anycast technique which allows
anycasting to be used in an IPv4 network, and the
other is the application of BGP anycasting to DNS
in an IPv4 network. It would be clearer if people
would prefix "anycast" with either BGP or DNS to make
it clear which they are talking about. Conceivably
there could be other applications that could be
distributed using BGP anycast. And if those applications
are designed knowing the quirks of BGP anycasting
then presumably they would have ways to overcome
some of the issues that affect DNS.

I would reword your statement as follows.

... then this should go on the "con" list for
DNS anycasting.

--Michael Dillon

Iljitsch van Beijnum wrote:

under some very
specific circumstances it can also happen with per packet load
balancing.

You're misunderstanding how per-packet load balancing is generally used.

I wasn't saying anything about how per packet load balancing is generaly used, the point is that it's possible that subsequent packets end up at different anycast instances when a number of specific prerequesites exists. In short: a customer must pplb across two routers at the same ISP, and each of those routers must have different preferred paths to different anycast instances. This isn't going to happen often, but it's not impossible, and it's not bad engineering on the customer's or ISP's part if it does, IMO.

You're wrong. That's VERY bad engineering!

PPLB requires 2 routers, one at each end of the link bundle.

More than 1 router at any end will lead to a lot more problems than
anycast, including multicast and any stateful protocol (like TCP).

For one thing, the load balancing will be only in 1 direction, and will
lead to congestion in the reverse path.... Self defeating.

Iljitsch van Beijnum wrote:

Well, there may be only three "at" the LINX, but from where I'm sitting, 7 are reachable over the AMS-IX, 4 over ISP #1 and 1 over ISPs #2 and #3, respectively.

Interestingly enough, b, c, d and f all share this hop:

6 portch1.core01.ams03.atlas.cogentco.com (195.69.144.124)

(195.69.144.0/23 is the AMS-IX exchange subnet.)

I'm beginning to wonder whether you're just an agent provocateur.

Assuming that your link to AMS-IX fails, your redundant attachments to
the world will provide the reachability to those same 7 via your other
links. All that shows is those 7 are topologically closer via that
path. You don't seem to have 13 paths. So?

FWIW, I see all DNS roots via the same BellSouth path. IFF BellSouth
fails, I'm sure that other paths will pick up the slack. I'm not
worried, because I've experienced BellSouth failures in the past, and
I've tested dropping each of my links from time to time to ensure that
routing works and I'm getting what I'm paying for....

Do you actually do any engineering, or just kibitzing?

In short: a customer must pplb across two routers at the same ISP, and each of those routers must have different preferred paths to different anycast instances. This isn't going to happen often, but it's not impossible, and it's not bad engineering on the customer's or ISP's part if it does, IMO.

You're wrong. That's VERY bad engineering!

PPLB requires 2 routers, one at each end of the link bundle.

It doesn't really require that. Redundancy requires that the routers at the ends of two links both be different. Having one router at one end and two at the other is a good compromise in many situations.

More than 1 router at any end will lead to a lot more problems than
anycast, including multicast and any stateful protocol (like TCP).

How many people run multicast exactly? And that's precisely the reason why multicast is a different SAFI so you get to have different multicast and unicast routing.

As for TCP, it would be very useful if someone were to run the following experiment:

> That's not the point. If without anycast this is better than with
> anycast, then this should go on the "con" list for anycast.

People often confuse two separate technical things
here. One is the BGP anycast technique which allows
anycasting to be used in an IPv4 network, and the
other is the application of BGP anycasting to DNS
in an IPv4 network. It would be clearer if people
would prefix "anycast" with either BGP or DNS to make

There is also MSDP anycasting, which is both pretty cool and
close to best common practice for anyone running MSDP.

Regards
Marshall

Iljitsch van Beijnum wrote:

It doesn't really require that. Redundancy requires that the routers at the ends of two links both be different. Having one router at one end and two at the other is a good compromise in many situations.

OK, now I'm sure you don't actually do any engineering.

In 25+ years, I've not found that router failure was a major or even
interesting problem. Link failures are probably 80%. Upstream failures
are probably another 5%, about the same as staff fumblefingers, power
failures, and customer misconfiguration that somehow affects routing --
like the idiots with the 5 character password last week that got rooted
and swamped their link so badly that BGP dropped.

I've lived through "inverse multiplexing", and BONDING, etc, etc....

Sure, I've had routers that had to be rebooted every week to overcome
a slow memory leak. But you're not fixing that....

A redundant router should be where it would be doing some good -- on a
diverse link to another upstream.

Now, let's say that your path through router 2 is several hundred, or
maybe a few thousand, miles longer than your path through router 3. You
are, after all, arguing that the paths are different enough that the
packets are going to end up at different anycast hosts, which is
generally equivalent to going into another network via a different
exchange point.

Have you just come up with a way to overcome the speed of light, or are
you arguing that doing per packet load balancing over paths with
differences in latency of tens or hundreds of milliseconds wouldn't result
in out of order packets?

-Steve

Also, be mindful of ECMP.

   http://www.isc.org/pubs/tn/isc-tn-2004-1.html
   http://www.isc.org/pubs/tn/isc-tn-2004-1.txt

Joe

i don't think iljitsch is in a position to teach an "anycast 101" class.

here's my evidence:

i don't think iljitsch is in a position to teach an "anycast 101" class.

If anyone feels they can do better, please step up...

here's my evidence:

note-- harald asked us to move this thread off of ietf@, so i've done that.
iljitsch added ietf@ back to the headers in his reply to me. i'm taking it
back off again. iljitsch, please leave it off, respecting harald's wishes.

Hey! I missed this one. I'm on dnsop but it's pretty low on my to-read list.

Unfortunately, your evidence contains its share of errors so I'm not sure if you should be teaching the class either.

... It's possible for bad things to happen if:

1. some DNS server is anycast (TLD servers are worse than roots because the
root zone is so small)
2. fragmented UDP packets or TCP are used as a transport
3. a network is built such that packets entering it through router X may
prefer a different external link towards a certain destination than packet
entering it through router Y
4. a customer of this network is connected to two different routers
5. the customer enables per packet load balancing

#1 and #2 are normal, even though fragmented udp isn't very common nowadays.
#3 is extremely common. #4 is normal for high-end customers. and #5 will
only affect customers whose ISP shares an IGP with the anycast -- in other
words, "other customers of the same ISP".

Nope. Consider:

            +-------+ +-------+
            >ISPrtr1+---+ACinstA|
+------+---+---+---+ +-------+

source> >

+------+---+---+---+ +-------+
            >ISPrtr2+---+ACinstB|
            +-------+ +-------+

Where the anycast instances exchange routing information using BGP.

If there is no special BGP configuration in effect, the ISPrtr1 will prefer the path to anycast instance A and 2 to B, because the external path takes precedence over a same length path that's learned over iBGP.

The current Cisco multipath BGP rules require the whole AS path to be the same (which would be the case in this diagram if both anycast instances use the same AS number), but older IOSes only require the next hop AS and the path length to be the same.

Now the question is: how do we deal with this? I don't think removing
anycast wholesale makes sense and/or is feasible. Same thing for declaring
per packet load balancing an evil practice.

as i said the other day, "all power tools can kill." if you turn on PPLB
and it hurts, then turn it off until you can read the manual or take a class
or talk to an expert. PPLB is a link bundling technology. if you turn it
on in non-parallel-path situation, it will hurt you, so, "don't do that."

Yes, per packet load balancing will cause reordering, and if that's an issue you shouldn't use it. But if with pplb packets end up at two different hosts, that's not the fault of the people who invented per packet load balancing or the people who turned it on, but the fault of the people giving the same address to two different hosts.

A better solution would be to give network operators something that
enables them to make sure load balancing doesn't happen for anycasted
destinations. A good way to do this would be having an "anycast" or
"don't load balance" community in BGP, or publication of a list of
ASes and/or prefixes that shouldn't be load balanced because the
destinations are anycast.

since PPLB won't affect BGP (since BGP is not multipath by default), this is
not an issue.

If the uncommon network setup exists, and pplb is turned on, the problem can manifest itself. The fact that someone had to turn on a feature that's turned off by default is immaterial. (There is no BGP by default to begin with.)

and they would know that PPLB is basically a link bundling technology used
when all members of the PPLB group start and end in the same router-pair;

It doesn't make much sense to have multiple links terminate on the same
router on both ends as then both these routers become single points of
failure.

i don't even know what conversation we're in any more. why does it matter
whether they are single points of failure, if this is the configuration for
which PPLB was intended?

There is no requirement that all packets between two hosts follow the same path. So people who pplb have the IP architecture at their side, unlike those who implement anycast. So a little less blaming the victim would be in order. (Well, if there are any victims, because all of this happening is pretty unlikely.)

> as i said the other day, "all power tools can kill." if you turn on
> PPLB and it hurts, then turn it off until you can read the manual or
> take a class or talk to an expert. PPLB is a link bundling technology.
> if you turn it on in non-parallel-path situation, it will hurt you, so,
> "don't do that."

Yes, per packet load balancing will cause reordering, and if that's an
issue you shouldn't use it. But if with pplb packets end up at two
different hosts, that's not the fault of the people who invented per
packet load balancing or the people who turned it on, but the fault of
the people giving the same address to two different hosts.

since i already know that Iljitsch isn't listening, i'm not interested in
debating him further. i would be interested in hearing from anybody else
who thinks that turning on pplb in a eyeball-centric isp that has multiple
upstream paths is a reasonable thing to do, even if there were no anycast
services deployed anywhere in the world. at the moment i am completely
certain that turning on pplb would be an irrational act, and would have a
significant performance-dooming effect on a client population behind it,
and that the times when pplb would actually be useful and helpful are very
limited, and that anycast doesn't even enter into the reasons why doing as
Iljitsch paints would be a bad idea.

but my mind is open, if anyone can speak from experience on the matter.

since i already know that Iljitsch isn't listening, i'm not interested in
debating him further.

[...]

but my mind is open, if anyone can speak from experience on the matter.

Right.