Facility wide DR/Continuity

Drew_Weaver · June 3, 2009, 12:09pm

Hi All,

I'm attempting to devise a method which will provide continuous operation of certain resources in the event of a disaster at a single facility.

The types of resources that need to be available in the event of a disaster are ecommerce applications and other business critical resources.

Some of the questions I keep running into are:

Should the additional sites be connected to the primary site (and/or the Internet directly)?
What is the best way to handle the routing? Obviously two devices cannot occupy the same IP address at the same time, so how do you provide that instant 'cut-over'? I could see using application balancers to do this but then what if the application balancers fail, etc?

Any advice from folks on list or off who have done similar work is greatly appreciated.

Thanks,
-Drew

Dobbins_Roland · June 3, 2009, 12:21pm

Avoid 'cut-over' entirely - go active/active/etc., and use DNS-based GSLB for the various system elements.

Ricky_Duman · June 3, 2009, 1:25pm

Drew,

IMO as your %Availability goes up, (99, 99.9, 99.99.....100%)... your
price to implement will go up exponentially. That being said, a
majority of this will depend on what your budget is to achieve your
desired availability.

You can either

- Failover to backup servers using DNS (but may not be instant)
Also should consider replication solution if your
resources need fresh data

On the high-end you can:
  - run secondary mirror location...replicating data
  - Setup BGP announcing your IP block out of both locations
  - You should have a private link interconnecting your sites for
your IBGP session.

herrin · June 3, 2009, 2:37pm

I'm attempting to devise a method which will provide continuous
operation of certain resources in the event of a disaster at a single facility.

Drew,

If you can afford it, stretch the LAN across the facilities via fiber
and rebuild the critical services as a load balanced active-active
cluster. Then a facility failure and a routine server failure are
identical and are handled by the load balancer. F5's if you like
commercial solutions, Linux LVS if you're partial to open source as I
am. Then make sure you have a Internet entry into each location with
BGP.

BTW, this tends to make maintenance easier too. Just remove servers
from the cluster when you need to work on them and add them back in
when you're done. Really reduces the off-hours maintenance windows.

This is how I did it when I worked at the DNC and it worked flawlessly.

If you can't afford the fiber or need to put the DR site too far away
for fiber to be practical, you can still build a network which
virtualizes your LAN. However, you then have to worry about issues
with the broadcast domain and traffic demand between the clustered
servers over the slower WAN.

It's doable. I've done it with VPNs over Internet T1's. But you better
have your developers on board early and and provide them with a
simulated environment so that they can get used to the idea of having
little bandwidth between the clustered servers.

- Failover to backup servers using DNS (but may not be instant)

If your budget is more than a shoestring, save yourself some grief and
don't go down this road. Even with the TTLs set to 5 minutes, it takes
hours to get to two-nines recovery from a DNS change and months to get
to five-nines. The DNS protocol is designed to be able to recover
quickly but the applications which use it aren't. Like web browsers.
Google "DNS Pinning."

Regards,
Bill Herrin

Brandon_Galbraith · June 3, 2009, 2:53pm

<snip>

If you can't afford the fiber or need to put the DR site too far away
for fiber to be practical, you can still build a network which
virtualizes your LAN. However, you then have to worry about issues
with the broadcast domain and traffic demand between the clustered
servers over the slower WAN.

It's doable. I've done it with VPNs over Internet T1's. But you better
have your developers on board early and and provide them with a
simulated environment so that they can get used to the idea of having
little bandwidth between the clustered servers.

In most cases, the fiber is affordable (a certain bandwidth provider out
there offers Layer 2 point to point anywhere on their network for very low
four digit prices). We recently put into place an active/active environment
with one end point in the US and the other end point in Amsterdam, and both
sides see the other as if they were on the same physical lan segment. I've
found that, like you said, you *must* have the application developers
onboard early, as you can only do so much at the network level without the
app being aware.

-brandon

--

Brandon Galbraith
Mobile: 630.400.6992
FNAL: 630.840.2141

Netfortius · June 3, 2009, 3:01pm

In an environment where a DR site is deemed critical, it is my experience
that critical business applications also have a test or development
environment associated with the production one. If you look at the problem
this way, then a DR equipped with the test/devel systems, with one
"instance" of production always available, would only be challenging in
terms of data sync. Various SAN solutions would resolve that (SAN sync-ing
over WAN/MAN/etc.). Virtualization of critical systems may also add some
benefits here: clone the critical VMs in the DR, and in conjunction with the
storage being available, you'll be able to bring up this type of machines in
no time - just make sure you have some sort of L2 available - maybe EoS, or
tunneling over an L3 connectivity - tons of info when querying for virtual
machine mobility and inter-site connectivity.

Voice has to be considered, also - f/PSTN - make arrangements with provider
to re-route (8xx) in case of disaster. VoIP may add some extra capabilities
in terms of reachability over the Internet, in case your DR site cannot
accommodate - C/S people, for example, who are critical to interface with
customers in case of disaster (if no information - bigger loss - perception
issues) have to be able to connect even from home.

As far as "immediate" switch from one to another - DNS is the primary
concern (unless some wise people have hardcoded IPs all over), but there are
other issues people tend to forget, at the core of some clilents - take
Oracle "fat" client and its TNS names - I've seen those associated with IPs,
instead of host names ... etc.

Disclaimer: the above = one of many aspects. Have seen DNS comments already,
so I won't repeat those aspects.

HTH,

Dobbins_Roland · June 3, 2009, 3:05pm

I would advise strongly against stretching a layer-2 topology across sites, if at all possible - far better to go for layer-3 separation, work with the app/database/sysadmin folks to avoid dependence on direct adjacencies, and gain the topological freedom of routing.

_Bill_Woodcock · June 3, 2009, 5:47pm

Should the additional sites be connected to the primary site

> (and/or the Internet directly)?

Yes, because any out-of-band synchronization method between the servers at
the production site and the servers at the DR site is likely to be more
difficult to manage. You could do UUCP over a serial line, but...

    > What is the best way to handle the routing? Obviously two devices
    > cannot occupy the same IP address at the same time, so how do you
    > provide that instant 'cut-over'?

This is one of the only instances in which I like NATs. Set up a NAT
between the two sites to do static 1-to-1 mapping of each site into a
different range for the other, so that the DR servers have the same IP
addresses as their production masters, but have a different IP address to
synchronize with.

-Bill

Brandon_Galbraith · June 3, 2009, 5:53pm

Or you use RFC1918 address space at each location, and NAT each side between
public anycasted space and your private IP space. Prevents internal IP
conflicts, having to deal with site to site NAT, etc.

-brandon

Dobbins_Roland · June 3, 2009, 6:08pm

With all due respect, both of these posited choices are quite ugly and tend to lead to huge operational difficulties, susceptibility to DDoS, etc. Definitely not recommended except as a last resort in a difficult situation, IMHO.

_Bill_Woodcock · June 3, 2009, 6:25pm

I wouldn't go quite so far as to say that they have security implications,
but I definitely agree that these are solutions of last resort, and that
any live load-balanced solution is infinitely preferable to a stand-by
solution. Which, IMHO, is unlikely to ever work as hoped for. I was just
answering the question at hand, rather than the meta-question of whether
the question being asked was the right question.

-Bill