Reliable Cloud host ?

Does anyone have any recommendation for a reliable cloud host?

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)
2. Actual support (with a phone number I can call)
3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

thanks,
-Randy

Godaddy? Servint.com? Amazon EC2?

-mike

This is actually a much harder problem to solve than it sounds, and gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your VM requires that your VM be stored on some kind of SAN (usually iSCSI based) storage system. Otherwise, two hosts have no way of accessing your VM's data if one were to die. This makes things an order of magnitude or higher more expensive.

But then all you've really done is moved your single point of failure to the SAN. Small SANs aren't economical, so you end up having tons of customers on one SAN. If it dies tons of VMs are suddenly down. So you now need a redundant SAN capable of live-mirroring everyone's data. These aren't cheap either, and add a lot of complexity to things. (How to handle failover if it died mid-write, who has the most recent data after a total blackout, etc)

And this is really just saying "If hardware fails, i want my VM to reboot on another host." If what you're defining high availability to mean "even if a physical host fails, i don't want a second of downtime, my VM can't reboot" you want something like VMware's ESXi High Availability modules where your VM is actually running on two hosts at once, running in lock-step with each other so if one fails the other takes over transparently. Licenses for this are ridiculously expensive, and requires some reasonably complex networking and storage systems.

And I still haven't touched on having to make sure both physical hosts capable of running your VM are on totally independent switches/power/etc, the SAN has multiple interfaces so it's not all going through one switch, etc.

I also haven't run into anyone deploying a high-availability/redundant system where they haven't accidentally ended up with a split-brain scenario (network isolation causes the backup node to think it's live, when the primary is still running). Carefully synchronizing things to prevent this is hard and fragile.

I'm not saying you can't have this feature, but it's not typical in "reasonably priced" cloud services, and nearly unheard-of to be something automatically used. Just moving your virtual machine from using local storage to ISCSI backed storage drastically increases disk latency and caps the whole physical host's disk speed to 1gbps (not much deployment for 10GE adapters on the low-priced VM provider yet). Any provider who automatically provisions a virtual machine this way will get complaints that their servers are slow, which is true compared to someone selling VMs that use local storage. The "running your VM on two hosts at once" system has such a performance penalty, and costs so much in licensing, you really need to NEED it for it not to be a ridiculous waste of resources.

Amazon comes sorta close to this, in that their storage is mostly-totally separate from the hosts running your code. But they have had failures knock out access to your storage, so it's still not where I think you're saying you want to be.

The moral of the story is that just because it's "in the cloud", it doesn't gain higher reliability unless you're specifically taking steps to ensure it. Most people solve this by taking things that are already distributable (like DNS) and setting up multiple DNS servers in different places - that's where all this "cloud stuff" really shines.

(please no stories about how you were able to make a redundant virtual machine run using 5 year old servers in your basement, i'm talking about something that's supportable on a provider scale, and isn't adding more single-points-of-failure)

-- Kevin

With that some of those cloud providers are charging per-instance,
automatic hot standby is really not a given, but that could just be me :slight_smile:

We use Amazon and are happy with them. With them, you would have to set
up your own failover operation but it's absolutely doable. They give you
all the tools you need (load balancing, EBS, etc) but it's up to you to
make it happen. We use their load-balancing feature with HTTP but it
looks like you could do it with any service (DNS, etc). As a result,
when they had their last huge outage (a whole datacenter), we lost some
of our instances but our customer-facing services remained available.

Their support options are pretty good but you have to shell out for a
package to get them on the phone. Pricing for that is tied to how much
of their resources you are using.

-- Ben

We have found it effective at least for things like DNS and MX-backup to simply swap some VPS and/or physical colo with another ISP outside our geographic area. Both protocols are designed for that kind of redundancy. Definitely has limitations, but is also probably the cheapest solution.

- Mike

> We have been using Rackspace Cloud Servers. We just realized that
> they have absolutely no redundancy or failover after experiencing
> a outage that lasted more than 6 hours yesterday. I am appalled
> that they would offer something called "cloud" without having any
> failover at all.
>
> Basic requirements:
>
> 1. Full redundancy with instant failover to other hypervisor hosts
> upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and
gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your
VM requires that your VM be stored on some kind of SAN (usually
iSCSI based) storage system. Otherwise, two hosts have no way of
accessing your VM's data if one were to die. This makes things an
order of magnitude or higher more expensive.

This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)

But then all you've really done is moved your single point of failure
to the SAN. Small SANs aren't economical, so you end up having tons
of customers on one SAN. If it dies tons of VMs are suddenly down.
So you now need a redundant SAN capable of live-mirroring everyone's
data. These aren't cheap either, and add a lot of complexity to
things. (How to handle failover if it died mid-write, who has the
most recent data after a total blackout, etc)

NetApp. HA heads. Done. Add a DR site with replication, and you can survive a site failure, and be back up and running in less than an hour. I would think that the big datacenter guys already have this type of thing set up.

And this is really just saying "If hardware fails, i want my VM to
reboot on another host." If what you're defining high availability
to mean "even if a physical host fails, i don't want a second of
downtime, my VM can't reboot" you want something like VMware's ESXi
High Availability modules where your VM is actually running on two
hosts at once, running in lock-step with each other so if one fails
the other takes over transparently. Licenses for this are
ridiculously expensive, and requires some reasonably complex
networking and storage systems.

I don't need that kind of HA, and understand that it is not going to be available. 15 minutes of downtime is fine. 6 hours is completely unacceptable, and it false advertising to say you have a "Cloud" service, and then have the realization that you could have *indefinite* downtime.

And I still haven't touched on having to make sure both physical
hosts capable of running your VM are on totally independent
switches/power/etc, the SAN has multiple interfaces so it's not all
going through one switch, etc.

That is all just basic datacenter design. I have that level of redundancy with my extremely small datacenter. I only have 2 hypervisor hosts running around 12 VMs.

I also haven't run into anyone deploying a
high-availability/redundant system where they haven't accidentally
ended up with a split-brain scenario (network isolation causes the
backup node to think it's live, when the primary is still running).
Carefully synchronizing things to prevent this is hard and fragile.

I've never had it. Not if you properly set up failover (look at STONITH)

I'm not saying you can't have this feature, but it's not typical in
"reasonably priced" cloud services, and nearly unheard-of to be
something automatically used. Just moving your virtual machine from
using local storage to ISCSI backed storage drastically increases
disk latency and caps the whole physical host's disk speed to 1gbps

No it doesn't. Haven't you heard of multipath? Using 4 1Gb/s paths gives me about the same I/O as a local RAID array, with the added feature of failover if a link drops. 4 1Gb/s ports is ridiculously cheap. And, 10Gb is not nearly as expensive as it used to be.

(not much deployment for 10GE adapters on the low-priced VM provider
yet). Any provider who automatically provisions a virtual machine
this way will get complaints that their servers are slow, which is
true compared to someone selling VMs that use local storage. The
"running your VM on two hosts at once" system has such a performance
penalty, and costs so much in licensing, you really need to NEED it
for it not to be a ridiculous waste of resources.

I don't follow what you mean by "running the VM on two hosts." I just want my single virtual to be booted up on a spare hypervisor if there is a hypervisor failure. No license costs for that, and should not have any performance implications at all.

Amazon comes sorta close to this, in that their storage is
mostly-totally separate from the hosts running your code. But they
have had failures knock out access to your storage, so it's still
not where I think you're saying you want to be.

The moral of the story is that just because it's "in the cloud", it
doesn't gain higher reliability unless you're specifically taking
steps to ensure it. Most people solve this by taking things that are
already distributable (like DNS) and setting up multiple DNS servers
in different places - that's where all this "cloud stuff" really
shines.

The funky problem with DNS specifically, is that all the servers need to be up, or someone will get bad answers. Not having a preference system, like MX records has hurt in this regard. Anycast fixes this to a certain degree. Anycast is another challenge for these hosting providers.

(please no stories about how you were able to make a redundant
virtual machine run using 5 year old servers in your basement, i'm
talking about something that's supportable on a provider scale, and
isn't adding more single-points-of-failure)

I have actually done this :slight_smile:

But, I also have a fully redundant system at our main office using very few components. We also have a DR site, connected with fiber. The challenge we have is if we run into routing issues upstream that are beyond our control. Hence the need to have a few things also hosted externally geographically and routing-wise.

Um. You and I apparently work in different clouds.

In my world, the SLAs I have agreed to state, roughly, that uptime is not guaranteed, nor is data recoverability. They suggest that that sort of thing is -my- problem to engineer and architect around.

I don't use Rackspace's cloud solution - but I haven't seen anything to suggest that they advertise their service any differently.

The "cloud" provides flexibility and rapid deployment at the expense of hands-on control and reliability (and SLAs).

Perhaps you forgot to read the SLA? Or you can show us where someone defines "Cloud" as "highly available" and "without indefinite downtime" ?

This does not have to be true at all. Even having a fully fault-tolerant SAN in addition to spare servers should not cost much more than having separate RAID arrays inside each of the server, when you are talking about 1,000s of server (which Rackspace certainly has)

When is your cloud offering going to be available to the public?

I don't need that kind of HA, and understand that it is not going to be available. 15 minutes of downtime is fine. 6 hours is completely unacceptable, and it false advertising to say you have a "Cloud" service, and then have the realization that you could have *indefinite* downtime.

I think you're assuming "cloud" means things that the provider does not. To me, "cloud" just means VPS that I can create/destroy quickly whenever I feel like it, without any interaction from the provider's people. i.e. a few mouse clicks or an API call can "provision a 10gb CentOS VM with 256mb RAM". It's up and running before I could locate a CentOS install CD, and if I don't like it, a few clicks or an API call deletes it, reprovisions it, exchanges it for a Ubuntu server, etc. Cloud doesn't mean if the node my VM(s) are on dies or crashes, my VMs boot up on an alternate node. That would certainly be a nice feature, but that's just a form of redundancy in a cloud...not a defining attribute of cloud.

The funky problem with DNS specifically, is that all the servers need to be up, or someone will get bad answers. Not having a preference system,

DNS "handles" down servers. How would one of your DNS servers being down give someone bad answers? It won't give any answers, and another server will be queried. Or do you mean if storage "goes away", but your DNS server is still running, it'll either give nxdomain or stale data...depending on whether it had the data in memory or storage went away and updates began failing because of it?

Since it is the weekend, I can't resist writing down a little equation:

Marketing(cloud) <> Technology(cloud)

For some values of "cloud" perhaps?

p.s. tongue firmly in cheek

- Tony

Well indeed that is a valid point. All cloud to me means is that there is some abstracted instance of x and that it does not always relate to a particular physical device, indeed, it may well be spread around a few physical devices.

I don't think there is any implied magic redundancy automatic failover move your instance to another bit of metal if something breaks in there unless that's specifically stated.

caveat emptor

Hello,

Does anyone have any recommendation for a reliable cloud host?

We require 1 or 2 very small virtual hosts to host some remote services to serve as backup to our main datacenter. One of these services is a DNS server, so it is important that it is up all the time.

We have been using Rackspace Cloud Servers. We just realized that they have absolutely no redundancy or failover after experiencing a outage that lasted more than 6 hours yesterday. I am appalled that they would offer something called "cloud" without having any failover at all.

Basic requirements:

1. Full redundancy with instant failover to other hypervisor hosts upon hardware failure (I thought this was a given!)
2. Actual support (with a phone number I can call)
3. reasonable pricing (No, $800/month is not reasonable when I need a tiny 256MB RAM Server with <1GB/mo of data transfers)

Well, as everyone has been saying, unfortunately with "infrastructure"
clouds, you have to engineer your set up to their standards to have
failover.

For example, Amazon (as mentioned in the thread) give a 99.95% uptime
SLA *if* you set up failover yourself accros more than one
"Avaliability Zone" within a region. Details are at

and http://blog.rightscale.com/2008/03/26/setting-up-a-fault-tolerant-site-using-amazons-availability-zones/
(though clearer, this one is a bit of an advert). As mentioned, with
Amazon, you can use support if you pay for it, it's not included as
standard.

If you fancy some help though, people like RightScale sounds like
exactly what you are after to make management much simpler for you
IT and Cloud Management, Optimization and Solutions | Flexera, but pricing for
services like that can be a little high for small setups, though they
do have a free edition that may be suitable.

You can get the same kind of 99.95% SLA from other providers if you
follow their deployment guidelines regarding their type of "zones".
Microsoft will do it for not too much
)http://www.windowsazure.com/en-us/support/sla/) include online and
telephone support in the price and are in the process of making Red
Hat Linux available.

But let's not forget simply buying the software as a service is also
an option, where fail-over becomes Someone Else's Problem. For DNS,
EasyDNS (DNS | easyDNS) are rather good and
not too expensive, and you can get a 100% up-time guarantee if you
want. A review of them regarding availability is at
When a DNS outage isn't an outrage • The Register

Do let us know who you end up picking and how it goes.

Alex

Linode.com is not cloud based but they offer IP failover between VPS
instances at no additonal charge - their pricing is excellent, I have
had no down time issues with them in 3+ years with 3 different
customers using them and they have nice OOB and programmatic API
access for controlling VPs instances as well.

Max

Pardon the weird question:

Is the DNS service authoritative or recursive? If auth, you can solve this a few ways, either by giving the DNS name people point to multiple AAAA (and A) records pointing at a diverse set of instances. DNS is designed to work around a host being down. Same goes for MX and several other services. While it may make the service slightly slower, it's certainly not the end of the world.

Taking a mesh of services from Rackspace, EC2, The Planet, or any other number of hosting providers will allow you to roll-your-own.

The other solution is to go to a professional DNS service provider, e.g.: Dyn, Verisign, EveryDNS or NeuStar.

While you can run your own infrastructure, the barrier for operating it properly is getting a bit higher each year in doing it "right". I was recently shown an attack graph of a ~200Gb/s attack against a DNS server. *ouch*.

Sometimes being professional is knowing when to say "I can't do this justice myself, perhaps it's better/easier/cheaper to pay someone to do it right".

- Jared

(Disclosure: I work for one of the above named companies, but not in a capacity related to anything in this email).

[...] For DNS,
EasyDNS (DNS | easyDNS) are rather good and
not too expensive, and you can get a 100% up-time guarantee if you
want. A review of them regarding availability is at
When a DNS outage isn't an outrage • The Register

I have been a very satisfied EasyDNS customer for about a decade and
concur with the article. Nothing is perfect, but the rapid response and
support I've received have always been top-notch.

Do let us know who you end up picking and how it goes.

Indeed. "Cloud" outside of references to mists and objects in the sky is a
completely meaningless term for operators. In fact, it has made it harder
to differentiate between services (which I'm sure is the point).

As an operator (knowing how things can be subject to accelerated roll-out
when $business feels they are missing out), I wonder if a lot of these
"cloud" service bumps-in-the-road aren't just a symptom of not being fully
baked in.

~JasonG

> 1. Full redundancy with instant failover to other hypervisor hosts
> upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and
gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your
VM requires that your VM be stored on some kind of SAN (usually
iSCSI based) storage system. Otherwise, two hosts have no way of
accessing your VM's data if one were to die. This makes things an
order of magnitude or higher more expensive.

This does not have to be true at all. Even having a fully fault-tolerant
SAN in addition to spare servers should not cost much more than
having separate RAID arrays inside each of the server, when you
are talking about 1,000s of server (which Rackspace certainly has)

Randy,

You're kidding, right?

SAN storage costs the better part of an order of magnitude more than
server storage, which itself is several times more expensive than
workstation storage. That's before you duplicate the SAN and set up
the replication process so that cabinet and room level failures don't
take you out.

DR sites then create a ferocious (read: expensive) bandwidth
challenge. Data can't flush from the primary SAN's write cache until
the DR SAN acknowledges receipt. If you don't have enough bandwidth to
keep up under the heaviest daily loads, the cache quickly fills and
the writes block.

I maintain 50ish VMs with about 30 different providers at the moment.
Not one of them attempts to do anything like what you describe.

NetApp. HA heads. Done. Add a DR site with replication,
and you can survive a site failure, and be back up and
running in less than an hour. I would think that the big
datacenter guys already have this type of thing set up.

That's expensive and VMs are sold primarily on price. You want high
reliability, you start with the dedicated colo server. Customers who
want DR in a VM environment buy two VMs and build data replication at
the app layer.

Linode.com is not cloud based but they offer IP failover between VPS
instances at no additonal charge - their pricing is excellent, I have
had no down time issues with them in 3+ years with 3 different
customers using them and they have nice OOB and programmatic API
access for controlling VPs instances as well.

Hi Max,

I have had superb results from Linode and highly recommend them.
However, they're facilitating application level failover not keeping
your VM magically alive. And:

http://library.linode.com/linux-ha/ip-failover-heartbeat-pacemaker-ubuntu-10.04

"Both Linodes must reside in the same datacenter for IP failover"

So they don't support a full DR capability even if you're smart at the
app level.

[...] For DNS,
EasyDNS (DNS | easyDNS) are rather good and
not too expensive, and you can get a 100% up-time guarantee if you
want. A review of them regarding availability is at
When a DNS outage isn't an outrage • The Register

I have been a very satisfied EasyDNS customer for about a decade and
concur with the article. Nothing is perfect, but the rapid response and
support I've received have always been top-notch.

I have been a satisfied DNS Made Easy customer for many years.

Note: I am also an employee of DNS Made Easy. I was a customer for
years before I became an employee.

Do let us know who you end up picking and how it goes.

Indeed. "Cloud" outside of references to mists and objects in the sky is a
completely meaningless term for operators. In fact, it has made it harder
to differentiate between services (which I'm sure is the point).

As an operator (knowing how things can be subject to accelerated roll-out
when $business feels they are missing out), I wonder if a lot of these
"cloud" service bumps-in-the-road aren't just a symptom of not being fully
baked in.

It depends on what you mean by "bumps-in-the-road"...

If you mean issues experienced by customers of cloud service providers,
then the most common issues are a symptom of not implementing redundancy
(anticipating failure) in their usage of the platform. There are a
whole lot of folks who believe that they can buy an instance from Vendor
=~ /.*cloud.*/ and all of their DR worries will magically be "taken care
of" by the platform. That isn't the case.

Amazon is usually pretty good at providing RFOs after issues. All of
their RFOs (that I have seen) include pointers to all of the Amazon
redundancy configuration documents that customers who did experience an
issue regarding the RFO did not follow (which caused them to experience
an outage due to a platform issue).

DR in using cloud services is the same as DR has always been - look at
all potential failures and then implement redundancy where the
cost/benefit works out in favor of the redundancy. Document, test,
rinse, lather, repeat.

Rightscale and other services like it provide tools to help.

-DMM

If you want a system with 0 loss and 0 delay, start building your private network.

I'm never claimed your response would be perfect, but it will certainly work well enough to avoid major problems. Or you can pay someone to do it for you. I'm not sure what a DNS hosted solution costs, and I'm geeky and run my own DNS on beta/RC quality software as well ;).

What I do know is that my domain hasn't disappeared from the net wholesale as the name servers are "diverse-enough".

Is DNS performance important? Sure. Should everyone set their TTL to 30? No. Reaching a high percentage of the internet doesn't require such a high SLA. Note, I didn't say reaching the top sites. While super-old, http://www.zooknic.com/Domains/counts.html says > 111m named sites in a few gTLDs. I'm sure there are better stats, but most of them don't need the same dns infrastructure that a google, bing, Facebook, etc require.

If your DNS fits on a VM in someone else's "cloud", you likely won't notice the difference. A few extra NS records will likely do the right thing and go unnoticed.

- Jared

Pardon the weird question:

Is the DNS service authoritative or recursive? If auth, you can
solve this a few ways, either by giving the DNS name people point to
multiple AAAA (and A) records pointing at a diverse set of
instances.

Authoritative. But, also not the only thing that we are running that needs some geographic and route diversity.

DNS is designed to work around a host being down. Same
goes for MX and several other services. While it may make the
service slightly slower, it's certainly not the end of the world.

Oh, how I wish this were true in practice. If I had a dollar for every time we had serious issues because one of a few authoritative DNS servers was not responding... OK, I wouldn't be rich, but this happens all the time. Caching servers out on the net get a "non-answer" because the server they chose to ask was down, and it caches that. They shouldn't do that, but they do, and there's nothing that can be done about it.

-Randy

No, actually, it won't. In practice, most end user applications
disregard the DNS TTL.

In some cases this is because of carelessness: The application does a
gethostbyname once when it starts, grabs the first IP address in the
list and retains it indefinitely. The gethostbyname function doesn't
even pass the TTL to the application. Ntpd is/used to be one of the
notable offenders, continuing to poll the dead address for years after
the server moved.

In other cases disregarding the TTL was a deliberate design decision.
Web browser DNS Pinning is an example of this. All modern web browsers
implement a form of DNS Pinning where they refuse to try an alternate
IP address for a web server on subsequent TCP connections after making
the first successful contact. This plugs a javascript security leak
where a client side application could be made to scan the interior of
its user's firewall by switching the DNS back and forth between local
and remote addresses. In some cases this stuck-address behavior can
persist until the browser is completely closed and reopened, possibly
when the PC is rebooted weeks later.

The net result is that when you switch the IP address of your server,
a percentage of your users (declining over time) will be unable to
access it for hours, days, weeks or even years regardless of the DNS
TTL setting.

This isn't theoretical, by the way. I had to renumber a major web site
once. 1 hour TTL at the beginning of the process. Three month overlap
in which both addresses were online and the DNS pointed to the new
one. At the end of the three months a fraction of a percent of the
*real user traffic* was _still_ coming in the obsolete address. Using
the correct name in the Host: header, so the user wasn't deliberately
picking the IP address.

If you want DR that *works*, reroute the IP address.

Regards,
Bill Herrin

> 1. Full redundancy with instant failover to other hypervisor hosts
> upon hardware failure (I thought this was a given!)

This is actually a much harder problem to solve than it sounds, and
gets progressively harder depending on what you mean by "failover".

At the very least, having two physical hosts capable of running your
VM requires that your VM be stored on some kind of SAN (usually
iSCSI based) storage system. Otherwise, two hosts have no way of
accessing your VM's data if one were to die. This makes things an
order of magnitude or higher more expensive.

This does not have to be true at all. Even having a fully fault-tolerant
SAN in addition to spare servers should not cost much more than
having separate RAID arrays inside each of the server, when you
are talking about 1,000s of server (which Rackspace certainly has)

Randy,

You're kidding, right?

SAN storage costs the better part of an order of magnitude more than
server storage, which itself is several times more expensive than
workstation storage. That's before you duplicate the SAN and set up
the replication process so that cabinet and room level failures don't
take you out.

This is clearly becoming a not-NANOG-ish thread, however...

Failing to have central shared storage (iSCSI, NAS, SAN, whatever you
prefer) fails the smell test on a local enterprise-grade
virtualization cluster, much less a shared cloud service.

Some people have done tricks with distributing the data using one of
the research-ish shared filesystems, rather than separate shared
storage. That can be made to work if the host OS model and its
available shared filesystems work for you. Doesn't work for Vmware
Vcenter / Vmotion-ish stuff as far as I know.

There are plenty of people doing non-enterprise-grade virtualization.
There's no mandate that you have the ability to migrate a virtual to
another node in realtime or restart it immediately on another node if
the first node dies suddenly. But anyone saying "we have a cloud" and
not providing that type of service, is in marketing not engineering.