Akamai server reliability

Garlic · November 28, 2005, 6:39pm

Hi,

Many moons ago, we got a set of Akamai servers. Over the years I think they replaced every one of them at least once. Last August we got a another set of servers due to a move and now two of those three servers have failed.

I still have the original server that started garlic.com in production after 11+ years so I know servers can last a long time. I don't understand why Akamai failure rates are so high

Is anyone else seeing high failure rates of Akamai servers at their facilities?

Roy

Chris_Owen · November 28, 2005, 6:46pm

We had 3 boxes for 5-6 years without a problem. Then one of them failed.
We've since replaced that box 5-6 times in the last year. The replacement
boxes often come with non-spining CPU fans and other issues so I'm not
that surprised. The last replacement was a few months ago though so maybe
this one will stick around.

I think whoever is doing their refurbs isn't doing a very good job. They
never seem very concerned though.

Chris

Mike_Tancsa · November 28, 2005, 6:55pm

Nope, just one bad box in many years.

---Mike

Vinny_Abello1 · November 28, 2005, 7:02pm

Out of the total three Akamai servers we have, I think we've had two of them replaced in the past three or four years that we've had them. One was replaced several times. The replacement servers tend to be refurbished and I've seen multiple things wrong with them when they arrive. If I recall correctly, one replacement wouldn't even boot successfully... Just kept crashing. Reloading the OS from an Akamai recovery CD had no affect. Shipping does cause problems whereby the parts can come loose during transit.

The most common problem we see is failed hard drives and/or SCSI bus errors which are likely related to the hard drive failures. I'm surprised Akamai doesn't have any hardware RAID with hot swap yet (at least not in the boxes we have). It would be much less costly for them to ship a new hard drive than a whole new server each time a hard drive fails. I know the idea is to have very cheap boxes in clusters, but I wonder how much they're paying in shipping for replacing the cheap hardware.

As of late, we've had no known problems with our Akamai boxes. That one box does occasionally have weird SCSI hangs where the other two work nonstop. For the most part it is fine though.

Vinny Abello
Network Engineer
Server Management
vinny@tellurian.com
(973)300-9211 x 125
(973)940-6125 (Direct)
PGP Key Fingerprint: 3BC5 9A48 FC78 03D3 82E0 E935 5325 FBCB 0100 977A

Tellurian Networks - The Ultimate Internet Connection
http://www.tellurian.com (888)TELLURIAN

"Courage is resistance to fear, mastery of fear - not absence of fear" -- Mark Twain

Christian_Kuhtz6 · November 28, 2005, 7:18pm

Never underestimate the amount of airbills that can be paid with KISS strategy.

Anything else is trollage on NANOG.

_Bill_Woodcock · November 28, 2005, 7:17pm

Yep, that's true. Shipping is cheap, it's customs that's expensive and
time-consuming, and Akamai tends to avoid the kind of places where you
have to deal with a lot of customs.

-Bill

Christopher_L_Morro1 · November 28, 2005, 7:31pm

I'd note that the original poster didn't classify 'broken' or 'outage' or
'non-functioning'... just the end result: "replacement".

So, is akamai doing some fancy SMART detection and seeing bad fans and
replacing, seeing a bad cpu fan or disk or memory corruption and
replacing, or are these hard box outages with no recourse but a complete
immediate replacement?

(just curious as they don't let us play with these pieces/parts )

Chris_Owen · November 28, 2005, 7:39pm

As far as I can tell the only thing that will get a box replaced is if it
can't be booted/pinged. We've pointed out dead CPU fans before (even on
the incoming replacement boxes) and they've never seemed to care. If it
runs it runs. If it doesn't they replace the entire box.

Given all their redundancy I suppose that is probably the way to go.

Chris

Mikael_Abrahamsson · November 28, 2005, 8:04pm

Especially since Akamai doesn't pay for truck rolls and man hours to get the replacements done onsite.

Deepak_Jain · November 28, 2005, 8:47pm

Never underestimate the amount of airbills that can be paid with KISS strategy.

Especially since Akamai doesn't pay for truck rolls and man hours to get the replacements done onsite.

I'm sorry, isn't that exactly what an airbill *is* paying for -- to get the equipment on site?

The man hours (really, we are talking about less than a single hour to replace a server including all the mounting and repacking). The one man hour that they need (no more than 6 a year by the look of it) should offset the value the ISP is getting from not buying bandwidth to get to the content and for the improved performance they get.

If that model doesn't work for the ISP in question, they should ask Akamai to pull their gear.

DJ

Joel_Jaeggli1 · November 28, 2005, 8:51pm

As far as I can tell the only thing that will get a box replaced is if it
can't be booted/pinged. We've pointed out dead CPU fans before (even on
the incoming replacement boxes) and they've never seemed to care. If it
runs it runs. If it doesn't they replace the entire box.

Having built a fair number of machines to live for 5 years or longer in data-centers I will never visit, there's relatively little that you want to triage onsite on a rackmount pc. Drives, in hot-plug enclosures and removable power supply modules are about it... Smart-hands are good for racking and stacking, swapping disks, recabling the oob, swapping media and so forth. It's not really a good use of someone else's time to have them performing experimental surgery on pc's. Much better to simply ship out another one and ship the old one back in the same box.

Decent modern 1u chassis still have sufficient airflow with a couple fans failed to remain adequately cool, further there's now enough sensors in a pc to be able to tell when you getting in trouble, rpm indicator for all the fans, intake processor and output temperature, thermal sensors in each of the drives etc. Our success-rate at indetifying machines before they fail has gotten substantially better over time.

Chris_Owen · November 28, 2005, 8:57pm

I didn't really get the impression that people were really complaining so
much (I certainly wasn't) as they were just pointing out there was an
issue.

However, I do think Akamai would be better off getting their issues with
their replacement boxes straightened out. I agree that we get value for
having the boxes on our network (and so do they lets not forget).
However, it is a bit frustrating to replace the same box 3 times in less
than a month. Hauling a box down to the colo is no big deal but when the
box you are taking down there has a dead CPU fan and two dead case fans
it's hard not to think you might be wasting your time.

It isn't just that they are wasting my time. They are also wasting their
own time. It's the overall lack efficiency that bothers me ;-]

Chris

Petri_Helenius · November 28, 2005, 9:01pm

Chris Owen wrote:

It isn't just that they are wasting my time. They are also wasting their
own time. It's the overall lack efficiency that bothers me ;-]

Don't worry, it wont take long until google parks their datacenter-in-a-container outside at the fiber junction and the content distribution guys will be obsoleted overnight.

Pete

Bandy_Rush1 · November 28, 2005, 9:08pm

It isn't just that they are wasting my time. They are also wasting their
own time. It's the overall lack efficiency that bothers me ;-]

i suspect you have a datapoint on how they're doing financially.
they ain't stoopid. they'll deal with it when the cost/benefit
gets high enough on their priority list. isn't the first time
that good s&m covers some technical gaps, and won't be the last.

randy

Pete_Templin1 · November 28, 2005, 9:16pm

Deepak Jain wrote:

If that model doesn't work for the ISP in question, they should ask Akamai to pull their gear.

And hopefully they'll (someday) send servers in my direction - is their "minimum criteria" creeping upwards at the same rate as overall Internet traffic did in the late 90s?

pt

Steven_Bellovin · November 29, 2005, 1:09am

To quote a science fiction story I'm fond of, "efficiency depends on
what you want to effish".

--Steven M. Bellovin, http://www.cs.columbia.edu/~smb

Jon_Lewis1 · November 29, 2005, 2:39am

I'm sorry, isn't that exactly what an airbill *is* paying for -- to get the equipment on site?

They also frequently need boxes power cycled. It got to be so frequent that we "gave them" a remote reboot switch for all their gear and told them how to use it. They still kept emailing us for reboots until I finally used a contact at akamai to get the remote reboot info properly placed.

We've had our share of failed boxes, DOA boxes, boxes with components literally falling out of them on arrival, etc. I suspect it's just a sign of the box building having been farmed out to the cheapest available source. When you're building boxes in really large volume, what's a few missing screws here there?

The man hours (really, we are talking about less than a single hour to replace a server including all the mounting and repacking). The one man hour that they need (no more than 6 a year by the look of it) should offset the value the ISP is getting from not buying bandwidth to get to the content and for the improved performance they get.

I wouldn't count on that. With bandwidth prices continually falling, and the ISP business changing (at least for us, dialup/DSL is dying, hosting is taking off, and now instead of having spare outbound capacity to sell to Akamai), we do more outbound than inbound, so the servers really don't save us anything except maybe a bit of latency.

If that model doesn't work for the ISP in question, they should ask Akamai to pull their gear.

Think of the man hours that'd take, ripping them out, boxing them up, etc.

NANOG_Mail_List_Comm · November 29, 2005, 4:21am

The impression I got was they originally scattered their machines to
everyone who had a network with a growth plan and bought them a beer. Some
people even got/get paid to host them.

After the .com crash they started being a bit more careful about who they
gave them to and doing a bit more analysis as whether a new site was worth
the trouble.

One way to get a cluster might be to suggest that your will make better
use of it than a nearby company with a cluster that is much smaller than
you. I have heard of people trying this in Australia, no idea how well it
works.

I know people who were doing under 10Mb/s via their clusters, but they are
in Aus/NZ so the threshold might be higher elsewhere.

Gadi_Evron1 · November 29, 2005, 9:59am

To quote a science fiction story I'm fond of, "efficiency depends on what you want to effish".

--Steven M. Bellovin, Steven M. Bellovin

Sci-fi injection!

(marking another beer owed)

Gadi.

michael.dillon1 · November 29, 2005, 10:33am

I still have the original server that started garlic.com in production
after 11+ years so I know servers can last a long time. I don't
understand why Akamai failure rates are so high

Applications which cause the disk to thrash will wear out
disk drives much more quickly than non-thrashing applications.
When I still ran USENET news servers back before cyclic file
systems were used, I remember that their hard drives died frequently,
often after less than a year of service, but those drives were
thrashing 24 by 7. You can hear drives thrashing and feel it
by touching the case. It is caused by almost completely random
access resulting in almost constant head movement.

It is cost effective to just thrash cheap drives and
replace them when they die.

--Michael Dillon