Akamai server reliability

Hi,

Many moons ago, we got a set of Akamai servers. Over the years I think they replaced every one of them at least once. Last August we got a another set of servers due to a move and now two of those three servers have failed.

I still have the original server that started garlic.com in production after 11+ years so I know servers can last a long time. I don't understand why Akamai failure rates are so high

Is anyone else seeing high failure rates of Akamai servers at their facilities?

Roy

We had 3 boxes for 5-6 years without a problem. Then one of them failed.
We've since replaced that box 5-6 times in the last year. The replacement
boxes often come with non-spining CPU fans and other issues so I'm not
that surprised. The last replacement was a few months ago though so maybe
this one will stick around.

I think whoever is doing their refurbs isn't doing a very good job. They
never seem very concerned though.

Chris

Nope, just one bad box in many years.

         ---Mike

Out of the total three Akamai servers we have, I think we've had two of them replaced in the past three or four years that we've had them. One was replaced several times. The replacement servers tend to be refurbished and I've seen multiple things wrong with them when they arrive. If I recall correctly, one replacement wouldn't even boot successfully... Just kept crashing. Reloading the OS from an Akamai recovery CD had no affect. Shipping does cause problems whereby the parts can come loose during transit.

The most common problem we see is failed hard drives and/or SCSI bus errors which are likely related to the hard drive failures. I'm surprised Akamai doesn't have any hardware RAID with hot swap yet (at least not in the boxes we have). It would be much less costly for them to ship a new hard drive than a whole new server each time a hard drive fails. I know the idea is to have very cheap boxes in clusters, but I wonder how much they're paying in shipping for replacing the cheap hardware.

As of late, we've had no known problems with our Akamai boxes. That one box does occasionally have weird SCSI hangs where the other two work nonstop. For the most part it is fine though.

Vinny Abello
Network Engineer
Server Management
vinny@tellurian.com
(973)300-9211 x 125
(973)940-6125 (Direct)
PGP Key Fingerprint: 3BC5 9A48 FC78 03D3 82E0 E935 5325 FBCB 0100 977A

Tellurian Networks - The Ultimate Internet Connection
http://www.tellurian.com (888)TELLURIAN

"Courage is resistance to fear, mastery of fear - not absence of fear" -- Mark Twain

Never underestimate the amount of airbills that can be paid with KISS strategy.

Anything else is trollage on NANOG.

Yep, that's true. Shipping is cheap, it's customs that's expensive and
time-consuming, and Akamai tends to avoid the kind of places where you
have to deal with a lot of customs.

                                -Bill

I'd note that the original poster didn't classify 'broken' or 'outage' or
'non-functioning'... just the end result: "replacement".

So, is akamai doing some fancy SMART detection and seeing bad fans and
replacing, seeing a bad cpu fan or disk or memory corruption and
replacing, or are these hard box outages with no recourse but a complete
immediate replacement?

(just curious as they don't let us play with these pieces/parts :slight_smile: )

As far as I can tell the only thing that will get a box replaced is if it
can't be booted/pinged. We've pointed out dead CPU fans before (even on
the incoming replacement boxes) and they've never seemed to care. If it
runs it runs. If it doesn't they replace the entire box.

Given all their redundancy I suppose that is probably the way to go.

Chris

Especially since Akamai doesn't pay for truck rolls and man hours to get the replacements done onsite.

Never underestimate the amount of airbills that can be paid with KISS strategy.

Especially since Akamai doesn't pay for truck rolls and man hours to get the replacements done onsite.

I'm sorry, isn't that exactly what an airbill *is* paying for -- to get the equipment on site?

The man hours (really, we are talking about less than a single hour to replace a server including all the mounting and repacking). The one man hour that they need (no more than 6 a year by the look of it) should offset the value the ISP is getting from not buying bandwidth to get to the content and for the improved performance they get.

If that model doesn't work for the ISP in question, they should ask Akamai to pull their gear.

DJ

As far as I can tell the only thing that will get a box replaced is if it
can't be booted/pinged. We've pointed out dead CPU fans before (even on
the incoming replacement boxes) and they've never seemed to care. If it
runs it runs. If it doesn't they replace the entire box.

Having built a fair number of machines to live for 5 years or longer in data-centers I will never visit, there's relatively little that you want to triage onsite on a rackmount pc. Drives, in hot-plug enclosures and removable power supply modules are about it... Smart-hands are good for racking and stacking, swapping disks, recabling the oob, swapping media and so forth. It's not really a good use of someone else's time to have them performing experimental surgery on pc's. Much better to simply ship out another one and ship the old one back in the same box.

Decent modern 1u chassis still have sufficient airflow with a couple fans failed to remain adequately cool, further there's now enough sensors in a pc to be able to tell when you getting in trouble, rpm indicator for all the fans, intake processor and output temperature, thermal sensors in each of the drives etc. Our success-rate at indetifying machines before they fail has gotten substantially better over time.

I didn't really get the impression that people were really complaining so
much (I certainly wasn't) as they were just pointing out there was an
issue.

However, I do think Akamai would be better off getting their issues with
their replacement boxes straightened out. I agree that we get value for
having the boxes on our network (and so do they lets not forget).
However, it is a bit frustrating to replace the same box 3 times in less
than a month. Hauling a box down to the colo is no big deal but when the
box you are taking down there has a dead CPU fan and two dead case fans
it's hard not to think you might be wasting your time.

It isn't just that they are wasting my time. They are also wasting their
own time. It's the overall lack efficiency that bothers me ;-]

Chris

Chris Owen wrote:

It isn't just that they are wasting my time. They are also wasting their
own time. It's the overall lack efficiency that bothers me ;-]

Don't worry, it wont take long until google parks their datacenter-in-a-container outside at the fiber junction and the content distribution guys will be obsoleted overnight.

Pete

It isn't just that they are wasting my time. They are also wasting their
own time. It's the overall lack efficiency that bothers me ;-]

i suspect you have a datapoint on how they're doing financially.
they ain't stoopid. they'll deal with it when the cost/benefit
gets high enough on their priority list. isn't the first time
that good s&m covers some technical gaps, and won't be the last.

randy

Deepak Jain wrote:

If that model doesn't work for the ISP in question, they should ask Akamai to pull their gear.

And hopefully they'll (someday) send servers in my direction - is their "minimum criteria" creeping upwards at the same rate as overall Internet traffic did in the late 90s?

pt

To quote a science fiction story I'm fond of, "efficiency depends on
what you want to effish".

    --Steven M. Bellovin, http://www.cs.columbia.edu/~smb

I'm sorry, isn't that exactly what an airbill *is* paying for -- to get the equipment on site?

They also frequently need boxes power cycled. It got to be so frequent that we "gave them" a remote reboot switch for all their gear and told them how to use it. They still kept emailing us for reboots until I finally used a contact at akamai to get the remote reboot info properly placed.

We've had our share of failed boxes, DOA boxes, boxes with components literally falling out of them on arrival, etc. I suspect it's just a sign of the box building having been farmed out to the cheapest available source. When you're building boxes in really large volume, what's a few missing screws here there? :slight_smile:

The man hours (really, we are talking about less than a single hour to replace a server including all the mounting and repacking). The one man hour that they need (no more than 6 a year by the look of it) should offset the value the ISP is getting from not buying bandwidth to get to the content and for the improved performance they get.

I wouldn't count on that. With bandwidth prices continually falling, and the ISP business changing (at least for us, dialup/DSL is dying, hosting is taking off, and now instead of having spare outbound capacity to sell to Akamai), we do more outbound than inbound, so the servers really don't save us anything except maybe a bit of latency.

If that model doesn't work for the ISP in question, they should ask Akamai to pull their gear.

Think of the man hours that'd take, ripping them out, boxing them up, etc. :slight_smile:

The impression I got was they originally scattered their machines to
everyone who had a network with a growth plan and bought them a beer. Some
people even got/get paid to host them.

After the .com crash they started being a bit more careful about who they
gave them to and doing a bit more analysis as whether a new site was worth
the trouble.

One way to get a cluster might be to suggest that your will make better
use of it than a nearby company with a cluster that is much smaller than
you. I have heard of people trying this in Australia, no idea how well it
works.

I know people who were doing under 10Mb/s via their clusters, but they are
in Aus/NZ so the threshold might be higher elsewhere.

To quote a science fiction story I'm fond of, "efficiency depends on what you want to effish".

    --Steven M. Bellovin, Steven M. Bellovin

Sci-fi injection!

(marking another beer owed)

  Gadi.

I still have the original server that started garlic.com in production
after 11+ years so I know servers can last a long time. I don't
understand why Akamai failure rates are so high

Applications which cause the disk to thrash will wear out
disk drives much more quickly than non-thrashing applications.
When I still ran USENET news servers back before cyclic file
systems were used, I remember that their hard drives died frequently,
often after less than a year of service, but those drives were
thrashing 24 by 7. You can hear drives thrashing and feel it
by touching the case. It is caused by almost completely random
access resulting in almost constant head movement.

It is cost effective to just thrash cheap drives and
replace them when they die.

--Michael Dillon