Ahoy, SLA boffins!

_Bill_Woodcock · July 29, 2009, 4:34am

So I've embarked on the no-doubt-futile task of trying to interpret SLAs as empirically-verifiable technical specifications, rather than as marketing blather. And there's something that I'm finding particularly puzzling:

In most SLAs, there seem to be two separate guarantees proffered: one concerning "network availability" and one concerning "packet loss." Now, if I were to put my engineer hat on, and try to _imagine_ what the difference might be, I might imagine "network availability" to have something to do with layer-2 link status being presented as "up," while packet loss would be the percentage of packets dropped. But when I actually read SLAs, "network availability" is generally defined as the portion of the month that the path from the customer's local loop to the transit or peering routers was "available" to transmit packets. Packet loss, on the other hand, is generally defined as the portion of packets which are lost while crossing that exact same piece of network.

Now, what am I missing here? Is this one of those Heisenberg things, where "network availability" is the time the network _could have_ delivered a packet _when you weren't actually doing so_, while "packet loss" is the time the network _couldn't_ deliver a packet when you _were_ actually doing so?

Is "network availability" inherently unmeasurable on a network that's less than 100% utilized?

Am I over-thinking this?

Seriously, though, I know there are people who don't consider SLAs to be fantasy-fiction, and some of them must not be innumerate, and some subset of those must be on NANOG, and the intersection set might be equal to or greater than one, right? Can anybody explain this to me in a way I can translate into code, while still taking myself seriously?

-Bill

ianai · July 29, 2009, 4:42am

Yes. But not because you are coming to strange conclusions, but because (as you say in your first sentence), you are trying to put empirical / objective meaning to marketing blather.

I had a simple way to fix this. I defined a network as "down" with more than X% packet loss (usually with X in the 2-5 range, depending on other deal parameters). IMHO, a network with 5% packet loss -is- down. I don't know about you, but none of my customers will use my service if they have 5% loss. TCP is finicky! This receives the strongest credit because you cannot use the service.

Below X, you are not "down", just degraded, and therefore the link has some utility, but not 100% utility. This receives a credit, but not as strong a credit as being unable to use a link.

Oh, and, of course, if the there is no light on the fiber, then we are (obviously) "down" as well.

Make sense?

Or I am over-thinking it?

Michael_Dillon4 · July 29, 2009, 11:04am

Am I over-thinking this?

Yes, I think so. Often a large component of an SLA is related to the
cost of compliance versus the cost of the penalty imposed. If it is
cheaper to pay the occasional penalty, rather than construct the
network to meet the SLA, then the network operator will often make a
purely sales/marketing decision to use the SLA without including
engineering/OPS in the discussion.

Also, the wording often refers to unplanned downtime so that any
planned downtime doesn't get counted in the non-availability measure.
And sometimes you find some allowance for packet drop during a limited
time period so that if you drop a thousand packets, it doesn't count
if it happens during the peak hour of the day or if all packets are
dropped in a few minutes timeframe.

Another limitation that I have seen refers to "core" network or "core"
PoPs meaning the part of the network in the major market area
(generally the USA and Western Europe) but not covering network or
PoPs in "fringe" areas.

I don't believe that there is any hard science behind SLAs and that
most engineering/OPS teams don't even know what are the actual SLAs
being given to customers. There are engineering targets that are
sometimes referred to as SLAs but they are not the Service Level
Agreement that is in signed customer contracts.

All that aside, it would be interesting to see some standards for
measuring and reporting things like "network availability" from an
engineering point of view.

--Michael Dillon

Andreas_Rich · July 29, 2009, 12:54pm

Bill,
To be brief, but hopefully not too fleeting, the majority of the
standards orgs - ITU, MEF - use packet loss to derive availability.
Loss% = the % of packets which were transmitted but not received by the
destination host. As for availability, loss is measured across some
time period. If during that period X% of the transmitted packets were
NOT lost, then the network is said to be available. Typically a 20%
figure is used, e.g. if 20% of the packets transmitted during a 5-minute
period were received then the network is said to be 100% Available for
that 5-minute time period. Some Carriers have taken this to the extreme
to say that if at least 1 packet was successfully transmitted then the
network was 100% Available for the time period.

Loss is a measure of the networks usability, Availability is .......??
(Meaningless??) What utility does a network have that is "Available"
yet sustaining a loss rate which renders it inoperable?

Rich

Leo_Bicknell1 · July 29, 2009, 2:33pm

I think the desired goal here is to separate the access SLA from
the backbone SLA. That is, consider a simple picture:

Network Cloud------Provider Edge Router-----Local Loop-----Customer Router

Network availability is the % of the time the customer router and
provider edge router can communicate, and is designed to measure
if the local loop is up. For instance, let's say the provider edge
router looses all its uplinks to the Network Cloud, your local loop
is up and functioing but you have 100% packet loss to all destinations.

The "packet loss" SLA kicks in on a per-destination basis. Everything
is up and working, but the provider has a full circuit and is
dropping 20% of the packets on that link. You catch it, you get a
credit.

I think the technical reason why these are separate has to do with
the expectations. If my local loop is dropping 0.5% of the packets
due to errors, it is broken and must be fixed. If some random
destination on the Internet is dropping 0.5% of the packets well,
that's a normal day in the life of the network. Plus, if your local
loop takes errors then you get a credit. However, if there's a
full link in the backbone but none of your packets take it, and
thus you are unaffected, you don't.

Now, having said all that, and having been one of the people who've
attempted to communicate sane, rational, technical ideas to marketing
and legal the chance that anything sane made it in the actual contract
is, well, nil.

Michael_Dillon4 · July 29, 2009, 4:59pm

Now, having said all that, and having been one of the people who've
attempted to communicate sane, rational, technical ideas to marketing
and legal the chance that anything sane made it in the actual contract
is, well, nil.

I disagree.

If someone takes the trouble to publish a technical document describing a
sane technical way to measure a network SLA, and they also provide code
for measuring/calculating the SLA, then there is a good chance that the
industry will pick it up.

Look at 95th percentile billing. Dave Rand at Abovenet thought it up,
probably to
simplify the billing process and keep billing overhead costs down. Then UUNet
picked it up and suddenly just about everyone was offering a 95th percentile
billing model.

-- Michael Dillon

Net · July 29, 2009, 5:49pm

Aawaw

Holmes_David_A · July 29, 2009, 6:05pm

We use the BRIX active measurement system (BRIX now owned by EXFO) which
gathers round trip time, packet loss, and jitter randomly every minute
24x7x365 for our major backbone links to calculate SLAs. "Network
Availability" can be measured empirically using BRIX calculated values
of packet loss, and expressed in terms of #9's, which BRIX will also
calculate over any time period for which BRIX historical data is being
kept. BRIX historical data is kept on an embedded Oracle data base. BRIX
usually runs on a Solaris SMP server.

herrin · July 29, 2009, 7:25pm

The SLA's I've looked at promise me that if their service is hard down
for a week (with no ambiguity whatsoever) they'll credit my bill for
upwards of 2% of the $50k/year or so I spend on the Internet
connection for my mutli-million dollar online service.

So yeah, you're overthinking it. When they start coupling those SLAs
with some sort of serious business loss insurance, then paying
attention to the SLA and carefully examining what constitutes failure
may make some kind sense at a technical level.

Regards,
Bill Herrin

JC_Dill · July 29, 2009, 8:19pm

William Herrin wrote:

Am I over-thinking this?

The SLA's I've looked at promise me that if their service is hard down
for a week (with no ambiguity whatsoever) they'll credit my bill for
upwards of 2% of the $50k/year or so I spend on the Internet
connection for my mutli-million dollar online service.

I'm really surprised anyone considers this an SLA, or anything special in a business contract. I automatically expect to get a credit of 1.923% if the service were not provided for a period of 168 hours, no questions asked and no SLA required.

When service is simply not provided, there's nothing special about not having to pay for it. I don't know of any business where you can have a contract that requires you to pay your monthly/annual fee for services when said services are not provided. If you have a housekeeping or lawn service that is supposed to come once a week, and you have an annual contract with them for this service at $50/week, and they miss a week (provide no service) you don't pay them anyway for that missed week. You don't need an SLA in your contract with them to have this right to withhold payment for the period of time when the services are not provided *at all*.

An SLA comes into play when a service is degraded below the quality you contracted for. What credit do they give you when you have 168 hours of degraded service, e.g. 50% of the service level you specified in your RFQ? That's where your SLA comes in. The SLA specifies at what point your service is considered "degraded" (how much below the contracted service level, and how long of a time period is required before it is considered below grade) and what $credit you may receive when you are provided some service, but not to the level specified in your contract.

jc

herrin · July 29, 2009, 8:38pm

Hi JC,

Perhaps you miss my point: what the ISP is offering to pay me as a
result of a failure to deliver adequate service is so much less than
my loss for the same as to render the payment meaningless. I'm gonna
terminate the contract for nonperformance and hire someone who can get
the job done long before its worth my time to chase you for an
SLA-based service credit. And we both know it. The only way I ever
chase you for an SLA credit is I'm playing the blame game instead of
doing my job for my customers.

Regards,
Bill Herrin

Stephen_Sprunk3 · July 29, 2009, 9:52pm

JC Dill wrote:

William Herrin wrote:

The SLA's I've looked at promise me that if their service is hard
down for a week (with no ambiguity whatsoever) they'll credit my bill
for upwards of 2% of the $50k/year or so I spend on the Internet
connection for my mutli-million dollar online service.

I'm really surprised anyone considers this an SLA, or anything special
in a business contract. I automatically expect to get a credit of
1.923% if the service were not provided for a period of 168 hours, no
questions asked and no SLA required.

When service is simply not provided, there's nothing special about not
having to pay for it.

Read your contract closely and you'll find that, except for an explicit
SLA clause (which will cost you extra), they make no guarantee that the
circuit will work at all and you'll still owe them money. On top of
that, the SLA payouts are usually capped at an amount _less_ than the
price increase due to demanding an SLA. If your circuit costs $2k/mo,
and it's down for an entire month, you'll probably still owe them at
least $1500 for that non-service -- and you could buy a non-SLA service
for the same $1500/mo.

(Savvy customers who are spending big bucks know how to negotiate these
terms to be more favorable, but most customers aren't savvy unless
they've already been burned by this.)

S

JC_Dill · July 30, 2009, 6:59am

Stephen Sprunk wrote:

Read your contract closely and you'll find that, except for an explicit
SLA clause (which will cost you extra), they make no guarantee that the
circuit will work at all and you'll still owe them money.

I am not a lawyer. However, over the years many lawyers have told me you can't have a legally enforcible contract that says (in essence) you owe me money even if I give you absolutely nothing in exchange (or visa versa). A legally enforcible contract must *always* have an exchange of consideration - I give you something (money, labor, tangible property, intangible property) in exchange for something you give me.

Many businesses try this type of crap all the time, but (according to the above mentioned lawyers) it's not worth the paper it is written on. They make these clauses hoping the other party doesn't know their rights. However, contract law (e.g. the UCC) trumps unenforcible and illegal clauses in your contracts (this is why we *have* civil laws regarding civil contracts, otherwise there would be no point in civil laws at all). But please don't take my word for it, ask your own lawyer to review your contract and give you an opinion about the legality and enforceability of clauses of this type, in your particular contract.

jc

Leigh_Porter · July 30, 2009, 9:23am

Indeed, that's why some companies have contracts managers with experience of thieving gits who try to rip you off on SLAs. We indeed have been burned and so our contracts worth any money now have real good incentives for the vendors to come up with the goods and make what they sell us work. Even though, sometimes important stuff gets dropped because the vendor refuses to be bound by it, and then, we get screwed over it.

Matthew_Petach2 · July 30, 2009, 9:50pm

Actually, SLA credits are useful in cases where it's not the only path
between two sites; if, for example, you have 12 OC192 links running
across the US, but your peak traffic on them doesn't exceed 80Gb
combined, having an OC192 down for a day or two won't really hurt
you; there's no reason to cancel the circuit, the rest of your links are
carrying the traffic just fine, but since one of the links failed to meet
its SLA, you might as well push the vendor to give you the SLA
credit back; it saves you some money, you have no lost customers,
you have no other impact to your business. It's not about playing
the "blame game", it's about giving the vendor an incentive to try
to run their system a bit more reliably.

Now, for single-homed customers depending on that one link,
I agree, an SLA is largely meaningless compared to the impact
of being down. But there's many cases where the SLA is
meaningful, and collecting SLA credits is worth it, without
there being a corresponding massive loss in revenue
associated with the outage.

Matt