How do you (not how do I) calculate 95th percentile?

I am wondering what other people are doing for 95th percentile calculations
these days. Not how you gather the data, but how often you check the
counter? Do you use averages or maximums over time periods to create the
buckets used for the 95th percentile calculation?

A lot of smaller folks check the counter every 5 min and use that same
value for the 95th percentile. Most of us larger folks need to check more
often to prevent 32bit counters from rolling over too often. Are you larger
folks averaging the retrieved values over a larger period? Using the
maximum within a larger period? Or just using your saved values?

This is curiosity only. A few years ago we compared the same data and the
answers varied wildly. It would appear from my latest check that it is
becoming more standardized on 5-minute averages, so I'm asking here on Nanog
as a reality check.

Note: I have AboveNet, Savvis, Verio, etc calculations. I'm wondering
if there are any other odd combinations out there.

Reply to me offlist. If there is interest I'll summarize the results
without identifying the source.

Jo Rhett wrote:

I am wondering what other people are doing for 95th percentile calculations
these days. Not how you gather the data, but how often you check the
counter? Do you use averages or maximums over time periods to create the buckets used for the 95th percentile calculation?

We use maximums, every 5 minutes.

A lot of smaller folks check the counter every 5 min and use that same
value for the 95th percentile. Most of us larger folks need to check more often to prevent 32bit counters from rolling over too often.

Actually, a lot of people do 5 minutes... and I would say that larger companies don't check them more often because they are using 64 bit counters, as should anyone with over about 100Mbps of traffic.

  Are you larger

folks averaging the retrieved values over a larger period? Using the
maximum within a larger period? Or just using your saved values?

In our setup, as with a lot of people likely, any data that is older than 30 days is averaged. However, we store the exact maximums for the most current 30 days.

A lot of smaller folks check the counter every 5 min and use that same
value for the 95th percentile. Most of us larger folks need to check more
often to prevent 32bit counters from rolling over too often. Are you larger
folks averaging the retrieved values over a larger period? Using the
maximum within a larger period? Or just using your saved values?

Most people are using 64 bit counters. This avoids the wrapping problem (assuming you don't have 100GE and poll more then once every 5 years :-)).

This is curiosity only. A few years ago we compared the same data and the
answers varied wildly. It would appear from my latest check that it is
becoming more standardized on 5-minute averages, so I'm asking here on Nanog
as a reality check.

Yup, 5 min seems to be the accepted time.

(I did this fast, and, who knows; I could be off my an order or two of magnitude)

Most people are using 64 bit counters. This avoids the wrapping problem (assuming you don't have 100GE and poll more then once every 5 years :-)).

2^64 is 18,446,744,073,709,551,616 bytes.

100 GE (100,000,000,000 bits/sec) is 12,500,000,000 bytes/sec.

It would take 1,475,739,525 seconds, or 46.79 years for a counter wrap.

>A lot of smaller folks check the counter every 5 min and use that same
>value for the 95th percentile. Most of us larger folks need to check more
>often to prevent 32bit counters from rolling over too often.

Actually, a lot of people do 5 minutes... and I would say that larger
companies don't check them more often because they are using 64 bit
counters, as should anyone with over about 100Mbps of traffic.

Counter size is an incomplete reason for polling interval.

If you need a 5 minute average and poll your routers once every five
minutes, what happens if an SNMP packet gets lost?

In the best case, a retransmission over Y seconds sees it through, but
now you've got 300+Y seconds in what was supposed to be a 300 second
average...your next datapoint will also now be a 300-Y average unless
you schedule it into the future.

In the worst case, you've lost the datapoint entirely. This loses not
just the one datapoint ending in that five minute span, but also the
next datapoint. Sure, you can synthesize two 5 minute averages from
one 10 minute average (presuming your counters wouldn't roll), but this
is still a loss in data - one of those two datapoints should have been
higher than the other.

At a place of previous employ, we solved this problem by using a 30
second (!) polling interval, and a home-written (C, linking to
the UCD-SNMP library (now net-snmp)) polling engine that did its best
to emit and receive as many queries in as short a space of time as it
was able to (without flooding monitored devices).

In these circumstances, we could lose several datapoints and still
construct valid 5-minute averages from the pieces (combinations of 30,
60, 90 etc second averages, weighting each by the number of seconds
it represents within the 300-second span).

Our operations staff also enjoyed being able to see graphical response
to changes in traffic balancing within half a minute...better, faster
feedback. Another factor that makes 'counter size' a bad indicator
for polling interval.

In our setup, as with a lot of people likely, any data that is older
than 30 days is averaged. However, we store the exact maximums for the
most current 30 days.

You keep no record? What do you do if a customer challenges their
bill? Synthesize 5 minute datapoints out of the larger averages?

I recommend keeping the 5 minute averages in perpetuity, even if that
means having an operator burn the data to CD and store it in a safe (not
under his desk in the pizza boxes, nor under his soft drink as a coaster).

Doh! You are 100% correct.

I didn't take into account the fact that the counters are if(In|Out)*Octets* and NOT if(in/Out)*Bits*.

The point is that 64-bit counters are not likely to roll :slight_smile:

Warren

David W. Hankins wrote:

A lot of smaller folks check the counter every 5 min and use that same
value for the 95th percentile. Most of us larger folks need to check more often to prevent 32bit counters from rolling over too often.

Actually, a lot of people do 5 minutes... and I would say that larger companies don't check them more often because they are using 64 bit counters, as should anyone with over about 100Mbps of traffic.

Counter size is an incomplete reason for polling interval.

Possibly incomplete, but a reason for some none the less, if all they can do is 32 bit counters.

If you need a 5 minute average and poll your routers once every five
minutes, what happens if an SNMP packet gets lost?

No one said it was "needed", just what is done.. and I agree with your reason of more frequent polling, than doing it because of counter roll.

In the best case, a retransmission over Y seconds sees it through, but
now you've got 300+Y seconds in what was supposed to be a 300 second
average...your next datapoint will also now be a 300-Y average unless
you schedule it into the future.

In the worst case, you've lost the datapoint entirely. This loses not
just the one datapoint ending in that five minute span, but also the
next datapoint. Sure, you can synthesize two 5 minute averages from
one 10 minute average (presuming your counters wouldn't roll), but this
is still a loss in data - one of those two datapoints should have been
higher than the other.

In our setup, as with a lot of people likely, any data that is older than 30 days is averaged. However, we store the exact maximums for the most current 30 days.

You keep no record? What do you do if a customer challenges their
bill? Synthesize 5 minute datapoints out of the larger averages?

This isn't for customer billing. We don't bill customers on Mbps, but rather on total volume of GB transfered. That is an easy number to collect and doesn't depend on 5 minute itervals being successful. Right up until someone clears the counters :wink: