Initial Route Server Stats for MAE-East

sorry if this seems so trivially obvious, but depending on what you are
trying to measure, sources of significant signal noise *is* one thing
to consider when doing an experimental design. using Pings to cisco
routers produces a signal with a number of different components, some
real signal, some noise, depending on what you you are trying to

so exactly what ARE we trying to measure??? I might have a suggestion
if the problem was well-formed.

for instance, if you want to measure "router busy-ness" then CPU load
and SSE misses are probably good candidates. if you are trying to
find out WHY the routers are busy, then you have to take more data

If you are trying to establish how effectively (successfully?) the
routers are forwarding packets across MAE-EAST, then I believe you need
to do somethine more sophisticated than just ping routers. you could
inject "dye" packets which get forwarded and then you monitor how many
get across what paths and when.

is this a lot me complex than just pinging routers? yes; knowning facts
is often a lot more harder than handwaving.

but the issue here *is* scientific experimental design.

You have to define WHAT you want to measure before you can design an
experiment to measure it. then you have to analyze the experiment to
verify whether it actually measures what you want. then you run the
experiment several times and analyze the output.

and most importantly, if you do a good enough job of defining WHAT you
want to measure, maybe some other enterprising folks will devise a
different method to measure the same phenomenology and run independent
experiments. this way you learn whether you are really measuring what
you thought and can compare results.

and if you do a good enough job of all this, you get to call it "science."

otherwise, you are just collecting data and beating on graphing

And while I very much appreciate the intuitive bent which gets us all
through the day, I haven't heard anything that looks like a definition
of what we are trying to measure.


Once last set of comments. Mike asks what is the measurement for? When
BGP sessions break it is sometimes because of partial or complete packet
loss. I think some goals here include:

1) we want to be able to correlate BGP session failures with packet loss,
2) we would like to understand how the NAP performance contributes to
routing stability.

It is interesting to note that some interesting correlations result as a
side effect. Specifically, the strong correlation between packet loss
and gigaswitch utilization, and packet delay and gigaswitch utilization.
Not perfect correlation, but somewhat strong.

PS -
Dun & I did some further data analysis and calculated that 1% of the packet
loss can be attributed to the last of 5 packets (20%) transmitted having
a response time of greater than a second (which occurs 5%-10% of the time).
That is, the last packet response doesn't come back in the 1 second
timeout period and stats show this as a loss. The first four packets can
have delays > 1 second, but the last can not. We can attribute
approx. 1%-2% of the packet loss to this phenomenon.