And how exactly would you interpret the number returned by net_loss (int), in a column called "LOSS", in reference to reachability of a "hop" between two end points:
int net_loss(int at)
{
if ((host[at].xmit - host[at].transit) == 0)
return 0;
/* times extra 1000 */
return 1000*(100 - (100.0 * host[at].returned / (host[at].xmit - host[at].transit)) );
} ?
If the hop(s) following the one you see loss for shows no loss, then disregard the loss for that hop, obviously whatever it is, it does not affect transit, which is what you really want to know.
I'd interpret it to mean you're hitting a control plane policer or
somesuch, with no actual bearing on end-to-end performance, judging
from the diagnostic output you've graciously provided us with.
I find myself giving this lecture several times a week to random
"gamer" customers upset that intermediary routers don't reply to their
pings at full line rate; I'd expect slightly better critical thinking
skills from the posters on this list, but I've been wrong before.
This is one of the most misunderstood concepts in properly reading output from a traceroute (mtr, visualtraceroute, whatever). Basically you are seeing loss of packets destined directly TO that router, not THRU it. Most often this is caused by 1) the router having ratelimits applied to these packets so as not to bog down the CPU while it’s trying to perfom its main function…forwarding packets or 2) the router is already busy and places a low priority on responding to those packets so as to leave CPU available for forwarding packets.
You can see from the trace that hops after that don’t show any loss. If that router was actually causing loss then you would see the loss continue thru the rest of the trace. Since you don’t, you can assume that the router is experiencing one of the cases above. Of course there are always exceptions but 99.9% of the time this is the case. This same concept applies to latency as well. If you see only a single hop with a high response time and everything afterwards is normal, it’s the same situation but it’s taking the router a longer time to respond to you rather than it ignoring you. You can test this by simply pinging the end destination…do you see the same loss and/or high latency, if not you can disregard it.
And while we’re on the subject of reading this output, remember that traces only show you the forward path, not the reverse. Thanks to the wonders of asymmetric routing, at times it could be the return path that actually has the loss on it, the loss in the forward path only gives you an idea of where to begin troubleshooting.
I'd be interested in a discussion of this as well. To answer a slightly
different question, I usually point the "ping and traceroute" geeks to
Karl's wonderful treatise on the subject: http://www.iwl.com/Resources/Papers/icmp-echo_print.html.
> >
> >
> > If the hop(s) following the one you see loss for shows no loss, then
> > disregard the loss for that hop, obviously whatever it is, it does not
> > affect transit, which is what you really want to know.
> >
> > Is that correct?
> >
> This is one of the most misunderstood concepts in properly reading output
> from a traceroute (mtr, visualtraceroute, whatever). Basically you are
> seeing loss of packets destined directly *TO* that router, not THRU it. Most
no... not destined TO the router, destined THROUGH the router that happen
to TTL=0 ON that router.
Very true. Most backbone kit on a tier 1 network is designed to switch
packets in a distributed fashion, shifting packets between ports/cards
over a backplane of some sort. On such kit, generating things such as a
TTL-exceeded packet is usually punted to a central processor (whose
primary task is to build route tables to hand off to the cards), which
deals with the task in a much slower and much lower priority way than
packets which transit the routing device. You also don't want your
central processor to have to deal with too much of this sort of thing,
which is (at least one of the reasons) why it's often rate limited.
which is also misunderstood by just about everyone but anyway... 'not
affecting transit' for reasons sited by yourself and min and adam already,
yes.
If your clocks are accurately synced, you can even get unidirectional
delay.
I usually run it like this:
./rtt -v <host>
you will need to run ./rtt_resp on the far end host.
You can also use iperf or similer tools to help customers
diagnose network problems, but a easy/lightweight daemon on a few
hosts is always fairly easy to play with in a quick-and-dirty way...
I was really just pointing out that 'traceroute' or 'mtr' send packets
with increasing TTL to show 'loss' or 'delay' from place to place, I
wasn't trying to debate the every-changing reasons why backbone equipment
might or might not answer 'ttl-expired' or 'unreachable' (or any
'exception traffic' really) in a 'timely' fashion. That issue changes with
the wind/os/hardware/model....
nice to L3 sending in the answer police though Thanks!
>
> >
> > > >
> > > >
> > > > If the hop(s) following the one you see loss for shows no loss, then
> > > > disregard the loss for that hop, obviously whatever it is, it does not
> > > > affect transit, which is what you really want to know.
> > > >
> > > > Is that correct?
> > > >
> > > This is one of the most misunderstood concepts in properly reading output
> > > from a traceroute (mtr, visualtraceroute, whatever). Basically you are
> > > seeing loss of packets destined directly *TO* that router, not THRU it. Most
> >
> > no... not destined TO the router, destined THROUGH the router that happen
> > to TTL=0 ON that router.
>
> Very true. Most backbone kit on a tier 1 network is designed to switch
I was really just pointing out that 'traceroute' or 'mtr' send packets
with increasing TTL to show 'loss' or 'delay' from place to place, I
wasn't trying to debate the every-changing reasons why backbone equipment
might or might not answer 'ttl-expired' or 'unreachable' (or any
'exception traffic' really) in a 'timely' fashion. That issue changes with
the wind/os/hardware/model....
Yeah, it was a sweeping generalisation, hence the excessive use of words
such as "usually" and "most" I was trying to put the point across as
to why things are like this, for those that might be wondering why. The
main point was actually that the ability of a device (router, web server
etc) to deal with stuff _like_ ICMP message generation does not reflect
its ability to perform it's main task.
nice to L3 sending in the answer police though Thanks!