The issue isnt knowing everything, it's making accusations of issues while you still
dont know how much you dont know. (~D. Rumsfeld) -- My customers in a nutshell
(they pay to be able to yell about random stuff I guess, and I provide that service!).
The OP didnt make any accusations however, and just asked what was going on (sorry
if I sounded harsh in reply). Once, Google having a 22.214.171.124 failure locally on
its (anycast?) dns servers resulted in dozens of calls to us "your server
hosting our site must be down!! Our website isnt working! People are calling us!".
Most of my work is with these situations is spent proving it's not our fault.
Mtr makes it very hard because it's a very subtle tool, and only gives partial
information. (I still think mtr is a killer app though!)
consider this (fake, example) trace:
6. 100ge13-1.core1.chi1.he.net 0.0% 10
7. 100ge14-1.core2.chi1.he.net 0.0% 10
8. 100ge3-1.core1.sjc2.he.net 30.0% 10
10. UNKNOWN-216-115-101-X.yahoo.com 10.0% 10
11. routerer-ext.ysv.freebsd.org 20.0% 10
12. wfe0.ysv.freebsd.org 30.0% 10
First off, the OP may have asked "who's fault is hop 9, yahoo or HE?" and seen it
as an issue. Ignoring that for now, the rest of the packetloss is an issue --
where is the problem though?
This is very tricky - it looks like hop 8 is at fault of course - or is it
just dropping ICMP as it's allowed to? How did hop 10 get only 10% loss then if
8 has 30? Is 8 then dropping ~20% (not statistically correct..) of ICMP just cuz
it can, and then having a 'real' 10% loss on top of that?
Or it's hop 11? But hop 12 has more PL, perhaps hop 12 is the issue
all along and 8 10 and 11 are just dropping ICMP? Or it's 8, 11 and 12 doing
~10% each? (not statistically correct.)
Can't say for sure - it's a probabilities game - and being completely correct
about it, hop 6 isn't blameless either (just very unlikely to be at fault
statistically, though not impossible with only 10 pings per hop - a statistician
can calculate it for us).
This is why more pings are required to be sure of the situation - I like to do
-i 0.1 -c 100 so it's completed quickly before conditions change. Then you
can make a statistically valid pronouncement of where the problem MIGHT BE
within a useful confidence interval - however, without the return route we're
still largely in the dark as to the actual location of the issue. You cant be
'100% sure' with this stuff - technically speaking, it's all 'luck of the draw'.
(Beware: this one time, at band camp, some etherchannel or equiv at HE was
showing PL only for specific ips in any target subnet -- because they were xor'ing
the source & target IP to load balance and one channel was wonky. Fun times
debugging that one: "WFM from here, what's your issue?")