Phantom packet loss is being shown when using pathping in connection with asynchronous routing - although there is no real loss.

Hallo colleagues,

Maybe someone of you can help me to understand the phenomenon of pack loss
when using asynchronous routing?

I have customers who are complaining about packet loss and they are
providing me with MTRs and pathpings (that's some sort of traceroute that
pings every hop it sees several times - comes with windows xp) that show the
loss starting at my routers and ending at their server (=the last hop). All
users are coming from a (dialup-)network where the way from them to our
servers are going via a carrier different than the carrier we are using to
route the traffic back to the dial user.
The interesting thing is that there is no loss at all when the users either
use a ping instead of this pathping/mtr-stuff or when I perform a ping or
even an mtr on my server in direction of the dialup customer.

The nasty thing is that there is de facto NO LOSS on the line but the users
is seeing some sort of phantom loss.

The problem immediately disappears when I change to way back to the same
carrier as the way to us so that we have synchronous routing again.

My assumption is that pathping and mtr somehow get irritated by the icmp
messages due to a wrong timing or something like that. Any ideas?

Thanks,
Gunther

I have customers who are complaining about packet loss and they are
providing me with MTRs and pathpings (that's some sort of traceroute that
pings every hop it sees several times - comes with windows xp)

(if it comes with win xp, then that sounds interesting-yet-surprising -- it's more usually found at <http://www.bitwizard.nl/mtr/>).

[...]

The nasty thing is that there is de facto NO LOSS on the line but the users
is seeing some sort of phantom loss.

The starting point for any investigation like this is to compare the traceroute that apparently shows loss or other problems with traceroutes from strategic points in the path back to the source.

If there's a congestion problem which is the cause of the concern then comparing traceroutes in both directions will usually help find it.

If there's no congestion problem, or the apparent problem is unusual latency or loss in the numbers mtr displays for particular routers in the path, then mtr's ICMP echo requests towards the control elements of particular routers are probably being deliberately rate-limited by the operators of those routers.

Joe

Try varying the mtr interval, such as "-i .1" (must be root for <1). Does the packetloss significantly increase with this faster mtr? Try slower "-i 10". Does the packetloss significantly decrease or go away?

If the answer to both above questions is yes, then I would suspect ICMP rate limiting.

You could also try varying the speed of ping. Windows is pretty limited, but on unix you can do things like .1 second intervals ("-i .1" as root). Does a faster ping trigger this apparent loss? If so, ICMP rate limiting.

The only part that I don't get is that you can mtr to him without packetloss. Although the path in-between may be different, the final hop packetloss should exactly equal what he sees when mtring you. A round-trip is a round-trip, and results should be identical regardless of who originates. I can't think of any way this would be different unless echo and echo-reply were being rate limited independently.

My home ISP (apartment ethernet "t1" service, which is actually multiple T3s) has a Packeteer or something along that line. If I use ping, everything is fine since it goes so slow. If I use MTR, it works fine for the first few seconds then sees >90% packetloss on all hops from then on once the rate limiter burst bucket runs dry. Of course, TCP still sees no packetloss even when mtr is seeing this heavy rate limited loss...

The only part that I don't get is that you can mtr to him without
packetloss. Although the path in-between may be different, the final

hop

packetloss should exactly equal what he sees when mtring you. A

round-trip

is a round-trip, and results should be identical regardless of who
originates. I can't think of any way this would be different unless

echo

and echo-reply were being rate limited independently.

If the time was different then the packet loss would
be different. Perhaps the customer runs the tests during
his busy period when he is concerned about making sure
there is no delay. Then, later in the day, after his busy
period is over he takes the time to contact his ISP. The ISP
then runs some tests which show there is no packet loss
at all. To be sure this is not happening, synchronize the
tests and run simultaneously.

Try tcptraceroute because this more accurately reflects
the traffic that is flowing.
http://michael.toren.net/code/tcptraceroute/

http://tracetcp.sourceforge.net/ is a windows tool
that is similar.

The open source tool LFT can be built to run on Windows
under cygwin http://pwhois.org/lft/ but they have this
warning on their page:

   Many people have complained about various problems on
   the Windows platform. Both LFT and the WhoB client
   compile and run well under Cygwin environments on
   Windows. Unfortunately, Microsoft's changes to the
   Windows IP stack (as of XP Service Pack 2) reduced
   their raw socket functionality significantly as part
   of their security bolstering process. These changes
   have effectively stopped LFT from working properly
   while using TCP. LFT's UDP tracing and other advanced
   features still work properly. For more information on
   Windows raw sockets, consult

www.microsoft.com/technet/prodtechnol/winxppro/maintain/sp2netwk.mspx#EIAA

This may have nothing to do with your MTR issue but it
does make one wonder whether a Windows machine is safe
to do performance testing. In any case, the LFT people
think that their non-TCP features still work properly
on Windows and this is a tool that you can also run
on your end. Worth a try?

--Michael Dillon

I can't tell you what is going on. But I can ask, (a) why are you doing
asymmetrical routing in the first place? and, (b) is it possible that
the MicroSoft versions of these tools are reporting errors BECAUSE of
the asynchronous routing?

For any non-trivial path, it seems to me that asymmetry in forward and return paths is normal. Symmetrical paths are the exception.

From another angle, how can anybody hope to ensure that all forward and return paths are identical when the only exit under their control is the one on the outbound path, at their own border?

Joe

If this is for their customers, it wasn't clear that the path went
outside their zone of control. I did wonder.