Anyone from Verizon/TATA on here? Possible Packet Loss

Hi guys,

We host a web application for a client and they've been complaining that it's been slow since yesterday. It seems fast from the locations I've tested and the system looks fine, so I suspected there was packet loss going on somewhere between them and our colo facility.

I did a few trace routes from our firewall to the client's IP and most of the time they look fine, however I occasionally see some packet loss.

Good trace route:

1 65.61.0.97 0 msec 0 msec 0 msec
2 107.1.118.217 0 msec 10 msec 0 msec
3 69.139.194.21 0 msec 0 msec 0 msec
4 68.86.147.129 10 msec 10 msec 20 msec
5 68.86.94.169 20 msec 30 msec 20 msec
6 68.86.86.26 20 msec 20 msec 10 msec
7 216.6.87.97 10 msec 20 msec 20 msec
8 216.6.87.34 10 msec 20 msec 10 msec
9 152.63.34.22 20 msec 10 msec 20 msec
10 130.81.28.255 30 msec 30 msec 20 msec

Traceroutes with packet loss (8th hop):

1 65.61.0.97 0 msec 0 msec 0 msec
2 107.1.118.217 0 msec 10 msec 0 msec
3 69.139.194.21 0 msec 0 msec 0 msec
4 68.86.147.129 20 msec 10 msec 10 msec
5 68.86.94.169 20 msec 20 msec 30 msec
6 68.86.86.26 20 msec 20 msec 10 msec
7 216.6.87.97 10 msec 20 msec 30 msec
8 216.6.87.34 10 msec * 10 msec
9 152.63.34.22 140 msec 110 msec 20 msec
10 130.81.28.255 20 msec 30 msec 30 msec

1 65.61.0.97 10 msec 0 msec 0 msec
2 107.1.118.217 0 msec 10 msec 0 msec
3 69.139.194.21 0 msec 0 msec 0 msec
4 68.86.147.129 20 msec 20 msec 10 msec
5 68.86.94.169 30 msec 20 msec 20 msec
6 68.86.86.26 20 msec 20 msec 20 msec
7 216.6.87.97 20 msec 10 msec 20 msec
8 216.6.87.34 20 msec 40 msec *
9 152.63.34.22 20 msec 10 msec 10 msec
10 130.81.28.255 30 msec 30 msec 20 msec

It appears the 8th hop occasionally has packet loss.

Thanks,
Derek

This is not the proper way to interpret traceroute information. Also, 3
pings is not sufficient to determine levels of packet loss statistically.

I suggest searching the archives regarding traceroute, or googling how to
interpret them in regards to packet loss, as what you posted does not
indicate what you think it does.

-Blake

Agreed. Derek should read "A Practical Guide to (Correctly)
Troubleshooting with Traceroute":
http://www.nanog.org/meetings/nanog45/presentations/Sunday/RAS_traceroute_N45.pdf

Thanks guys. That was an informative read. I will do some more troubleshooting.

Derek

After some further troubleshooting, I believe I have narrowed down the issue to one of Verizon's routers (130.81.28.255).

ping 130.81.28.255 repeat 100
Type escape sequence to abort.
Sending 100, 100-byte ICMP Echos to 130.81.28.255, timeout is 2 seconds:
?!!!!!!!!?!!!!!!!?!!!!!!!!?!!!!!!!!!!!!!!!?!!!!!!!!!!!!!!?!!!!!!!!!!!?
!!!!!!!!!!!!!!!!!!!!!!?!!!?!!!
Success rate is 91 percent (91/100), round-trip min/avg/max = 20/26/30 ms

I had my client send me the output of the ping command (100 pings) and a trace route.

Their 5th hop is 130.81.28.254 and one of the response times in their trace route was 175ms so the issue seems to be around there.

I asked them to open a ticket with Verizon to take a look.

Thanks,
Derek

Many (most?) routers deprioritize ICMP meesages. Direct pings against the router are not informative re transit failures.

That router might be experiencing a high CPU load, thus not being able to reply ICMP on a timely manner or maybe QoS policies are influencing depending on the kind of traffic the router deals with.

If packets are only being delayed/lost on that segment, I would start my analysis there.

That router might be experiencing a high CPU load, thus not being able to reply ICMP on a timely manner or maybe QoS policies are influencing depending on the kind of traffic the router deals with.

If packets are only being delayed/lost on that segment, I would start my analysis there.

Many (most?) routers deprioritize ICMP meesages. Direct pings against the router are not informative re transit failures.

I'm at home now. I also have Verizon FiOS and believe I am seeing the same thing our client saw. So you guys are saying that the response times in traceroutes might not always be accurate because routers prioritize ICMP messages. Does that mean values from MTR aren't accurate? I fired up MTR and took 2 screenshots (http://imgur.com/a/RDyXO). What do you guys think? Most of the time the ping times seem fairly low, however I occasionally see these spikes. It seems sporadic...

My boss also has FiOS and he is seeing the same thing. Pages load quick most of the time and sometimes take awhile to load.

Thanks,
Derek

To recap, traceroute, mtr, and similar utilities work by talking to each
succesive router along a path. Because this is so, and because Any Given
Router may be too busy to deal with such packets in favor of "real" traffic
(most routers handle data packets on the line cards, while they may have
to expend actual CPU on things like ICMP), it's possible for a path with
perfect connectivity to show some intermediate hops completely missing --
No Reply At All, you might say -- to diagnostic tools.

The traces you show look pretty decent; I've seen much worse on links
with fine interactive shell session response. The time you have to
worry is when one router *and everything past it* shows packet loss of
roughly the same amount, or when ping times jump markedly at a given
spot (by which I mean, say, from 32 to 800ms, rather than from 32 to 125).

The short version, though, which most people are are being uncharacteristically
too nice to say (:slight_smile: is that this is still a tier 1 problem, and NANOG is
generally tier 3 or 4. :slight_smile:

You're welcome to take the issue up over on outages@outages.org, if you
like...

Cheers,
-- jra

Thanks guys. Sorry for the noise...

Derek