packet loss question

Phillip_Lynn · July 7, 2016, 7:17pm

Hi all,

I am writing because I do not understand what is happening. I ran mtr against our email server and www.teco.comand below are the results. I am not a network engineer so I am at a loss. I think what I am seeing is maybe a hand off issue, between Frontier and Level3Miami2. If I am correct then what can I do?

My system is running Centos 6.5 Linux.

Thanks,

Phillip

(! 1011)-> sudo mtr -r netwolves.securence.com
HOST: xxxxx@netwolves.comLoss% Snt Last Avg Best Wrst StDev
   1. 172.24.109.1 0.0% 10 0.6 0.6 0.6 0.7 0.0
   2. lo0-100.TAMPFL-VFTTP-322.gni 0.0% 10 3.2 2.0 1.0 4.3 1.2
   3. 172.99.44.214 0.0% 10 4.0 4.9 2.3 6.9 1.5
   4. ae8---0.scr02.mias.fl.fronti 0.0% 10 9.3 9.1 7.5 9.8 1.0
   5. ae1---0.cbr01.mias.fl.fronti 0.0% 10 8.9 9.1 7.6 9.7 0.7
   6. lag-101.ear3.Miami2.Level3.n 80.0% 10 9.0 8.9 8.8 9.0 0.1
   7. 10ge9-14.core1.mia1.he.net 0.0% 10 14.3 13.0 7.6 18.1 4.3
   8. 10ge1-1.core1.atl1.he.net 0.0% 10 25.6 33.2 22.4 99.7 23.6
   9. 10ge10-4.core1.chi1.he.net 0.0% 10 45.6 51.8 45.5 82.7 12.5
  10. 100ge14-2.core1.msp1.he.net 0.0% 10 53.6 63.9 53.6 125.2 21.8
  11. t4-2-usi-cr02-mpls-usinterne 0.0% 10 53.2 73.1 53.2 225.6 54.0
  12. v102.usi-cr04-mtka.usinterne 0.0% 10 53.2 53.9 53.2 55.3 0.6
  13. netwolves.securence.com 0.0% 10 53.4 53.9 53.4 55.4 0.7

(! 1014)-> sudo mtr -r www.teco.com
HOST: xxxxx@netwolves.comLoss% Snt Last Avg Best Wrst StDev
   1. 172.24.109.1 0.0% 10 0.6 0.6 0.6 0.7 0.0
   2. lo0-100.TAMPFL-VFTTP-322.gni 0.0% 10 104.8 81.4 1.1 113.2 43.2
   3. 172.99.47.198 0.0% 10 115.0 77.8 2.9 115.0 40.2
   4. ae7---0.scr01.mias.fl.fronti 0.0% 10 111.1 80.2 8.5 113.5 41.3
   5. ae0---0.cbr01.mias.fl.fronti 0.0% 10 105.9 82.2 7.6 115.4 33.8
   6. lag-101.ear3.Miami2.Level3.n 70.0% 10 116.1 80.2 8.5 116.1 62.0
   7. NTT-level3-80G.Miami.Level3. 0.0% 10 110.0 81.5 9.0 120.3 41.9
   8. ae-3.r20.miamfl02.us.bb.gin. 0.0% 10 119.8 84.0 10.0 119.8 38.5
   9. ae-4.r23.asbnva02.us.bb.gin. 10.0% 10 137.4 107.6 30.1 142.7 45.7
  10. ae-2.r05.asbnva02.us.bb.gin. 0.0% 10 135.0 109.9 36.6 140.0 39.1
  11. xe-0-9-0-8.r05.asbnva02.us.c 0.0% 10 147.5 125.6 49.4 165.5 41.1
  12. 24.52.112.21 0.0% 10 158.6 124.0 49.6 161.3 41.5
  13. 24.52.112.42 0.0% 10 151.0 127.7 52.2 159.0 41.2
  14. ??? 100.0 10 0.0 0.0 0.0 0.0 0.0

Job_Snijders3 · July 7, 2016, 7:32pm

Hi Philip,

I can't address your immediate concern, but I do have some hints
regarding traceroute:

1/ Please review the excellent presentation from RA{T,S}:
https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf
https://www.youtube.com/watch?v=a1IaRAVGPEE

2/ When you run mtr, make sure to run it with the '-w' argument for
easier to read reverse DNS entries.

Kind regards,

Job

Ken_Chase1 · July 7, 2016, 7:52pm

No offence, but i swear that mtr should come with a license to use it. I get more
questions from people accusing us of network issues with mtr in hand...

You shoudlnt care that there's 80% packet loss in the middle of your route, unless
you have actual traffic to lag-101.ear3.miami2.level3.net. I suspect you dont.
(If you did, you'd have mtr'd to it directly of course.)

As for your second trace, the sudden jump from 0% on 2nd last hop to 100% last
hop packetloss seems like firewalling to me. (long discussion about the
probabilities of getting 5 0%pl hops in a row and 100% on an unfirewalled
endpoint elided. TL;DR: use more packets in your test -i 0.1 -c 100 thanks).

If you have 0% packetloss to your target endpoint, is there an issue here?
What caused you to mtr? 0% pl is pretty good. You could play quake 1.0
through that pl and ping time. The +20ms ATL<>CHI jump in the route you'd have
to take up with einstein/bill nye/$deity.

For the 2nd trace, the 1st hop is your latency issue (plus the big jump from
miami<>ashburn, again the limit is c.)

ICMP is allowed to be dropped by intervening routers. Someone will quote an RFC
at us shortly.

Mtr without a return route is not that useful in figuring out packetloss
because pl requires the packet make it there and back. Pl could be anywhere on
the return route, which is probably not symmetrical. The internet stopped
being symetrical about 20+ years ago (if it ever even loosely was), so get a friend
to send you an mtr to your ip from the farside.

(I remember a project long ago, some cgi-bin (yeah that long ago) that was
basically a full-path forward+reverse traceroute you could hit on a selected
server at the provider. Rather handy. not sure if its still a thing, or what it was
called.)

/kc

Cutler_James_R · July 7, 2016, 8:14pm

Phillip,

The data for netwolves.securence.com shows 0% loss between HOST and netwolves.securence.com. This is most certainly good. The 80% Loss in line 4 simply indicates that that particular router was too busy to respond in a timely manner to an ECHO request because it was busy forwarding data traffic. There is no problem to solve for this connection.

The data for www.teco.com <http://www.teco.com/> has a couple of busy hops. However, for as far as the trace succeeds (24.52.112.42) there is no effective loss end to end. The ??? response, similar to *** from traceroute, indicates that there is probably no route to the destination from that point. (Or there is a firewall blocking SNMP ECHO requests at that point.) Diagnosis may require contacting the operator of www.teco.com <http://www.teco.com/> to confirm the system is actually on line and operational. Contact information for tech.com is in whois.

Auxiliary information - traceroute data from a system in plymouth.mi.michigan.comcast.net <http://plymouth.mi.michigan.comcast.net/> shows similar results for both targets.

Are there other hosts difficult to reach?

James R. Cutler
James.cutler@consultant.com
PGP keys at http://pgp.mit.edu

Cutler_James_R · July 7, 2016, 8:18pm

sedit /blocking SNMP ECHO requests/blocking ICMP ECHO requests/

oops! dyslexic fingers.

James R. Cutler
James.cutler@consultant.com
PGP keys at http://pgp.mit.edu

Brian_Bruns · July 7, 2016, 8:27pm

Is it bad that the first thing that came to mind is "Oh FFS, another troll"?

Mel_Beckman · July 7, 2016, 8:32pm

Yes. It indicates that there was never a time when you did not know everything

-mel beckman

William_Herrin · July 8, 2016, 3:53am

Hi Ken,

That's not correct. Routers might not generate an ICMP time-exceeded
packet for every packet whose TTL reaches zero, but that's not the
same thing. Routers dropping ICMP packets in transit would be bad.
Protocols like TCP depend on path MTU discovery and path MTU discovery
critically depends on ICMP.

Regards,
Bill Herrin

Phillip_Lynn · July 8, 2016, 1:01pm

None taken,

We are having issues with our email and loading some web pages. I used mtr to try and find if there is a possible connection issue. I just need to understand what is happening , and be able to explain the output showing the 80% packet loss . We are not pointing fingers, just looking to understand the issue better.

Thanks

Ken_Chase1 · July 8, 2016, 1:09pm

And we know the whole internet observes handling mtu discovery properly
and doesnt just firewall all ICMP because 'hackers'.

(OP's issue may well be MTU discovery, esp if he's on broadband. Don't have
enough details. I just solved this exact problem a couple weeks ago for a
client with an UBNT ERX by turning on it's MTU hacking feature. Sites that
engaged in ICMP mtu blocking included cnn.com.)

I meant routers are allowed to drop ICMP request packets to themselves, not
the packets to be transitted. I wasnt clear.

/kc

Mel_Beckman · July 8, 2016, 2:02pm

Philip,

Quite often slow Web page loading and email transport -- termed an application-layer problem because basic transport seems unaffected -- is due to DNS problems, particularly reverse DNS for the IP addresses originating your Web queries. If you have non-existent or intermittent IN-ADDR entries for those IPs, the remote Web servers can be timing out if they have older configurations that, for example, do DNS lookups in order to log HTTP requests and block on completion, resulting in timeouts. Use "nslookup x.x.x.x" command line queries (nslookup is on Windows, Mac and UNIX/Linux) to see if you can resolve the public IP addresses your users original queries from. You can find those addresses by visiting http://whatismyip.com from a problem desktop.

A second common cause of app-specific throughput problems, particularly where email is involved, is failed MTU discovery. The standard Internet MTU is 1500 bytes, but sometimes a router misconfiguration or change in encapsulation type along the path through your ISP lowers that to, say, 1492 or 1486 bytes (MTU is in increments of 8). The result is that whenever your web or email client sends a maximum MTU packet, the packet is dropped, resulting in connection impairment. Most HTTP and Email packets are not max-MTU in size, so you get very uneven performance simulating network congestion.

You can force the MTU to a lower number at your border to test this. You typically do this at your firewall; it's a setting on the WAN interface config. Temporarily lower that value dramatically to something like 1440 and see if your problem goes away. If it does, you may need to permanently reduce MTU, so you should try other divisible-by-8 values -- 1492, 1486, 1478, etc -- until you find the largest one that works. I commonly see this when a customer switches ISPs from DSL to Cable. Cable providers are fond of stealing 8 or 16 bytes for their CMT headers in a way that breaks MTU discovery.

A third frequent application-layer throughout debillitator is IPv6 misconfiguration. If you support IPv6 for your end users, they may be getting directed to IPv6 web or mail servers (which are generally preferred via DNS) but thwarted by IPv6 transport issues, which could be as simple as routing or MTU, or as complex as an invisible 6-over-4 NAT somewhere (such as a your upstream ISP). These problems generally require an IPv6-competent network engineer to resolve, but you can test by disabling IPv6 on your network (which also requires an IPv6-competent network engineer

I'm always amazed at how often these three causes are at the root of performance problems. So it's worth investigating each.

-mel beckman

James_M_Keller · July 8, 2016, 3:42pm

All we are seeing here is control plane filtering by intermediate routers. Unless packet loss numbers start at a router and hops past it show the same or higher losses it's not an actual issue with the transport path at that hop. Outside of your own domain of administrative control, you can't rely on intermediate routers responding to ICMP (either filtered completely or rate limited responses).

William_Herrin · July 8, 2016, 4:50pm

And we know the whole internet observes handling
mtu discovery properly
and doesnt just firewall all ICMP because 'hackers'.

(OP's issue may well be MTU discovery, esp if he's on
broadband. Don't have
enough details. I just solved this exact problem a couple
weeks ago for a
client with an UBNT ERX by turning on it's MTU hacking
feature. Sites that
engaged in ICMP mtu blocking included cnn.com.)

Heh. And because life isn't interesting enough, Amazon AWS has started
defaulting their larger VMs to a 9001 byte interface MTU.

I meant routers are allowed to drop ICMP request
packets to themselves, not
the packets to be transitted. I wasnt clear.

I think you really meant echo-request packets to themselves, right?

Regards,
Bill Herrin

James_Greig · July 10, 2016, 11:48pm

There was a useful nanog presentation somewhere that explained this really well in particular reading traceroutes correctly

Kind regards

James Greig

Mel_Beckman · July 11, 2016, 12:26am

James,

You may be thinking of this presentation:

https://www.nanog.org/meetings/nanog47/presentations/Sunday/RAS_Traceroute_N47_Sun.pdf

-mel beckman

Mark_Andrews2 · July 11, 2016, 1:26am

In message <25577FE1-6366-4D6D-B82E-A779193CB458@beckman.org>, Mel Beckman writ
es:

Philip,

Quite often slow Web page loading and email transport -- termed an
application-layer problem because basic transport seems unaffected -- is
due to DNS problems, particularly reverse DNS for the IP addresses
originating your Web queries. If you have non-existent or intermittent
IN-ADDR entries for those IPs, the remote Web servers can be timing out
if they have older configurations that, for example, do DNS lookups in
order to log HTTP requests and block on completion, resulting in
timeouts. Use "nslookup x.x.x.x" command line queries (nslookup is on
Windows, Mac and UNIX/Linux) to see if you can resolve the public IP
addresses your users original queries from. You can find those addresses
by visiting http://whatismyip.com from a problem desktop.

A second common cause of app-specific throughput problems, particularly
where email is involved, is failed MTU discovery. The standard Internet
MTU is 1500 bytes, but sometimes a router misconfiguration or change in
encapsulation type along the path through your ISP lowers that to, say,
1492 or 1486 bytes (MTU is in increments of 8). The result is that
whenever your web or email client sends a maximum MTU packet, the packet
is dropped, resulting in connection impairment. Most HTTP and Email
packets are not max-MTU in size, so you get very uneven performance
simulating network congestion.

The Internet Standard MTU's are 68 octets for IPv4 (RFC 791) and
1280 octets for IPv6 (RFC 2460).

Every size greater than those is subject to negotiation. Now most
paths pass packets greater than those values. Ethernet is very
common and passes 1500.

Encapsulated / translate traffic is also very common and has MTUs
< 1500 and affects BOTH IPv4 and IPv6 data streams and will become
more so as we move from dual stack to IPv6 only where IPv4 is a
service running on top of IPv6.

cpolish · July 11, 2016, 7:04pm

In message <25577FE1-6366-4D6D-B82E-A779193CB458@beckman.org>, Mel Beckman writ
The Internet Standard MTU's are 68 octets for IPv4 (RFC 791) and
1280 octets for IPv6 (RFC 2460).

Every size greater than those is subject to negotiation. Now most
paths pass packets greater than those values. Ethernet is very
common and passes 1500.

Thanks for identifying the source, I wish more people did this.
My nitpick is that RFC791 doesn't label MTU=68 as "standard";
it says (section 3.2, p.25):

Every internet module must be able to forward a datagram of 68
octets without further fragmentation.

More to your point, RFC791 also says (section 3.1, p. 13):

    All hosts must be prepared to accept datagrams of up to 576
    octets (whether they arrive whole or in fragments). It is
    recommended that hosts only send datagrams larger than 576
    octets if they have assurance that the destination is prepared
    to accept the larger datagrams.

James_Greig · July 11, 2016, 9:49pm

That's the one:)

Kind regards

James Greig

Sean_Donelan · July 12, 2016, 7:25am

RFC791 was written during the internet's anti-standard era.

We reject: kings, presidents and voting. We believe in: rough consensus and running code

cpolish · July 12, 2016, 8:37pm

Hi Sean,

Lovely quote and all, but... do you mean that when RFC791 was
drafted the IETF didn't issue 'standards'? RFC791 was written by
Jon Postel for DARPA and AFAIKT is foundational. It's referenced
by more than 420 RFCs. It begins, "This document specifies the
DoD Standard Internet Protocol." Seems about as official as
times permitted.

Point was, 576 bytes is the minimum MTU for transporting IP
datagrams. Also see RFC1122/3.3.2, which references
RFC791 of course.

Best regards,