Large RTT or Why doesn't my ping traffic get discarded?

Jason_Iannone · December 21, 2022, 5:10pm

Here’s a question I haven’t bothered to ask until now. Can someone please help me understand why I receive a ping reply after almost 5 seconds? As I understand it, buffers in SP gear are generally 100ms. According to my math this round trip should have been discarded around the 1 second mark, even in a long path. Maybe I should buy a lottery ticket. I don’t get it. What is happening here?

Jason

64 bytes from 4.2.2.2: icmp_seq=392 ttl=54 time=4834.737 ms
64 bytes from 4.2.2.2: icmp_seq=393 ttl=54 time=4301.243 ms
64 bytes from 4.2.2.2: icmp_seq=394 ttl=54 time=3300.328 ms
64 bytes from 4.2.2.2: icmp_seq=396 ttl=54 time=1289.723 ms
Request timeout for icmp_seq 400
Request timeout for icmp_seq 401
64 bytes from 4.2.2.2: icmp_seq=398 ttl=54 time=4915.096 ms
64 bytes from 4.2.2.2: icmp_seq=399 ttl=54 time=4310.575 ms
64 bytes from 4.2.2.2: icmp_seq=400 ttl=54 time=4196.075 ms
64 bytes from 4.2.2.2: icmp_seq=401 ttl=54 time=4287.048 ms
64 bytes from 4.2.2.2: icmp_seq=403 ttl=54 time=2280.466 ms
64 bytes from 4.2.2.2: icmp_seq=404 ttl=54 time=1279.348 ms
64 bytes from 4.2.2.2: icmp_seq=405 ttl=54 time=276.669 ms

Mel_Beckman · December 21, 2022, 5:22pm

Sometimes this is usually due to high CPU time on the target device. If the device is under heavy load, the ICMP Echo process gets lowest priority. With a well-known name server like 4.2.2.2, this seems unlikely. It could be an intermediate hop or a routing loop, Do a traceroute to get more detailed per-hop statistics.

-mel

Mel_Beckman · December 21, 2022, 5:28pm

Keep in mind that ping reports round trip time, so there could be a device delaying the ping reply on the return trip. In these cases, it helps to have a traceroute from both ends, to detect asymmetrical routing and possibly return path congestion invisible in a traceroute from you end.

William_Herrin · December 21, 2022, 7:56pm

Hi Jason,

This usually means a problem on the Linux machine originating the
packet. It has lost the ARP for the next hop or something similar so
the outbound ICMP packet is queued. The glitch repairs itself,
briefly, releasing the queued packets. Then it comes right back.

Regards,
Bill Herrin

Dave_Taht · December 21, 2022, 9:19pm

There's this thing called bufferbloat...

Mailman · December 21, 2022, 9:21pm

As well if this persists you may consider disabling hardware rx/tx checksumming to see if it clears up your results. Some net cards can get glitchy causing this exact behavior.

GL

William_Herrin · December 21, 2022, 9:37pm

Hi Dave,

Yes, but I've seen this particular pattern before and it's generally
not bufferbloat. With bufferbloat you usually see consistent long ping
times: this ping is 3 seconds, the next ping is 2.9, the next is 3.2.
This example had a descending pattern spread exactly the number of
seconds apart that the ICMP message was sent. The descending pattern
indicates something went wrong with arp, or a virtual machine was
starved for CPU time and didn't run for a couple seconds, or something
like that.

Regards,
Bill Herrin

Joel_Maslak · December 21, 2022, 10:12pm

You didn’t tell us anything about your path or your endpoint, or if you see this just with Lumen’s DNS servers or with other devices. So it is hard to guess what is going on here.

That said, I know I’ve seen this kind of behavior both with buffer bloat on consumer devices (particularly the uplink direction) and wifi networks (which can have surprisingly deep buffers, with retransmissions occurring at layer 1.5/2). My guess is that there is a software routing/switching device somewhere in the path (wifi AP, home router, Linux or BSD router, etc).

Jerry_Cloe · December 22, 2022, 5:32am

Because there is no standard for discarding “old” traffic, only discard is for packets that hop too many times. There is, however, a standard for decrementing TTL by 1 if a packet sits on a device for more than 1000ms, and of course we all know what happens when TTL hits zero. Based on that, your packet could have floated around for another 53 seconds. Having said that, I’m not sure many devices actually do this (but its not likely it would have had a significant impact on this traffic anyway).

Saku_Ytti1 · December 22, 2022, 6:07am

There certainly aren't any temporal buffers in SP gear limiting the
buffer to 100ms, nor are there any mechanisms to temporally decrease
TTL or hop-limit. Some devices may expose temporal configuration to
UX, but that is just a multiplier for max_buffer_bytes, and what is
programmed is a fixed amount of bytes instead of temporal limit as
function of observed traffic rate.
This is important, because HW may support tens or even hundreds of
thousands of queues, because HW may support large amount of logical
interfaces with HQoS and multiple queues each, then if such device is
ran with single logical interface, which is low speed either
physically or shaped, you may end up having very very long temporal
queues, not because people intend to queue long, but because
understanding all of this requires lot of context and information
about platform which isn't readily available nor is solved by 'just
remove those buffers from devices physically, it's bufferbloat'.

Like others have pointed out, there is not much information to go with
and this could be many things, one of those could be 'buffer bloat'
like Taht pointed out, this might be true because cyclical nature of
the ping, buffer getting filled and drained. I don't really think
ARP/ND is good candidate like Herring suggested, because it's
cyclical, instead of exactly single event, but not impossible.

We'd really need to see full mtr output, and if or not this affects
other destinations, if it just affects icmp or also dns, ideally
reverse traceroute as well. I can tell that I'm not observing the
issue, nor did I expect to observe it, as I expect problem to close to
your network, and therefore affecting a lot of destinations.

William_Herrin · December 22, 2022, 6:40am

Suppose you have a loose network cable between your Linux server and a
switch. Layer 1. That RJ45 just isn't quite solid. It's mostly working
but not quite right. What does it look like at layer 2? One thing it
can look like is a periodic carrier flash where the NIC thinks it has
no carrier, then immediately thinks it has enough of a carrier to
negotiate speed and duplex. How does layer 3 respond to that?

1s: send ping toward default router
1.1s: ping response from remote server
2s: send ping toward default router
2.1s: ping response from remote server
2.5s: carrier down
2.501s: carrier up
3s: queue ping, arp for default router, no response
4s: queue ping, arp for default router, no response
5s: queue ping, arp for default router, no response
6s: queue ping, arp for default router, no response
7s: queue ping, arp for default router
7.01s: arp response, send all 5 queued pings but note that the
earliest is more than 4 seconds old.
7.1s: response from all 5 queued pings.

Cable still isn't right though, so in a few seconds or a few minutes
you're going to get another carrier flash and the pattern will repeat.

I've also seen some cheap switches get stuck doing this even after the
faulty cable connection is repaired, not clearing until a reboot.

Regards,
Bill Herrin

Saku_Ytti1 · December 22, 2022, 7:02am

Suppose you have a loose network cable between your Linux server and a
switch. Layer 1. That RJ45 just isn't quite solid. It's mostly working
but not quite right. What does it look like at layer 2? One thing it
can look like is a periodic carrier flash where the NIC thinks it has
no carrier, then immediately thinks it has enough of a carrier to
negotiate speed and duplex. How does layer 3 respond to that?

Agreed. But then once the resolve happens, and linux floods the queued
pings out, the responses would come ~immediately. So the delta between
the RTT would remain at the send interval, in this case 1s. In this
case, we see the RTT decreasing as if the buffer is being purged,
until it seems to be filled again, up-until 5s or so.

I don't exclude the rationale, I just think it's not likely based on
the latencies observed. But at any rate with so little data, my
confidence to include or exclude any specific explanation is low.

William_Herrin · December 22, 2022, 7:26am

Howdy,

Not quite. The ping origination time isn't set when layer 3 decides
the packet can be delivered to layer 2, it's set when layer 7 drops
the packet on the stack. In other words: when the ping app "sends" the
packet, not when the NIC actually puts the packet on the wire or even
when the OS sends the packet over to the NIC. The time the packet
spends queued waiting for ARP to supply a next-hop MAC address counts
against the round trip time.

When you see this pattern of descending ping times exactly one second
apart where the responses all arrived at once, it's usually because
something in the path didn't have the next-hop MAC address for a
while, and then it did. And it's usually not something deep in the
network because something deep would exhaust it's transmission queue
long before it could queue several seconds worth of pings.

If you want to prove this to yourself, set up a Linux box, install a
filter to drop arp replies (arptables or nftables), delete the arp
entry for your default router (arp -d) and then start pinging
something. When you -remove- the arp filter, you'll see the pattern in
the ping responses that Jason posted.

You may get different results in other OSes. For example, Windows will
lose its DHCP address with the carrier flash, so when ping tries to
send the packet the network is unreachable. Because the stack
considers the network unreachable, the ping packet isn't queued and
the error is reported immediately to the application.

Regards,
Bill Herrin

Masataka_Ohta · December 22, 2022, 11:19am

Jerry Cloe wrote:

Because there is no standard for discarding "old" traffic, only
discard is for packets that hop too many times. There is, however, a
standard for decrementing TTL by 1 if a packet sits on a device for
more than 1000ms, and of course we all know what happens when TTL

> hits zero. Based on that, your packet could have floated around for
> another 53 seconds.

Totally wrong as the standard says TTL MUST be decremented at least
by one on every hop and TTL MAY NOT be decremented further as is
specified by the standard of IPv4 router requirements (rfc1812):

    When a router forwards a packet, it MUST reduce the TTL by at least
    one. If it holds a packet for more than one second, it MAY decrement
    the TTL by one for each second.

As for IPv6,

    Unlike IPv4, IPv6 nodes are not required to enforce maximum packet
    lifetime. That is the reason the IPv4 "Time to Live" field was
    renamed "Hop Limit" in IPv6. In practice, very few, if any, IPv4
    implementations conform to the requirement that they limit packet
    lifetime, so this is not a change in practice.

Masataka Ohta

Jason_Iannone · December 22, 2022, 12:35pm

Thanks for engaging with this. I was intentionally brief in my explanation. I have observed this behavior in congested networks for years and ignored it as an obvious symptom of the congestion. What has always piqued my curiosity though is just how long a ping can last.

In my case yesterday, I was at the airport at peak holiday travel and free wifi usage time. I expect a bad experience. I don’t expect a ping to return 5 seconds after originating it. I just imagine the network straining and groaning to get my ping back to me. It’s okay, man. Let it go.