Headscratcher of the week

Mike2 · May 31, 2013, 10:25pm

Gang,

In the interest of sharing 'the weird stuff' which makes the job of being an operator ... uh, fun? is that the right word?..., I would like to present the following two smokeping latency/packetloss plots, which are by far the weirdest I have ever seen.

These plots are from our smokeping host out to a customer location. The customer is connected via DSL and they run PPPoE over it to connect with our access concentrator. There is about 5 physical insfastructure hops between the host and customer; The switch, the BRAS, the Switch again, and then directly to the DSLAM and then customer on the end.

The 10 day plot:
http://picpaste.com/10_Day_graph-YV3IdvRV.png

The 30 hour plot:
http://picpaste.com/30_hour_graph-DrwzfhYJ.png

How can you possibly have consistent increase in latency like that? I'd love to hear theories (or offers of beer, your choice!).

Happy friday all!

Mike-

Jonathan_Lassoff · May 31, 2013, 10:39pm

Those are some truly perplexing graphs. Quite strange that it appears
linear, as if something is slightly changing over time or
growing/shrinking at a constant-ish rate.

Do you have throughput or PPS graphs for the intermediate links as
well? Any similar correlations in the derivative slope?

My only hunch would be some intermediate buffer being increasingly
full over time, as some other application riding the path linearly
grows in packets/second or bits/second.

Cheers,
jof

Jeff_Kell · May 31, 2013, 10:39pm

OK, here's a wild guess from left-field. Well, at least from left-field
where I made at least one game-saving catch

We had a similar case some years back, but it was a ramp-up in overall
traffic we were looking at. If you're looking at latency, it could be
related to traffic (do you have traffic graphs?).

One particular user that was accustomed to Windows and trying to get
started with Linux was "playing games" with our NAT firewall. Rather
than file a request with us for a static NAT and firewall openings for
their "new" Linux server, they discovered that as long as they generated
some internet traffic periodically, they could defeat the NAT
translation timeout, and essentially keep a static outside IP.

Problem was, they "crontab"ed a "ping" of an outside server to run once
a minute. Just a "ping x.x.x.x".

Windows as we know defaults to only ping 4 times then quit.

Linux does not

So you might look for some "recurring scheduled event" on the customer's
end that might be cumulative rather than simply recurring.

Jeff

Brett_Frankenberger1 · June 1, 2013, 12:30am

Theory:

There's a stateful device (firewall, NAT, something else) in the path
that is creating state for every ICMP Echo Request it forwards and
(possibly) searching that state when forwarding the ICMP Echo Reply
responses, and never destroying that state, and either the create
operation or the search operation (or both) takes an amount of time
that is a linear function of the number of state entries.

-- Brett

Jake_Khuon4 · June 1, 2013, 12:40am

Variation of the buffer filling theory is that there's some QoS/traffic-shaping going on which is causing your ping packets to get classed and policed into an ever depleting buffer pool.

I wonder what would happen to the pattern if you reset the interface. |8^)

Blake_Dunlap1 · June 1, 2013, 4:18am

I agree with previous poster, table size progression and corresponding
increase in search delay, probably related directly to the monitoring
itself, or at least a connection state of some kind.