high latency ds3 issue on unloaded line

Mike2 · September 26, 2008, 4:04pm

Hello,

I have a ds3 from qwest which has daily issues with insane point-to-point latencies sometimes exceeding 1000ms for hours on end, and which suddenly disappear, and does not appear to correspond with actual measured link utilization (less than 20mbps most days).

To make a long investigation short, the problem comes on during the day and then lets up late in the evening. I have tested and examined everything at the ip layer and no it's not high utilization, an ACL, router cpu or bad hardware, no line errors or other issues visible from interface or controller stats. yes I have flushed all hardware, and I have a 7204vxr/npe-400 with this single ds3. The only clue seems to be millions of 'output drops' from qwest's side. And at night I can hit popular ftp mirrors from a directly attached server and observe my interface reporting about %100 utilization combined with my users and customers, so yeah it really is a full line rate ds3. And historically Mrtg always shows around 20mbps or less utilization and it's only smokeping that goes off, usually in the afternoon when the point to point latencies between my router and qwest start heading north, and consistently at that. I also have another in house tool that takes 30 second snapshots of my ds3 interface in order to catch short bursts that would be smoothed out with mrtg's 5 minute average, but during these high latency times there aren't any spikes noted. And for added confusion (or fun!), the latency can start at any utilization level - I've observed it while we were pulling just 12mbps, and I have not had it while we were doing 34mbps, only the time of day seems to be the common factor.

Qwest has not been able to identify the issue, only note that - yeah, this really is happening when there is otherwise no real load on the line - and I am certain we have done everything to rule out the ip layer. They have put in a 'request' to move me to another router, but I am not hopeful of a resolution that way as the router we're currently on doesn't appear otherwise to have the problem with any other subscriber.

What I want to know, is it possible that the underlaying atm/sonet that carries my ds3 from my facility is somehow oversubscribed or misconfigured? We have an OC12 fiber entrance and this is the only circuit provisioned on it, and in our small tiny town the only other user on the ring with us is comcast (according to the att network engineer who installed this). I don't know enough about atm/sonet to imagine conditions that would cause the issues I am seeing here , but every ip layer tool I have only ever tells me there isn't an ip issue here. I can issue ping from my router directly to the attached qwest router and get > 1000ms and then other times (out of the problem window), I am getting 4ms.

If anyone has laughs or beers to offer me, send 'em on cuz I could use both right about now....

Mike-

Aaron3 · September 26, 2008, 4:28pm

Have you taken some traffic captures to see what kind of traffic's coming
through? Could be an infected machine sending lots of small packets from
lots of spoofed addresses. I've seen that kind of thing cause issues with
older routers before.

chip · September 26, 2008, 5:01pm

Mike,

I've seen issues similar to this when using the 12 Port DS3 cards, Engine
O, in Cisco GSR's. Basically if there's any single ds3 that is full on any
of the 12 ports, then the buffers on the card fill up and every other port
on that card has their traffic queued, thus introducing latency. There are
some things that can be done to help with the situation but not much to
actually resolve the issue. One is to use WRED instead of FIFO on the ints
at the provider side. A more effective solution, but more invasive, is to
basically set each card's tx queue to 1 so when something is waiting on the
blocked port, packets are dropped instead of queued.

Applied Globally:
cos-queue-group 1-16
precedence all random-detect-label 0
random-detect-label 0 1 16 1

Ints w/less than 45Mb
tx-cos 1-16
tx-queue-limit 1

Mostly this is just done on the interfaces that are configured with a clock
rate less than 45Mb. Of course, all this will need to be done on the Qwest
side. Since you're only see this issue during the day, it points to a
traffic problem. If qwest needs some proof have them run this command when
you're seeing the high latency:

execute-on slot <slot> show contro frfab queue
execute-on slot <slot> show contro tofab queue

That is, if they're using those cards on a GSR.

Hope that helps.

--chip

John_Lee1 · September 26, 2008, 6:59pm

Mike,

Your latencies which suddenly appear for several hours and then go away and do this on a regular basis sounds like a layer 2, facility switching issue. As you indicated " the problem comes on during the day and then lets up late in the evening" sounds like the under lying facility is being switched back around the "long side" of the SONET ring or other facility. Some carrier facilities are scheduled for "one path or direction" say during the day that are supposed to be for lower latency time periods for interactive work and then switch for a lower cost, higher latency path in the evening when computer to computer backups do not care. If you can plot the times the issues start and end and that these occur daily during the week and not on weekends etc that would be a strong indicator.

John (ISDN) Lee

kris_foster · September 26, 2008, 7:17pm

John

Even if this is happening, the distance you can travel at 2/3 sol says there is something else going wrong here (1 sec latency is a very long time).

Kris

Jay_Hennigan · September 26, 2008, 7:20pm

John Lee wrote:

Mike,

Your latencies which suddenly appear for several hours and then go away and do this on a regular basis sounds like a layer 2, facility switching issue. As you indicated " the problem comes on during the day and then lets up late in the evening" sounds like the under lying facility is being switched back around the "long side" of the SONET ring or other facility. Some carrier facilities are scheduled for "one path or direction" say during the day that are supposed to be for lower latency time periods for interactive work and then switch for a lower cost, higher latency path in the evening when computer to computer backups do not care. If you can plot the times the issues start and end and that these occur daily during the week and not on weekends etc that would be a strong indicator.

For 1000 ms latencies, that would be a VERY long "long side". I think there is something else going on here.

John_Lee1 · September 26, 2008, 7:26pm

Kris,

No disagreement on the sol, the issue is how many oeos the signal is going through and how far it is going. When the facilities switched in a previous life the latencies usually took a several hundred millisecond jump but not seconds. My question / issue is the repeatability and schedule of the lengthening delay and what other activity event would correlate with it.

John

Ben_Plimpton · September 26, 2008, 7:35pm

We've had a similar issue with a few of our Qwest DS3's. The solution has been 1 of the following....

1) Qwest has over-provisioned the transit links on their atm network that the DS3 is riding and the during peak times of the day, the transit link becomes congested causing high latency not related to our traffic levels. So the congestion could be appearing beyond your local loop.

2) We also had an instance where qwest had an issue with the PVC on the atm switch that we connected into that was causing > 500ms of latency. Like you, we are in a small town served by older ATM switches, so you might just see if they can rebuild both sides to see if that clears it up. Sounds quacky, but after 12 hours of troubleshooting, that was the fix.

Ben

Frank_Bulk1 · September 27, 2008, 5:24am

It would be quite the poorly implemented ATM-based transport system if
DS-3's were over-provisioned. We're not talking about packet-based service,
it should be transported as traditional SONET-mapped.

Frank

Anton_Kapela1 · September 27, 2008, 11:34am

Anyone considered this could simply be a case of a customer ds3
provisioned into a mpls ccc/l2ckt style upstream aggregate? Ie.
Ppp/hdlc in mpls.

It seems best to first contact Q and ask exactly how this thing is provisioned.

-Tk