DSL and/or Routing Problems

Greetings NANOGers,

Yesterday we starting noticing long delays on an ADSL connection. I spent most
of the day trying to track down the problem and getting no where. Telco says
they do not detect any problem on the line... so I am kind of lost. Anyone here
have any ideas? Here are the specifics:

This connection uses a Cisco 827 ADSL router and has several static IPs. All IPs
show identical delays. Using other circuits between the same two locations, we
do not see any delays.

Normally on this DSL connection, local can ping remote with packet transit times
around 60-70ms. Here is what we are seeing now:

# ping -s SOMEHOST 68 25; sleep 1; ping -s SOMEHOST 68 25
PING SOMEHOST: 68 data bytes
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=0. time=105. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=1. time=9132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=2. time=8132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=3. time=7132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=4. time=6132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=5. time=5133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=6. time=4133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=7. time=3133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=8. time=2133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=9. time=1133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=10. time=133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=11. time=104. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=12. time=110. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=13. time=109. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=14. time=112. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=15. time=106. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=16. time=114. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=17. time=107. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=18. time=109. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=19. time=106. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=20. time=112. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=21. time=106. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=22. time=108. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=23. time=106. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=24. time=110. ms

----SOMEHOST PING Statistics----
25 packets transmitted, 25 packets received, 0% packet loss
round-trip (ms) min/avg/max = 104/1918/9132
PING SOMEHOST: 68 data bytes
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=0. time=112. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=1. time=9131. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=2. time=8132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=3. time=7132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=4. time=6132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=5. time=5132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=6. time=4133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=7. time=3132. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=8. time=2133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=9. time=1133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=10. time=133. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=11. time=111. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=12. time=106. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=13. time=109. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=14. time=116. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=15. time=108. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=16. time=107. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=17. time=113. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=18. time=106. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=19. time=107. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=20. time=108. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=21. time=108. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=22. time=105. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=23. time=109. ms
76 bytes from SOMEHOST (w.x.y.z): icmp_seq=24. time=106. ms

----SOMEHOST PING Statistics----
25 packets transmitted, 25 packets received, 0% packet loss
round-trip (ms) min/avg/max = 105/1918/9131

What really has me bugged is the pattern shown by the first dozen packets... why
the relatively quick first time, followed by a long but decreasing delay which
repeats every time you restart the ping (that's why I provided 2 samples)?

Despite the fact that Telco says there are not any line problems, we are seeing
a change in DSL performance compared to our benchmark. When we first started
noticing the problem yesterday, both in and out connections were using the Fast
path, but compared to the benchmark, the inbound speed had dropped to 576 and
the Capacity had jumped to 99%, plus we had some RS and CRC errors on both in
and out connections. Later in the day, the connection switched from using the
Fast path to the Interleave path (we did nothing on our end to cause this to
change) and the performance settled down to what is shown below under "DSL NOW."

DSL BENCHMARK:

This connection uses a Cisco 827 ADSL router and has several static IPs. All IPs show identical delays. Using other circuits between the same two locations, we do not see any delays.

What's the weather like? :wink:

See if you can get the ADSL router to give you upstream/downstream noise margins and any other userful reporting ...

AR Driver Counters Display :
TX :|packets: 8597915 = direct: 2923483 + qued: 5674434
     > = oamF4: 0 + oamF5: 0 + others
     >fail count = chNoEr: 0 + dropped: 0
     >txMissIsr= 0, queCnt= 0, txOnGoing= 0
RX :|packets: 8924470 = toATM: 8919249 + loopback: 0 + errors
     > , where oamF4: 0, oamF5: 0
     >errors = crc: 5069 + mbuf: 0 + len: 0 + pad: 0 + strayed: 151
     >rxMissIsr= 0, queCnt= 0, nonAA= 0, sramErr= 0, reqSramMax= 6
     >dummyIsr = 256833, fpgaIsr = 14826785
VC( 0 to 3 ) : 08924319 00000000 00000000 00000000
VC( 4 to 7 ) : 00000000 00000000 00000000 00000000
VC( 8 to 11 ) : 00000000 00000000 00000000 00000000
VC( 12 to 15 ) : 00000151

Upstream Noise Margin
relative capacity occupation: 78%
noise margin upstream: 11.0 db
output power downstream: 16.0 dbm
attenuation upstream: 31.5 db
carrier load: number of bits per symbol(tone)
tone 0- 31: 00 00 00 04 67 77 66 65 66 66 66 66 55 54 43 00
tone 32- 63: 00 00 00 44 55 66 66 66 66 66 66 66 66 66 26 66
tone 64- 95: 66 65 55 54 45 55 55 44 44 44 44 44 44 43 33 22
tone 96-127: 22 22 02 22 22 20 00 00 00 00 00 00 00 00 00 00
tone 128-159: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tone 160-191: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tone 192-223: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tone 224-255: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

Downstream Noise Margin
relative capacity occupation: 95%
noise margin downstream: 6.5 db
output power upstream: 12.0 dbm
attenuation downstream: 66.5 db
carrier load: number of bits per symbol(tone)
tone 0- 31: 00 00 00 04 67 77 66 65 66 66 66 66 55 54 43 00
tone 32- 63: 00 00 00 44 55 66 66 66 66 66 66 66 66 66 26 66
tone 64- 95: 66 65 55 54 45 55 55 44 44 44 44 44 44 43 33 22
tone 96-127: 22 22 02 22 22 20 00 00 00 00 00 00 00 00 00 00
tone 128-159: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tone 160-191: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tone 192-223: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
tone 224-255: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00

<snip>

Assuming it is not your ISP or that the telco is the ISP.

Dont believe them. Tell them to reset the port. Tell them to change the pairs. Tell them to switch your line to a different port on the dslam. Tell them to put you into a different CO. Tell them to dispatch a technician to test your line "at the nid". Get a FTP server with good connectivity on the internet and upload/download to it, measuring your speed. Show the telco low bandwidth and packet loss. Do some flood pinging (carefully).

Test the line with a cheap linksys or netgear or smc or dlink or similar "broadband" residential router with ADSL modem (or even software [google for raspppoe for windows, linux has pppoe software available as well - if thats what your setup uses]).

Spend a few dollars and get ADSL on another phone line if that all does not work.

For the money they make off a ADSL line, a Telco is unlikely to do more than run the standard automated web testing thingy and say "Everything fine here!" and hope you dont call back and cost them more. That makes sense. The more support time and expertise expended on you, the less profit generated for them by your business.

I cant count the number of "Tests perfectly!" that get resolved mysteriously inside the telco after some more harrasment. Furthermore, our experience on average is that the more the line costs per month, the better service you get on it. Typicaly with any large amount of circuits, you will find the right people in the telco who actually give a damn about you and can "get things done"

Joe

Jon,

I saw something like this once, and it was a router bug. Your ping trace
seems to be saying that after the first ping something gets stuck, and
ten more pings accumulate in a queue somewhere until a ten-second timer
expires, or maybe ten messages in a queue is a maximum that triggers some
action. Then the queued-up pings are all released at once and after that
things run normally.

Very strange, but I'll bet it's some peculiarity deep in the guts of
whatever box you're pinging. Maybe a process stops running and gets
restarted by an inactivity timer. The ISP might be able to find
something in a log file or system trace that would be a clue to someone
familiar with the router's internals. Did they just have an equipment
change or a software upgrade? It doesn't feel like a link outage or
quality problem because it always starts after the first ping and lasts
for exactly ten more.

I also wonder if the ping anomaly is even the same problem as the one
that's causing the poor performance you're observing on your benchmarks.

Cheers, and good luck!

John Renwick

DSL BENCHMARK:

                ATU-R (DS) ATU-C (US)
Capacity Used: 72% 21%

                 Interleave Fast Interleave

Fast

Speed (kbps): 0 960 0

256

Reed-Solomon EC: 0 0 0

0

CRC Errors: 0 0 0

0

Header Errors: 0 0 0

0

Bit Errors: 0 0
BER Valid sec: 0 0
BER Invalid sec: 0 0

DSL NOW:

                ATU-R (DS) ATU-C (US)
Capacity Used: 94% 63%

                 Interleave Fast Interleave

Fast

Speed (kbps): 736 0 256

0

Reed-Solomon EC: 99 0 4

0

CRC Errors: 4 0 1

0

Header Errors: 3 0 0

0

Bit Errors: 0 0
BER Valid sec: 0 0
BER Invalid sec: 0 0

You've gone from fast path to interleaved. Interleaved can inject
up to 64ms of latency, in each direction, ontop of the normal line
latency. (IE say 12ms loop time, interleaved can bump that up to
140ms latency.) Interleaved is used to trade latency for line
stability. I'm not sure of the specifics on that however.
Basically, you set your latency tolerance on the dslam, up to 64ms
for up and downstream, and dependant on line conditions, your
latency will vary between base loop latency and the max allowed by
your tolerance. On a good line, you won't see any latency injected,
a poor line will run right up to the tolerance and still retrain due
to errors.

You need to ask the telco why they've changed you from fast path,
and request that you get put back to a fast path config. You MAY be
able to restrict your dsl modem to training fast path only if they
have your line set to auto for signaling.

Joshua Coombs

: >Greetings NANOGers,
: >
: >Yesterday we starting noticing long delays on an ADSL connection.
: <snip>
: Assuming it is not your ISP or that the telco is the ISP.
: Dont believe them. Tell them to reset the port. Tell them to change the

NETAT! Never Ever Trust A Telco! test, test and test some more on your
side and then demand they do the same.

I have even had to troubleshoot their network. I did the above and then
when it still didn't work everyone (my boss, my boss' boss, data center
techs and the same level of telco folks all got on a conference call for
The Big Blame Party. It was, once again, their fault.

scott

: pairs. Tell them to switch your line to a different port on the dslam.
: Tell them to put you into a different CO. Tell them to dispatch a
: technician to test your line "at the nid". Get a FTP server with good
: connectivity on the internet and upload/download to it, measuring your
: speed. Show the telco low bandwidth and packet loss. Do some flood
: pinging (carefully).
:
: Test the line with a cheap linksys or netgear or smc or dlink or similar
: "broadband" residential router with ADSL modem (or even software [google
: for raspppoe for windows, linux has pppoe software available as well -
: if thats what your setup uses]).
:
: Spend a few dollars and get ADSL on another phone line if that all does
: not work.
:
: For the money they make off a ADSL line, a Telco is unlikely to do more
: than run the standard automated web testing thingy and say "Everything
: fine here!" and hope you dont call back and cost them more. That makes
: sense. The more support time and expertise expended on you, the less
: profit generated for them by your business.
:
: I cant count the number of "Tests perfectly!" that get resolved
: mysteriously inside the telco after some more harrasment. Furthermore,
: our experience on average is that the more the line costs per month, the
: better service you get on it. Typicaly with any large amount of
: circuits, you will find the right people in the telco who actually give
: a damn about you and can "get things done"
:
: Joe
:
:

I am betting you are running ping under linux. There is some linux bugette
(I think cosmetic) that occasionally causes Linux to count down
(effectively) 9, 8, 7, 6, 5, 4, 3, 2, 1 secs. I once tracked it down and it
was something boring like DNS resolution in ping. Check with tcpdump you
are /really/ seeing that delay. Oh and make sure you are pinging something
unloaded, not a router that has better things to do. I have no doubt you
are seeing some delay (or you wouldn't have started looking) but I would
bet ping is putting you off the scent.

Alex