Odd router brokenness

Mark_Radabaugh · November 23, 2011, 2:41pm

Since this list likes to speculate with little facts on a regular basis (and I'll admit to being as guilty as anyone) I throw this one out for opinions :

We were seeing very odd behavior on a Cogent circuit following a software upgrade to tol01.atlas. Two traceroutes:

mark@angola-gw> traceroute 74.125.226.6
traceroute to 74.125.226.6 (74.125.226.6), 30 hops max, 40 byte packets
  1 * * gi1-1.ccr01.tol01.atlas.cogentco.com (38.104.148.5) 110.315 ms
  2 te4-2.ccr01.sbn01.atlas.cogentco.com (154.54.7.154) 139.520 ms 196.910 ms 5.728 ms
  3 * * *
  4 * te0-5-0-5.ccr21.ord03.atlas.cogentco.com (154.54.44.174) 8.310 ms te0-0-0-7.ccr21.ord03.atlas.cogentco.com (154.54.25.70) 8.752 ms
  5 te0-0-0-0.ccr22.ord03.atlas.cogentco.com (154.54.24.214) 8.983 ms te0-1-0-0.ccr22.ord03.atlas.cogentco.com (66.28.4.66) 7.948 ms *
  6 * * te-9-1.car4.Chicago1.Level3.net (4.68.127.129) 26.127 ms
  7 GOOGLE-INC.car4.Chicago1.Level3.net (4.71.100.22) 38.132 ms 25.120 ms *
  8 * * 209.85.254.122 (209.85.254.122) 24.539 ms
  9 * 72.14.237.130 (72.14.237.130) 26.134 ms 72.14.237.108 (72.14.237.108) 25.021 ms
      MPLS Label=666803 CoS=4 TTL=1 S=1
10 216.239.46.161 (216.239.46.161) 31.816 ms 35.702 ms 32.249 ms
11 72.14.233.142 (72.14.233.142) 32.897 ms * *
12 * yyz06s05-in-f6.1e100.net (74.125.226.6) 33.319 ms *

and a ping over the same path:

--- www.l.google.com ping statistics ---
675 packets transmitted, 323 packets received, 52.1% packet loss
round-trip min/avg/max/stddev = 12.834/28.831/129.743/28.987 ms

and at the same time:

mark@angola-gw> traceroute 38.100.128.10
traceroute to 38.100.128.10 (38.100.128.10), 30 hops max, 40 byte packets
  1 gi1-1.ccr01.tol01.atlas.cogentco.com (38.104.148.5) 4.445 ms 1.841 ms 1.713 ms
  2 te7-7.ccr02.cle04.atlas.cogentco.com (154.54.5.230) 5.318 ms te3-2.ccr02.cle04.atlas.cogentco.com (154.54.28.86) 4.755 ms te7-7.ccr02.cle04.atlas.cogentco.com (154.54.5.230) 4.982 ms
  3 te4-2.ccr01.pit02.atlas.cogentco.com (154.54.30.10) 7.997 ms te3-2.ccr01.pit02.atlas.cogentco.com (154.54.30.6) 7.736 ms te4-2.ccr01.pit02.atlas.cogentco.com (154.54.30.10) 8.177 ms
  4 te0-0-0-5.mpd21.dca01.atlas.cogentco.com (154.54.40.81) 17.197 ms te0-0-0-5.ccr22.dca01.atlas.cogentco.com (154.54.30.230) 16.907 ms te0-0-0-5.mpd21.dca01.atlas.cogentco.com (154.54.40.81) 17.008 ms
  5 te0-1-0-0.mpd22.dca01.atlas.cogentco.com (154.54.2.193) 17.358 ms te0-0-0-0.mpd22.dca01.atlas.cogentco.com (154.54.31.38) 17.196 ms te0-1-0-0.mpd22.dca01.atlas.cogentco.com (154.54.2.193) 18.690 ms
  6 te4-2.mpd01.iad03.atlas.cogentco.com (154.54.29.122) 17.885 ms * 18.537 ms
  7 cogentco.com (38.100.128.10) 17.836 ms !<10> 17.918 ms !<10> 17.833 ms !<10>

--- 38.100.128.10 ping statistics ---
236 packets transmitted, 236 packets received, 0.0% packet loss
round-trip min/avg/max/stddev = 22.717/27.942/128.011/12.236 ms
sh-3.2#

Works perfectly. There is no asymmetric routing in this scenario (only 1 BGP peer running during this test), and it is not due to traffic congestion. Initial speculation over the dropped packets in the trace to 74.125.226.6 was ICMP depriortization. The results are too consistent for that to make sense (I have dozens of traceroutes to the same destination - they all appear similar).

I realize there is a long history of Cogent/L3 ugliness but I'm pretty sure that this issue has nothing to do with that subject.

Traceroutes and pings from the control plane of tol01.atlas sourced from 38.104.148.5 do not show any odd behavior. Inbound traffic (to us) is not affected by this. Our workaround while resolving this issue was to change local-pref on the affected prefixes to send traffic out our other providers.

The issue started after a software upgrade to tol01.atlas and resolved after a (reported) reboot of tol01.atlas.

The question is: How does a router break in this manner? It appears to unintentionally be doing something different with traffic based on the source address, not the destination address. I realize this can be done intentionally - but that is not the case here (unless somebody isn't telling me something).

Saku_Ytti1 · November 23, 2011, 4:33pm

I don't think we can determine that it has anything to do with source
address based on data shown.
38.104.148.5 could very well be 6500 and somehow broken adjacency to
74.125.226.6, perhaps hardware adjacency having MTU of 0B, causing punt
which is rate-limited by different policer than TTL exceeded policer.

Keegan_Holley · November 23, 2011, 4:41pm

an ether-channel go south without the port showing down. Stuff hashed over
the good like was fine, stuff hashed over the bad like wasn't. Led to some
painful support calls from customers. I agree this list is a haven of
speculation and OT comments. In order to avoid making a bad problem worse
you should probably contact cogent.

Mark_Radabaugh · November 23, 2011, 4:45pm

I was told the router was reloaded to resolve a CEF issue. Not sure what was wrong with 'clear cef linecard'.

Leigh_Porter · November 23, 2011, 5:00pm

Now *that* brings back memories!

Mark_Radabaugh · November 23, 2011, 5:01pm

It's fixed at this point. You are correct in that it was quite painful getting this escalated far enough to get it fixed. The tools that are available (at least that I know of) to try to prove the issue to level 1 and 2 support just doesn't get the job done.

It's the eternal problem of convincing L1/2 support that you really have a problem not of your own making.

Mark

Saku_Ytti1 · November 23, 2011, 5:05pm

Or just fixing the broken prefixes/adjacencies and opening CTAC case about
what was wrong with them.

http://www.quickmeme.com/meme/35cet6/