Intermedia (ICIX) brokenness...

So who replaced Intermedia's routers with a lump of cheese? :

4 gw-swbell.digex.net (206.181.161.153) 8 ms 9 ms 10 ms
5 aus1-core3-pos1-0.atlas.icix.net (165.117.68.210) 9 ms 6 ms 7 ms
6 dfw3-core2-pos2-2.atlas.icix.net (165.117.68.218) 18 ms 15 ms 14 ms
7 dfw3-core1-pos7-0.atlas.icix.net (165.117.48.121) 28 ms 14 ms 14 ms
8 a3-0-14.crtntx1-cr12.bbnplanet.net (4.24.147.53) 1366 ms 1336 ms 1360 ms
9 p6-0.crtntx1-br2.bbnplanet.net (4.24.9.253) 1404 ms 1352 ms 1381 ms
10 p15-0.crtntx1-br1.bbnplanet.net (4.24.10.113) 1374 ms 1327 ms 1359 ms
11 p2-0.crtntx1-cr1.bbnplanet.net (4.24.5.2) 1368 ms 1392 ms 1364 ms
12 p3-0-0.aperian3.bbnplanet.net (4.24.104.98) 1423 ms 1328 ms 1369 ms
13 morannon.the-infinite.org (216.139.208.190) 1423 ms 1383 ms 1374 ms
$

... and the reverse ...

4 183.at-2-0-0.XR1.DFW7.ALTER.NET (152.63.96.241) 28 ms 28 ms 28 ms
5 191.ATM10-0-0.BR1.DFW7.ALTER.NET (146.188.241.177) 29 ms 30 ms 21 ms
6 137.39.52.34 (137.39.52.34) 29 ms 28 ms 28 ms
7 dfw3-core1-pos6-0.atlas.icix.net (165.117.48.133) 1370 ms 1377 ms 1369 ms
8 dfw3-core2-pos6-0.atlas.icix.net (165.117.48.122) 1306 ms 1357 ms 1353 ms
9 aus1-core3-pos3-0.atlas.icix.net (165.117.68.217) 36 ms 36 ms 36 ms
10 aus1-core1-pos9-0-0.atlas.icix.net (165.117.68.209) 37 ms 37 ms 37 ms
11 206.181.161.30 (206.181.161.30) 37 ms 38 ms austin-gw.swbell.com (206.181.161.154) 1363 ms
12 ded1-fa1-0-0.austtx.swbell.net (151.164.20.243) 38 ms 38 ms 37 ms
13 seton-medical-center-638746.cust-rtr.swbell.net (151.164.22.46) 34 ms 34 ms 34 ms
$

Simple troubleshooting...

So who replaced Intermedia's routers with a lump of cheese? :

7 dfw3-core1-pos7-0.atlas.icix.net (165.117.48.121) 28 ms 14 ms 14 ms
8 a3-0-14.crtntx1-cr12.bbnplanet.net (4.24.147.53) 1366 ms 1336 ms 1360 ms

6 137.39.52.34 (137.39.52.34) 29 ms 28 ms 28 ms
7 dfw3-core1-pos6-0.atlas.icix.net (165.117.48.133) 1370 ms 1377 ms 1369 ms

... would tell you that the problem is most likely a full connection
between either 2548 and 1, or 701 and 2548.

[snip]

Personally, I'm still trying to figure out why Exodus, in all their
apparent wisdom (or lack thereof), has stopped using the GBLX OC-48's in
the former GlobalCenter facilities (or at least SNV3), and is now
shuttling all its traffic out a single Exodus OC-12. Prior to yesterday
these traces would've shown gblx.net routers (on different IPs), and would
never have touched an exodus backbone...

fs1(1)# traceroute ops.sj.ipixmedia.com
traceroute to ops.sj.ipixmedia.com (64.209.175.20), 30 hops max, 40 byte packets
1 fw.i.eng.bamboo.com (192.168.12.254) 0.872 ms 0.756 ms 0.716 ms
2 gw.eng.bamboo.com (63.78.93.1) 1.961 ms 2.275 ms 2.076 ms
3 240.ATM3-0.GW4.PAO1.ALTER.NET (157.130.195.13) 2.960 ms 3.109 ms 2.935 ms
4 124.ATM2-0.XR2.PAO1.ALTER.NET (146.188.148.86) 3.175 ms 3.491 ms 3.251 ms
5 188.at-1-0-0.XR4.SCL1.ALTER.NET (152.63.51.138) 3.960 ms 4.013 ms 4.339 ms
6 194.ATM5-0.GW6.SCL1.ALTER.NET (152.63.52.53) 4.347 ms 4.280 ms 4.068 ms
7 exodus-OC12-gw.customer.alter.net (157.130.203.90) 4.523 ms 4.673 ms 4.879 ms
8 66.35.194.18 (66.35.194.18) 4.908 ms 4.647 ms 4.830 ms
9 bbr01-p4-1.snva03.exodus.net (209.185.9.85) 5.470 ms 5.497 ms 5.541 ms
10 64.15.192.3 (64.15.192.3) 5.761 ms 5.606 ms 5.458 ms
11 64.209.177.30 (64.209.177.30) 5.334 ms 5.737 ms 5.738 ms
12 ops.sj.ipixmedia.com (64.209.175.20) 5.648 ms 5.751 ms 5.697 ms

Reverse:

ops:~# traceroute fs.eng.bamboo.com
traceroute to fs.eng.bamboo.com (63.78.93.3), 30 hops max, 40 byte packets
1 wr1.sj.ipixmedia.com (64.209.175.3) 0.213 ms 0.155 ms 0.144 ms
2 64.209.177.18 (64.209.177.18) 0.196 ms 0.206 ms 0.162 ms
3 64.15.192.17 (64.15.192.17) 0.279 ms 0.23 ms 0.191 ms
4 bbr02-p0-3.sntc08.exodus.net (209.185.9.86) 0.927 ms 0.986 ms 0.852 ms
5 66.35.194.5 (66.35.194.5) 1.002 ms 0.902 ms 0.843 ms
6 POS2-0.GW6.SCL1.ALTER.NET (157.130.203.89) 2.571 ms 1.548 ms 1.43 ms
7 168.at-6-0-0.XR4.SCL1.ALTER.NET (152.63.52.62) 1.706 ms 1.653 ms 1.911 ms
8 152.63.51.141 (152.63.51.141) 2.414 ms 2.781 ms 2.405 ms
9 188.ATM5-0.GW4.PAO1.ALTER.NET (146.188.148.85) 2.544 ms 2.572 ms 2.791 ms
10 ipix-gw.customer.ALTER.NET (157.130.195.14) 4.946 ms 5.457 ms 4.969 ms
11 fs.eng.bamboo.com (63.78.93.3) 4.573 ms 4.984 ms 5.023 ms

Of course, this is probably a move I should've expected from Exodus, after
the mongolian flustercluck that was the AS change in SNV3. You'd think
they would do something like that carefully, as you can -seriously- bone
customers. But noooooo. One of our junior admins made the change (since
I was out of town, but hey, it's cut and paste!). He, and all of the
other affected customers in SNV3 on the conference call, were left on hold
for about half an hour (plus the call started half an hour late),
whereupon the exodus engineering team popped back in and said "We're done
with our side, you guys go ahead!".

Now. Does it seem logical to kill connectivity over BOTH of your hosting
routers at once, thus killing every single BGP-running customer you have
that isn't physically in their cage at the time? Or would it seem better
to do what I assumed they'd do, which is do one router, wait for everyone
to make changes, then do the other?

I guess this is what happens when I assume intelligence at a
hosting/backbone provider.

*returns to watching the lights blink*

-j

The problem is almost certainly between AS2548 (Digex/Intermedia/icix) and
AS1 (Genuity/bbnplanet). But I think the reason that the Intermedia hop
times look like Swiss cheese in the return trace is tricky.

My guess is that Intermedia and Genuity have two physical paths between
their networks in Dallas, and that Intermedia is splitting traffic to AS1
between the two paths based on a combination of source and destination IP
address. One of the paths is overloaded going from AS2548 to AS1 -- the
other is not.

In the forward direction, where the outgoing trace packets going from
AS2548 to AS1 all have the same source and destination addresses, there is
a clean transition. In the reverse direction, where the source address of
the ICMP error message changes with every hop, the round trip time may be
very short or very long, depending on the router involved. (Check out hop
11 in the return trace.)

-steve

Dashbit -- The Leader In Internet Topology
www.dashbit.com www.traceloop.com