Famous operational issues

If were just talking about outages historically, I recall the 1996 AOL Email debacle, not really anything to do with network mishaps but more so DNS configuration…

As well, I believe the North East 2003 blackout was a great DR test that no one was expecting.

Of course we also have the big non-events too such as Y2K…

Regards

Lets hope you aren’t depending on a piece of medical equipment with a Y2038 issue to keep you alive.

Y2038 is everybody's problem!

Mark

That was the one with the most severe imact for my company. Seven Frame Circuits (UUNET) and we all saw what an updtae can do

Morris worm, November 1988. Much confusion and eventually the realization
the John Brunner had called it from 13 years out ("The Shockwave Rider", 1975).
But sloppy coding meant it could be defeated with one line of /bin/sh.

---rsk

When Boston University joined the internet proper ca 1984 I was in
charge of that group.

We accidentally* submitted an initial HOSTS.TXT file which included
some internally used one-character host names (A, B, C) and one which
began with a digit (3B, an AT&T 3B5), both illegal for HOSTS.TXT back
then.

This put the BSD Unix program which converted from HOSTS.TXT to Unix'
/etc/hosts format into an infinite loop filling /tmp which in those
days crashed Unix and it often couldn't reboot successfully without
manual intervention.

On many, many hosts across the internet.

I hesitate to guess a number since scale has changed so much but some
of the more heated email claimed it brought down at least half the
internet by some count.

It was worsened by the fact that many hosts pulled and processed a new
HOSTS.TXT file via cron (time-based job scheduler) at midnight so no
one was around to fix and reboot systems.

The thread on the TCP-IP mailing list was: BU JOINS THE INTERNET!

It was a little embarrassing.

Today it probably would have landed me in Gitmo.

* There were two versions, the one we used internally, and the one to
be submitted which removed those host names. The wrong one got
submitted.

Well… pre-Internet, but the great Northeast fiber cut comes to mind (backhoe vs. fiber, backhoe won). Miles Fidelman

I remember when the big carriers de-peered with Cogent in the early 2000s. The underestimated the amount of web-sites being hosted by people using cogent exclusively.

Justin Wilson
j2sw@j2sw.com

Cogentco still did not peer with Google and HE over IPv6 I guess.

The he.net side is interesting as you can see who their v4 transits are but they suppress their routes via v6, but (last I knew) lacked community support for their customers to do similar route suppression.

I’m not a fan of it, but it makes the commercial discussions much easier each time those networks come by to shop services to me in a personal or professional capacity. “No, I need all the internet”.

- Jared

Thanks John.

This reminds me of two I've not seen anyone mention yet. Both
coincidentally in the Chicago area that I learned before my entry
into netops full time. One was a flood:

  <Chicago flood - Wikipedia>

The other, at the dawn of an earlier era:

  <TELECOM Digest OnLine - Sorted: Remembering the Great Telco Fire, May, 1988>

I wouldn't necessarily put those two in the top 3, but by some standard
for many they were certainly very significant and noteworthy.

John

(resent - to list this time)

Ahh, war stories. I like the one where I got a wake up call that our IRC server was on fire, together with the rest of the DC.

Not that widespread, but we reached Slashdot. :slight_smile:

November 2002, University of Twente, The Netherlands. Some idiot wanted to be a hero. He deflated peoples tires, to help inflate them. One morning he thought it would be a good idea to start a small fire and then extinguish it, so he would be the hero that stopped a fire. He failed and the building burned down. He got caught a few days later when he tried the same thing in a different building.

Almost all of the IT was in that building, including core network, uplinks to SURFNet (Dutch Educational Network) and to the 2000 students living on the campus. Ironically a new DC was already being built, so that was ready for use a few weeks later.

As we had quite a network for 2002 we hosted for instance security.debian.org. The students all had 100Mbit in their room, so some of them also hosted some popular websites. One I can remember was an image sharing site.

Some students immediately created a backup network; dhcp server, dns server with a catch all, website explaining what was going on, IRC server, etc..

A local ISP offered to sponsor 50Mbit for the residents, which was connected via a microwave relay and a temporary fiber was run through a ditch to connect two parts of the campus residencies. At the end of the day all 2000 students had their internet connection back, although all behind a single 50Mbit link.

Syslog message from the local SURFNet router:

lo0.ar5.enschede1.surf.net 3613: Nov 20 07:20:50.927 UTC: %ENV_MON-2-TEMP: Hotpoint temp sensor(slot 18) temperature has reached WARNING level at 61(C)

(Disclaimer: Where I say we, I mean we as University. I wasn't working for the university, but was part of the students working on the backup network. There are probably some other people on list with some more details and I've probably missed some details, but this is the summary.)

Stolen isn’t nearly as exciting as what happens when your (used) 6509 arrives and
gets installed and operational before anyone realizes that the conductive packing
peanuts that it was packed in have managed to work their way into various midplane
connectors. Several hours later someone notices that the box is quite literally
smoldering in the colo and the resulting combination of panic, fire drill, and
management antics that ensue.

Owen

On that note, I’d be very interested in hearing stories of actual incidents that are the cause of why cardboard boxes are banned in many facilities, due to loose particulate matter getting into the air and setting off very sensitive fire detection systems.

Or maybe it’s more mundane and 99% of the reason is people unpack stuff and don’t always clean up properly after themselves.

We had a plastic bag sucked into the intake of a router in a
datacenter once that caused it to overheat and take the site down. We
had cameras in our cage and I remember seeing the photo from the site of
the colo (I'll protect their name just because) taken as the tech was on
the phone and pulled the bag out of the router.

  The time from the thermal warning syslog that it's getting warm
to overheat and shutdown is short enough you can't really get a tech to
the cage in time to prevent it.

  I assume also the latter above, which is people have varying
definitons of clean.

  - Jared

I had a customer that tried to stack their servers - no rails except the bottom most one - using 2x4's between each server. Up until then I hadn't imagined anyone would want to fill their cabinet with wood, so I made a rule to ban wood and anything tangentially related (cardboard, paper, plastic, etc.). Easier to just ban all things. Fire reasons too but mainly I thought a cabinet full of wood was too stupid to allow.

The "no wood" rule has become a fun story to tell everyone who asks how that ended up being a rule. The wood customer turned out to be a complete a-hole anyway, wood was just the tip of the iceberg.

Worked a cronic support call where their internet would bounce at noon every workday. The Cisco 1601 or 1700 Router that had there T1 in, ended up being on top a microwave. Weeks of troubleshooting and shipping new routers on this one.

Also had another one where the router was plugged in to an outlet that was controlled by a light switch, discovered this after shipping them two new routers.

Customer had there building remodeled and the techs counldn’t find the T1 Smartjack for the building. The contract who did the remodel job, decided it would be a good idea to cut out the section of wall where the telco equipment was and mounted it to the ceiling. It’s new location was in the ladys bathroom, above the drop ceiling mounted to the building’s rafters 10’ in the air.

Customer needed a new router, because the first one died. It was a machine shop and they mounted the router to the wall next to a lathe or drill press that used oil to cool the bit while it was cutting. It looked like some dumped the router in a bucket of oil when we got it back.

Arriving at another large colo for a buildout. Only to find that our ASR9K that arrived 2 weeks ago was stored outside on the load dock which has no roof or locked gate. I guess that why Cisco put the plastic bag over the chassis when there shipped.

Colo techs at another larger colo decided to unpack our router which was a fully loaded 1/2 rack chassis. Since they couldn’t lift it, they tipped the router on the side and walked it back by shifting the weight from one corner of the chassis to another. Bending the chassis. I could see the scrap marks in the floor from it.

We had colo space in top floor of an ATT CO where we put a Cisco 7513 to terminate about a dozen CHDS3’s. The roof was leaking and instead of fixing the roof. The fix was to put a sheet of plastic over our cabinet. It was more like a tent over the cabinet. A pool of water formed in a diviot at the top and it was 120+ degrees under the plastic tarp.

Our office was in a work loft off an older building and they had the AC unit mounted to the ceiling with a drip pan underneath them. Well, AC on the 2nd floor had the pump for the drip pan died. Who every installed the drip pan didn’t secure it or center it under the AC unit. It filled up with water and since it was not secured and was off centered. The drip pan came crashing down with a few gallons of water. The water worked it’s way over to the wall and traveled down one story in the building. The floor below had all the telco equipment mounted to that same wall and the water flowed down right through a couple of ATT’s Ciena mounted to the wall shorting them out. I was at the Chicago Nanog Hackathon on Sunday and was called out to work that one :confused:

Was working in the back of a cabinet that had -48 VDC power for a Cisco Router, a screw fell and shorted out the power. My co worker who was standing in front of the rack wasn’t happy because the ADC PowerWorx Fuse panel was about 6" from his face where he was working. It had those little black alarm fuses, that had the spring-loaded arm. When it tripped a nice shower of sparks had flew right at his face Luckly he wore glasses.

I was 18 at my first IT job and it was a brand-new building. I was plugging in a 208VAC 30A APC UPS in the server room the electrican had just energized and check the circuit. I plugged in the APC UPS and gave it a good turn for the twist lock plug to catch and KA BAMB!!! Sparks came shooting out of the outlet at me. I think I pooped myself that day. Turns out the electricians deiced that a single Gange electrical box was good enough for a 208 VAC 30A outlet, that barely fit in the box. Didn’t put any tape around the wire terminals. When they energized the circuit there was enough of an air gap that the hot screw didn’t ground out. When I gave it that good old twist while plugging in the APC, I grounded the hot screw to the side of the electrical box.

On the "stupid racking" front, I give you most of a rack dedicated to a single server. Not all that high a server, maybe 2U or so, but *way* too deep for the rack, so it had been installed vertically. By looping some fairly hefty chain through the handles on either side of the front of the chassis, and then bolting the four chain ends to the four rack posts. I wish I'd kept pictures of that one. Not flammable, but a serious WTF moment.

Cheers,
Tim.

Normally I reference this as an example of terrible government
bureaucracy, but in this case it's also how said bureaucracy can delay
operational changes.

I was a contractor for one of the many branches of the DoD in charge
of the network at a moderate-sized site. I'd been there about 4
months, and it was my first job with FedGov. I was sent a pair of
Cisco 6509-E routers, with all supervisors and blades needed, along
with a small mountain of SFPs, to replace the non-E 6509s we had
installed that were still using GBICs for their downlinks. These were
the distro switches for approximately half the site.

Problem was, we needed 84 new SC-LC fiber jumpers to replace the SC-SC
we had in place for the existing switch - GBICs to SFPs remember. We
hadn't received any with the shipment. So I reached out to the project
manager to ask about getting the fiber jumpers. "Oh, that should be
coming from the server farm folks, since it's being installed in a
server farm." Okay, that seems stupid to me, but $FedGov, who knows. I
tell him we're stalled out until we get those cables - we have the
routers configured and ready to go, just need the jumpers, can he get
them from the server farm folks? He'll do that.

It took FIFTEEN MONTHS to hash out who was going to pay for and order
the fiber jumpers. Any number of times as the months dragged on, I
seriously considered ordering them on Amazon Prime using my corporate
card. We had them installed a week and a half after we got them. Why
that long? Because we had to completely reconfigure them, and after 15
months, the urgency just wasn't there.

By the way, the project ended up buying them, not the server farm team.

A few I remember:

. Some monitoring server SCSI drive failed (we’re talking State/Province level govt)… Got a return back stating it will take 6 month delay to get a replacement…

Ended up choosing to use my own drive instead of leaving something that could be have been deadly, unmonitored.

. Metro interruption during rush hour (for a pop of 4M) due to overload power bar in a MMR (Meet Me Room) during a unplanned deployment;

. Cherry red and very angry looking 520-600V bus bar =D;

. Fire fighters hitting the building generator emergency STOP button because some neighbor reported smoke on top of the building during a black out…
( not their fault, local gov failure as usual )

. Some idiots poured gasoline into a large pipe under a bridge… ended up demonstrating the lack of diversity to the DCs on that urban island;

. Underground transformer blow up downtown Mtl and took out the entire fiber bundle, demonstrating to those customers that their diversity was actually real =D.

(took them a year to get that fixed)

and

. Obviously: Any rack cabling I do…