Famous operational issues

jtk · February 16, 2021, 7:37pm

Friends,

I'd like to start a thread about the most famous and widespread Internet
operational issues, outages or implementation incompatibilities you
have seen.

Which examples would make up your top three?

To get things started, I'd suggest the AS 7007 event is perhaps the
most notorious and likely to top many lists including mine. So if
that is one for you I'm asking for just two more.

I'm particularly interested in this as the first step in developing a
future NANOG session. I'd be particularly interested in any issues
that also identify key individuals that might still be around and
interested in participating in a retrospective. I already have someone
that is willing to talk about AS 7007, which shouldn't be hard to guess
who.

Thanks in advance for your suggestions,

John

Mailman · February 16, 2021, 8:00pm

This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:

The experiment triggered a bug in some Cisco router models: affected
Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
Any peers of such Ciscos receiving this BGP update, would (according to
then current RFCs) consider the BGP UPDATE corrupted, and would
subsequently tear down the BGP sessions with the Ciscos. Because the
corruption was not detected by the Ciscos themselves, whenever the
sessions would come back online again they'd reannounce the corrupted
update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
global scale in both IBGP and EBGP!

Luckily the industry took these, and many other lessons to heart: in
2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
Messages") which specifices far more robust behaviour for BGP speakers.

Kind regards,

Job

Mailman · February 16, 2021, 8:13pm

https://blogs.oracle.com/internetintelligence/longer-is-not-always-better

Jorg_Kost · February 16, 2021, 8:33pm

Hi,

I don't want to classify and rate it, but would name 9/11.

You can read about the impacts on the list archives and there is also a presentation from NANOG '23 online.

Regards
Jörg

Bandy_Rush1 · February 16, 2021, 8:44pm

actually, the 129/8 incident was as damaging as 7007, but folk tend not
to remember it; maybe because it was a bit embarrassing

and the baltimore tunnel is a gift that gave a few times

and the quake/mudslides off taiwan

the tohoku quake was also fun, in some sense of the word

but the list of really damaging wet glass cuts is long

Bandy_Rush1 · February 16, 2021, 8:55pm

actually, the 129/8 incident

a friend pointed out that it was the 128/9 incident

but folk tend not to remember it

qed, eh?

Mailman · February 16, 2021, 8:59pm

https://en.wikipedia.org/wiki/SQL_Slammer was interesting in that it was an application-layer issue that affected the network layer.

Damian

Sean_Donelan · February 16, 2021, 9:28pm

Since you said operational issues, instead of just outage...

How about MCI Worldcom's 10-day operational disaster in 1999.

http://www.cnn.com/TECH/computing/9908/23/network.nono.idg/
How not to handle a network outage

[...]
MCI WorldCom issued an alert to its sales force, which was given the option to deliver a notice to customers by e-mail, hand delivery or telephone – or not at all. After a deafening silence from company executives on the 10-day network outage, MCI WorldCom CEO Bernie Ebbers finally took the podium to discuss the situation. How did he explain the failure, and reassure customers that the network would not suffer such a failure in the future? He didn't. Instead, he blamed Lucent.
[...]

Todd_Underwood3 · February 16, 2021, 9:30pm

There are all the hilarious leaks and blocks.

Pakistan blocks youtube and the announcement leaks internet-wide.
Turk telecom (AS9121 IIRC) leaks a full table out one of their providers.

So many routing level incidents they’re probably not even interesting any more, I suppose.

The huge power outages in the US northeast in 2003 (https://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.183.998&rep=rep1&type=pdf) were pretty decent.

Jorg_Kost · February 16, 2021, 9:31pm

Oh well, MCI in 1999 was all about…

Justin_M_Streiner2 · February 16, 2021, 9:39pm

Would this also extend to intentional actions that may have had unintended consequences, such as provider A intentionally de-peering provider B, or the monopoly telco for $country cutting itself off from the rest of the global Internet for various reasons (technical, political, or otherwise)?

That said, I’d still have to stick with AS7007, the Baltimore tunnel fire, and 9/11 as the most prominent examples of widespread issues/outages and how those issues were addressed.

Honorable mention: $vendor BGP bugs, either due to $vendor ignoring the relevant RFCs, implementing them incorrectly, or an outage exposed a design flaw that the RFCs didn’t catch. Too many of those to list here

jms

Jared_Mauch · February 16, 2021, 10:08pm

I was thinking about how we need a war stories nanog track. My favorite was being on call when the router was stolen.

Sabri_Berisha · February 16, 2021, 10:18pm

Hi,

I was thinking about how we need a war stories nanog track. My favorite was
being on call when the router was stolen.

Wait... what? I would love to listen to that call between you and your manager.

But, here is one for you then. I was once called to a POP where one of our main
routers was down. Due to political reasons, my access had been revoked. My
manager told me to do whatever I needed to do to fix the problem, he would cover
my behind. I did, and I "gently" removed the door. My manager held word.

Another interesting one: entering a pop to find it flooded. Luckily there were
raised floors with only fiber underneath the floor panels. The NOC ignored the
warnings because "it was impossible for water to enter the building as it was
not raining". Yeah, but water pipes do burst from time to time.

But my favorite was pressing an undocumented combination of keys on a fire
alarm system which set off the Inergen protection without warning, immediately.
The noise and pressure of all that air entering the datacenter space with me
still in it is something I will never forget. Similar to the response of my
manager who, instead of asking me if I was ok, decided to try and light a piece
of paper. "Oh wow, it does work, I can't set anything on fire".

All if this was, obviously, in the late 1990s and early 2000s. These days,
things are -slightly- more professional.

Thanks,

Sabri

Jon_Lewis1 · February 16, 2021, 10:36pm

This reminds me of one of the Sprint CO's we were colo'd in. Access to the CLEC colo area was via a back door through the Men's room! One weekend, I had to make the drive to that site to deal with an access server issue, and I found they'd locked the back door to the Men's room from the colo floor side, so no access. Using supplies I found inside the CO, I managed open the locked door and get to our gear. That route, being our only access route was probably some kind of violation. Not all of our techs were guys.

While we never had a router stolen, we did have a flash card stolen from one of our routers in a WCOM colo facility (most customers in open relay racks). It was right after they'd upgraded the doors to the colo area from simplex locks to card access. I was pissed for quite some time that WCOM knew who was in there (due to the card access system), but refused to tell us. I figured it was probably one of their own people.

Sean_Donelan · February 16, 2021, 10:51pm

Biggest internet operational SUCCESS

1. Secure Shell (SSH) replaced TELNET. Nearly eliminated an entire class of security problems on the Internet. But then HTTP took over everything, so a good news/bad news.

2. Internet worms massively reduced by changed default configurations and default firewalls (Windows XP proved defaults could be changed). Still need to work on DDOS amplification.

3. Head of Line blocking in IX switches (although I miss Stephen Stuart saying "I'm Sorry" at every NANOG for a decade). Was a huge problem, which is a non-problem now.

4. Classless Inter-Domain Routing and BGP4 changed how Internet routing worked across the entire backbone, and it worked! Vince Fuller et al rebuilt the aircraft in flight, without crashing.

5. Y2K was a huge suggess because a lot of people fixed things ahead time, and almost nothing crashed (other than the National Security Agency's internal systems :-). I'll be retired before Y2038, so that's someone else's problem.

rcompton · February 16, 2021, 10:51pm

There was the outage in 2014 when we got to 512K routes. http://www.bgpmon.net/what-caused-todays-internet-hiccup/

CAUTION: The e-mail below is from an external source. Please exercise caution before opening attachments, clicking links, or following guidance.

    > I'd like to start a thread about the most famous and widespread Internet
    > operational issues, outages or implementation incompatibilities you
    > have seen.
    >
    > Which examples would make up your top three?

This was a fantastic outage, one could really feel the tremors into the
far corners of the BGP default-free zone:

RIPE NCC and Duke University BGP Experiment — RIPE Labs

    The experiment triggered a bug in some Cisco router models: affected
    Ciscos would corrupt this specific BGP announcement ** ON OUTBOUND **.
    Any peers of such Ciscos receiving this BGP update, would (according to
    then current RFCs) consider the BGP UPDATE corrupted, and would
    subsequently tear down the BGP sessions with the Ciscos. Because the
    corruption was not detected by the Ciscos themselves, whenever the
    sessions would come back online again they'd reannounce the corrupted
    update, causing a session tear down. Bounce ... Bounce ... Bounce ... at
    global scale in both IBGP and EBGP!

    Luckily the industry took these, and many other lessons to heart: in
    2015 the IETF published RFC 7606 ("Revised Error Handling for BGP UPDATE
    Messages") which specifices far more robust behaviour for BGP speakers.

Kind regards,

Job

E-MAIL CONFIDENTIALITY NOTICE:
The contents of this e-mail message and any attachments are intended solely for the addressee(s) and may contain confidential and/or legally privileged information. If you are not the intended recipient of this message or if this message has been addressed to you in error, please immediately alert the sender by reply e-mail and then delete this message and any attachments. If you are not the intended recipient, you are notified that any use, dissemination, distribution, copying, or storage of this message or any attachment is strictly prohibited.

Pierre_Emeriaud · February 16, 2021, 10:52pm

In a similar fashion, a network I know had a massive outage when a
failing linecard corrupted is-is lsps, triggering a flood of purges
and taking down the whole backbone.

This was pre-rfc6232, so you can guess that resolving the issue was a real PITA.

This kind of outages fuels my netops nightmares.

Scott_Weeks1 · February 16, 2021, 10:56pm

AS7007 is how I found NANOG. We (Digital Island; first job out
of college) were in 10-20 countries around the planet at the time.
All of them wentdown while we were in cisco training. I kept
interrupting the class andtelling my manager "everything's down!
We need to stop the training and get on it!" We didn't because I
was new and no onebelieved that much could go down all at once.
They assumed it was a monitoring glitch.So, the training
continued for a while until very senior engineers got involved.
One of the senior guys said something to the effect of "yeah, it's
all over NANOG." I said what is NANOG? I signed upfor the list
and many of you have had to listen to me ever since...

scott

Paul_Ebersman · February 16, 2021, 10:58pm

This reminds me of one of the Sprint CO's we were colo'd in.

Ah, Sprint. Nothing like using your railroad to run phone lines...
Our routers in San Jose colo were black from the soot of the trains.

Fondly remember a major Sprint outage in the early 90s. All our data
circuits in the southeast went down at once and there were major voice
outages in the entire southeast.

Turns out a storm caused a mudslide which in turn derailed a train
carrying toxic waste, resulting in a wave of 6-10' of toxic mud taking
out the Spring voice pop for the whole southeast, because it was
conveniently located right on said railroad tracks.

We were a big enough customer that PLSC in Atlanta gave us the real
story when we asked for an ETA on repair. They couldn't give us one
immediately until the HAZMAT crew let them in. Turned out to be a total
loss of all gear.

They yanked every tech east of the Misssissippi and a 7ESS was Fedex
overnighted (stolen from some customer in the middle east?) and they had
to rebuild everything.

Was down less than 10 days. Good times.

Simon_Lockhart1 · February 16, 2021, 11:10pm

For an operational perspective, I was part of the team trying to keep the
BBC website up and running through 9/11...

http://www.slimey.org/bbc_ticket_10083.txt

Simon