Level3 worldwide emergency upgrade?

Ray_Wong · February 6, 2013, 9:58am

Does anyone have details on tonight's apparent worldwide emergency
router upgrade? All I managed to get out of the portal was 30 minutes,
"Service Affecting" (no kidding?) and the NOC line gave me the
recording about it and disconnected me.

-R>

JP1 · February 6, 2013, 11:04am

Nothing confirmed from my side, but the general guess I saw was that it was Juniper-related.

-J

_Stephane_Bortzmeyer · February 6, 2013, 11:11am

a message of 10 lines which said:

the general guess I saw was that it was Juniper-related.

Juniper Technical Bulletin PSN-2013-01-823, probably?

James_Jones1 · February 6, 2013, 11:13am

ugh!

Jason_Biel · February 6, 2013, 11:21am

That is general guess.

Bret_Palsson · February 6, 2013, 11:29am

I just received this email from level3

Peter_Ehiwe · February 6, 2013, 11:38am

Also received same ...

Jared_Mauch · February 6, 2013, 12:39pm

So, I'm wondering what is shocking that someone may have to push out some sort of upgrade either urgently or periodically that is so impacting and causes these emails on the list.

There seems to be some sort of psychological event happening in addition to the technological one.

In the past I've had to push out software fixes "urgently" due to various reasons, either being a software thing like the PSN or some weird hardware+software interaction that causes bad things to happen.

Would you rather your ISP not maintain their devices? Are the consequences "so bad" of a 30 minute outage that your business is severely impacted?

- Jared

Alex_Rubenstein · February 6, 2013, 12:57pm

Would you rather your ISP not maintain their devices? Are the
consequences "so bad" of a 30 minute outage that your business
is severely impacted?

- Jared

You had me up until that line.

That should be expanded a little ...

First, I'd say, yes - many businesses would be severely impacted and may even have consequential issues if they had to sustain a 30 minute outage. Suppose for a moment they couldn't process money machines transactions for 30 minutes; or Netflix couldn't serve content for 30 minutes; or youporn was offline for 30 minutes.

The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?

Jonathan_Towne · February 6, 2013, 1:10pm

On Wed, Feb 06, 2013 at 07:57:06AM -0500, Alex Rubenstein scribbled:
# The question should be more along the lines of, "why aren't you multihomed in a way that would make a 30 minute outage (which is inevitable) irrelevant to you?

The fun part of this emergency maintenance in the northeast USA was that even
folks who are multihomed felt it: Level3 managed to do this in a way that
kept BGP sessions up but killed the ability to actually pass traffic. I'm not
sure what they did that caused this, or whether anyone but northeast folks
were affected by it, but it sure was neat to be effectively blackholed in and
out of one of your provided circuits for a while.

Also, in the northeast, they managed to make it quite a bit more than a 30min
outage for many people; they even slid hours outside of their advertised
emergency window.

I do applaud them for what I can only assume was a *massive* undertaking:
emergency upgrading that many routers in such a short period of time.

-- Jonathan Towne

Jared_Mauch · February 6, 2013, 1:28pm

Yeah, perhaps not as elegantly worded as I would have hoped, but there are many reasons things "go down". Just one of those elements is the internet part, there's also transport, power, and other elements that combine to make this complex system called the internet. If you N+N or N+1 your power, perhaps something similar for your connectivity is important. Or you just plan to be down/broken periodically for 30 minutes and have a plan to cover that.

The building where our NOC is located sometimes gets evacuated. Having a plan for that is important. During one visit, there was a small fire in the building (or so we were told). Certainly an unexpected event that disrupted us for ~30 minutes.

The handling and response of these events certainly is important. I do want to understand why and how it's so bad so if there are things as a SP in the community we can improve upon we can do that.

That's my real goal, not poking at people who are single homed and down.

- Jared

Alex_Rubenstein · February 6, 2013, 1:50pm

Yeah, perhaps not as elegantly worded as I would have hoped, but there are
many reasons things "go down". Just one of those elements is the internet
part, there's also transport, power, and other elements that combine to
make this complex system called the internet. If you N+N or N+1 your
power, perhaps something similar for your connectivity is important. Or you
just plan to be down/broken periodically for 30 minutes and have a plan to
cover that.

Agreed.

The building where our NOC is located sometimes gets evacuated. Having a
plan for that is important. During one visit, there was a small fire in the
building (or so we were told). Certainly an unexpected event that disrupted
us for ~30 minutes.

And, if it is important to you, you will have N+N NOC's - ie, more than one, and different buildings, cities, or countries, depending on your requirement.

The handling and response of these events certainly is important. I do want
to understand why and how it's so bad so if there are things as a SP in the
community we can improve upon we can do that.

I suspect, as I touched previously, the most noise will come from the people who are the least realistic, and least prepared. Personally, I live with the expectation that whatever it is (power, fiber, transport, ISP, highways, fuel delivery, etc.) will at some point be broken, degraded, or otherwise unavailable, and you have to plan accordingly.

Personally (and I speak for NAC) I/we don't care, really, if any upstream IP provider breaks; we have made appropriate plans to work around that in an automated fashion. Hope that answers your more general question.

Andrew_Sullivan1 · February 6, 2013, 3:10pm

My impression is mostly that people are left feeling uncomfortable by
a massive upgrade of this sort with so little communication about why
and so on. "Emergency work for five hours and 30 minutes
disconnection" that turns out to take longer than 30 minutes of
disconnection probably ought to come with some explanation (at least
after the fact).

Regards,

A

Ray_Wong · February 6, 2013, 3:43pm

Especially in the wake they already recently did one. It's unsettling
to receive little communication, and even multihomed, there's always
the question of being pushed into overages around other providers.

Yes, short notice maintenance does happen. Better communication
happens much less often.

I was more looking for details, i.e. the sort of problem this is, as
it probably also means all my *other* providers are going to be
scrambling in the next few days/weeks/months, depending on what gear
they're all using. I'm out of the global infrastructure game myself
for a few years currently, but I still have to think ahead to the
network I do maintain.

-R>

Joel_Jaeggli · February 6, 2013, 4:04pm

So, I'm wondering what is shocking that someone may have to push out some sort of upgrade either urgently or periodically that is so impacting and causes these emails on the list.

My impression is mostly that people are left feeling uncomfortable by
a massive upgrade of this sort with so little communication about why
and so on. "Emergency work for five hours and 30 minutes
disconnection" that turns out to take longer than 30 minutes of
disconnection probably ought to come with some explanation (at least
after the fact).

Especially in the wake they already recently did one. It's unsettling
to receive little communication, and even multihomed, there's always
the question of being pushed into overages around other providers.

Yes, short notice maintenance does happen. Better communication
happens much less often.

I recieved advance (24 hours) notification of maintenances over the last two days to circuits ranging in size from 100MB/s to 10Gb/s in about a dozen locations. I assumed there would be further disruption as devices I'm not directly connected to were touched.

I was more looking for details, i.e. the sort of problem this is, as
it probably also means all my *other* providers are going to be
scrambling in the next few days/weeks/months, depending on what gear
they're all using.

All your other providers using that vendor have been scrambling for about a week as well. Junos devices should be upgraded.

Ray_Wong · February 6, 2013, 4:16pm

OK, having had that first cup of coffee, I can say perhaps the main
reason I was wondering is I've gotten used to Level3 always being on
top of things (and admittedly, rarely communicating). They've reached
the top by often being a black box of reliability, so it's (perhaps
unrealistically) surprising to see them caught by surprise. Anything
that pushes them into scramble mode causes me to lose a little sleep
anyway. The alternative to what they did seems likely for at least a
few providers who'll NOT manage to fix things in time, so I may well
be looking at longer outages from other providers, and need to issue
guidance to others on what to do if/when other links go down for
periods long enough that all the cost-bounding monitoring alarms start
to scream even louder.

I was also grumpy at myself for having not noticed advance
communication, which I still don't seem to have, though since I
outsourced my email to bigG, I've noticed I'm more likely to miss
things. Perhaps giving up maintaining that massive set of procmail
rules has cost me a bit more edge.

Related, of course, just because you design/run your network to
tolerate some issues doesn't mean you can also budget to be in support
contract as well. Knowing more about the exploit/fix might mean
trying to find a way to get free upgrades to some kit to prevent more
localized attacks to other types of gear, as well, though in this case
it's all about Juniper PR839412 then, so vendor specific, it seems?

There are probably more reasons to wish for more info, too. There's
still more of them (exploiters/attackers) than there are those of us
trying to keep things running smoothly and transparently, so anything
that smells of "OMG new exploit found!" also triggers my desire to
share information. The network bad guys share information far more
quickly and effectively than we do, it often seems.

-R>

PC11 · February 6, 2013, 4:24pm

Given the issue was announced a week ago, I'm surprised they didn't provide
some sort of emergency notification prior to the upgrade. However, I
certainly understand their immediate desire to deploy this update. I don't
think it's bad as the BGP one from not too long ago in that exploit code is
not yet publicly available to my knowledge, but it certainly won't take
long.

Justin_M_Streiner · February 6, 2013, 4:34pm

If Level3 is pushing this upgrade because of a security vulnerability, like the recent Juniper PSN, any public notification will likely be tersely worded out of necessity.

You might be able to get more details by contacting your account team, but it's highly unlikely that you'll see the level of detail you're looking for in a public communication. That's not a knock against Level3, and most other carriers will likely be equally tight-lipped on the details.

jms

Siegel_David2 · February 6, 2013, 5:01pm

Hi Ray,

This topic reminds me of yesterday's discussion in the conference around getting some BCOP's drafted. it would be useful to confirm my own view of the BCOP around communicating security issues. My understanding for the best practice is to limit knowledge distribution of security related problems both before and after the patches are deployed. You limit knowledge before the patch is deployed to prevent yourself from being exploited, but you also limit knowledge afterwards in order to limit potential damage to others (customers, competitors...the Internet at large). You also do not want to announce that you will be deploying a security patch until you have a fix in hand and know when you will deploy it (typically, next available maintenance window unless the cat is out of the bag and danger is real and imminent).

As a service provider, you should stay on top of security alerts from your vendors so that you can make your own decision about what action is required. I would not recommend relying on service provider maintenance bulletins or public operations mailing lists for obtaining this type of information. There is some information that can cause more harm than good if it is distributed in the wrong way and information relating to security vulnerabilities definitely falls into that category.

Dave

Joel_Jaeggli · February 6, 2013, 5:31pm

My impression is mostly that people are left feeling uncomfortable by
a massive upgrade of this sort with so little communication about why
and so on. "Emergency work for five hours and 30 minutes
disconnection" that turns out to take longer than 30 minutes of
disconnection probably ought to come with some explanation (at least
after the fact).

I was more looking for details, i.e. the sort of problem this is, as
it probably also means all my *other* providers are going to be
scrambling in the next few days/weeks/months, depending on what gear
they're all using. I'm out of the global infrastructure game myself
for a few years currently, but I still have to think ahead to the
network I do maintain.

If Level3 is pushing this upgrade because of a security vulnerability, like the recent Juniper PSN, any public notification will likely be tersely worded out of necessity.

The one that motivated us to upgrade is:

PR839412

I assume that applies to most people with interest in running current junos. My imagination is pretty good so that got my attention.