PAIX Outages

I have heard rumors that S&D has been having persistent switch
problems with their switches at PAIX (Palo Alto), and I was kind of
wondering if anyone actually cared?

I have heard rumors that S&D has been having persistent switch
problems with their switches at PAIX (Palo Alto), and I was kind of
wondering if anyone actually cared?

well, they've sure been having fun up at the six in seattle

randy

Personally I tend to suspect the general lack of uproar is a rather
unfortunate (for them) sign that PAIX is no longer relevant when it comes
to critical backbone infrastructures.

It looks like different folks have been seeing different levels of outages
depending upon which switch/card they are connected to, but I havn't been
able to find anyone who has seen fewer than 30 hits between April 16th and
the two this morning. Our ports have seen just under 28 hours of total
downtime so far this month, while some lucky people have only seen around
6 hours.

I'm not sure if anyone at S&D or Extreme actually has any real idea what
the problem is with these current switches, but given this amount of
downtime, they should have replace every last component by now. If Extreme
can't fix them, there should be a pile of Black Diamond's sitting on the
curb waiting for trash day. In fact, 9/10ths of the way through writing
this e-mail, I got a call from S&D stating that they are doing exactly
that. :slight_smile:

In the mean time, here are some of the more interesting snipits of what
has been tried on the current switches:

16 Apr 2005 20:19:53 GMT
We are currently experiencing some problems with 2 network cards in our
Palo Alto peering switch. This might be causing possible service
degradations. Switch Engineers are expecting new cards to replace the 2
suspected faulty network cards. These cards should be arriving in or
around 1 hour. Right after the cards arrive, we will be scheduling an
emergency maintenance window to get these cards replaced.

19 Apr 2005 14:16:07 GMT
The Purpose of this Emergency Maintenance window is for Switch Engineers
to replace a faulty processor module card affecting the Bay Area Peering
customers. The estimated down time will be 15 minutes.
(Actual downtime several hours)

19 Apr 2005 19:27:49 GMT
This is the final update regarding the problems experienced today with the
peering fabric. Our Switch Engineers corrected the problems during the
emergency maintenance window by replacing two line cards and 2 processor
cards in the Palo Alto switch. All peering sessions should be restored at
this time.

22 Apr 2005 21:56:15 GMT
The purpose of this emergency maintenance window is for engineers to
replace defective power supply units on the Paix Switch. No impact to your
services is expected.

24 Apr 2005 21:25:48 GMT
Our Switch Engineers will be conducting and emergency processor cards
replacement at the Palo Alto site. The expected downtime while this
maintenance is being conducting will be 2 hours.

24 Apr 2005 21:36:18 GMT
Our Switch Engineers will be conducting and emergency chassis replacement
at the Palo Alto site. The expected downtime while this maintenance is
being conducting will be 3 hours.

25 Apr 2005 19:17:41 GMT
Our engineers have escalated the problems with the peering switch in Palo
Alto to 3rd level support at Extreme, the switch vendor. More details will
follow as they become available.

26 Apr 2005 03:00:34 GMT
Our Switch Engineers have advised us that the switch has been migrated to
a different power bus to rule out any power variables. Power is being
monitored for the next 24 hours.

28 Apr 2005 13:33:05 GMT
At approximately 6:05 AM local time, the peering switch rebooted itself.
Our switch engineers are investigating this issue and believe all sessions
are back to normal at this time. More details will be provided as they
become available.

When I see a stable switching platform going forward, and some service
credits for the massive outages we've all endured so far, I'll probably be
a lot less cranky about the entire situation. Until then I have to say, if
they keep this up their are going to need to change their name to "Switch
or Data".

Oh well, at least this didn't happen during the S&D sponsored NANOG. :slight_smile:

In a message written on Thu, Apr 28, 2005 at 01:51:54PM -0400, Richard A Steenbergen wrote:

Personally I tend to suspect the general lack of uproar is a rather
unfortunate (for them) sign that PAIX is no longer relevant when it comes
to critical backbone infrastructures.

That, or a sign that operators are doing their job. There should be
enough redundancy in the system that loss of any one site, for whatever
reason, doesn't cause a major, or even minor disruption.

I'm not so sure you can draw that conclusion. At this point,
everyone's high traffic peers are private interconnects anyway. The
site is likely more important to them than the public fabric within
the site.

  A facility power outage would probably be a lot more painful than
some public fabric issues. Any traffic that works its way down to a
public fabric probably has other public fabrics to go to as well.

  --msa

If you have a Cisco router that craps out on a regular basis, Cisco will
tell you to get a second one. Some people find this to be a great
solution, while other people go buy a Juniper.

This probably isn't the way they wanted to announce this, but PAIX is
rolling out a new 10GE capable platform (the Extreme Aspen series).
Equinix is about to follow suit with their 10GE platform, and the only
other two modern competetive IX's in the US have already deployed new 10GE
capable platforms (NYIIX with Foundry MG8 and NOTA with Force10). Of
course the europeans have had customers up on 10GE for 6 months now, and
at a fraction of the price that the US IX's will be charging, but lets
ignore that and focus on our own backwater continent right now. :slight_smile:

At the moment, the US IX's largely price their ports as high as the market
will possibly bear (and then sometimes a few bucks more just as a kick in
the teeth), and largely doesn't have 10GE ports available for either
customers or multiple-site trunking. This means that most serious
providers don't even have the option of public peering at interesting
capacities, even if they weren't concerned about reliability issues. As
the US IX market finally gets its act together and rolls out 10GE, many
networks are going to start upgrading, and start putting much larger
amounts of traffic on them to save on PNI costs. After all, we both know
that due to current financial conditions not every network can afford to
have all of the spare PNI ports they would like to ensure that they have
sufficiently diverse/redundant interconnections with their peers, yes? :slight_smile:

With these IX's poised to take another order of magnitude step (remember
the good 'old days when GE seemed to large?), they are about to get
another shot in the arm as far as being used for mission critical peering
infrastructure is concerned. But no matter how good an idea it may be to
make sure that you "always have diverse capacity at another location", if
one IX is having significantly higher numbers of disruptions than the
rest, the network operators are going to go elsewhere (well after their 5
year contracts are up at any rate).

Besides, I don't think "and for when we go down, there is an Equinix
facility down the road" is really the marketing angle that Switch and Data
had in mind.

Yeah, what's the issue? US public peering ports are absurdly
overpriced. Anyone had a laugh at the PAIX list prices when
seeing them first? Considering LINX and AMSIX are their own
companies for some years now they are doing an excellent job
at being (too?) affordable, but it surely works. Then when I
think about the NYIIX woes some time ago and other stuff
(like the current PAIX trouble) I cannot help but get rid of
public peering, especially in the US.

As another matter I do not believe in public peering at all
when you have flows to a single peer that are ore than half
of a full GE. Been there, was not at all nice. I guess more
and more operators will have less and less public IX ports,
and the open peering coalition will start wondering at some
point... The AMSIX has a lot of 10G peers. While they just
take two ports, and the AMSIX supposedly also being redundant
(and cheap <g>) it is just a time- bomb. How many times did
either LINX or AMSIX had issues (actually very rare!) and we
happily overloaded our peers' interfaces at the respective
other IX... Say what you want, but public peering (yes/no)
has a lot to do with your amount of traffic, and your peers.

Paying 3 to 4 times as much in the US for the very same I am
sure I get even less value - and I'm pulling out.

(Well, since we stumbled about this topic... thanks, ras!)

Alexander

and we
happily overloaded our peers' interfaces at the respective
other IX...

That sounds more like a planning issue than anything else. If you
have traffic going through a pipe, then you need to make sure you have
somewhere else to send it. If you are managing your peers properly,
private or public, there should be no issue.

With public peering you simply never know how much spare
capacity your peer has free. And would you expect your
peer with 400 Mbit/s total to have 400 reserved on his AMSIX
port for you when you see 300 at LINX and LINX goes down?

Been there, numerous times. I still tend to say - it depends
on your type of peers and traffic per peer.

Alexander

You also never know with private peering: Backbone links.

Regards,
Daniel

With public peering you simply never know how much spare
capacity your peer has free.

So with your key peers you talk to them and find out, but
I don't see how this is any different if you have a private
interconnect. Just because you have say a STM-1 into another peer
doesn't mean they have the STM-1 to carry the traffic out, given
your example below I'd say its even more unlikely.

And would you expect your peer
with 400 Mbit/s total to have 400 reserved on his AMSIX port
for you when you see 300 at LINX and LINX goes down?

Key ones, yes.

Been there, numerous times. I still tend to say - it depends
on your type of peers and traffic per peer.

But your point on public versus private doesn't alter those facts.

what makes this a public peering issue.. i see a couple folks already made the
point i wanted to do but just because you have capacity to a peer (on a public
interface or a dedicated) PI doesnt mean they arent aggregating at their side
and/or have enough capacity to carry the traffic where it needs to go

this is also about scale, i would hope you arent peering 400Mb flows across a
1Gb port at an IX, this would imho not be good practice.. if your example were
40Mb then it would be different or perhaps 400mb on a 10Gb port.

you might even argue there is more incentive to ensure public ix ports have
capacity as congestion will affect multiple peers

Steve