cross connect reliability

All,
  Today I had yet another cross-connect fail at our colo provider. From
memory, this is the 6th cross-connect to fail while in service, in 4yrs
and recently there was a bad SFP on their end as well. This seemes like
a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
  Until a few years ago my time was spent mostly in single tenant data
centers, and it may be true that we made fewer cabling changes and made
less of a ruckus when cabling... but this still seems like a pretty high
failure rate at the colo.
  I am curious; what do you expect the average reliability of your FastE
or GigE copper cross-connects at a colo?

Thanks,
Mike

Michael J McCafferty wrote:

All,
  Today I had yet another cross-connect fail at our colo provider. From
memory, this is the 6th cross-connect to fail while in service, in 4yrs
and recently there was a bad SFP on their end as well. This seemes like
a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
  Until a few years ago my time was spent mostly in single tenant data
centers, and it may be true that we made fewer cabling changes and made
less of a ruckus when cabling... but this still seems like a pretty high
failure rate at the colo.
  I am curious; what do you expect the average reliability of your FastE
or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection
(optical or electrical) like a patch panel, I'd expect it to keep going
forever unless someone damages it.

~Seth

Hello Michael:

From: Michael J McCafferty [mailto:mike@m5computersecurity.com]
Sent: Thursday, September 17, 2009 2:46 PM
To: nanog
Subject: cross connect reliability

All,
  Today I had yet another cross-connect fail at our colo provider.
From
memory, this is the 6th cross-connect to fail while in service, in

4yrs

and recently there was a bad SFP on their end as well. This seemes

like

a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
  Until a few years ago my time was spent mostly in single tenant
data
centers, and it may be true that we made fewer cabling changes and

made

less of a ruckus when cabling... but this still seems like a pretty
high
failure rate at the colo.
  I am curious; what do you expect the average reliability of your
FastE
or GigE copper cross-connects at a colo?

Thanks,
Mike

I agree with their Reason for Outage, but it sounds like a design issue.
We prewire all of our switches to patch panels so they don't get touched
once they're installed. The patch panels are much more friendly to
insertions and removals than a 48 port 1-U switch. We also have
multiple connections on the fiber side to avoid those failures. With
all of that, we still have failures, but their effect and frequency are
minimized.

Mike

Seth Mattinen wrote:

Michael J McCafferty wrote:

All,
  Today I had yet another cross-connect fail at our colo provider. From
memory, this is the 6th cross-connect to fail while in service, in 4yrs
and recently there was a bad SFP on their end as well. This seemes like
a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
  Until a few years ago my time was spent mostly in single tenant data
centers, and it may be true that we made fewer cabling changes and made
less of a ruckus when cabling... but this still seems like a pretty high
failure rate at the colo.
  I am curious; what do you expect the average reliability of your FastE
or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection
(optical or electrical) like a patch panel, I'd expect it to keep going
forever unless someone damages it.

That's truly wishful thinking, as are the assumptions that insulate it from damaging factors. Nothing lasts forever.

Michael J McCafferty wrote:

All,
  Today I had yet another cross-connect fail at our colo provider. From
memory, this is the 6th cross-connect to fail while in service, in 4yrs
and recently there was a bad SFP on their end as well. This seemes like
a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
  Until a few years ago my time was spent mostly in single tenant data
centers, and it may be true that we made fewer cabling changes and made
less of a ruckus when cabling... but this still seems like a pretty high
failure rate at the colo.
  I am curious; what do you expect the average reliability of your FastE
or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection
(optical or electrical) like a patch panel, I'd expect it to keep going
forever unless someone damages it.

Or until someone pulls out the wrong cable (which has happened to me).

Regards
Marshall

All,
Today I had yet another cross-connect fail at our colo provider. From
memory, this is the 6th cross-connect to fail while in service, in 4yrs
and recently there was a bad SFP on their end as well. This seemes like
a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
Until a few years ago my time was spent mostly in single tenant data
centers, and it may be true that we made fewer cabling changes and made
less of a ruckus when cabling... but this still seems like a pretty high
failure rate at the colo.
I am curious; what do you expect the average reliability of your FastE
or GigE copper cross-connects at a colo?

Thanks,
Mike

    Does the colo let anyone run cables or do they have approved
contractors? It sounds like a design issue to me in the way the cables
are treated. In 4 years at a busy colo we have had one copper cross connect
not act right. It would pass data but was flaky. We replaced it because it
was an easy run just to rule it out.

    I am assuming your are in shared space. If so I would investigate your
weak points (which I am sure you already are doing).
    
    Justin

Alex Balashov wrote:

Seth Mattinen wrote:

Michael J McCafferty wrote:

All,
    Today I had yet another cross-connect fail at our colo provider.
From
memory, this is the 6th cross-connect to fail while in service, in 4yrs
and recently there was a bad SFP on their end as well. This seemes like
a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
    Until a few years ago my time was spent mostly in single tenant data
centers, and it may be true that we made fewer cabling changes and made
less of a ruckus when cabling... but this still seems like a pretty high
failure rate at the colo.
    I am curious; what do you expect the average reliability of your
FastE
or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection
(optical or electrical) like a patch panel, I'd expect it to keep going
forever unless someone damages it.

That's truly wishful thinking, as are the assumptions that insulate it
from damaging factors. Nothing lasts forever.

What the OP is describing is abnormally high in my view.

Based purely on my own personal experience, the structured wiring in my
parent's house I put in in the mid 90's has never suffered a failure, is
still in use today, and it's in a residential environment with dogs and
cats. I'd expect a properly managed environment to fare at least as good
as that.

~Seth

Marshall Eubanks wrote:

Michael J McCafferty wrote:

All,
    Today I had yet another cross-connect fail at our colo provider. From
memory, this is the 6th cross-connect to fail while in service, in 4yrs
and recently there was a bad SFP on their end as well. This seemes like
a high failure rate to me. When I asked about the high failure rate,
they said that they run a lot of cables and there is a lot of jiggling
and wiggling... lots of chances to get bent out of whack from activity
near my patches and cables.
    Until a few years ago my time was spent mostly in single tenant data
centers, and it may be true that we made fewer cabling changes and made
less of a ruckus when cabling... but this still seems like a pretty high
failure rate at the colo.
    I am curious; what do you expect the average reliability of your FastE
or GigE copper cross-connects at a colo?

Never to fail? Seriously; if you're talking about a passive connection
(optical or electrical) like a patch panel, I'd expect it to keep going
forever unless someone damages it.

Or until someone pulls out the wrong cable (which has happened to me).

That's not a failure though. It's a disconnection. It happens but is readily attributable to a cause.

Random failures of a single ports connectivity.... bizzare and annoying. Whole switches? Seen it.
Whole panels? Seen it.
Whole blades? Seen it.

Single port on a switch or patch panel? Never.

[lots of stuff deleted].

We've seen cross-connects fail at sites like "E" and others. Generally speaking, it is a human-error issue and not a component failure one. Either people are being sloppy and aren't reading labels, or the labels aren't there.

In a cabinet situation, every cabinet does not necessarily home back to its own patch panel, so some trashing may occur -- it can be avoided with good design [cables in the back stay there, etc].

When you are talking about optics failing and they are providing "smart" cross-connects, almost anything is possible.

The true tell tale is whether you have to call when the cross-connect goes down, or if it just "bounces". Either way, have them take you to their cross-connect room and show you their mess. Once you see it, you'll know what to expect going forward.

Deepak

Not that I would know from experience, but it is rumored that certain
telco techs in the NYC area can be persuaded to "borrow" other people's
pairs for less than a hundred dollars.
  
  Richard Golodner

We have a winner!

A famous one that can happen with some techs is that they make jumpers
from solid wire with generic rj45 plugs (yes, I've seen this recently
from several folks who should know better). These will last somewhere
around a year (long enough to forget when they were installed) then
randomly fail from just fan vibration or slight breezes. There are rj45
plugs made for solid wire (have 3 little prongs instead of 2, and they
are offset to straddle the wire) but I feel that even these can go bad.
I know if the techs are properly educated that this "will never happen"
(tm)... (till someone needs a custom-length jumper on a sunday...)
(for which, one colo building has an ace hardware with most of the right
stuff, but unfortunately most don't).

As we all (should) know, all solid-wire cable should terminate in a
panel and proper short jumpers (pref. with molded strain-relief) are
used for the rest.

-- Pete

Not really. That's all too easy to diagnose and fix. Poorly terminated and or mistreated cabling is far more likely. I wrote a long post about all the crap termination and poor treatment I've seen...but canceled the message.

Because no-one is stealing pairs anymore?

I once had a circuit go down because the fiber connector wasn't crimped
on correctly, and the fiber pulled out of the connector while a tech was
working in the cable tray nearby. After we opened a ticket about the
issue, said tech "fixed" it by shoving the fiber back into the connector
by hand, and walking away. Needless to say it went down again the next
day. Names withheld to protect the guilty and keep them from raising my
prices for heckling them in public, but the moral of the story is never
underestimate the laziness or stupidity of the cable monkeys some of
these places hire and let touch your routers. :slight_smile:

It's just not as interesting or hard to troubleshoot as a poorly made patch cable that's had one conductor go open, only goes open when the wire is tugged a certain direction, nicked wires shorting, a switch port with its RX side burned out, an RJ45 plug who's mistreated tab no longer works, and though it looks inserted in the port, it's really just kind of hanging there not making full/good contact, etc.

I would hope in any data center, "stealing pairs" doesn't happen as much as any of the above or taking pairs the tech genuinely thought were dead.

In their defense, that was clearly the fastest way to fix it. :slight_smile:

Just not a very long term solution.

Having work in high traffic colo spaces around the world for the last ten years or so, in my experience this type of issue is very rare. If you are having this type of "quality" issue, I would sit down with your sales rep and ask to be stepped through their processes, there is obviously something that has gone VERY VERY WRONG.

Shane

You've never seen a single port go bad on a switch? I can't even count
the number of times I've seen that happen. Not that I'm not suggesting
the OP wasn't the victim of a human error like unplugging the wrong port
and they just lied to him, that happens even more.

My favorite bizarre random failure story is a toss-up between one of
these two:

Story 1. Had a customer report that they weren't able to transfer this
one particular file over their connection. The transfer would start and
then at a certain point the tcp session would just lock up. After a lot
of head scratching, it turned out that for 8 ports on a 24 port FastE
switch blade, this certain combination of bytes caused the packet to be
dropped on this otherwise perfectly normal and functioning card, thus
stalling the tcp session while leaving everything around it unaffected.
If you moved them to a different port outside this group of 8, or used
https, or uuencoded it, it would go through fine.

Story 2. Had a customer report that they were getting extremely slow
transfers to another network, despite not being able to find any packet
loss. Shifting the traffic to a different port to reach the same network
resolved the problem. After removing the traffic and attempting to ping
the far side, I got the following:

<drop>
64 bytes from x.x.x.x: icmp_seq=1 ttl=61 time=0.194 ms
64 bytes from x.x.x.x: icmp_seq=2 ttl=61 time=0.196 ms
64 bytes from x.x.x.x: icmp_seq=3 ttl=61 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=0 ttl=61 time=4.159 ms
<drop>
64 bytes from x.x.x.x: icmp_seq=5 ttl=61 time=0.194 ms
64 bytes from x.x.x.x: icmp_seq=6 ttl=61 time=0.196 ms
64 bytes from x.x.x.x: icmp_seq=7 ttl=61 time=0.183 ms
64 bytes from x.x.x.x: icmp_seq=4 ttl=61 time=4.159 ms

After a little bit more testing, it turned out that every 4th packet
that was being sent to the peers' router was being queued until another
"4th packet" would come along and knock it out. If you increased the
interval time of the ping, you would see the amount of time the packet
spent in the queue increase. At one point I had it up to over 350
seconds (not milliseconds) that the packet stayed in the other routers'
queue before that 4th packet came along and knocked it free. I suspect
it could have gone higher, but random scanning traffic on the internet
was coming in. When there was a lot of traffic on the interface you
would never see the packet loss, just reordering of every 4th packet and
thus slow tcp transfers. :slight_smile:

In message <20090917234547.GT51443@gerbil.cluepon.net>, Richard A Steenbergen w
rites:

>
> Random failures of a single ports connectivity.... bizzare and annoying.
> Whole switches? Seen it.
> Whole panels? Seen it.
> Whole blades? Seen it.
>
> Single port on a switch or patch panel? Never.

You've never seen a single port go bad on a switch? I can't even count
the number of times I've seen that happen. Not that I'm not suggesting
the OP wasn't the victim of a human error like unplugging the wrong port
and they just lied to him, that happens even more.

My favorite bizarre random failure story is a toss-up between one of
these two:

Story 1. Had a customer report that they weren't able to transfer this
one particular file over their connection. The transfer would start and
then at a certain point the tcp session would just lock up. After a lot
of head scratching, it turned out that for 8 ports on a 24 port FastE
switch blade, this certain combination of bytes caused the packet to be
dropped on this otherwise perfectly normal and functioning card, thus
stalling the tcp session while leaving everything around it unaffected.
If you moved them to a different port outside this group of 8, or used
https, or uuencoded it, it would go through fine.

Seen that more than once. It's worse when it's in some router on the
other side of the planet and your just a lowly customer.