ATT GigE issue on 11/19 in Kansas City

We lost several of our GigE links to AT&T for 6 hours on 11/19, anyone else see this and get a root cause from AT&T? All I can get is that they believe a change caused the issue.

We lost several (but not all) of our Optiman circuits on 11/19 at about 10:20am. We were told the root issue was that all VLANs in one of their switches had been accidentally deleted / removed. We were never able to get any additional detail (like "how") but services were restored about 16:45.

+1 to the above - we received the following RFO, from the their NOC:

"All impacted VLANS were rebuilt to restore service. It is believed
there were some configuration changes that caused the VLAN troubles. A
case has been opened with Cisco to further investigate the root
cause."

***Stefan Mititelu
http://twitter.com/netfortius
http://www.linkedin.com/in/netfortius

Stefan wrote the following on 11/30/2011 8:53 AM:

That was my first thought as well.. it would just surprise me if a huge provider like AT&T was using VTP instead of using a provisioning tool that automates the manual pruning process to avoid issues like this. In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Brad Fleming wrote:

In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again?

What is the utilitarian value in an RFO?

Joe

In either case I'm a customer and will likely never be told what went wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again?

Nope. But if this same issue crops up again we'll have to "work the system" harder and demand calls with knowledgeable people; not an easy task for a customer my size (I'm not Starbucks with thousands of sites). A single outage can be understood, seeing repeated issues means I want to know what's going wrong. If the issue is something simply mitigated and the service provider hasn't taken steps, I need to start looking for a different service provider. Everything has a little downtime every now and again and I can live with it on lower speed circuits.

What is the utilitarian value in an RFO?

To determine whether its an honest mistake or a more systemic issue that should push me toward another option.

No. It doesn't prevent it from happening again. But at least you can
have them check for that same issue when it happens next time.

I guess the RFO gives the customer the feeling that the vendor was able
to isolate the issue and fix it; as opposed to "issue was resolved
before isolation".

- Miraaj Soni

What I have seen lately with telco's building and operating Metro Ethernet Forum (MEF) based Ethernet networks is that relatively inexperienced telco staff are in charge of configuring and operating the networks, where telco operational staff are unaware of layer 2 Ethernet network nuances, nuances that in an Enterprise environment network engineers must know, or else. I have seen numerous instances of telco MEF layer 2 outages of 20-30 seconds where my layer 3 routing keep-alives time out. Subsequent telco root cause analysis has determined that spanning tree convergence brought down multiple links in the telco MEF network. One telco technician, assigned to Ethernet switch configuration, told me that a 20-30 second network hang is not really a big deal.

Brad Fleming wrote:

In either case I'm a customer and will likely never be told what went
wrong. I'm OK with that so long as it doesn't happen again!

Does being told what happened somehow prevent it from happening it again?

What is the utilitarian value in an RFO?

"The outage was caused by an engineer turning off the wrong router, it
has been turned back on and service restored"
"The outage appears to have been caused by a bug in the routers
firmware, we are working with the vendor on a fix"
"There was an outage, now service is back up again"

A brief isolated incident in any case you probably don't care enough
to change providers (if you care about outages that much, you just
divert traffic to your other redundant connections), but say you've
had 2 outages in a week with that given as the explanation, which one
makes you feel more concerned about going shopping for another
provider?

Technically the first provider knows the causes of the outages and it
has been fixed while the second one doesn't know for sure what the
problem is and they might have fixed it or might not, however I
suspect most people would probably not agree with that interpretation.
The third provider I don't think there's any way to interpret it to
make them look good.

From a utilitarian point of view the more detail customers get the

less angry they normally are, and I believe "less angry" is a
generally accepted form of "happier" in the ISP world (at least some
ISPs seem to think so). Therefore for utilitarian reasons you should
write nice long details reports, unless the cause is incompetence then
you should probably just shut up and let people assume incompetence
instead of confirming it, as confirming it might make them less happy.
Although one could also argue that by being honest about incompetence
your customers will likely change providers sooner, causing an overall
increase in their level of happiness. This utilitarian thing is
complicated.

- Mike

We use RANCID here, quite heavily, to help guide
provisioning engineers so they are better prepared for the
future, and actually understand what it is they are
configuring.

Pre-provisioning training is all good and well, but hands-on
experience always has the chance of "going the other way".
While RANCID is after-the-fact, it's a great tool for
refining what the folk on the ground know.

It certainly has helped us a great deal, over the years.

Mark.

When the RFO gets filtered through the marketing department, it gets
interesting, and totally useless. This is what we got as an official
RFO for an outsourced hosted VoIP service (carrier shall remain
nameless) that was for all practical purposes down hard for two DAYS due
to a botched planned software upgrade, verbatim and in its entirety:

"Coincident with this upgrade, we experienced an Operating System-level
failure on the underlying application server platform which had the
effect of defeating the redundancy paradigm designed into our service
architecture."