how to write an incident report

For those who don't recognise the name (presumably many people), Citylink are a disruptive high-speed metro transport provider in Wellington, New Zealand. They run the most elaborate and scary layer-2 switched ethernet network I've ever heard of, and the other week they ran into some problems which caused a prolonged outage.

Here's their writeup:

   http://news.clnz.net/2007/10/19#Loopback-Saturday-Discussion

Why don't I have any suppliers like this?

Joe

Wow! This has to be one of the best incident reports I have ever seen. It would be great if people took a page out of Citylink's book instead of the one paragraph "something died" type reports.

Probably because your suppliers run everything thru their lawyers. This obviously was posted straight via the NOC/OPs group. They'll continue to post incident reports like this until they get hit by their first lawsuit by someone similar to "spilling hot coffee on their lap".

"This take somewhat longer than it should have..."

"then we are pretty sure that the measures above (enforcing MAC count limits on every port, disabling keepalives on interswitch links, single homing all 2950's) will prevent the problem from reoccuring..."

There is enough info in that posting to bury them in frivilous lawsuits.

-Hank

There is enough info in that posting to bury them in frivilous lawsuits.

I say good for them then! Society is litigious enough without our engineers worrying about lawsuits. The moment you start tempering your true analysis of a situation to kow-tow to spin doctors is the moment your engineering badge should be revoked.

The world needs more honesty; not less.

Jason

PS This “Citylink” appears to be in New Zealand - perhaps they haven’t been invaded by lawyers yet?
PPS And it was an excellent analysis!

I've had a few responses like this, but I don't buy it. I've worked in many places, some in New Zealand and more elsewhere, where there was a general culture of fear about making public statements about operational incidents. I don't ever remember people sending proposed text to legal and having it pushed back with changes; what happened instead was that text wasn't written in the first place.

Maybe Simon's level of detail is such that no legal department would ever condone it. But there's such a tremendous distance between Simon's text and the usual "there are no known issues at this time" that I suspect people just aren't trying.

Joe

Only if they were stupid enough to move to the States.

                                -Bill

Maybe there was so much detail that the lawyers didn't understand
it. :slight_smile:

I've had a few responses like this, but I don't buy it. I've worked
in many places, some in New Zealand and more elsewhere, where there
was a general culture of fear about making public statements about
operational incidents. I don't ever remember people sending proposed
text to legal and having it pushed back with changes; what happened
instead was that text wasn't written in the first place.

Of course it wasn't, the only time public statements beyond the simple
"Network Status" update are made is when the outage is so huge that the
news media report it. Then the idea is to spin the problem as a freak
occurrence that no amount of money and planning (which the company of
course spent years and millions doing) would have prevented.

Legal and PR are going to take one look at the report and then ask what
the upside for the company is in releasing it. In most cases there will be
none so it won't happen. Techs know this so don't even bother.

In reality a large percentage of outages happen for "dumb" reasons and
publicising them just makes the company look bad (look at the previous
fault on the page).

Look at this Citylink outage, I'm sure the sales guys for rival companies
are right now working on their pitches for their customer's business
based on that has been posted.

"Look at these guys, they took down half the city and still don't know it
wasn't caused by hackers. Half the government was offline [1] all day
because they couldn't even get into their building after hours. Their
phones were off, their mail servers stopped working, they couldn't login
to their network themselves, and their websites were offline. They've
been having these sort of outages on a smaller scale for years and just
ignored them because they only affect one or two customers at a time."

[1] Roughly: Beehive = Whitehouse, RBNZ = Federal reserve, Bowen St = Parliament.

Maybe Simon's level of detail is such that no legal department would
ever condone it. But there's such a tremendous distance between
Simon's text and the usual "there are no known issues at this time"
that I suspect people just aren't trying.

Well I was pleasantly surprised at 365 Main's explanation of the problem
a while back.

http://www.365main.com/press_releases/pr_8_1_07_365_main_report.html

but once again that was a major event that couldn't be hidden.

Citylink is a slightly unusual company in it's level of openness (although
getting less so) but I would guess that most people on this list would be
fired if they posted something like Simon's text without running it by
legal.

Look at this Citylink outage, I'm sure the sales guys for rival

companies

are right now working on their pitches for their customer's business
based on that has been posted.

I think you greatly underestimate how customers react to the truth.


I think you greatly underestimate how customers react to the truth.
  

Indeed - “The Cluetrain Manifesto” () is probably a good starting point to understand exactly that point. MMC