Mitigating human error in the SP

Hello NANOG,

Long time listener, first time caller.

A recent organizational change at my company has put someone in charge
who is determined to make things perfect. We are a service provider,
not an enterprise company, and our business is doing provisioning work
during the day. We recently experienced an outage when an engineer,
troubleshooting a failed turn-up, changed the ethertype on the wrong
port losing both management and customer data on said device. This
isn't a common occurrence, and the engineer in question has a pristine
track record.

This outage, of a high profile customer, triggered upper management to
react by calling a meeting just days after. Put bluntly, we've been
told "Human errors are unacceptable, and they will be completely
eliminated. One is too many."

I am asking the respectable NANOG engineers....

What measures have you taken to mitigate human mistakes?

Have they been successful?

Any other comments on the subject would be appreciated, we would like
to come to our next meeting armed and dangerous.

Thanks!
Chad

Automated config deployment / provisioning. And sanity checking
before deployment.

A lab in which changes can be simulated and rehearsed ahead of time, new OS revisions tested, etc.

A DCN.

Vijay Gill had some real interesting insights into this in a presentation he gave back at NANOG 44:

http://www.nanog.org/meetings/nanog44/presentations/Monday/Gill_programatic_N44.pdf

His Blog article on "Infrastructure is Software" further expounds upon the benefits of such an approach - http://vijaygill.wordpress.com/2009/07/22/infrastructure-is-software/

That stuff is light years ahead of anything anybody is doing today (well, apart from maybe Vijay himself :wink: ... but IMO it's where we need to start heading.

Stefan Fouant, CISSP, JNCIE-M/T
www.shortestpathfirst.net
GPG Key ID: 0xB5E3803D

If upper management believes humans can be required to make no errors, ask whether they have achieved that ideal for themselves. If they say yes, start a recorder and ask them how. When they get done, ask them why they think the solution that worked for them will scale to a broader population. (Don't worry, you won't get to the point of needing the recorder.)

Otherwise, as Suresh notes, the only way to eliminate human error completely is to eliminate the presence of humans in the activity.

For those processes retaining human involvement, procedures and interfaces can be designed to minimize human error. Well-established design specialty. Human factors. Usability. Etc. Typically can be quite effective. Worthy using.

d/

I'll say "as vijay gill notes" after Stefan posted those two very
interesting links. He's saying much the same that I did - in a great
deal more detail. Fascinating.

Hello NANOG,

Long time listener, first time caller.

A recent organizational change at my company has put someone in charge
who is determined to make things perfect. We are a service provider,
not an enterprise company, and our business is doing provisioning work
during the day. We recently experienced an outage when an engineer,
troubleshooting a failed turn-up, changed the ethertype on the wrong
port losing both management and customer data on said device. This
isn't a common occurrence, and the engineer in question has a pristine
track record.

Why didn't the customer have a backup link if their service was so
important to them and indirectly your upper management? If your
upper management are taking this problem that seriously, then your
*sales people* didn't do their job properly - they should be ensuring
that customers with high availability requirements have a backup link,
or aren't led to believe that the single-point-of-failure service will
be highly available.

This outage, of a high profile customer, triggered upper management to
react by calling a meeting just days after. Put bluntly, we've been
told "Human errors are unacceptable, and they will be completely
eliminated. One is too many."

If upper management don't understand that human error is a risk factor
that can't be completely eliminated, then I suggest "self-eliminating"
and find yourself a job somewhere else. The only way you'll avoid
human error having any impact on production services is to not change
anything - which pretty much means not having a job anyway ...

Leaving the PHB rhetoric aside for a few moments, this comes down to two
things: 1. cost vs. return and 2. realisation that service availability is
a matter of risk management, not a product bolt-on that you can install in
your operations department in a matter of days.

Pilot error can be substantially reduced by a variety of different things,
most notably good quality training, good quality procedures and
documentation, lab staging of all potentially service-affecting operations,
automation of lots of tasks, good quality change management control,
pre/post project analysis, and basic risk analysis of all regular procedures.

You'll note that all of these things cost time and money to develop,
implement and maintain; also, depending on the operational service model
which you currently use, some of them may dramatically affect operational
productivity one way or another. This often leads to a significant
increase in staffing / resourcing costs in order to maintain similar levels
of operational service. It also tends to lead to inflexibility at various
levels, which can have a knock-on effect in terms of customer expectation.

Other things which will help your situation from a customer interaction
point of view is rigorous use of maintenance windows and good
communications to ensure that they understand that there are risks
associated with maintenance.

Your management is obviously pretty upset about this incident. If they
want things to change, then they need to realise that reducing pilot error
is not just a matter of getting someone to bark at the tech people until
the problem goes away. They need to be fully aware at all levels that risk
management of this sort is a major undertaking for a small company, and
that it needs their full support and buy-in.

Nick

Humans make errors.

For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving.

But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers.

So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated.

Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem?

You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification.

Paul

Humans make errors.

For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving.

But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers.

So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated.

Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem?

You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification.

Paul

Hello NANOG,

Long time listener, first time caller.

[snip]

What measures have you taken to mitigate human mistakes?

Have they been successful?

Any other comments on the subject would be appreciated, we would like
to come to our next meeting armed and dangerous.

Define your processes well, have management sign off so no blame game
and people realize they are all on the same side Use peer review.
Don't start automating until you have a working system, and then get
the humans out of the repetitive bits. Don't build monolithic systems.
Test your automation well. Be sure have the symmetric *de-provisioning*
to any provisioning else you will be relying on humans to clean out
the cruft instead of addressing the problem.

Extend accountability throughout the organization - replace commission-
minded sales folks with relationship-minded account management.

Always have OoB. Require vendors to be *useful* under OoB conditions,
at least to your more advanced employees.

Expect errors in the system and in execution; develop ways to check
for them and be prepared to modify methods, procedures and tools
without multiple years and inter-departmental bureaucracy. Change and
errors happen, so capitalize on to those events to improve you service
and systems rather than emphasizing punishment.

Cheers,

Joe

Humans make errors.

For your upper management to think they can build a foundation of reliability on the theory that humans won't make errors is self deceiving.

But that isn't where the story ends. That's where it begins. Your infrastructure, processes and tools should all be designed with that in mind so as to reduce or eliminate the impact that human error will have on the reliability of the service you provide to your customers.

So, for the example you gave there are a few things that could be put in place. The first one, already mentioned by Chad, is that mission critical services should not be designed with single points of failure - that situation should be remediated.

Agreed.

Another question to be asked - since this was provisioning work being done, and it was apparently being done on production equipment, could the work have been done at a time of day (or night) when an error would not have been as much of a problem?

As it stands now, business want to turn their services up when they
are in the office. We do all new turn-ups during the day, anything
requiring a roll or maintenance window is schedule in the middle of
the night.

You don't say how long the outage lasted, but given the reaction by your upper management, I would infer that it lasted for a while. That raises the next question. Who besides the engineer making the mistake was aware of the fact that work on production equipment was occurring? The reason this is important is because having the NOC know that work is occurring would give them a leg up on locating where the problem is once they get the trouble notification.

The actual error happened when someone was troubleshooting a turn-up,
where in the past the customer in question has had their ethertype set
wrong. It wasn't a provisioning problem as much as someone
troubleshooting why it didn't come up with the customer. Ironically,
the NOC was on the phone when it happened, and the switch was rebooted
almost immediately and the outage lasted 5 minutes.

Chad

We have solved 98% of this with standard configurations and templates.

To deviate from this requires management approval/exception approval after an evaluation of the business risks.

Automation of config building is not too hard, and certainly things like peer-groups (cisco) and regular groups (juniper) make it easier.

If you go for the holy grail, you want something that takes into account the following:

1) each phase in the provisioning/turn-up state
2) each phase in infrastructure troubleshooting (turn-up, temporary outage/temporary testing, production)
3) automated pushing of config via load override/commit replace to your config space.

Obviously testing, etc.. is important. I've found that whenever a human is involved, mistakes happen. There is also the "Software is imperfect" mantra that should be repeated. I find vendors at times have demanding customers who want perfection. Bugs happen, Outages happen, the question is how do you respond to these risks.

If you have poor handling of bugs, outages, etc.. in your process or are decision gridlocked, very bad things happen.

- Jared

I would also point Chad to this book: http://bit.ly/cShEIo (Amazon Link to Visual Ops).

It's very useful to have your management read it. You may or may not be able to or want to use a full ITIL process, but understanding how these policies and procedures can/should work, and using the ones that apply makes sense.

Change control, tracking, and configuration management are going to be key to avoiding mistakes, and being able to rapidly repair when one is made.

Unfortunately, most management that demands No Tolerance, Zero Error from operations won't read the book.

Good luck.. I'd bet most of the people on this list have been there one time or another.

Cheers,
-j

Chadwick Sorrell wrote:

This outage, of a high profile customer, triggered upper management to
react by calling a meeting just days after. Put bluntly, we've been
told "Human errors are unacceptable, and they will be completely
eliminated. One is too many."

Good, Fast, Cheap - pick any two. No you can't have all three.

Here, Good is defined by your pointy-haired bosses as an impossible-to-achieve zero error rate.[1] Attempting to achieve this is either going to cost $$$, or your operations speed (how long it takes people to do things) is going to drop like a rock. Your first action should be to make sure upper management understands this so they can set the appropriate priorities on Good, Fast, and Cheap, and make the appropriate budget changes.

It's going to cost $$$ to hire enough people to have the staff necessary to double-check things in a timely manner, OR things are going to slow way down as the existing staff is burdened by necessary double-checking of everything and triple-checking of some things required to try to achieve a zero error rate. They will also need to spend $$$ on software (to automate as much as possible) and testing equipment. They will also never actually achieve a zero error rate as this is an impossible task that no organization has ever achieved, no matter how much emphasis or money they pour into it (e.g. Windows vulnerabilities) or how important (see Challenger, Columbia, and the Mars Climate Orbiter incidents).

When you put a $$$ cost on trying to achieve a zero error rate, pointy-haired bosses are usually willing to accept a normal error rate. Of course, they want you to try to avoid errors, and there are a lot of simple steps you can take in that effort (basic checklists, automation, testing) which have been mentioned elsewhere in this thread that will cost some money but not the $$$ that is required to try to achieve a zero error rate. Make sure they understand that the budget they allocate for these changes will be strongly correlated to how Good (zero error rate) and Fast (quick operational responses to turn-ups and problems) the outcome of this initiative.

jc

[1] http://www.godlessgeeks.com/LINKS/DilbertQuotes.htm

2. "What I need is a list of specific unknown problems we will encounter." (Lykes Lines Shipping)

6. "Doing it right is no excuse for not meeting the schedule." (R&D Supervisor, Minnesota Mining & Manufacturing/3M Corp.)

Interesting book, maybe I'll bring that to the next meeting. Thanks
for the heads up on that.

Those things and some of the others that have been mentioned will go a very long way to prevent the second occurrence.

Only training, adequate (number and quality) staff, and a quality-above-all-all-else culture have a prayer of preventing the first occurrence. (For sure, lots of the second-occurrence-preventers may be part of that quality first culture.)

Thanks for all the comments!

Automated config deployment / provisioning. And sanity checking
before deployment.

Easy to say, not so easy to do. For instance, that incorrect port was identified
by a number or name. Theoretically, if an automated tool pulls the number/name
from a database and issues the command, then the error cannot happen. But how
does the number/name get into the database.

I've seen a situation where a human being enters that number, copying it from
another application screen. We hope that it is done by copy/paste all the
time but who knows? And even copy/paste can make mistakes if the selection
is done by mouse by someone who isn't paying enough attention.

But wait! How did the other application come up with that number for copying?
Actually, it was copy-pasted from yet a third application, and that application
got it by copy paste from a spreadsheet.

It is easy to create a tangled mess of OSS applications that are glued together
by lots of manual human effort creating numerous opportunities for human error.
So while I wholeheartedly support automation of network configuration, that is
not a magic bullet. You also need to pay attention to the whole process, the
whole chain of information flow.

And there are other things that may be even more effective such as hiding your
human errors. This is commonly called a "maintenance window" and it involves
an absolute ban on making any network change, no matter how trivial, outside
of a maintenance window. The human error can still occur but because it is
in a maintenance window, the customer either doesn't notice, or if it is planned
maintenance, they don't complain because they are expecting a bit of disruption
and have agreed to the planned maintenance window.

That only leaves break-fix work which is where the most skilled and trusted
engineers work on the live network outside of maintenance windows to fix
stuff that is seriously broken. It sounds like the event in the original posting
was something like that, but perhaps not, because this kind of break-fix work
should only be done when there is already a customer-affecting issue.

By the way, even break-fix changes can, and should be, tested in a lab
environment before you push them onto the network.

--Michael Dillon

Never said it was, and never said foolproof either. Minimizing the
chance of error is what I'm after - and ssh'ing in + hand typing
configs isn't the way to go.

Use a known good template to provision stuff - and automatically
deploy it, and the chances of human error go down quite a lot. Getting
it down to zero defect from there is another kettle of fish altogether
- a much more expensive with dev / test, staging and production
environments, documented change processes, maintenance windows etc.