Mitigating human error in the SP

Otherwise, as Suresh notes, the only way to eliminate human error completely is
to eliminate the presence of humans in the activity.

and,hence by reference.....

Automated config deployment / provisioning.

That's the funniest thing I've read all day... :wink:

A little pessimistic rant.... :wink:

Who writes the scripts that you use, who writes the software that you use ? There will always be at least one human somewhere, and where there's a human writing software tools, there's scope for bugs and unexpected issues. Whether inadvertent or not, they will always be there.

If the excrement is going to hit the proverbial fan, try as you might to stop it, it will happen. Nothing in the IT / ISP / Telco world is ever going to be perfect, far too complex with many dependencies. Yes you might play in your perfect little labs until the cows come home ..... but there always has been and always will be an element of risk when you start making changes in production.

Face it, unless you follow the rigorous change control and development practices that they use for avionics or other high-risk environments, you are always going to be left with some element of risk.

How much risk your company is prepared to take is something for the men in black (suits) to decide because it correlates directly with how much $$$ they are prepared to throw your way to help you mitigate the risk .....:wink:

That's my 2 <insert_currency> over ...... thanks for listening (or not !).... :wink:

Agreed.

I'd say that 10 minutes of checklist creation at the onset of a change
plan, then 5 minutes of checklist revision/debrief per day is time well
spent. After a couple of months attitudes to SOPs usually change.

_insert duplicate of aviation-style check-listing and human factors
reporting thread here_

Gord

Add to that the stuff that always sounds like a cop-out, even tom the victims--the "human error" made by people not on you payroll, the vendors that are responsible for the misleading (or absent) documentation, for the CLI stuff that doesn't work just the way a reasonable person would expect it too, for the hardware that fails dirty, and on and on--a very long list. Exacerbated by management that cheaps out on equipment, software, documentation, training, and staff.

Even with a lab with a rich fabric of equipment, there will be most of the other things to contend with.

A reasonable and competent management will not only provide what is needed for a reasonable error rate (which indeed can approach one over 5 nines) but will also provide the means of recovery when the inevitable happens. That might involve "needless" expense like additional staff, redundant equipment, alternate paths, ...

But it won't involve whippings until the morale improves or reductions in staff and funding until the errors go away.