Mitigating human error in the SP

Michael_Dillon4 · February 3, 2010, 1:36am

The actual error happened when someone was troubleshooting a turn-up,
where in the past the customer in question has had their ethertype set
wrong. It wasn't a provisioning problem as much as someone
troubleshooting why it didn't come up with the customer. Ironically,
the NOC was on the phone when it happened, and the switch was rebooted
almost immediately and the outage lasted 5 minutes.

This is why large operators have a "ready for service" protocol. The customer
is never billed until it is officially RFS, and to make it RFS requires more
than an operational network, it also requires the customer to agree in writing
that they have a fully functional connection.

This is another way of hiding human error, because now the up-down-up is
just part of the provisioning process. There is a record of the RFS date-time
so if the customer complains about an outage BEFORE that point, they can
be politely reminded that when RFS happened and that charging does not
start until AFTER that point.

--Michael Dillon

David_Hiers · February 3, 2010, 2:38am

If your manager pretends that they can manage humans without a few
well-worn human factor books on their shelf, quit.

David

Steven_Bellovin · February 3, 2010, 2:44am

Yup. Or use a database and a template-driven compiler. See "Configuration management and security", IEEE Journal on Selected Areas in Communications, 27(3):268-274, April 2009, by myself and Randy Bush, http://www.cs.columbia.edu/~smb/papers/config-jsac.pdf (the system described is Randy's work, from many years ago).

--Steve Bellovin, http://www.cs.columbia.edu/~smb

Brian_Christopher_Ra · February 3, 2010, 1:47pm

Reminds me of the saying, nothing is foolproof given a sufficiently talented
fool. I do agree that checklist, peer reviews, parallel turnups, and lab
testing when used and not jury rigged have helped me prepare for issue.
Usually when I skipped those things are the time I kick myself for not doing
it. Another thing that helps is giving yourself enough time, doing what you
can ahead of time, and being ready on time. Just my two bits.

Ross_Vandegrift · February 3, 2010, 4:14pm

Vijay's stuff is fascinating. The vision is great. But in my
experience, the vendors and implementations basically ruin the dream
for anyone who doesn't have his pull.

I'm sure my software is nowhere close to being as sophisticated as
his, but my plans are pretty much in line with his suggestions. Some
problems I've run into that I don't see any kind of solution for:

1) Forwarding-impacting bugs: IOS bugs that are triggered by SNMP are
easily the #1 cause of our accidental service impact. Most seem to be
race conditions that require real-world config and forwarding load -
not something a small shop can afford to build a lab to reproduce. If
we stuck to manual deployment, we might have made a few mistakes but
would it have been worse? Maybe - but honestly, it could be a wash.

2) Vendor support is highly suspicious of automation: anytime I open a
ticket, even unrelated to an automated software process, the first
thing the vendor support demands is to disable all automation.
Juniper is by far the best about this, and they *still* don't actually
believe their own automation tools work. Cisco TAC's answer has
always been "don't ever use SNMP if it causes crashes!" Procurve
doesn't even bother to respond to tickets related to automation bugs,
even if they are remotely triggerable crashes in the default config.

3) Automation interfaces are largely unsupported: I imagine vendor
software development having one or two guys that are the masterminds
for SNMP/NETCONF/whatever - and that's it. When I have a question on
how to find a particular tool, or find a bug in an automation
function, I can often go months on a ticket with people that have no
idea what I'm talking about. What documentation exists is typically
incomplete or inconsistent across versions and product lines.

4) Related tools prevent reliable error reporting: as far as I can
tell, Net-SNMP returns random values if a request fails; if there's a
pattern, I've failed to discern it. expect is similar. ScreenOS's
SSH implementation always returns that a file copy failed. Procurve
only this year implemented ssh key-based auth in combination with
remote authentication. The best-of-breed seems to be an oft-pathetic
collection of tools.

5) Management support: developing automation software is hard - network
devices aren't nearly as easy to deal with as they should be. When I
spend weeks developing features that later causes IOS to spontaneously
reload, people that don't understand the relation to operational
impact start to advocate dismantling the automation just like the
vendors above.

I'm sure we'll continue to build automated policy and configuration
tools. I'm just not convinced it's the panacea that everyone thinks.
Unless you're one of the biggest, it puts your network at someone
else's mercy - and that someone else doesn't care about your
operational expenses.

Ross

Christopher_Morrow · February 3, 2010, 4:35pm

capabilities... Vendors build what will make them money. If you want a
device to do X, getting lots of your friends in the operator community to
agree and talk to the vendor with the same message helps the vendor
understand and prioritize the request.

If you want more/better/faster/simpler configuration via 'script' (program)
it makes sense to ask the vendor(s) for these capabilities...

-chris

Michael_Dillon4 · February 3, 2010, 10:57pm

3) Automation interfaces are largely unsupported:

CLI is an automation interface. Combine that with a management server
from which telnet sessions to the router can be managed, and you have
probably the lowest risk automation interface possible. This may force
you into building your own tools, but if you really want low risk, that's
the price you pay.

I'm sure we'll continue to build automated policy and configuration
tools. I'm just not convinced it's the panacea that everyone thinks.
Unless you're one of the biggest, it puts your network at someone
else's mercy - and that someone else doesn't care about your
operational expenses.

That is not a risk of automation. That is a risk of buy versus build.

More and more businesses of all sorts are beginning to take a new
look at their software and automated systems with a view towards
building and owning and maintaining the parts that really are business
critical for their unique business. In this brave new world, only the
non-essential stuff will be bought in as packages.

--Michael Dillon

David_Hiers · February 4, 2010, 6:08am

You can completely implement Vijay's most impressive stuff and simply
move the problem to a different level of abstraction.

No matter what you do, it still comes down to some geek banging on
some plastic thingy. I'm as likely to screw up an "Extensible
Entity-Attribute-Relationship" as I am an ACL.

David