Those that remember the discussion may find this article interesting:
http://abcnews.go.com/Health/wireStory?id=9394406
Owen
Those that remember the discussion may find this article interesting:
http://abcnews.go.com/Health/wireStory?id=9394406
Owen
1. I grew up at the local airport watching my CFII pop train an
endless stream of pilots.
2. The checklist for my last production gear swap had over 400 steps
and 4 time/task gates (each with a rollback plan). As I did each
sequence of steps, I called it out, and someone read their copy of the
checklist and checked it off. An entire peanut gallery of rouges
watched the whole thing on livemeeting, waiting to pounce on the first
misstep or shortcut.
3. We migrated an entire nationwide phone system in 6 hours and
nobody noticed anything.
4. We met afterward to in an after action review meeting that I
picked up in the Army.
I'm more persistent than smart, and I tell ya, if you prep well
enough, you can hand your checklist to a stoned intern and you'll have
no worries at all.
David
I'm more persistent than smart, and I tell ya, if you prep well
enough, you can hand your checklist to a stoned intern and you'll
have no worries at all.
this works in a tech culture where folk follow mops obsessively. my
experience is that most north american engineers are too smart to do
that, and take shortcuts.
randy
Being a "North American Engineer", I resent that remark. =]
I _do_ create action plans and _do_ quarterback each step and _do_ slap down any attempt to deviate.
Eddy
I _do_ create action plans and _do_ quarterback each step and _do_
slap down any attempt to deviate.
imagine a network engineering culture where the concept of 'attempt to
deviate' just does not occur.
randy
=]
The networking group is under control.
Its the software engineers that start making edits to configs and code on the fly, improvisation at its finest. I guess my scope of interaction is greater than just networking. The hard part is that its a peer situation and how do you elevate the members of another team who have a lessor standard of operation. Also, they feel its fine to act like a cowboy and tackle problems on the fly. As long as the product is live before the window close. Then there is the almighty "We can't back out, we already made too many changes" that makes me want to grab rope and attach it to the ceiling.
Have a Merry Christmas,
Eddy
Eddy Martinez wrote:
Are you trying to suggest that this is something horrible, or that it's the future of network engineering?
I'm actually serious in asking the question, despite the grin.
-Dave
imagine a network engineering culture where the concept of 'attempt to
deviate' just does not occur.Are you trying to suggest that this is something horrible, or that
it's the future of network engineering?
neither. it is one [type of] ops engineering culture, and a very
successful one. it seems, from this gaijin's naive point of view, to be
the common one in japan.
when i try to 'sell' configuration automation, they are confused by how
important it is to me. they have a hard time seeing the need because
mops just work. my read is that this is because people do not have the
arrogance to take shortcuts.
when one is raised knowing that one's responsibility to the group is
more important than how smart one may think that one is, mops work.
randy
I _do_ create action plans and _do_ quarterback each step and _do_
slap down any attempt to deviate.
imagine a network engineering culture where the concept of 'attempt to
deviate' just does not occur.
Are you trying to suggest that this is something horrible, or that it's the future of network engineering?I'm actually serious in asking the question, despite the grin.
Possibly, he is trying to hint at a connection with Nazis, so somebody will mention it, invoking Godwin's Law, and bringing a fruitless religious thread to a close.
There's a full range of methods, with "just do it" on one side, "deviation is terms for dismissal" on the other, and plenty of shades of gray in between. I've seen both extremes result in excessive downtime. (How impromptu engineering can go wrong shouldn't take much imagination; the "no deviation" rule is especially hysterical when the backout plan doesn't work, but even without that, the "one thing didn't work exactly right, back it out and try again in two weeks" effect is destructive to both progress and morale.) Working with the dynamic and quality of the team is more important than any change management paradigm.
-Dave
imagine a network engineering culture where the concept of 'attempt to
deviate' just does not occur.Are you trying to suggest that this is something horrible, or that it's the future of network engineering?
The model of network engineering that grew up during the 1990s is
forever gone unless you work
in a smaller organization where people have to wear many hats. In the
big ISPs, now identical to
the big telcos, operations and engineering design duties are
separated. The operations folks
do not deviate from the written plans that they work with. If the
slightest thing happens that is not
in the plan, they rollback the changes as specified in the plan. They
don't fix anything unless it
is officially broken with trouble tickets filed and escalations up to
senior management. That is
about the only time that operations people can get away with taking
shortcuts and creative solutions.
On the other hand, the engineering design folks should spend a good
part of their day trying out
things, thinking up new ideas, poking around equipment and software to
see how far it can be pushed.
Then, when they have learned something and are ready to implement it
in the network, they write
a detailed plan for operations. Then some other engineering folks test
the heck out of that design
to try and find fault with it. After all the faults are fixed, it goes
to operations and the engineering
design folks move on to something else unless serious problems occur
and operations needs
a design engineer to approve some sensible action to be taken. The
operations folk can't take
the sensible action because that would deviate from their plans, but
getting engineering design
folks involved, gives them an out for real emergencies.
So the term "network engineering" is ambiguous because a lot of people
use it to mean the 90's
style job where engineering design activity and operational activity
were all jumbled together.
In some companies, taking the engineering design track not only means
that you lose enable
on the routers, but you lose all TACACS access and have to get
authorisation from a VP just
to ask for a copy of the running config on a production router. Some
people like ops because
they see a lot of stuff go by and learn from it, get their CCIE and
move into design engineering.
Others like ops because they are scared of the responsibility for
thinking up what to do next,
and making a mistake.
As far as I can see, the only way to get a job that mixes ops and
design is to be in 3rd or 4th
level support which is the top of the technical escalation chain where
a few excellent design
engineers do have enable on the routers because they fix important
problems in near realtime.
I suspect that it would be advantageous to have a career in which you
worked for a while in
ops before moving into design engineering if you want to get into
top-level support.
Take all this with a grain of salt. Every company does things a bit
different, and the terminology
that is used is ambiguous. It would be interesting to see what others
have to say about this
answer.
--Michael Dillon
I think it's a pretty accurate summation of how these things work in a lot of big organizations, all over the world.
There's a detrimental side to it, in that in the engineering org, the near-complete siloing away from ops can lead to an ivory-tower/King Canute type of mentality; in the ops org, this phenomenon in turn can lead to increasing frustration and lowered morale, which in turn leads to apathy and poor customer service.
All too often, one ends up with mutually-hostile engineering and ops teams who waste time and energy actively working to frustrate one another's ambitions, rather than combining their efforts to design, build, and operate the best network possible. Which in turn leads to many of the frustrations experienced every day by the end-customer.
I think that one must keep in mind that there are two kinds of
check-lists. There is a takeoff list where you can always choose to go
back to the ramp and fly another day if something doesn't check out but
there is a different priority when someone is already in the air and
something goes wrong. You can't decide to land a different day. In
that case you must rely on experience and knowledge to handle the
situation as it presents itself. Sure, you can have some basic checks
for things even in an emergency but you can't know how the problem is
going to present itself ahead of time. In cases like that you have set
of general parameters but the person "at the controls" needs to have
leeway to both clearly identify the nature of the problem and mitigate
the same if possible and that might include calling in some extra eyes
in order to identify things that might be going on with applications or
other devices that aren't specifically network gear.
So you can put a lot of process around changes in advance but there
isn't quite as much to manage incidents that strike out of the clear
blue. Too much process at that point could impede progress in clearing
the issue. Capt. Sullenberger did not need to fill out an incident
report, bring up a conference bridge, and give a detailed description of
what was happening with his plane, the status of all subsystems, and his
proposed plan of action (subject to consensus of those on the conference
bridge) and get approval for deviation from his initial flight plan
before he took the required actions to land the plane as best as he
could under the circumstances. And while that is a bit extreme in the
sense of most networks in that lives are not often at stake, some
concepts are the same (and there might be networks supporting various
occupations on this planet where lives might actually be at stake in the
case of a network failure during some sort of activity).
One of the most efficient shops I worked in was when the production
internet operation was owned by the engineering department. Corporate
operations owned the internal corporate IT, but engineering owned the
internet production data centers and network operations. If engineering
released a code revision that blew up the network, the VP of Engineering
was responsible for the entire picture, not just the software piece.
Same is true where a networking change blew up the application. Having
the responsibility for the entire "system" (software, hardware
platforms, and networking) under the same organization resulted in a lot
smoother operation without backbiting and greater access to and sharing
of resources between the application engineers, the systems
administrators, and the network engineers.
Conversely, the ever-increasing outright hostility and contempt evinced towards their customers by airlines worldwide - especially US-based airlines - over the last decade or so, all in the name of 'regulations', offers a useful counterexample.
When it comes to larger organizations, this latter scenario is more the norm than what you describe, in my experience. Critical problems are left unresolved for days/weeks/months; if one attempts to report an issue which is causing problems for many of an organizations customers worldwide, but one isn't oneself a direct customer of said organization, one is often as not ignored and shunted aside.
This isn't specific to the SP realm; it's simply a function of increased size, which leads to increased bureaucritization, which leads to dehumanization and the subordination of the organization's ostensible goals to internal politics, one-upsmanship, and blame-laying, no matter the industry in question. The folks with a can-do attitude who're willing to buck the system in order to do the right thing for the customer stand out in stark contrast to their peers, and in many cases end up paying a price in terms of career advancement because of their willingness to Do The Right Thing.
'Process' is all too often merely a ruse designed to avoid responsibility, shift blame/liability, justify hiring lower-cost/unqualified employees whilst shedding expensive/competent employees, and indulge in empire-building. We've seen this throughout corporate America with the 'permanent Y2K' of SoX and HIPAA, and the increasing involvement of government in terms of telecommunications-related rule-making which ends up directly affecting SPs.
I'm a big advocate of standards and change-control, and not an advocate of seat-of-the-pants, midnight engineering - except when the latter is necessary, as in the examples you give.
Unfortunately, many folks who work in larger organizations are actively prohibited from indulging in fluid, situationally-approrpriate problem resolution; and because of the aforementioned siloing of ops and engineering, their valuable first-hand experiences and the lessons learned thereby aren't taken into account during the design and rulemaking processes.
"*mayday mayday mayday. **Cactus fifteen thirty nine hit birds, we've lost
thrust (in/on) both engines we're turning back towards LaGuardia*" - Capt.
Sullenberger
Not exactly "detailed", but he definitely initiated an "incident report"
(the mayday), gave a "description of what was happening with his plane", the
"status of [the relevant] subsystems", and his proposed plan of action -
even in the order you've asked for!
His actions were then "subject to the consensus of those on the conference
bridge" (ie, ATC) who could have denied his actions if they believed they
would have made the situation worse (ie, if what they were proposing would
have had them on a collision course with another plane). In this case, the
conference bridge gave approval for his course of action ("*ok uh, you need
to return to LaGuardia? turn left heading of uh two two zero.*" - ATC)
5 seconds before they made the above call they were reaching for the QRH
(Quick Reference Handbook), which contains checklists of the steps to take
in such a situation - including what to do in the event of loss of both
engines due to multiple birdstrikes. They had no need to confer with others
as to what actions to take to try and recover from the problem, or what
order to take them in, because that pre-work had already been carried out
when the check-lists were written.
Of course, at the end of the day, training, skill and experience played a
very large part in what transpired - but so did the actions of the people on
the "conference bridge" (You can't get much more of a "conference bridge"
than open radio frequencies), and the checklists they have for almost every
conceivable situation.
Scott.
I think any network engineer who sees a major problem is going to have a
"Houston, we have a problem" moment. And actually, he was telling the
ATC what he was going to need to do, he wasn't getting permission so
much as telling them what he was doing so traffic could be cleared out
of his way. First he told them he was returning to the airport, then he
inquired about Peterburough, the ATC called Peterburough to get a runway
and inform them of an inbound emergency, then the Captain told the ATC
they were going to be in the Hudson. And "I hit birds, have lost both
engines, and am turning back" results in a whole different chain of
events these days than "I have two guys banging on the cockpit door and
am returning" or simply turning back toward the airport with no
communication. And any network engineer is going to say something if he
sees CPU or bandwidth utilization hit the rail in either direction.
Saying something like "we just got flooded with thousands of /24 and
smaller wildly flapping routes from peer X and I am shutting off the BGP
session until they get their stuff straight" is different than "we just
got flooded with thousands of routes and it is blowing up the router and
all the other routers talking to it. Can I do something about it?"
And that illustrates a point that is key. In that case the ATC was
asking what the pilot needed and was prepared to clear traffic, get
emergency equipment prepared, whatever it took to get that person
dealing with the problem whatever they needed to get it resolved in the
best way forward. The ATC isn't asking him if he was sure he set the
flaps at the right angle and "did you try to restart the engine" sorts
of things.
What I was getting at is that sometimes too much process can get in the
way in an emergency and the time taken to implement such process can
result in a failure cascading through the network making the problem
much worse. I have much less of a problem with process surrounding
planned events. The more the better as long as it makes sense.
Migrations and additions and modifications *should* be well planned and
checklisted and have backout points and procedures. That is just good
operations when you have tight SLAs and tight maintenance windows with
customers you want to keep.
Happy Holidays
George
"mayday mayday mayday. Cactus fifteen thirty nine hit birds, we've lost
thrust (in/on) both engines we're turning back towards LaGuardia" -
Capt. Sullenberger
Not exactly "detailed", but he definitely initiated an "incident report"
(the mayday), gave a "description of what was happening with his plane",
the "status of [the relevant] subsystems", and his proposed plan of
action - even in the order you've asked for!
His actions were then "subject to the consensus of those on the
conference bridge" (ie, ATC) who could have denied his actions if they
believed they would have made the situation worse (ie, if what they were
proposing would have had them on a collision course with another plane).
In this case, the conference bridge gave approval for his course of
action ("ok uh, you need to return to LaGuardia? turn left heading of uh
two two zero." - ATC)
5 seconds before they made the above call they were reaching for the QRH
(Quick Reference Handbook), which contains checklists of the steps to
take in such a situation - including what to do in the event of loss of
both engines due to multiple birdstrikes. They had no need to confer
with others as to what actions to take to try and recover from the
problem, or what order to take them in, because that pre-work had
already been carried out when the check-lists were written.
Of course, at the end of the day, training, skill and experience played
a very large part in what transpired - but so did the actions of the
people on the "conference bridge" (You can't get much more of a
"conference bridge" than open radio frequencies), and the checklists
they have for almost every conceivable situation.
Scott.
Just clearing a small point about pilots (I'm a pilot) - the
pilot-in-command has ultimate responsibility for his a/c and can ignore
whatever ATC tells him to do if he considers that to be contrary to the
safety of his flight (he may be asked to explain his actions later,
though). Now, usually ignoring ATC or keeping it in the dark about one's
intentions is not very clever - but dispatchers are not in the cockpit and
may misunderstand the situation or be simply mistaken about something (so
a pilot is encouraged to decline ATC instructions he considers to be in
error - informing ATC about it, of course).
But one of the first things a pilot does in an emergency is pulling out
the appropriate emergency checklist. It is kind of hard to keep from
forgetting to check obvious things when things get hectic (one of the
distressingly common causes of accidents is trivial running out of fuel -
either because the pilot didn't do homework on the ground (checking actual
fuel level in tanks, etc) or because when the engine got suddenly quiet he
forgot to switch to another, non-empty, tank).
The mantra about priorities in both normal and emergency situations is
"Aviate-Navigate-Communicate" meaning that maintaining control of a/c
always comes first, no matter what. Knowing where you are and where you
are going (and other pertinent situational awareness such as condition of
the a/c and current plan of actions) come second. Talking is lowest
priority.
The pre-planned emergency checklists may be a good idea for network
operators. Try obvious (when you're calm, that's it) actions first, if
they fail to help, try to limit damage. Only then go file the ticket and
talk to people who can investigate situation in depth and can develop a
fix.
The way aviation industry come with these checklists is, basically,
experience - it pays to debrief after recovery from every problem not
adequately fixed by existing procedures, find common ones, and develop
diagnostic procedure one could follow step-by-step for these situations.
(The non-punitive error or incident reporting which actually shields
pilots from FAA enforcement actions in most cases also helps to collect
real-world information on where and how pilots get into trouble).
The all-too-common multistep ticket escalation chains (which merely work
as delay lines in a significant portion of cases) is something to be
avoided.
Even better is to provide some drilling in diagnostic and recovery from
common problems to the front-line personnel - starting from following the
checklist on a simulated outage in the lab, and then getting it down to
what pilots call "the flow" - a habitual memorized procedure, which is
performed first and then checked against the checklist.
Note that use of checklists, drilling, and flows does not make pilots a
kind of robots - they still have to make decisions, recognize and deal
with situations not covered in the standard procedures; what it does is
speeding up dealing with common tasks, reduces mistakes, and frees up
mental processing for thinking ahead.
The ISP industry has a long way to go until it reaches the same level of
sophistication in handling problems as aviation has.
--vadim
Well, to counter this one might talk about the medical business (doctors) which hasn't been able to embrace the checklists at all (apart from in a few places), and they still consider their profession to be a craft, just like most network engineers do.
It's the classical "good/fast/cheap, please pick two". Aviation is slow/careful to bring in new tech, same with the health care side, they're both very conservative. We in the network business are still immature but quick and flexible, but as time goes on, our services are more and more important, and thus things settle in and slow down, but becomes more reliable. This is an evoltion that'll take quite some time, but it's already changed a lot the past 10 years.
There was quite a buzz regarding doctor checklists a few years back, I read several articles about it, but now I can't find the one I want to find, but <http://www.healthbeatblog.org/2007/12/pilots-use-chec.html> talks a bit about the topic.
It seems that there's a logical fallacy floating around somewhere
(networks have parts and are complicated, airplanes and flight involve
lots of parts and are also complicated, therefore aircraft are like
networks). I assert that comparing 'packet switching' to an industry
that has its roots in the late 1800's and had its first "hello world"
moment in 1903 isn't terribly fruitful.
Further, aircraft are the asymptotic limit of 'singly homed transit.'
Because of this, I think one could argue that pilots and ATC must be
held to a different professional standard due to the nature of public
trust at risk. At the other end of our strawman spectrum, we have end
users who must accept the risk that their provider will be unable to
connect them to lolcats.com on occasion, perhaps as often as 0.01% per
year, and most are happy to accept this. Four nines survivability on
flights, clearly, won't work.
What I'm getting at is that after following this thread for a while,
I'm not convinced any amount of process-borrowing is going to solve
problems better, faster, or even avoid them in the first place. At
best, our craft is 1/3rd as "old" (if that's somehow I measure of
maturity) as flight and nobody is being sued to settle 200+ accidental
deaths because of our mistakes.
-Tk
Whimsical deviations don't belong in the maint execution, they belong
in the brainstorming and design. Gather more points of view during
the peer review of the specification of work. In my experience, good
engineering makes for bad drama (and conversely if it is a "dramatic
save" then you have a bad engineer and likely a cowboy). Have a plan
that executes in stages, tests at checkpoints where partial completion
is possible, and a fallback for each step. A great way to train up
junior people, document as you go, expose flaws and lines of future
investigation, and if things go south you escalte to those who can
judge *reasonable* new directions.
To me, that kind of change management for non-automatable work is a
descendent of resonable group work. If you have project-oriented
autonomous teams that stick to the guideposts of "your standards" and
"minimal disruptions/maximal uptime" then good work will emerge. As
for automation, that enables your expensive hmans to do more smart
things so should always be incorporated in processes and be something
people move toward, IMO.
Cheers,
Joe