HE.net, Fremont-2 outage?

Alex Rubenstein wrote:
>> Yup. Related: "100% availability" is a marketing person's dream; it
>> sounds good in theory but is unattainable in practice, and is a
>> reliable sign of non-100%-reliability.
>
> You are confusing two different things.
>
> Availability != Reliability.

Pardon the interruption...

In the aforementioned statement, there appears an intense/flagrant -
compartmentalization/separation of terms without sufficient
explanation.

Correct. It's even a bit more interesting than that; there's an
implication that marketing people will not really know the difference,
having heard repeatedly about "high availability", may proceed to
use "availability" as a buzzword... I guess I was a bit more oblique
than intended.

Note that in being available, 'a' criteria to ensure
reliability is met. If one has the desire to delve into some of the
nuanced operational perspective, see: http://ow.ly/zmQg (pdf) or
http://ow.ly/zmTB (web friendly). The article is also available
through the IEEE Portal at Services Update (if one of the other links
appear to be unavailable, anytime).

I doubt marketing people will care. :slight_smile:

> For instance, an airplane is designed to be 100% reliable, but much less available. To keep a 747 from not crashing (100% reliability) it needs significant downtime (not 100% available).

This explanation, aside from being unsatisfactory, is misleading.
Operating times and maintenance times are very much separate quantities.

And airplanes aren't 100% reliable regardless...

For a power system as a whole, though, one could see 100% availability
as a prereq for 100% reliability. Of course, you more closely approach
100% through redundancies... oops, should we introduce another term to
debate? :slight_smile:

>> And even for those who follow best practices... You can inspect and
>> maintain things until you're blue in the face. One day a contractor
>> will drop a wrench into a PDU or UPS or whatever and spectacular things
>> will happen.
>
> That's were policies, procedures and methods come in (read: SAS70)

For the operationally minded -- on one hand, there is an assumption here
that 'accidents' are not preventable;

You cannot eliminate accidents. Accidents represent things which are by
definition unforeseen and unplanned. Accidents may be reducible through
the use of good planning and practices. On one hand, one can foresee a
risk in resting a wrench near some energized busbars while needing one's
hands to do something else; you can define good practices that forbid
this sort of thing. Even that may not completely eliminate the practice;
there are plenty of examples of companies having good policies that are
disregarded by employees in the field. On the other hand, when Bruno is
moving a construction excavator around next door, suffers a heart attack,
and floors the controls such that the excavator rams your building and
the boom arm penetrates your wall and shoves a guy face-first into the
busbars, well, obviously we're talking extremely unlikely (I hope it's
obvious I'm even trying to be a bit ridiculous), but that's an Accident.
And they happen.

on the other hand, there is at
least an assumption being made here that SAS 70 is the curative for
'accidents.' To be brief, accounting for human behavior as an
underlying contributor to accidents can be a backbreaking and immensely
messy endeavor. In this respect, SAS 70 can only be assistive.

Correct. We can only hope to reduce accidents.

My original point was simply that I prefer people who recognize 100% as a
desirable-but-unobtainable goal.

... JG

Regarding Reliability and Availability:

1. Reliability and Availability are related, but not identical.
2. Systemic availability is, generally, the result of the combination of component
  reliability, component redundancy, policies, procedures, and discipline.
3. Policies, procedures, and discipline help to reduce and/or mitigate accidents.

In terms of accidents and human factors:

1. Accidents cannot be eliminated, but, with proper procedures, policies, and
  disciplines, most can be eliminated or prevented.

2. Most accidents which cannot be eliminated can be mitigated, but, doing so
  often comes at a cost which exceeds the product of benefit and likelihood.

We could learn a lot about this from Aviation. Nowhere in human history has
more research, care, training, and discipline been applied to accident prevention,
mitigation, and analysis as in aviation. A few examples:

  NTSB investigations of EVERY US aircraft accident and published findings.
  NASA Aviation Safety Reporting System

  When NTSB finds a design flaw in an aircraft at fault for an accident there
  is a process by which that error gets translated into an Airworthiness Directive
  forcing aircraft owners to have the flaw corrected to continue operating the
  aircraft.

  When NTSB finds a training discrepancy, procedural problem, etc., there
  is a process by which those discrepancies are addressed.through training,
  retraining, etc.

  For example, after a couple of accidents related to microbursts, NTSB and
  FAA determined that all pilots should undergo training on windshear and
  windshear avoidance, including microburts.

  etc. (There are many more examples)

Owen

Owen,

I think if we conducted a poll, a disproportionate percentage of NANOG folks are likely also pilots (compared to the general population anyway) I agree with you completely that aviation is a good model to follow if it is adapted where it makes sense.

All,
The real problem is same human factors we have in aviation which cause most accidents. Look at the list below and replace the word Pilot with Network Engineer or Support Tech or Programmer or whatever... and think about all the problems where something didn't work out right. It's because someone circumvented the rules, processes, and cross checks put in place to prevent the problem in the first place. Nothing can be made idiot proof because idiots are so creative.

-Robert
SEL/MEL Private Instrument

Listed here:

THE FIVE HAZARDOUS ATTITUDES
1. Anti-Authority:
"Don't tell me."
This attitude is found in people who do not like anyone telling them what to do. In a sense, they
are saying, "No one can tell me what to do." They may be resentful of having someone tell them
what to do, or may regard rules, regulations, and procedures as silly or unnecessary. However, it
is always your prerogative to question authority if you feel it is in error.

2. Impulsivity:
"Do it quickly."
This is the attitude of people who frequently feel the need to do something, anything, immediately.
They do not stop to think about what they are about to do; they do not select the best alternative,
and they do the first thing that comes to mind.

3. Invulnerability:
"It won't happen to me."
Many people feel that accidents happen to others, but never to them. They know accidents can
happen, and they know that anyone can be affected. They never really feel or believe that they will
be personally involved. Pilots who think this way are more likely to take chances and increase risk.

4. Macho:
"I can do it."
Pilots who are always trying to prove that they are better than anyone else are thinking, "I can do it
–I'll show them." Pilots with this type of attitude will try to prove themselves by taking risks in order
to impress others. While this pattern is thought to be a male characteristic, women are equally
susceptible.

5. Resignation:
"What's the use?"
Pilots who think, "What's the use?" do not see themselves as being able to make a great deal of
difference in what happens to them. When things go well, the pilot is apt to think that it is good luck.
When things go badly, the pilot may feel that someone is out to get me, or attribute it to bad luck.
The pilot will leave the action to others, for better or worse. Sometimes, such pilots will even go
along with unreasonable requests just to be a "nice guy."

Tellurian Networks - A Dell Perot Systems Company
http://www.tellurian.com | 888-TELLURIAN | 973-300-9211
"Well done is better than well said." - Benjamin Franklin

No, no commercial pilot every flew overweight, or in weather below minimums,
or more that the max hours in a month.. never happens :wink: And there was never
a boss that 'pushed' them into it, for the sake of expediency or financial
gain, and the phrase.. 'Big Sky, Little Plane' was nevered uttered.. logbooks
never fudged and rules are always followed..

C(om)255379

Of course, all of those things have happened. However, if we started treating
networking errors more like the way we treat aviation errors, the reliability
of networking would improve dramatically. OTOH, if we did that, the cost
of networking would also probably gain a zero.

Owen
Commercial ASEL
Instrument Airplane

Owen DeLong wrote:

We could learn a lot about this from Aviation. Nowhere in human history has
more research, care, training, and discipline been applied to accident prevention,
mitigation, and analysis as in aviation. A few examples:

    NTSB investigations of EVERY US aircraft accident and published findings.

Ask any commercial pilot (and especially a commercial commuter flight pilot) what they think of NTSB investigations when the pilot had a "bad schedule" that doesn't allow enough time for adequate sleep. They will point out that lack of sleep can't be determined in an autopsy.

The NTSB routinely puts an accident down to "pilot error" even when pilots who regularly fly those routes and shifts are convinced that exhaustion (lack of sleep, long working days) was clearly involved. And for even worse news - the smaller the plane the more complicated it is to fly and the LESS rest the pilots receive in their overnight stays because commuter airlines are covered under part 135 while major airlines are covered under part 121. My ex flew turbo-prop planes for American Eagle (American Airlines commuter flights). It was common to have the pilot get off duty near 10 pm and be requited to report back at 6 am. That's just 8 hours for rest. The "rest period" starts with a wait for a shuttle to the hotel, then the drive to the hotel (often 15 minutes or more from the airport) then check-in - it can add up to 30-45 minutes before the pilot is actually inside a hotel room. These overnight stays are in smaller towns like Santa Rosa, Fresno, Bakersfield, etc. Usually the pilots are put up at hotels that don't have a restaurant open this late, and no neighboring restaurants (even fast food) so the pilot doesn't get dinner. (There is no time for dinner in the flight schedule - they get at most 20 minutes of free time between arrival and take-off - enough time to get a bio-break and hit a vending machine but not enough time to actually get a meal.) Take a shower, get to bed at about 11:30. Set the alarm for 4:45 am and catch the shuttle back to the airport at 5:15 to get there before the 6:00 reporting time. In that "8 hour" rest period you get less than 6 hours of sleep - if you can fall asleep easily in a strange hotel.

Commuter route pilots have been fighting to get regulations changed to require longer overnight periods, and especially to get the required rest period changed to "behind the door" so that the airlines can't include the commute time to/from the airport in the "rest" period. This would force the airlines to select hotels closer to the airport or else allow longer overnight layovers - either way the pilots would get adequate rest. See:

http://asrs.arc.nasa.gov/publications/directline/dl5_one.htm

The NTSB does a great job with mechanical issues and with training issues, but they totally miss the boat when it comes to regulating adequate rest periods in the airline schedules.

To bring this back to NANOG territory, how many times have you or one of your network admins made a mistake when working with inadequate sleep - due to extra early start hours (needless 8 am meetings), or working long/late hours, or being called to work in the middle of the night?

Finally, having lived with a commercial aviation pilot for 5 years and having worked with network types for much longer, I can say that while there is some overlap between pilots and IT techs, there are also a LOT of people who go into computers (programming, network and system administration) who are totally unsuitable for the regimented environment required for commercial aviation - people who HATE following a lot of rules and regulations and fixed schedules. If you tried to impose FAA-type rules and regulations and airline schedules on an IT organization, you would have a revolt on your hands. Tread carefully when you consider to emulating Aviation.

jc

Owen DeLong wrote:

We could learn a lot about this from Aviation. Nowhere in human history has
more research, care, training, and discipline been applied to accident prevention,
mitigation, and analysis as in aviation. A few examples:

   NTSB investigations of EVERY US aircraft accident and published findings.

Ask any commercial pilot (and especially a commercial commuter flight pilot) what they think of NTSB investigations when the pilot had a "bad schedule" that doesn't allow enough time for adequate sleep. They will point out that lack of sleep can't be determined in an autopsy.

As a point of information, I _AM_ a commercial pilot.

The NTSB routinely puts an accident down to "pilot error" even when pilots who regularly fly those routes and shifts are convinced that exhaustion (lack of sleep, long working days) was clearly involved. And for even worse news - the smaller the plane the more complicated it is to fly and the LESS rest the pilots receive in their overnight stays because commuter airlines are covered under part 135 while major airlines are covered under part 121. My ex flew turbo-prop planes for American Eagle (American Airlines commuter flights). It was common to have the pilot get off duty near 10 pm and be requited to report back at 6 am. That's just 8 hours for rest. The "rest period" starts with a wait for a shuttle to the hotel, then the drive to the hotel (often 15 minutes or more from the airport) then check-in - it can add up to 30-45 minutes before the pilot is actually inside a hotel room. These overnight stays are in smaller towns like Santa Rosa, Fresno, Bakersfield, etc. Usually the pilots are put up at hotels that don't have a restaurant open this late, and no neighboring restaurants (even fast food) so the pilot doesn't get dinner. (There is no time for dinner in the flight schedule - they get at most 20 minutes of free time between arrival and take-off - enough time to get a bio-break and hit a vending machine but not enough time to actually get a meal.) Take a shower, get to bed at about 11:30. Set the alarm for 4:45 am and catch the shuttle back to the airport at 5:15 to get there before the 6:00 reporting time. In that "8 hour" rest period you get less than 6 hours of sleep - if you can fall asleep easily in a strange hotel.

Flying in such a state of exhaustion is, whether you like it or not, a form of pilot error.

A pilot who chooses to fly on such a schedule is making an error in judgment. Sure, there are
all kinds of pressures and employment issues that need to be resolved to reduce and eliminate
that pressure, and, I support the idea of updating the crew duty time regulations with that
in mind.

That does not change the fact that FAR 91.3 still applies:

Sec. 91.3

Responsibility and authority of the pilot in command.

(a) The pilot in command of an aircraft is directly responsible for, and is the final authority as to, the operation of that aircraft.
(b) In an in-flight emergency requiring immediate action, the pilot in command may deviate from any rule of this part to the extent required to meet that emergency.
(c) Each pilot in command who deviates from a rule under paragraph (b) of this section shall, upon the request of the Administrator, send a written report of that deviation to the Administrator.

A failure to declare him/herself to be incapable of safely completing the flight is a failure to meet
the requirements of 91.3(a).

Commuter route pilots have been fighting to get regulations changed to require longer overnight periods, and especially to get the required rest period changed to "behind the door" so that the airlines can't include the commute time to/from the airport in the "rest" period. This would force the airlines to select hotels closer to the airport or else allow longer overnight layovers - either way the pilots would get adequate rest. See:

One More Leg: The Commuter Pilot's Conundrum

And that would be a good change.

In part, that change is supported by the number of times that the NTSB has made statments
such as:

We find the probable cause of the accident was pilot error. We believe that fatigue was likely
a factor in the accident.

The NTSB does a great job with mechanical issues and with training issues, but they totally miss the boat when it comes to regulating adequate rest periods in the airline schedules.

No, you miss the boat on the relationship between the stakeholders.

The NTSB has repeatedly commented on the need for better regulations and better studies
of crew duty time requirements and fatigue as a factor in accidents and incidents.

However, the NTSB CANNOT change regulations. They investigate accidents and make
recommendations to the regulatory agencies. The FAA needs to be the one to change the
regulations. The FAA has not done a particularly good job in addressing this topic, where
they have done a better job in improving mechanical and training issues and have been
more likely to follow up on NTSB recommendations in these areas. In part, that is the
result of reduced pushback on the FAA in these areas from industry. After all, Boeing does
NOT want to publicly say "We think that this mechanical factor the NTSB just determined
as the cause of 400 fatalities isn't really an issue and the FAA should not issue an AD
to make us correct it."

On the other hand, it's much harder for the kind of public feedback loop that exists in
the above statement to apply to crew fatigue issues.

In any case, this has drifted well off the NANOG topic, and, I would be happy to discuss
the NTSB, FAA, etc. with you off-list if you wish.

To bring this back to NANOG territory, how many times have you or one of your network admins made a mistake when working with inadequate sleep - due to extra early start hours (needless 8 am meetings), or working long/late hours, or being called to work in the middle of the night?

Sure, this happens, but, it's not the only thing that happens.

Finally, having lived with a commercial aviation pilot for 5 years and having worked with network types for much longer, I can say that while there is some overlap between pilots and IT techs, there are also a LOT of people who go into computers (programming, network and system administration) who are totally unsuitable for the regimented environment required for commercial aviation - people who HATE following a lot of rules and regulations and fixed schedules. If you tried to impose FAA-type rules and regulations and airline schedules on an IT organization, you would have a revolt on your hands. Tread carefully when you consider to emulating Aviation.

That's very true. I wasn't advocating that we should emulate aviation, so much as I was attempting
to point out that if you want to reduce accidents/incidents, there is a proven model for doing so
and that it comes at a cost. Today, we actually seem, and in my opinion, rightly so, to prefer
to live with the existing situation. However, given that is the choice we are making, we should
realize that is the choice we have made and accept the tradeoffs or make a different choice.

Owen

Owen DeLong wrote:

Owen DeLong wrote:

We could learn a lot about this from Aviation. Nowhere in human history has
more research, care, training, and discipline been applied to accident prevention,
mitigation, and analysis as in aviation. A few examples:

   NTSB investigations of EVERY US aircraft accident and published findings.

Ask any commercial pilot (and especially a commercial commuter flight pilot) what they think of NTSB investigations when the pilot had a "bad schedule" that doesn't allow enough time for adequate sleep. They will point out that lack of sleep can't be determined in an autopsy.

As a point of information, I _AM_ a commercial pilot.

There are commercial pilots who fly for a living, and there are those who have the certification but who don't fly for a living. Do you regularly fly for a commercial airline where your schedule is determined by the airline's needs, part 135 or part 121 rules, union rules, etc. with no ability to modify your work schedule to allow for adequate rest?

The NTSB routinely puts an accident down to "pilot error" even when pilots who regularly fly those routes and shifts are convinced that exhaustion (lack of sleep, long working days) was clearly involved. And for even worse news - the smaller the plane the more complicated it is to fly and the LESS rest the pilots receive in their overnight stays because commuter airlines are covered under part 135 while major airlines are covered under part 121. My ex flew turbo-prop planes for American Eagle (American Airlines commuter flights). It was common to have the pilot get off duty near 10 pm and be requited to report back at 6 am. That's just 8 hours for rest. The "rest period" starts with a wait for a shuttle to the hotel, then the drive to the hotel (often 15 minutes or more from the airport) then check-in - it can add up to 30-45 minutes before the pilot is actually inside a hotel room. These overnight stays are in smaller towns like Santa Rosa, Fresno, Bakersfield, etc. Usually the pilots are put up at hotels that don't have a restaurant open this late, and no neighboring restaurants (even fast food) so the pilot doesn't get dinner. (There is no time for dinner in the flight schedule - they get at most 20 minutes of free time between arrival and take-off - enough time to get a bio-break and hit a vending machine but not enough time to actually get a meal.) Take a shower, get to bed at about 11:30. Set the alarm for 4:45 am and catch the shuttle back to the airport at 5:15 to get there before the 6:00 reporting time. In that "8 hour" rest period you get less than 6 hours of sleep - if you can fall asleep easily in a strange hotel.

Flying in such a state of exhaustion is, whether you like it or not, a form of pilot error.

There is no other effective option. Almost all the commuter airline schedules have these short overnights, and it's impossible for most pilots to avoid being scheduled to fly them. If you bid for these schedules you are expected to fly them. You can't just decide at 11:30 pm that you need more than 5 hour's rest and that you won't be getting up at 4:30 am to get to the airport by your 6:00 am report time, or decide when your alarm wakes you at 4:30 that you are too tired and are going to get another 2 hours sleep, or decide at 7 pm that you are too exhausted from flying this schedule for 2 days and are not going to fly your last leg. If you do this *even once* you will get in very hot water with the company and if you do it repeatedly you will ultimately lose your job. They aren't going to change the schedule because it's "legal" under part 135.

A pilot who chooses to fly on such a schedule is making an error in judgment. Sure, there are
all kinds of pressures and employment issues that need to be resolved to reduce and eliminate
that pressure,

Right now there is no way to avoid putting your job in jeopardy by refusing to fly these unsafe schedules.

and, I support the idea of updating the crew duty time regulations with that
in mind.

That does not change the fact that FAR 91.3 still applies:

The airlines don't care. They draw up these unsafe schedules and expect pilots to magically be capable of flying them safely. If there's an accident it goes down as pilot error, but if you try to claim exhaustion and refuse to fly citing 91.3 on a repeated basis you WILL be fired. Catch 22.

Sounds a lot like working in IT with clueless management, doesn't it?

To bring this back to NANOG territory, how many times have you or one of your network admins made a mistake when working with inadequate sleep - due to extra early start hours (needless 8 am meetings), or working long/late hours, or being called to work in the middle of the night?

Sure, this happens, but, it's not the only thing that happens.

Finally, having lived with a commercial aviation pilot for 5 years and having worked with network types for much longer, I can say that while there is some overlap between pilots and IT techs, there are also a LOT of people who go into computers (programming, network and system administration) who are totally unsuitable for the regimented environment required for commercial aviation - people who HATE following a lot of rules and regulations and fixed schedules. If you tried to impose FAA-type rules and regulations and airline schedules on an IT organization, you would have a revolt on your hands. Tread carefully when you consider to emulating Aviation.

That's very true. I wasn't advocating that we should emulate aviation, so much as I was attempting
to point out that if you want to reduce accidents/incidents, there is a proven model for doing so
and that it comes at a cost.

Agreed.

Today, we actually seem, and in my opinion, rightly so, to prefer
to live with the existing situation. However, given that is the choice we are making, we should
realize that is the choice we have made and accept the tradeoffs or make a different choice.

Fast(big/powerful), cheap, good - pick any two. :slight_smile:

jc

Owen,

We could learn a lot about this from Aviation. Nowhere in human history has
more research, care, training, and discipline been applied to accident
prevention,
mitigation, and analysis as in aviation. A few examples:

Others later in this thread duly noted a definite relationship of
costs associated, which are clearly "worth it" given the particular
application of these methods [snipped]. However, I assert this is
warranted because of the specific public trust that commercial
aviation must be given. Additionally, this form of professional or
industry "standard" isn't unique in the world; you can find (albeit
small) parallels in most states' PE certification tracks and the like.

In the case of the big-I internet, I assert we can't (yet)
successfully argue that it's deserving of similar public trust. In
short, I'm arguing that big-I internet deserves special-pleading
status in these sorts of "instrument -> record -> improve" strawmen
and that we shouldn't apply similar concepts or regulation.

(Robert B. then responded):

All,
The real problem is same human factors we have in aviation which cause most
accidents. Look at the list below and replace the word Pilot with Network
Engineer or Support Tech or Programmer or whatever... and think about all
the problems where something didn't work out right. It's because someone
circumvented the rules, processes, and cross checks put in place to prevent
the problem in the first place. Nothing can be made idiot proof because
idiots are so creative.

I'd like to suggest we also swap "bug" for "software defect" or
"hardware defect" - perhaps if operators started talking about
problems like engineers, we'd get more global buy-in for a
process-based solution.

I certainly like the idea of improving the state of affairs where
possible - especially the operator->device direction (i.e
fat-fingering acl, prefix list, community list, etc). When people make
mistakes, it seems very wise to accurately record the entrance
criteria, the results of their actions, and ways to avoid it - then
shared to all operators (like at NANOG meetings!). The part I don't
like is being ultimately responsible for, or to "design around" a
class of systemic problems which are entirely outside of an operators
sphere of control.

What curve must we shift to get routers with hardware and software
that's both a) fast b) reliable and c) cheap -- in the hopes that the
only problems left to solve indeed are human ones?

-Tk

Anton Kapela wrote:

What curve must we shift to get routers with hardware and software
that's both a) fast b) reliable and c) cheap -- in the hopes that the
only problems left to solve indeed are human ones?

Fast, Reliable, Cheap - pick any two. No, you can't have all three.

The fastest(best) and most reliable *anything* can't be the cheapest one because someone will quickly seize the market opportunity to make one that is lower quality (slower) or less reliable and sell it for a lower price.

jc