outages, quality monitoring, trouble tickets, etc

......... Sean Donelan is rumored to have said:

] Customer service should be of interest to operations folks, at least
] to the extent the problems are getting reported to the right people to fix.

  It certainly is here.

] I doubt I can change anyone's mind that providing explanations to
] customers and non-customers when the network has problems is good for
] business.

  I agree with you that it is important.

] In the future I will simply recommend to customers to buy
] services from NSPs which do provide explanations when their networks
] fail. Since I haven't found a perfect network yet, I suspect it
] includes everyone on this list.

rope to the other guy. It just doesn't work, there's nothing in
the system to encourage it.

  What I mean by saying this is NOT that I don't think a per-NSP
  trouble reporting mechanism is a good idea. What I'm saying is
  that within our Internet arrangement today, I don't see that it's
  terribly capitalistically useful for NSP-A to adverise internal
  problems to NSP-B. There is no doubt in my mind that it IS
  terribly useful for NSP-A to advertise internal problems to
  NSP-A's customers, as well as to NSP-B if they inquire on behalf
  of NSP-B's customers wrt an outage internal to NSP-A. You're
  right the migration of customers is a good metric, but it's hard
  to quantify that migration wrt trouble reporting to management.

  A friend at MFS brings up a good point, that being that the COREN
  agreement stipulated for a trouble reporting list.

  Perhaps we could work to develop a scalable model of such for
  world wide Internet use, or adapt that to this.

  Any other suggestions?

] If you aren't providing the level of service I need, I'll go to someone who
] can. If XYZ's NOC gives me better service than ABC's NOC, I'll
] recommend XYZ to my customers.

  Adam Smith's rules _will_ follow us into the Internet. Agreed.

] > Sprint. I certainly see no reason why I should do this work for
] > you.
]
] Because it is in their self-interest? You are correct I can't make
] anyone run their network how I would like it run, not even MIDNET (GI).
]
] But I can point out long-term problems and code of silence is costing such
] providers money, and has already cost them customers.

  It's not a code of silence. That's my point, that being that
  historically when we are asked about problems we give darn good
  answers. That we don't directly advertise problem attention or
  resolution is not correlative to our response to requests.

  Should we provide darned good answers? - YES
  Should we provide automated Darned Good Answers to our customers?
                 - YES, it would be nice
                 but not a NEED,
                 rather a nifty
                 service (IMHO)

  Should we provide automated Darned Good Answers to other NSPs?
                 - YES, it would be nice
                 but not a NEED,
                 rather a nifty
                 service and
                 lower priority
                 than #2.

] I might call BARRNET because the University of California-Davis has
] reported problems reaching DRA to DRA's help desk, and the problem hasn't
] been resolved. No, BARNET doesn't *have* to talk to me. And I will
] report the same back to the customer. However, I suspect it is in
] BARRNET's self-interest to work with me in resolving the problem
] to ensure UC-Davis has end-to-end reliability.

  I agree it is too. However, when I hear people complaining about bad
  NOCs, I think it is important to point out that there is no
  mechanism in place to hold those other NSPs accountable as the
  person complaining is rarely the customer of the NSP. Yes it's in
  our long term interest, but that doesn't mean there's something in
  place to encourage it other than honest intention.

] I track network reliability by dollars (not packet loss, not latency).
] I measure network providers, good and bad, by how many of our customers
] have used their own dollars to buy private lines to St. Louis because
] they couldn't get the reliability they needed from the network provider.

  Ouch.

] As I said before: Ideally I want a reliable network. If you can't
] provide a perfectly reliability network I want an explanation when I
] can't get through. And I want the problem fixed. The better the
] explanation, the longer I'm willing to give you to fix the problem. If
] I get no explanation, I expect the problem to already be fixed.

  This is a good point, and I have been more convinced that it is
  important.

  Because of this discussion I am going to work to develop an
  automated WWW status page.

] The current situation is the customer gets neither the explanation nor
] action solving the problem.

  I appreciate that NSP response is not always ideal. However, I
  would encourage all people who get a less than exceptional
  response from a NOC technician to escalate the question so as to
  improve the NOC quality. No, this isn't something you should have
  to do, and it's not something that makes anyone terribly proud but
  it does tend to improve the service by natural tech selection.

] Since the technicians seem to be having a very difficult time fixing
] the network, I thought upper management could meet my other goal. Give
] the customer an explanation.

  This is done when they ask, and due to your and others concern, I
  am going to work to develop an automated web page showing down
  time problems.

] The Internet is a global cooperative network. If people don't cooperate,
] the global nature of the network fails.

  Agreed.

] Can't NSPs provide their customers an explanation at least as well as
] the US Post Office?

  Yes, it's possible, and due to this discussion, I am going to work
  to build one as nice as FedEx's.... Anyone want to volunteer
  joint development? :slight_smile:

  -alan

  Should we provide automated Darned Good Answers to our customers?
                 - YES, it would be nice
                 but not a NEED,
                 rather a nifty
                 service (IMHO)

Automated answers would be great...but what about implementation? "Press
1 for an automated status report...<click>" Keeping customer service
staff well-informed (perhaps via an internal automated system) might be a
better solution.

  Should we provide automated Darned Good Answers to other NSPs?
                 - YES, it would be nice
                 but not a NEED,
                 rather a nifty
                 service and
                 lower priority
                 than #2.

I'm afraid I have to disagree...in a network of the level of complexity
of today's Internet (in fact, in any system where communication between
two points is dependent on more than just an "upstream" entity),
connectivity issues are MORE likely to be caused by interaction with
other NSP's. Dissemination of problem information between providers
helps everyone diagnose difficulties and keep their customers better
informed with respect to current status and predictions for the near
future (solutions).

A mailing list for this purpose seems like overkill...if dozens of NSP's
were to be informed every time JoeNet has a problem, even if their
service were not to be affected, the noise overload would reduce the
informative value of the list, as well as provider attention to it. But
how to determine when a problem is important enough to be distributed?

A more interactive shared system (ticket-based?) makes more sense, but
may prove far more difficult to design. Problem classification, impact,
severity, and location are all issues here, as well as the problem of
associating such a record of a problem with its effects. That is, when
a provider "discovers" a problem, how are they to know if it has already
been "registered", and if so, how to reference the information associated
with it?

[need for explanations]

  This is a good point, and I have been more convinced that it is
  important.

  Because of this discussion I am going to work to develop an
  automated WWW status page.

Good response, but how sound is the choice of implementation? If there
is a problem with your network, there is no small chance that those most
interested in acquiring this information would not be able to reach your
server to do so.

] The current situation is the customer gets neither the explanation nor
] action solving the problem.
  I appreciate that NSP response is not always ideal. However, I
  would encourage all people who get a less than exceptional
  response from a NOC technician to escalate the question so as to
  improve the NOC quality. No, this isn't something you should have
  to do, and it's not something that makes anyone terribly proud but
  it does tend to improve the service by natural tech selection.

I hate to say it, but what may be needed here is standardization. NOC
operating procedre varies greatly between providers, and the proper
escalation, etc. of a problem may not be clear.

// Matt Zimmerman Chief of System Management NetRail, Inc.
// Work..........mdz@netrail.net | Play...gemini@alcor.netrail.net
// (703) 524-4800 [voice] (703) 524-4802 [data] (703) 534-5033 [fax]

connectivity issues are MORE likely to be caused by interaction with
other NSP's. Dissemination of problem information between providers
helps everyone diagnose difficulties and keep their customers better
informed with respect to current status and predictions for the near
future (solutions).

Agreed, but it has to be done in an "easy" manner. I'm sure that several
of the NSPs have concerns as to what this information will be used
for. Everyone likes to portray the image of having a 99.98%
uptime whenever possible, even though most folks realize that it just
plain isn't possible, at least today. This sort of leads into the
question of the various NOCs integration with whatever central repository of
information we are shooting to provide. When provider X opens a ticket,
will it automatically be reflected in the 'central' database? I doubt
folks will go for that based on security alone. Or how about provider
X's NOC staff fire off an Email to incident-report@outages.com? How will
they be trained or reimbursed for their time spent on this service?

[..facts about how useless mailing lists are removed..]

A more interactive shared system (ticket-based?) makes more sense, but
may prove far more difficult to design. Problem classification, impact,
severity, and location are all issues here, as well as the problem of
associating such a record of a problem with its effects. That is, when
a provider "discovers" a problem, how are they to know if it has already
been "registered", and if so, how to reference the information associated
with it?

Such an idea is already being discussed in several smoke filled rooms. :slight_smile:
Remedy/ARS has the ability to accept input for incident reports and
queries to its database via an Email form. One could write a Web page
containing the necessary parameters in a form, and then transpose that to an
Email sent to the AR system. Implementing such a system is really based
around cost issues, as the coding is relatively trivial. (CGIs come to mind)
(I used the above example because it's something we've done in the past
and I know works, there are probably others)

On the issue of connectivity -- agreed; some lonely site should not
be allowed to be the only host. However -- if connectivity between
certain NSPs also falls apart, you're equally screwed. Some sort of
distribution of the "centralized" source of information would be needed.

I forsee the most difficult part of the process being, convincing all of
the associated Operations groups into sharing their outage information.
Providing a simple mechanism for either the customer service, or operations
staff to disseminate outage information to the "server," would be equally
challenging. If step (a) were to be overcome, I would assume that
writing a procedure to fit (b).

-jh-