NOC Best Practices

Kasper_Adel · July 14, 2010, 5:24pm

Hello Everyone,

I am currently working on building a NOC so i'm looking for
materials/pointers to Best Practices documented out there.

On the top of my head are things like:

1) Documenting Incidents and handling them
2) Documenting Syslog messages
3) Documenting Vendor Software Bugs
4) Shift to Shift Hand over procedures
5) Commonly used scripts for monitoring
6) Frequently testing High Availability
7) Capturing config changes.
....etc

I can see that this is years of experience but i am wondering if any of this
was captured some where.

Thanks,
Kim

Kasper_Adel · July 16, 2010, 6:34pm

Thanks for all the people that replied off list, asking me to send them
responses i will get.

I got nothing other than :
http://www.nanog.org/meetings/nanog24/abstracts.php?pt=OTM1Jm5hbm9nMjQ=&nm=nanog24
and

Network Management- Accounting and Performance Strategies - Just the first
three chapters

Which is useful but i am looking for more stuff from the best people that
run the best NOCs in the world.

So i'm throwing this out again.

I am looking for pointers, suggestions, URLs, documents, donations on what a
professional NOC would have on the below topics:

1) Briefly, how they handle their own tickets with vendors or internal
2) How they create a learning environment for their people (Documenting
Syslog, lessons learned from problems...etc)
3) Shift to Shift hand over procedures
4) Manual tests they start their day with and what they automate (common
stuff)
5) Change management best practices and working with operations/engineering
when a change will be implemented

Should i be looking for ITIL stuff or its not any good?

Thanks,
Kim

JoeSox · July 16, 2010, 8:08pm

I believe, myself included, are hesitant to answer because it really
depends upon a lot of variables. Type of business your NOC is running,
the operating budget, number of racks, etc.
The details matter when narrowing things down.

But yes, I have seen this ITIL
http://www.frontrange.com/
click the Register for a Free ITIL Success Kit!

You may be interested in.

Joe_Provo4 · July 17, 2010, 6:56pm

Thanks for all the people that replied off list, asking me to send them
responses i will get.

[snip]

Which is useful but i am looking for more stuff from the best people that
run the best NOCs in the world.

So i'm throwing this out again.

I am looking for pointers, suggestions, URLs, documents, donations on what a
professional NOC would have on the below topics:

A lot, as others have said, depending on the business, staffing,
goals, SLA, contracts, etc.

1) Briefly, how they handle their own tickets with vendors or internal

Run a proper ticketing system over which you have control (RT and
friends rather than locking you into something you have to pay for
changes). Don't just by ticket closure rate, judge by succesfully
resolving problems. Encourage folks to use the system for tracking
projects and keeping notes on work in progress rather than private
datastores. Inculcate a culture of open exploration to solve problems
rather than rote memorization. This gets you a large way to #2.

2) How they create a learning environment for their people (Documenting
Syslog, lessons learned from problems...etc)

Mentoring, shoulder surfing. Keep your senior people in the mix
of triage & response so they don't get dull and cross-pollenate
skills. When someone is new, have their probationary period be
shadowing the primary on-call the entire time. Your third shift
[or whatever spans your maintenance windows] should be the folks
who actually wind up executing well-specified maintenances (with
guidance as needed) and be the breeding ground of some of your
better hands-on folks.

3) Shift to Shift hand over procedures

This will depend on your systems for tickets, logbooks, etc.
Sole that first and this should become evident.

4) Manual tests they start their day with and what they automate (common
stuff)

This will vary on the business and what's on-site; I can't
advise you to always include the genset is you don't have
one.

5) Change management best practices and working with operations/engineering
when a change will be implemented

Standing maintenance windows (of varying severity if that
matters yo your business), clear definition of what needs
to be done only duringthose and what can be done anytime
[hint: policy tuning shouldn't be restructed to them, and
you shouldn't make it so an urgent things like a BGP leak
can't be fixed]. Linear rather than parallel workflows
for approval, and not too many approval stages else your
staff will be spending time trying to get things through
the administrative stages instead of actual work. Very
simply, have a standard for specifying what needs to be
done, the minimal tests needed to verify success, and how
you fallback if you fail the tests. If someone can't
specify it and insist on frobbing around, they likely don't
understand the problem or the needed work.

Cheers,

Joe

khatfield · July 17, 2010, 8:02pm

I have to agree that this is all good information.

Your question on ITIL: My personal opinion is that ITIL best practices are great to apply to all environments. It makes sense, specifically in the change control systems.

However, as stated, it's also highly dependent on how many devices being managed/monitored. I come from a NOC managing 8600+ network devices across 190+ countries.

Strict change management policies, windows, approvers. All depending on times relative to the operations in different countries.

We were growing so rapidly that we continued purchasing companies and bringing over their infrastructure. Each time bringing in new ticket systems, etc.

NNM is by far one of my favorite choices for network monitoring. The issue with it is really the views and getting them organized in an easily viewable fashion.

RT is a great ticketing tool for specific needs. It allows for approvers and approval tracking of tickets. However, it isn't extremely robust.

I would recommend something like HP ServiceCenter since it can integrate and automate the alert output directly to tickets. This also allows the capability to use Alarmpoint for automated paging of your on-calls based on their schedules, by device, etc.

Not to say that I'm a complete HP fan boy but I will say that it works extremely well. Easy to use and simplicity is the key to less mistakes.

All of our equipment was 99% Cisco so the combination worked extremely well.

Turnover : I firmly believe shift changes should be verbally handed off. Build a template for the days top items or most critical issues. List out the ongoing issues and any tickets being carried over with the status. Allot 15 minutes for the team to sit down with the printout and review it.

Contracts/SLA's:
We placed all of our systems in a bulk 99.999% uptime critical SLA. However, this was a mistake on our part and the lack of time to plan well when adapting to an ever-changing environment.

It would be best to setup your appliances/hardware in your ticket system and monitoring tool based on the SLA you intend to apply to it. Also ensure you include all hardware information: Supply Vendor, Support Vendor, Support coverage, ETR from Vendor, Replacement time.

There are many tools that do automated discovery on your network and monitors changes on the network. This is key if you have a changing environment. The more devices you have, the more difficult it is to pinpoint what a failed router or switch ACTUALLY affects upstream or downstream.

If this is your chance, take the opportunity to map your hardware/software dependencies. If a switch fails and it provides service to: example: db01 and db01 drives the service in another location. Then you should know that failure is there. It's far too common for companies to get so large they have no idea what the impact of 1 port failure in xyz does to the entire infrastructure.

Next: Build your monitoring infrastructure completely separate than the rest of the network. If you don't do switch redundancy (active/passive) on all of your systems or NIC teaming (active/passive) then ensure you do it at least on your monitoring systems.

Build your logging out in a PCI/SOX fashion. Ensure you have remote logging on everything, log retention based on your need. Tripwire with approved reports being sent weekly on the systems requiring PCI/SOX monitoring.

Remember, if your monitoring systems go down, your NOC is blind. It's highly recommend that the NOC have gateway/jump box systems available to all parts of the network. Run the management completely on RFC1918 for security.

Ensure all on-calls have access, use a VPN solution that requires a password + vpn keygen. Utilize TACACs/LDAP the most you can. Tighten everything. Log everything. I can't say that enough.

Enforce pw changes every 89 days, require strong passwords/non dictionary, etc.

Build an internal site, use a wiki-based format, allow the team the ability to add/modify with approval. Build a FAQ/Knowledgebase. Possibly create a forum so your team can post extra tips/notes, one-offs. Anything that may help new members or people who run across something in the middle of the night they may have never seen. This keeps from waking up your lead staff in the middle of the night.

On-calls: Always have a primary/secondary with a clear on-call procedure 'documented'.
Example (critical):
1. Issue occurs
2. Page on-call within 10 minutes
3. Allow 10 minutes for return call.
4. Page again
5. Allow 5 minutes
6. Page secondary
Etc.

Ensure the staff documents every step they take and they copy/paste every page they send into the ticket system.

Build templated paging formats. Understand that most txt messages with several carriers have hard limits. Use something like:
Time InitialsofNOCPerson SystemAlerting Error CallbackNumber

(Ie. 14:05 KH nycgw01 System reports down 555-555-5555 xt103)

Use a paging internal website/software or as mentioned, something like Alarmpoint.

There is nothing more frustrating for an on-call to be paged and have no idea who to call back, who paged, or what the number is.

I've written so much my fingers hurt from these Blackberry keys. Hope this information helps a little.

Best of luck,
-Kevin

Excuse the spelling/punctuation... This is from my mobile.

Xavier_Banchon · July 17, 2010, 8:20pm

What about e-TOM? Is it better than ITIL V3?

Regards,

Xavier

Telconet S.A

khatfield · July 17, 2010, 10:01pm

eTOM is best regarded as a companion to ITIL practices. It has additional layers not covered by ITIL and vice versa.

I think a combination of practices from both is the best method.

-Kevin

Stefan_Listrom · August 9, 2010, 2:01pm

Hello Kim

I am also interested in NOC best practices, but have found out that it is not easy to find much documented on the subject. I think as most seem to have already answered in your thread, that is because every NOC is a little different from the other. Specially depending on the type of organisation or company they are working for.

One of the things we have done in the research and educational community in Europe is to start a Task Force[1] on the topic. The task force has not really kicked off yet, so unfortunately we don't have any answers to your questions yet. I also guess your from a commercial company which might have a little different priorities than "we" do. That said, maybe looking at our questions and problems, might give you some food for thoughts in regards to what is important for your NOC.

Following my link[2] below you can find our Terms of Reference. Basically what we are aiming to investigate and what we initially think is interesting to discuss in regards to a NOC.

Not sure if it is helpful for you, but during our initial discussions around the task force we had some presentations about the NOC from different kinds of organisations. You can find the presentation slides on our meeting page[3].

If you are interested in ITIL and operations I can recommend the following two books:
IT Service Management Based on ITIL V3, A Pocket Guide
The Visible OPS Handbook, Implementing ITIL in 4 practical and auditable steps

They are fairly easily read and make some good points. But if you consider implementing ITIL, be aware of the fact that it is easy to overcomplicating things. I would recommend starting out small and only use the things you think makes sense in regards to your organisation.

Someone in this thread mentioned e-tom[4] which is published by TMForum. TMForum publish best practices in among other things operations, the downside is that you have to be a member to access most of their published documents.

[1] http://www.terena.org/activities/tf-noc/
[2] http://www.terena.org/activities/tf-noc/tf-noc-tor_v3.pdf
[3] http://www.terena.org/activities/tf-noc/prep/programme.html
[4] http://www.tmforum.org/DocumentsBusiness/BusinessProcessFramework/35431/article.html

Best regards
Stefan