I have to agree that this is all good information.
Your question on ITIL: My personal opinion is that ITIL best practices are great to apply to all environments. It makes sense, specifically in the change control systems.
However, as stated, it's also highly dependent on how many devices being managed/monitored. I come from a NOC managing 8600+ network devices across 190+ countries.
Strict change management policies, windows, approvers. All depending on times relative to the operations in different countries.
We were growing so rapidly that we continued purchasing companies and bringing over their infrastructure. Each time bringing in new ticket systems, etc.
NNM is by far one of my favorite choices for network monitoring. The issue with it is really the views and getting them organized in an easily viewable fashion.
RT is a great ticketing tool for specific needs. It allows for approvers and approval tracking of tickets. However, it isn't extremely robust.
I would recommend something like HP ServiceCenter since it can integrate and automate the alert output directly to tickets. This also allows the capability to use Alarmpoint for automated paging of your on-calls based on their schedules, by device, etc.
Not to say that I'm a complete HP fan boy but I will say that it works extremely well. Easy to use and simplicity is the key to less mistakes.
All of our equipment was 99% Cisco so the combination worked extremely well.
Turnover : I firmly believe shift changes should be verbally handed off. Build a template for the days top items or most critical issues. List out the ongoing issues and any tickets being carried over with the status. Allot 15 minutes for the team to sit down with the printout and review it.
Contracts/SLA's:
We placed all of our systems in a bulk 99.999% uptime critical SLA. However, this was a mistake on our part and the lack of time to plan well when adapting to an ever-changing environment.
It would be best to setup your appliances/hardware in your ticket system and monitoring tool based on the SLA you intend to apply to it. Also ensure you include all hardware information: Supply Vendor, Support Vendor, Support coverage, ETR from Vendor, Replacement time.
There are many tools that do automated discovery on your network and monitors changes on the network. This is key if you have a changing environment. The more devices you have, the more difficult it is to pinpoint what a failed router or switch ACTUALLY affects upstream or downstream.
If this is your chance, take the opportunity to map your hardware/software dependencies. If a switch fails and it provides service to: example: db01 and db01 drives the service in another location. Then you should know that failure is there. It's far too common for companies to get so large they have no idea what the impact of 1 port failure in xyz does to the entire infrastructure.
Next: Build your monitoring infrastructure completely separate than the rest of the network. If you don't do switch redundancy (active/passive) on all of your systems or NIC teaming (active/passive) then ensure you do it at least on your monitoring systems.
Build your logging out in a PCI/SOX fashion. Ensure you have remote logging on everything, log retention based on your need. Tripwire with approved reports being sent weekly on the systems requiring PCI/SOX monitoring.
Remember, if your monitoring systems go down, your NOC is blind. It's highly recommend that the NOC have gateway/jump box systems available to all parts of the network. Run the management completely on RFC1918 for security.
Ensure all on-calls have access, use a VPN solution that requires a password + vpn keygen. Utilize TACACs/LDAP the most you can. Tighten everything. Log everything. I can't say that enough.
Enforce pw changes every 89 days, require strong passwords/non dictionary, etc.
Build an internal site, use a wiki-based format, allow the team the ability to add/modify with approval. Build a FAQ/Knowledgebase. Possibly create a forum so your team can post extra tips/notes, one-offs. Anything that may help new members or people who run across something in the middle of the night they may have never seen. This keeps from waking up your lead staff in the middle of the night.
On-calls: Always have a primary/secondary with a clear on-call procedure 'documented'.
Example (critical):
1. Issue occurs
2. Page on-call within 10 minutes
3. Allow 10 minutes for return call.
4. Page again
5. Allow 5 minutes
6. Page secondary
Etc.
Ensure the staff documents every step they take and they copy/paste every page they send into the ticket system.
Build templated paging formats. Understand that most txt messages with several carriers have hard limits. Use something like:
Time InitialsofNOCPerson SystemAlerting Error CallbackNumber
(Ie. 14:05 KH nycgw01 System reports down 555-555-5555 xt103)
Use a paging internal website/software or as mentioned, something like Alarmpoint.
There is nothing more frustrating for an on-call to be paged and have no idea who to call back, who paged, or what the number is.
I've written so much my fingers hurt from these Blackberry keys. Hope this information helps a little.
Best of luck,
-Kevin
Excuse the spelling/punctuation... This is from my mobile.