We are looking for a new network monitoring system. Since there are so many operators on this list, I would like to know which NMS do you use and why? Is there one that you really like, and others that you hate?
For free options (opensouce), LibreNMS and NetXMS come highly recommended by many wireless ISPs on low budgets. However, I am not sure the commercial options available nor their price points.
For monitoring network device/interface data plane reachability with
ping, we are still using an ancient piece of open source software
called Autostatus. I find it invaluable for notifying us about
reachability issues with it's simple to understand parent/child
relationships and graph-based fping methodology. It isn't perfect--it
doesn't scale very well, it doesn't have HA/clustering, it has no
fancy dependencies (just basic parent-child) and no event correlation,
no contact scheduling, no API, etc. but it is very easy to understand
why you are getting an alert or not and boiling that down to a single
point of failure and as such it provides reliable, trustable
information about data plane reachability from one vantage point on
For monitoring server & network service availability,
device/environmental health, etc. we are currently using Nagios. My
problems with it are that it has complex rules for how/when to perform
a specific health check and send or suppress a notification (and
perhaps bugs in our old version that never ever seems to send any Host
notifications except when it does) and the whole idea of "suppress the
Host check unless all Service checks for all services on the host are
down" doesn't really fit well with the idea of monitoring
device/interface reachability on routers & switches that make up a
complex graph of dependencies. Trying to shoehorn Nagios into
alerting on just the one IP address/device/interface that is causing
all the others behind it to be unreachable doesn't work very well.
You can't use Host Depenencies because Host checks are suppressed by
default, and Host Dependencies don't affect Service
Checks/notifications. Forcing Host checks to always run causes
performance problems. Creating a "Ping" service for every host
requires creating manual Service Dependencies between all the "Ping"
services on every Host. Then you end up with a complex configuration
that is very hard to understand. But for things like telling you when
a power supply or fan has died, or if the web service crashed, it
We did a survey of a bunch of open source tools to replace Nagios and
have settled on Icinga for it's APIs, dynamic rules with pattern
matching and boolean logic, and compatibility with Nagios plugins.
But it still doesn't change the basic architectural choices of the
Nagios core engine and hence isn't a good fit for network
device/interface reachability monitoring IMO.
As a small operator, we mainly use Icinga for the reasons Chuck mentioned.
The API allows us to do updates based on configuration parameters we’ve created in a custom MySQL database.
(resending with really, really the correct from:)
Here’s a snapshot of what tends to work for me, along with my $0.02 of thoughts:
- Observium handles polling, graphing and alerting for SNMP exposed objects on network devices,
- I feel that a visual representation of the physical network topology is extremely helpful for many aspects of day-to-day operations, so InterMapper handles that,
- Syslog and SNMPTRAP collection, correlation and alerting is handled by Splunk,
- Netflow collection and graphing is handled by nfsen,
- Smokeping for what smokeping does (but I just discovered vaping this morning, which looks awesome and will get some love).
I believe that LibraNMS has at some capability to use more robust graphing engines, which for me would be great; I find rrd is a little limiting these days. I think it also has (better?) support for weathermap, so I could technically replace InterMapper with weathermap and collapse the tool chain a bit.
With streaming telemetry becoming more of a thing, there will definitely be a shift away from SNMP for things that are polled for statistics.
There are interesting Netflow tools like Elastiflow and pmacct that are more robust than nfsen. The latter has a ton of functionality that can produce some interesting data for purposes of traffic engineering, among other things. The former uses ELK so it’s inherently gorgeous and fast, but it requires a ton of resources depending on the number of flows/sec that you’re collecting.
Hope that helps.
Take a look at opennms.org
Scales very well. Lots of API hooks for integration with other data
sources and applications.
It is open source and they offer paid support services, one-time (e.g.
setup and training) or on-going support contracts.
I run OpenNMS currently, and the one problem I have is it’s very peculiar – one might say academic – terminology and structure. It’s not a point-and-click interface, despite being web-based. Instead, you must wrangle with pollers and responders and notifiers. Eventually I got my head around it, but it’s still pretty painful to use.
I run a mix of Cactus, Intermapper, and PRTG, with PRTG a very nice commercial product that offers a free version supporting up to 100 sensors.
Part 2 (see Part 1 for my epistles on Autostatus & Nagios).
To complement Autostatus and Nagios and to replace our ancient Cricket
SNMP graphing/trending solution, several years ago we had adopted
We've now replaced that with AKiPS, which I highly recommend. It does
your basic 1 minute SNMP graphing, but it also collects SNMP Traps &
Syslog feeds and can alert on custom matches & events as well as host
down via ping. Its main feature is its comprehensive vendor MIB
support--it supports almost every vendor's device we use out of the
box with no special configuration. They are constantly adding support
for new vendors/devices and they are pretty responsive to adding new
ones. AKiPS' weakness is in alerting--it makes no attempt at
depenencies or event correlation, so you can get flooded with events.
I still use a tool I wrote in perl nearly 20 years ago called
"MrPing." MrPing handles multi-dependency graphs.
A is reachable via either B or C.
If A and B are down but C is up, A being down is a separate failure
from B being down. I need to know about both.
If B and C are both down, A is unreachable. I don't want to receive
alerts about A because they'll distract me from the root cause of the
problem: that both B and C are down. The NMS should record that A is
unreachable but it should also tell me that A being unreachable is a
dependent failure that I can ignore until I fix the failures it
The NMSes I've paid attention to either don't support dependencies
well at all or support only simple hierarchical dependencies.
Resilient, professional networks simply aren't built that way.
I think anybody looking for a be-all-end-all solution will find nothing but heartburn.
different suites have different strong suits, and deciding you are going to pursue one and ignore all others may mean living without a feature or set of features you may find really useful or eventually necessary. but maintaining multiple complete NMSes isn’t really tenable either.
all of that said, we use a combination of a couple. Nagios/Icinga because it’s been around forever (both in the world and in our network), and the power of script based checks, being able to write your own handlers and pretty much just leverage it as a framework you can shove questions into and get regular answers from is invaluable.
LibreNMS gives us the best pretty pictures, letting us monitor much much more than just interface traffic, out of the box. much more than cacti is capable of without a ton of work (i.e. down to the tx/rx power and temperature readings of individual SFPs). it scales relatively well; at least in theory. i will be able to tell you for sure later this year as we are near the limits of what we can monitor with a single polling device. alerting out of Libre into Slack has proven quite fantastic. we can spawn threads attached to anything from a BGP peer dropping or a CPU alert as we move to triage and solve, even if we are in the field or meetings or whatever.
we also still have cacti around for random one-offs. as great as Libre is, its poller can be a bit intense for some devices; so in those cases it’s safer for us to just have cacti graph the one or two OIDs we need specifically, without trolling all the other available sensors.
we ran OpenNMS for a bit, but it proved way to dumb to maintain a large (and growing) complex network, without dedicating at least one or two people to the care and feeding of it.
Consider also open-source FlowViewer for netflow capture and analysis. A lot of very useful netflow based analytical tools in an easy UI. Sits on top of a robust set of Carnegie-Mellon's high-capacity SiLK netflow tools.
Regarding netflow/sflow/ipfix monitoring, we had recently started using elastiflow by Robert Cowart. Scales very well with pretty visualizations. Cannot imagine what paid / supported version has to offer
seconded. the pains of maintaining ELK are made worthwhile by this alone.
Being a small business we like to use a mostly free and open source tools. Our networking monitoring stack presently looks like:
Simple Reachability Monitoring (Ping) - uptimerobot.com
Just $4.5 per month for 50 monitors with 1 minute intervals (free if you are find with 5 minutes monitoring intervals). This is connected to our slack channel and also sends SMS when something goes offline.
Traffic & Device Monitoring - LibreNMS
A fork of Observium but adds the much needed alerting feature that observing only offers with it’s paid plans. We use it to monitor switch port traffic, BGP sessions, device health, etc.
Packet Inspection or Flow Monitoring we use FastNetMon (https://fastnetmon.com/features/) the free edition is good for our needs.
As open source tools go, Smokeping is a great tool to add to your NMS arsenal:
LibreNMS + Weathermap for graphs, real-time, and alerting. Vaping for a simple Up/Degraded/Down dashboard (great replacement for Multiping/PingPlotter on a TV). Elastiflow for netflow.
I really really want to like OpenNMS, and would love to use it daily; I feel like it could handle many integrations well, but have never had the time to dedicate to fully diving into it. I have used it in the past for small setups (monitoring ~100 remotely managed routers/firewall) and it did well, after getting past some learning curves. I keep coming back to it every 6 months or so and trying the latest version.
Having done a full circle on the number of network monitoring packages, dealing with pro’s and con’s, we ended up with using Check_mk, moreover OMD… http://omdisto.org
We found (OMD) this to be a very powerful combination of different packages, each can shine for it’s own strength and other compliments it for for the weaknesses !
There are many other threads on this topic as well. I can say +1 for check_mk though.
We run Iris - home-grown (South Africa), great support, small/nimble team that are able to fix issue, add features and give advice. Very flexible, captures plenty of data out-the-box, supports a ton of vendors and data points, e.t.c. It’s a commercial solution, but not out of reach. Heck, even I can afford it :-). We moved from a Cacti/SmokePing/Observium/Zabbix combo to Iris 2 years ago. Much happiness. Mark.
We run Iris - home-grown (South Africa), great support, small/nimble
team that are able to fix issue, add features and give advice.
Very flexible, captures plenty of data out-the-box, supports a ton of
vendors and data points, e.t.c.
It's a commercial solution, but not out of reach. Heck, even I can
afford it :-).
We moved from a Cacti/SmokePing/Observium/Zabbix combo to Iris 2 years
ago. Much happiness.
+1 for Iris.
We've been with them for a couple of years now, and the support has been first class - quick incident response, fast fixes, and very approachable regarding feature requests.
They are based out of Cape Town, South Africa, but also have a US presence in the DC area.