Monitoring system recommendation

Dear Nanog community

We are currently planning to upgrade our monitoring system (Opsview) due to
scalability issues and I was wondering what do you recommend for monitoring
5000 hosts and 35000 services. We would like to use a monitoring system
that is compatible with the nagios plugin format, however we are not sure
if systems like Icinga/Shinken/Op5 are the way to go.

Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts?
Would you recommend commercial systems like Sevone, Zabbix, etc instead of
open source ones?

Your input is really appreciated it

Thank you and have a great day

Regards

While not being completely drop-in compatible with Nagios plugins, Xymon
(Big Brother clone) is up to the task of monitoring this many
hosts/services.

Here's a page with a list of businesses who are publicly reporting their
use of Xymon and the number of hosts/services they're monitoring.
ServiceNow is the biggest I've seen with 569,869 hosts and 740,185
status messages (different service checks being reported back in). It's
really hard to find tools that can scale that large, but with the load
distributed to a few Xymon Proxys which are reporting to your
centralized instance it will scale as large as you want.

https://en.wikibooks.org/wiki/System_Monitoring_with_Xymon/User_Guide/The_Xymon_Users_list#ServiceNow.2C_Inc.

I've used it for years and greatly prefer it to everything else due to
its simplicity and config format. I find nagios's config format
extremely tedious.

As for Nagios plugins: Nagios derives the results of plugins from the
status as exit codes: 0 = green, 1 = yellow, 2 = red if I recall
correctly. If you just modify the plugin to execute a Xymon command as
the last step and report the color instead of the exit code it should
work fine. There was a tool called "xynagios" that automatically made
nagios plugins work without modification but I haven't tried to use it
and don't know if it's still out there.

There are two things you might want to be aware of with Xymon: the
monitoring data is not encrypted not the wire; it's up to you to handle
that at the moment if you feel it is necessary. It also does not support
IPv6. There was a huge rewrite in progress for years to handle both of
these but it stalled out. Recently it has picked up a lot of development
steam and they're scrapping the major rewrite and back porting the
important things. I believe Xymon 4.4 will at least have the encrypted
transport.

5000 hosts and 35000 services. We would like to use a monitoring
system that is compatible with the nagios plugin format, however we
are not sure if systems like Icinga/Shinken/Op5 are the way to go.

At that kind of scale, you need to take a serious look at moving away from the Nagios plugin model. Any model based primarily on forking
external processes is going to hold you back.

Is someone using systems like Op5 or Icinga2 for monitoring > 5000
hosts?

This kind of scale is easily achieved with OpenNMS with appropriate hardware and planning, largely because it does everything in-process. We do offer limited support for NRPE as a transitional mechanism. Disclosure: I get paid to work with OpenNMS in a consulting capacity.

Would you recommend commercial systems like Sevone, Zabbix, etc
instead of open source ones?

Zabbix is open source. I know some of their team and would recommend putting them on your list.

I also know a number of brilliant people who work for SevOne, but I don't know much about their product.

-jeff

We are currently planning to upgrade our monitoring system (Opsview) due
to scalability issues and I was wondering what do you recommend for
monitoring
5000 hosts and 35000 services. We would like to use a monitoring system that

Another consideration is check_mk. We use it in our shop. The check_mk people wrapped a bunch of python around the Nagios notification engine. No longer do you need to worry about the tedium of nagios config files, those are all built automatically from commands from a gui or from a single configuration file.

Check_mk has a benchmarking page which scales to more hosts than you specified:
https://mathias-kettner.de/checkmk_checkmk_benchmarks.html

For an architecture diagram of how they use nagios for alerting, and python for scanning:
http://mathias-kettner.com/check_mk.html

If an included agent isn't available, new ones can be written.

We are quite happy with the solution. We've replaced cricket, cacti, nagios, observium, and a little bit of smokeping with this almost all in one tool.

Although I haven't ever scaled it that high, I've had a lot of luck using
Gearman (mod_gearman) to make Nagios horizontally scalable.

It allows you to use Nagios itself only as a scheduler and reporting UI,
and offload all of the actual probing to other servers. There'll be a
theoretical limit to the amount of scale you get get out of that due to
relying on a single Nagios instance to schedule checks and receive reports
of success, but I imagine it's much higher than your current requirements.

I once worked for Zenoss and still suggest them. Zenoss supports NAGIOS
plugins, and my $DAYJOB is at a Zenoss Partner who can help you achieve
your goals. If you need some help with Zenoss feel free to contact me off
list.

Andrew

Things to notice, as I prefer Zabbix over nagios (real database related, more functionalities) :
- Zabbix actually is open source. You can buy support from them or from partners if you want
- Zabbix can be distributed through central/proxies architecture to scale
- nagios plugins can be adapted for Zabbix, as the later only needs numerical value (no status or text)

> Dear Nanog community
>
> We are currently planning to upgrade our monitoring system (Opsview) due
to
> scalability issues and I was wondering what do you recommend for
monitoring
> 5000 hosts and 35000 services. We would like to use a monitoring system
> that is compatible with the nagios plugin format, however we are not sure
> if systems like Icinga/Shinken/Op5 are the way to go.
>
> Is someone using systems like Op5 or Icinga2 for monitoring > 5000 hosts?
> Would you recommend commercial systems like Sevone, Zabbix, etc instead
of
> open source ones?

We (op5) have customers running > 50,000 hosts and > 300,000 services. So
5,000 hosts is generally not a problem.

As mentioned by Jeff, the forking model *can* become a problem. Small
binaries
that don't load a lot of libraries fork pretty fast. A test we made some
time ago
showed a 15 minute load peak at 3.89 (on 24 cores/hyperthreads) when
checking
100,000 services every 5 minutes. Check latencies were 0.8 seconds max and
0.002 seconds avg. Average cpu load was 15%.

Specs for the machine used:
Dell PowerEdge R620
2x Intel Xeon E5-2620
24 GB ram
Dell PERC H710 hardware RAID card
RAID10 on 4x300GB 15kRPM SAS drives

So a single (now almost vintage) server can handle 300 plugin executions per
second without breaking a sweat. Scaling up is definitely a possibility, but
scaling out (using mod gearman, mk or merlin, all open source) is available
as
well.

Complex plugins, for example check_vmware_api which loads the large VMware
perl SDK can get you in trouble though. I suggest you run a test with the
plugin
mix you are planning to use.

If scaling out is not an option, and you want to stay in the nagios/naemon
world,
a custom worker can be developed to get rid of the loading overhead.
Documentation is available at
http://www.naemon.org/documentation/developer/workers.html

Full disclosure: I work as development team lead at op5

best regards
Mikael Falkvidd

We use Zabbix here pretty heavily. Monitoring roughly 10,000 hosts 13,000 interfaces and a mirage of services.

-Brent

I'm not at that scale, but I've seen some fairly impressive performance searching through a friend's NetXMS system with a couple years of verbose syslog and monitoring to go through.

Dear Nanog community

[...snipped...]

Your input is really appreciated it

Thank you and have a great day

Regards

I have not used openNMS in production.. does it work well under heavy load?

regards,
J

Yes, but depends on HW. They support some pretty huge environments.
You have to have "enough" IOPs to keep up with the polling, DB and RRD data.
Then there will never be a "heavy" load...

I would contact them and based on your needs ask them what HW you will need for your implementation.
You can get real world info from the mailing list: https://sourceforge.net/p/opennms/mailman/
I would suggest the opennms-discuss list.