SNMP - monitoring large number of devices

Hi all,

recently I have been tasked with a NMS project. The idea is to pool about
20 OID's from 50k cable modems in less then 5 minutes (yes, I know it's a
one million OID's). Before you say check out some very professional and
expensive solutions I would like to know are there any alternatives like
open source "snmp framework"? To be more descriptive many of you knows how
big is the mess with snmp on cable modem. You always first perform snmp
walk in order to discover interfaces and then read the values for those
interfaces. As cable modem can bundle more DS channels, one time you can
have one and other time you can have N+1 DS channels = interfaces. All in
all I don't believe that there is something perfect out there when it comes
to tracking huge number of cable modems so I would like to know is there
any "snmp framework" that can be exteded and how did you (or would you)
solve this problem.

Thank you.

Pavel,

AFAIK there are no frameworks that solve, or even come close to solving,
this problem. The first thing to learn is how to do asynchronous SNMP
calls that will let you send out requests without having to wait for the
response. How you do those will vary by the language that you're using.
Next, figure out to scale the processing and persisting of the returns and
try and learn (without causing an impact) how many simultaneous SNMP
requests your CMTSs will deal with while at the same time handling their
normal load from customer traffic. BULKGETS are very handy, but they will
also cause problems because some platforms limit the max size of the SNMP
return.

From my experience building your server side programming to do 50,000

modems in <5 minutes is very doable, but unless your dealing with more than
10 CMTSs you probably can't do it in production without impacting
performance. Cisco 10Ks still have absolute caps on the amount of SNMP
they will allow and other manufacturers and models do different things that
limit what you can do.

Scott Helms
Vice President of Technology
ZCorum
(678) 507-5000

We built our own system for this purpose and just spawn one process per device being polled. This seems to work out OK and many cores can make this work out. You can also just split the workload horizontally across multiple servers.

The challenges are as usual how to report from a dataset like that. Many systems exist to distribute workload across the servers.

Custom poller is really the way to go here IMHO. It’s not that hard but requires investment to build it and operate it.

- Jared

Zabbix is probably one of the more robust snmp platform, it's database backed by either postgres, mysql or oracle and scales pretty big. If this is more than a one-time event, you'll need some real horsepower and HDD space to keep all that data. It might be worth writing a custom ruby/python/perl application on a platform that allows a clean fork just prior to the SNMP call it'll allow you to remain in the wait-state for that one snmp thread without blocking the next request, and let you manage the workload of the server by controlling the number of simultaneous forks.

I've done about ~60,000 OID queries (over a few dozen devices) per 5
minutes using OpenNMS, which is Java based. At the scale you're looking at,
disk I/O would be a major performance issue (if using rrdtool). Google for
'Tuning RRD' for some tips that can make a significant difference.

recently I have been tasked with a NMS project. The idea is to pool about
20 OID's from 50k cable modems in less then 5 minutes (yes, I know it's a
one million OID's). [...] You always first perform snmp
walk in order to discover interfaces and then read the values for those
interfaces.

Hi Pavel,

NMS is ever a trade off between collecting the data you need to manage
your system and minimizing the impact collection has on the equipment
being monitored.

You propose to spend somewhere between 2mbps and 10mbps on SNMP
queries alone 24/7/365 and then do something with the 10ish terabytes
per year of collected data

This is less a network monitoring problem and more a big-data
map-reduce problem. Doable, although you'll likely have to break the
problem up in to a lot of parallel VMs. You'll need a software
developer with big data experience. So far as I know there is no
off-the-shelf software (commercial or open source) kitted out to
facilitate this specific activity. 95% of the software modules needed
to create such a thing is available as open source.

Frankly, it's probably also not a good idea. Lengthen your intervals,
don't collect every piece of information every interval and figure out
a way that you don't have to walk the oids every single time.

Thus spake Dan White (dwhite@olp.net) on Tue, Sep 29, 2015 at 03:37:51PM -0500:

Pavel,

It's all going to be how you deploy a selected polling system. Most server operating systems are going to struggle with that many transactions in a short period of time no matter the awesomeness of the polling engine. Look for a distributed polling solution. If you can spread the connection load out a bit it may be less of an issue. The worst issue will be populating your solution with that many devices and sensors to go check, look for api or import tool! What do you think the network latency between your polling location or locations to all of the cable modems? Do they respond pretty well with SNMP queries?

We use PRTG and its very efficient with SNMP. One of the most efficient SNMP polling engines I have used. My quick math and experience tells me that with a PRTG core server and two or three remote PRTG probes (hardware based) you should be able to hit 6,000 snmp polls a min from each probe. PRTG will automatically attempt a multi get. Go with three probes to be safe and I think you could hit over 50,000 with buffer room.

Easy platform, inexpensive compared to other licensed NMS systems. Free trial with unlimited sensors on their website. But their software engineering staff will state that this project would be stretching the design of the system and its focus.

Would love to hear what you figure out works best!

Sincerely,
Nick Ellermann – CTO & VP Cloud Services
BroadAspect

E: nellermann@broadaspect.com
P: 703-297-4639
F: 703-996-4443

THIS COMMUNICATION MAY CONTAIN CONFIDENTIAL AND/OR OTHERWISE PROPRIETARY MATERIAL and is thus for use only by the intended recipient. If you received this in error, please contact the sender and delete the e-mail and its attachments from all computers.

I'm able to poll a few thousand CMs in a few seconds using perl's Net::SNMP and async calls. 50k seems pretty doable.

--Blake

@op,

Can you expand a little on the end goal, health, noise mitigation, nms replacement, modem validation?

So we have used www.zenoss.org for many years. Individual collectors are easily handling snmp poll rates of 1.5k oids per second(450k per 5m). As zenoss core is open source Its probably worth a look for you.

-Joel

OpenNMS has a poller that will do what you want. The problem is figuring out what you wish to collect and how to use it. Most of the time it's not as simple as pointing at the modem and saying go.

I've added a few oids for some of the modems we support, just so I can get SNR on them. I don't usually add customer modems directly to monitoring unless I'm tracking a long term problem and want to watch the SNR for that customer for weeks.

I monitor our CMTS' with a threshold system that says if number of active modems decreases by around 20 then alert. This can cause false positives with modems migrating between cards, but if you tweak the numbers right it works okay.
We also have graphs for signal and other things on each CMTS.

Now that I'm thinking about it, I believe I could get away with adding all our modems for SNR, then try to write something to add/remove them and keep it in sync with our provisioning system. I would need to make sure everything was in order so I don't get 400 emails when a site goes down, but it all should be possible. I'm not sure if the I/O would be worth it, but being able to aggregate some of the data and look at SNR across an entire plant would be nice. At one point I had a project to put modems at the tail end of each leg of a plant then monitor them. This is because we don't have monitor-able amplifiers. It never happened though.

The truth is that balancing a plant is easy enough once you're used to it, and the extra metrics you might get from doing some of these things isn't worth the long term I/O. We do have other (non-NMS) systems that will poll and get instantaneous results like this for entire plants. That has been very useful.

My guess is no matter what system you pick, you will either need to spend a couple of weeks hacking on it or pay someone to implement it. There isn't a turnkey system that does exactly what you want because 99% of network monitoring companies target systems rather than networks (the market is much larger..).

If you want to roll your own:

https://github.com/tobez/snmp-query-engine

I recently discovered this and wanted it years ago. I actually considered stripping the poller out of OpenNMS so there would be a bare-bones poller you could send oids to and get back results. The reason being that almost everyone who does SNMP does a bad job of it and is slow. So, don't start at the library layer and don't write your own thing (unless you have to..). You need asynchronous communication, bulk and gettable support, and you don't want to worry about max PDU size. That's what snmp-query-engine does (maybe.. I've just looked at the tin, I haven't used it)

Second note about rolling your own: Skip whisper, rrdtool, mrtg, and any other single-system datacollection. You want 1 million oids or more in 5 minutes? You need SSD for hardware and will probably want to distribute data writes eventually. Research things that make this easier. Cassandra based storage... but nothing good is fully formed. You should still probably begin with OpenTSDB, InfluxDB or another established time series database rather than rolling your own. They have warts but fixing the warts is better than creating new one-use TSDB's with their own flaws. See https://github.com/OpenNMS/newts/wiki/Comparison-of-TSDB

We have used ZenOss for a number of years at this scale (40k+ devices, at intervals of 1-5 minutes). It is possible to do if you have the hardware and storage performance to throw at it. We used OpenNMS before that and had to change due to scale. During that time we evaluated a number of the big name and big dollar solutions and none of them seemed to scale any better without significantly more hardware costs.
That's not to say ZenOss is perfect, we have plenty of headaches too.

Thank you all for you suggestions, I knew it that NANOG is a perfect place
for those kind of questions.
I will discuss your comments with my colleagues to see what would be the
best solution.
Once again thank you all for your valuable suggestions, I hope I will
update you soon with some results/test and of course more questions :slight_smile:

Please try the following/

https://statseeker.com/

Check out Science Logic. Their platform is made to scale to these levels.

Hello,

recently I have been tasked with a NMS project. The idea is to pool about
20 OID's from 50k cable modems in less then 5 minutes (yes, I know it's a
one million OID's). Before you say check out some very professional and
expensive solutions I would like to know are there any alternatives like
open source "snmp framework"? To be more descriptive many of you knows how
big is the mess with snmp on cable modem. You always first perform snmp
walk in order to discover interfaces and then read the values for those
interfaces. As cable modem can bundle more DS channels, one time you can
have one and other time you can have N+1 DS channels = interfaces. All in
all I don't believe that there is something perfect out there when it comes
to tracking huge number of cable modems so I would like to know is there
any "snmp framework" that can be exteded and how did you (or would you)
solve this problem.

You might wish to check out GitHub - tobez/snmp-query-engine: multiplexing SNMP query engine
(disclosure: I am the author).

Some scripting is needed to instruct it to do the polling and fetch the
collecting results. Currently there are only bindings for Perl
(Net::SNMP::QueryEngine::AnyEvent - multiplexing SNMP query engine client using AnyEvent - metacpan.org),
but bindings for other dynamic languages should be pretty straightforward.

We routinely use it to collect comparable quantities of OIDs.

\Anton.