RE: Open-Source Network Management Tools

Hash: SHA1

From: Alexei Roudnev []
Sent: Friday, September 17, 2004 12:53 AM
To: Michael Smith;
Subject: Re: Open-Source Network Management Tools

I always tried to avoid any deal with SNMP TRAPS as most unreliable
and unconvenient way of alerting (unfortunately, it can not be
avoided totally).
We use 'syslog' (syslog-ng + home written syslog analyzers +
copmmercial soft, sometimes) when possible.

Unfortunately, SNMP TRAPS are what is available on the SONET
transport side of the network. There is no useful data to be gotten
from polling. In addition, the fact that TRAPS are proactive instead
of reactive means I have am immediately aware of network events
whereas I might miss something with a poll.

In addition, we have dry contact closures on these devices that TRAP
only, no polling. Thankfully, the number of these events is small
enough that syslog functions quite well.

Syslog has not been up to the task of working with the sheer volume
of TRAPS generated when there is a significant event on the optical
network. Sometimes we see the notification but not the resolution,
sometimes we see all but the last line of a TRAP message, and
sometimes we get nothing.



There is another problem with TRAPS:
- when I code monitoring, I always need 2 messages:

(We have a few scripts making monitoring, and it always started with sending
CRITICAL message only, and ended in sending both messages - it iis
impossible to work without having information _if condition still exists or

Unfortunately, no SYSLOG no SNMPTRAP have such positive notifications, which
makes their use very difficult, and limit it to a very small set of really
CRITICAL events.

I have not such problem with POLL:
- poll parameter, draw a chart;
- if parameter override threshold, 'SHORT FAILURE' event raised (no paging,
just show a problem);
- if 'SHORT FAILURE' exists for some time, it is converted into CRITICAL and
send alert;
- when problem fixed, it sends RESTORED message.
(See: ProactiveNetwork system; many opensource systems. Do not see - CA!,
good example of terrible design. BMC is something average.)

As a result, you always can see:
- history of the parameter (so, if it is disk space, easy to understand, how
many time do you have, for example);
- history of events (when it failed and when it restored);
- if someone other work this problem out.

Without it... I receive a message

  ALERT, CRITICAL, server XXX, oid

I do not know (it's impossible) where to look - there is not any parameter
associated with this message.
I do not know, was it short condition (may be, disk was replaced in RAID) or
it still exists (DISK failed now);
In retrospective, manager do not see, how fast it was fixed.

It all makes SNMP TRAPS very unconvenient (not talking about possible lost
of event).