Open source Netflow analysis for monitoring AS-to-AS traffic

I share this concern, but in my experience the market simply does not
care at all what the data means. People happily graph L3 rate from
Junos, and L2 rate from other boxes, using them interchangeably as
well as using them to determine if or not there is congestion.
While in reality, what you really want is L1 speed, so you can
actually see if the interface is full or not. Luckily we are starting
to see more and more devices also support peak-buiffer-util in
previous N seconds, which is far more useful for congestion
monitoring, unfortunately it is not IF-MIB so most will never ever
collect it.

Note, it is possible to get most Juniper gear to report L2 rate like
IF-MIB specifies, but it's a non-standard configuration option,
therefore very rarely used.

I also wholeheartedly agree on inline templates being near peak
insanity. Huge complexity for upside that is completely beyond my
understanding. If I decide to collect a new metric, then punching in
the metric number+name somewhere is the least of my worries. Idea that
the costs are lowered by having machines dynamically determine what is
being collected and monitored is just bizarre. Most of the cost of
starting to collect a new metric is figuring out how it is actionable,
what needs to happen to the metric to trigger a given action, and how
exactly we are extracting value from this action.
Definitely Netflow v9/v10 should have done out-of-band templates, and
left it to operator concern to communicate to the collector what it is
seeing.

Even exceedingly trivial things in v9/v10 entities can be broken for
years and years before anyone notices, like for example the original
sampling entities are deprecated, they are replaced with new entities,
which communicate 'every N packets, sample C packets', this is very
very good, because it allows you to do stateless sampling, while still
filling out export packet with MTU or larger size to keep export PPS
rate same before/after axing cache. However, by the time I was looking
into this, only pmacct correctly understood how to use these entities,
nfcapd and arbor either didn't understand them, or understood them
incorrectly (both were fixed in a timely manner by responsible
maintainers, thank you).

Hi Peter,

Thanks for that link. I did read the spec, and while the definition itself is clear, the escape clause gives a lot of wiggle room:

Hardware limitations may prevent an exact reporting of the underlying frame length, but an agent should attempt to be as accurate as possible.

I read that as, “the vendor will do whatever it pleases, and you should be grateful to receive a non-negative integer at all.” I could be too cynical, though.

Anyway, this particular vendor does other funny things (such as sometimes stripping the q-tag headers from the sampled frame; throttling the frame sampling on the box, but not adjusting the sampling interval in the sFlow exports) that make it a true joy to work with this gear. :wink:

Cheers,

– Steven

This is very high on my todo list, notably because I don't want to reimplement Grafana. The API already exists (the current web interface uses it) but it is not "stable" (it may change in future versions).