Open source Netflow analysis for monitoring AS-to-AS traffic

Saku_Ytti1 · March 30, 2024, 5:25am

I share this concern, but in my experience the market simply does not
care at all what the data means. People happily graph L3 rate from
Junos, and L2 rate from other boxes, using them interchangeably as
well as using them to determine if or not there is congestion.
While in reality, what you really want is L1 speed, so you can
actually see if the interface is full or not. Luckily we are starting
to see more and more devices also support peak-buiffer-util in
previous N seconds, which is far more useful for congestion
monitoring, unfortunately it is not IF-MIB so most will never ever
collect it.

Note, it is possible to get most Juniper gear to report L2 rate like
IF-MIB specifies, but it's a non-standard configuration option,
therefore very rarely used.

I also wholeheartedly agree on inline templates being near peak
insanity. Huge complexity for upside that is completely beyond my
understanding. If I decide to collect a new metric, then punching in
the metric number+name somewhere is the least of my worries. Idea that
the costs are lowered by having machines dynamically determine what is
being collected and monitored is just bizarre. Most of the cost of
starting to collect a new metric is figuring out how it is actionable,
what needs to happen to the metric to trigger a given action, and how
exactly we are extracting value from this action.
Definitely Netflow v9/v10 should have done out-of-band templates, and
left it to operator concern to communicate to the collector what it is
seeing.

Even exceedingly trivial things in v9/v10 entities can be broken for
years and years before anyone notices, like for example the original
sampling entities are deprecated, they are replaced with new entities,
which communicate 'every N packets, sample C packets', this is very
very good, because it allows you to do stateless sampling, while still
filling out export packet with MTU or larger size to keep export PPS
rate same before/after axing cache. However, by the time I was looking
into this, only pmacct correctly understood how to use these entities,
nfcapd and arbor either didn't understand them, or understood them
incorrectly (both were fixed in a timely manner by responsible
maintainers, thank you).

Steven_Bakker · March 31, 2024, 9:53am

Hi Peter,

Thanks for that link. I did read the spec, and while the definition itself is clear, the escape clause gives a lot of wiggle room:

“Hardware limitations may prevent an exact reporting of the underlying frame length, but an agent should attempt to be as accurate as possible.”

I read that as, “the vendor will do whatever it pleases, and you should be grateful to receive a non-negative integer at all.” I could be too cynical, though.

Anyway, this particular vendor does other funny things (such as sometimes stripping the q-tag headers from the sampled frame; throttling the frame sampling on the box, but not adjusting the sampling interval in the sFlow exports) that make it a true joy to work with this gear.

Cheers,

– Steven

Vincent_Bernat · April 14, 2024, 9:41am

This is very high on my todo list, notably because I don't want to reimplement Grafana. The API already exists (the current web interface uses it) but it is not "stable" (it may change in future versions).

Javier_Gutierrez · June 5, 2024, 9:15pm

Hi everyone,
I’ve been trying to get Akvorado to work on my environmnet but I keep getting the flows to stop collecting, it seems like the issue is related to the number of exporters I have sending data, can someone please share the max number they have gotten to work and the flows/s rate without the system crashing?

Thanks in advance for your answers.

Vincent_Bernat · June 8, 2024, 7:46am

Without much information, I think this is more likely that you are running out of disk space.

Dave_Taht · June 9, 2024, 2:55am

We are in the process of adding netflow collection to libreqos. Any potential testers using any of these backends described below out there?

Javier_Gutierrez · June 10, 2024, 2:05pm

After some troubleshooting I ended up having to increase my kafka partitions as well as my clickhouse collectors as it seemed like clickhouse would lack behind quite a bit
I also has some issues with the server where CPU and RAM would max out all the time, my RAM usage is still quite high and seems to grow exponentially as the day goes by, but i don’t think its causing the issues anymore. Storage wise I’m good tho

Thanks for the advice.