sFlow vs netFlow/IPFIX

Crane_Todd · February 28, 2016, 8:06pm

This maybe outside the scope of this list but I was wondering if anybody had advice or lessons learned on the whole sFlow vs netFlow debate. We are looking at using it for billing and influencing our sdn flows. It seems like everything I have found is biased (articles by companies who have commercial offerings for the "better" protocol)

Todd Crane

Nick_Hilliard3 · February 28, 2016, 10:40pm

Todd Crane wrote:

This maybe outside the scope of this list but I was wondering if
anybody had advice or lessons learned on the whole sFlow vs netFlow
debate. We are looking at using it for billing and influencing our
sdn flows. It seems like everything I have found is biased (articles
by companies who have commercial offerings for the "better"
protocol)

There is a lot of religion floating around about this subject.

Netflow was designed to measure flows, and it turned out that the design
was robust enough for it to be more-or-less good enough for billing
purposes. It's "more or less" because on larger routers, you can't do
1:1 data export and you end up needing to do traffic sampling, at which
point you're billing based on realistic estimates rather than exact
data. That's fine if your contract with your customer says it's ok.

Netflow works by tracking individual flows in the data plane. This is
pretty complicated in practice and requires dedicated hardware to handle
it at line rate. You generally end up with two packet forwarding
engines on a hardware-forwarded router: one to handle the forwarding,
and the other to categorise and handle the flow data. This means that
netflow is expensive to design, build and run.

Sflow is a simple packet header sampling mechanism. The only thing it
does is to pick out every 1 in N packets, and to try to figure out where
the headers stop and the data begins. The header is then forwarded to
the sflow collector, which is where all the smart stuff is done.

If your netflow / sflow packet sampling mechanism is accurate and your
router is configured appropriate for the quantity of flow data being
exported (i.e. it isn't dropping data samples due to overload), then for
the most part, there will be no substantial difference between using
sflow and sampled netflow (depending on the data flow type), assuming
that each protocol provides the data you're looking for.

Obviously, if your sampling mechanism is broken or your exporter is
overloaded, then both sflow and netflow will produce trash.

If you're using unsampled netflow, then netflow will be more accurate,
assuming you don't end up overflowing the netflow data export mechanism.

Anything which uses sampling - regardless of whether it's for netflow or
sflow - needs to be profiled before being pushed into production,
because you need to understand the limits of the sampling mechanism.
Hardware sampling often doesn't work properly or plateaus off at a
certain stage, dropping packets in the process. This can cause
unwelcome surprises.

Without knowing anything more about your requirements or your choice of
equipment, I'd suggest that sflow would probably be a better choice for
SDN tuning and probably netflow would be better for billing, but YMMV.

Nick

Phil · February 28, 2016, 11:15pm

What HW are your looking at our are you rolling your own probes? Router/switch HW almost never does both. Netflow/IPFIX puts the flow intelligence in the router, but with that comes more limitations.

Sflow typically uses more BW because you are sending headers for each packet. The sflow collector also needs more intelligence since it's doing flow correlation, AS matching, etc. instead of the router doing it. However it is more flexible since adding a new header, like vxlan or NSH is much easier to implement in some analysis SW than router SW.

Phil

Baldur_Norddahl · February 28, 2016, 11:26pm

Around here they are currently voting on a law that will require unsampled
1:1 netflow on all data in an ISP network with more than 100 users. Then
store that data for 1 year, so the police and other parties can request a
copy (with a warrant but you are never allowed to tell anyone that they
came for the data and the judges will never say no).

My routers can apparently actually do 1:1 netflow and the documentation
does not state any limits on that. So maybe I am lucky?

To the original question: in this country sFlow only is apparently about to
become illegal.

Regards,

Baldur

Dobbins_Roland · February 29, 2016, 2:24am

That's interesting, given that most larger routers don't support 1:1.

Valdis_Kletnieks · February 29, 2016, 2:41am

In the war between reality and governmental paranoia, reality usually loses.

Pavel_Odintsov · February 29, 2016, 7:26am

Hello, folks!

I've huge experience for battle sflow vs netflow because in my free
DDoS detection toolkit fastnetmon we support both capture methods.

You could look at this comparison table:
https://github.com/pavel-odintsov/fastnetmon/blob/master/docs/CAPTURE_BACKENDS.md

From my own experience sflow should be selected if you are interested

in internal packet payload (for dpi / ddos detection) or you need fast
reaction time on some actions (ddos is best example).

If you just need to count traffic and you accept pretty long reaction
time and not enough accurate traffic bandwidth data you could select
netflow.

From hardware point of view almost all brand new switches support

sflow free of charge (no additional licenses or modules). But be
aware, Cisco do not support this protocol at all (that's pretty weird,
really). Also keep in mind sflow implemented in hardware ASIC with
small help from CPU and it's pretty fast and suitable for really any
traffic bandwidth.

I have experience with sflow analytics for 1.5 Tb+ network and it's
working really well!

For netflow sometimes you need additional modules / software licenses
and sometime devices completely haven't support for it. And if you
have software devices (for example small SRX routers from Juniper)
netflow generation will be pretty expensive from CPU point of view
because netflow need pretty big amount of CPU resources for
aggregation.

Avi_Freedman1 · February 29, 2016, 7:27am

Re: limits -

For Cisco/Juniper it's in the low hundreds of thousands of flows/sec
per chipset/linecard for 1:1 NetFlow/IPFIX, I think.

Then of course, as has been mentioned, you'll need to be able to send
it and receive it to something - and store+query.

Avi Freedman
CEO, Kentik

<snip>

Dobbins_Roland · February 29, 2016, 7:32am

This does not match my experience. In particular, the implied canard about flow telemetry being inadequate for timely DDoS detection/classification/traceback grows tiresome, as it's used for that purpose every day, and works quite well.

If one is also using an IDMS-type device to mitigate DDoS traffic, the device sees the whole packet, anyways.

Avi_Freedman1 · February 29, 2016, 7:38am

This maybe outside the scope of this list but I was wondering if anybody had advice or lessons learned on the whole sFlow vs netFlow debate. We are looking at using it for billing and influencing our sdn flows. It seems like everything I have found is biased (articles by companies who have commercial offerings for the "better" protocol)

Todd Crane

Most vendors that take "flow" take both so there tends not to be THAT much religion unless you talk to someone who hates being flooded with 1:1 flow, or debugging broken (usually NetFlow) implementations.

In our experience, they both basically work for ops use cases nowadays, for major vendors of routers, and most switches.

sFlow gives faster feedback and more accurate (adding things up, * sample rates, closer to SNMP counter data) than most NetFlow/IPFIX implementations. How much varies from slightly to extreme (if you're using Catalysts for NetFlow/IPFIX).

My thesis overall re: why sFlow 'just works' a bit better is that it's just so much easier to implement sFlow because there's no need to track flows (hash table or whatever data structure you need). Just grab samples of headers, parse, fill structs, and send.

That said, things are hugely less sucky than 10 or even 5 years ago in the NetFlow world, and for the right vendor and box and software it's possible to get NetFlow/IPFIX essentially as accurate.

And has been noted, it at least in theory some boxes that do tens to hundreds of gigabits (or low terabits) of traffic support 1:1, which you could in theory do with sFlow as a transport, but I haven't seen a switch or router that does that. Re: 1-1 flow - the boxes supporting that are generally not the biggest purchase-able from Cisco or Juniper, but are used as the big-boy backbone and border routers by a good number of multi-terabit networks, and even some multi-tens-of-terabit networks.

Good luck in your flow journeys.

Avi Freedman
CEO, Kentik

Pavel_Odintsov · February 29, 2016, 7:41am

Sorry but I could not understand what issues you've found in sflow.
Could you describe they in details?

Recently I had speech at RIPE 71 and show pattern of real attack which
achieved 6 gbps in first 30 seconds (just check slide 6 here
http://www.slideshare.net/pavel_odintsov/ripe71-fastnetmon-open-source-dos-ddos-mitigation).

And sflow device could offer 3-4 seconds detection time from this
case. But netflow __could__ delay telemetry up to 30 seconds (in case
of huge syn/syn-ack flood for example) and you network will experience
downtime.

But with sflow you could detect and mitigate this attack in seconds.
Is it make sense?

Dobbins_Roland · February 29, 2016, 7:53am

Could you describe they in details?

Inconsistent stats, lack of ifindex information.

But netflow __could__ delay telemetry up to 30 seconds (in case of huge syn/syn-ack flood for example) and you network will experience downtime.

This is incorrect, and reflects an inaccurate understanding of how NetFlow/IPFIX actually works, in practice. It's often repeated by those with little or no operational experience with NetFlow/IPFIX.

Pavel_Odintsov · February 29, 2016, 8:12am

What you mean as lack of ifindex in sflow?

I could offer example sflow v5 sample structure description (it's from
my C++ based sflow parser but actually it's pretty simple to
understand):

        uint32_t sample_sequence_number; // sample sequence number
        uint32_t source_id_type; // source id type
        uint32_t source_id_index; // source id index
        uint32_t sampling_rate; // sampling ratio
        uint32_t sample_pool; // number of sampled packets
        uint32_t drops_count; // number of drops due to
hardware overload
        uint32_t input_port_type; // input port type
        uint32_t input_port_index; // input port index
        uint32_t output_port_type; // output port type
        uint32_t output_port_index; // outpurt port index
        uint32_t number_of_flow_records;
        ssize_t original_payload_length;

As you can see we have source id, sampling rate and definitely we have
full information about source and destination ifindexes.

In addition to sample structure (which consist of first X bytes of
each packet) we have counter structures which working as old good
"snmp counters" and offer detailed information about load on each
port.

Looks like you haven't so much field experience with sflow. I could
help and offer some real field experience below.

Dobbins_Roland · February 29, 2016, 8:38am

Looks like you haven't so much field experience with sflow. I could
help and offer some real field experience below.

I've already recounted my real-world operational experience with NetFlow.

I have my own netflow collector implementation for netflow v5, netflow v9 and IPFIX (just check my repository

Coding something and using something operationally are two different things. I'm not a coder, but I've used NetFlow operationally since 1998, primarily on Cisco platforms (some Junipers, but I don't know a lot about Juniper boxes).

So you know about Mirkotik implementation of netflow (they have
minimum possible active and inactive timeout - 60 seconds) ?

Yes. That does not equate to a 60s delay in detection/classifying/tracing back a SYN-flood, or anything else.

Or what about old Cisco routers which support only 180 seconds as active timeouts?

I think you're referring to the *default* value for the active flow timer, which can of course be altered.

Could they offer affordable time for telemetry delivery?

Yes, because there has never been any such router, and also because cache size and other tunable parameters, as well as FIFOing out of flows when the cache is full, guarantees that very few flows of the type seen in DDoS traffic hang around in the cache for any appreciable length of time.

Because not all netflow implementations are OK. And definitely some netflow implementations are broken.

You can search the archives on this list and see my previous detailed explanation of NetFlow caveats on Cisco 6500/7600 with EARL6 and EARL7 ASICs.

Your statements about it taking an inordinately long time to detect/classify/traceback SYN-floods and other types of DDoS attacks utilizing NetFlow implementations (with the exceptions of crippled implementations like the aforementioned EARL6/EARL7 and pre-Sup7 Cisco 4500) are simply untrue.

Pavel_Odintsov · February 29, 2016, 8:53am

Thanks for explained answer!

But actually it's mistake to think I haven't real field experience
just because I'm developer. In world of big companies nobody could do
ops and development. But I'm trying to keep close to both worlds. And
could conclude it's definitely possible.

It's definitely possible thanks to my flexible company

But actually "I think you're referring to the *default* value for the
active flow timer, which can of course be altered." It's not about
default. It's about minimal possible.

For Mikrotik routers same issue. Minimal possible timeout is 60 / 60.
And impossible to decrease it.

Also so much routers could not do enough accurate netflow without
additional (and very expensive) line cards just for netflow
generation.

OK, we could handle some sort of SYN flood.

But what about 20 Gbps http flood with valid requests when each
customer are real (and not spoofied) and they are sending huge post
requests and hang on connection?

How netflow will handle correctly handshaked connection, established
http session but haven't closed correctly for a while?

Actually it could wait for active/inactive timeout and you will get
bad news from ops guys about network downtime. But sflow will handle
it with flying colors without delay.

What about destination http host detection with netflow? Could it
extract "host" header from netflow? And drop only part of traffic to
our own host?

Definitely not. Netflow haven't any information about http headers but
sflow has.

What about same issue for dns flood when somebody flood out some
certain host? You could detect this attack with netflow. But you could
not extract information about certain type of DDoS attack and attacked
domain.

When we speaking about "very rough" DDoS attack mitigation and
filtering we could use netflow.

But when we are really care about network stability, customer service
SLA and ability to filter malicious traffic with perfect precision we
should use sflow.

I really like to hear feedback about my vision.

Dobbins_Roland · February 29, 2016, 9:42am

It's not about default. It's about minimal possible.

To my knowledge, there has never been a Cisco router which only allowed an active flow timer value of 180s, which wasn't user-configurable. I would appreciate the details of any such router.

For Mikrotik routers same issue. Minimal possible timeout is 60 / 60.
And impossible to decrease it.

As we've seen already from another poster in this thread, that isn't the case.

Also so much routers could not do enough accurate netflow without
additional (and very expensive) line cards just for netflow
generation.

I believe you're referring to PICs on Juniper routers, yes? Or perhaps the requirement for E3 or E5 linecards on Cisco 12Ks? Or maybe DFCs on Cisco 6500s/7600s? Or possibly M-series linecards on Cisco N7Ks (which are switches, of course)?

TANSTAAFL.

OK, we could handle some sort of SYN flood.

As noted previously, this is indeed the case.

But what about 20 Gbps http flood with valid requests when each
customer are real (and not spoofied) and they are sending huge post
requests and hang on connection?

Attacks of this nature generally leave a 'wake' or 'contrail' which is pretty easily spotted if one's statistical anomaly detection routines are optimal.

Actually it could wait for active/inactive timeout and you will get
bad news from ops guys about network downtime.

As a network ops guy, I can assure you that you are incorrect, largely because you don't seem to understand the interplay of active flow timer, inactive flow timer, NetFlow cache size, NetFlow cache FIFOing, and normal flow cache baselines.

But sflow will handle it with flying colors without delay.

NetFlow handles it with flying colors without delay.

What about destination http host detection with netflow? Could it
extract "host" header from netflow? And drop only part of traffic to
our own host?

Of course not, for classical flow telemetry templates - but that's when one drops from the macroanlytical to the microanalytical. And flow telemetry doesn't 'drop' anything.

For some reason, you don't mention Flexible NetFlow at all. It's true that it's taken a while to become practical to use (back when the then-Cisco NetFlow PM asked me to create the CLI grammar and syntax for FNF, I noted that it wouldn't take off until there was a decent control-plane interface for creating, configuring, and tearing down dynamic flow caches, as well as some degree of ASIC support on larger platforms), but now that the various 'SDN'-type provisioning mechanisms are being implemented, and now that at least partial FNF is supported to varying degrees on various ASICs, this will hopefully change.

Netflow haven't any information about http headers but sflow has.

See above. This isn't necessary, and it isn't possible at scale with s/Flow, either.

What about same issue for dns flood when somebody flood out some
certain host? You could detect this attack with netflow. But you could
not extract information about certain type of DDoS attack and attacked
domain.

There's no need to do this with flow telemetry. Once the attack has been detected/classified/traced back, one drops to the microanalytical for situationally-appropriate mitigation.

When we speaking about "very rough" DDoS attack mitigation and
filtering we could use netflow.

Not just 'very rough', see above.

But when we are really care about network stability, customer service
SLA and ability to filter malicious traffic with perfect precision we
should use sflow.

This is demonstrably incorrect. Many of the largest networks in the world successfully utilize NetFlow telemetry for all these purposes; they have for many years, and will continue to do so.

[And, btw, nothing has 'perfect precision'.]

That doesn't mean that NetFlow (or IPFIX) is perfect, and it doesn't mean that all implementations are perfect, and it doesn't mean that the ability to get more information about traffic via FNF or IPFIX EE mechanisms isn't desirable. But you are simply wrong about the utility of NetFlow and/or IPFIX with classical flow templates.

I really like to hear feedback about my vision.

See above.

Pavel_Odintsov · February 29, 2016, 9:59am

Thanks for detailed question!

I have only one question. Why you against sFLOW protocol telemetry
with so huge passion ?

It's not proprietary technology and not an product from yet another
big company. I'm not trying to sell anything because... nothing to
sell. Really, isn't it?

It's just yet another open standard to analyze data approved and
implemented as RFC.

If somebody developed this standard. Implemented it in ASIC (they have
very huge price for development and production you know). That's means
"somebody" really want it and will definitely use it.

Actually, sflow is not so popular as netflow. But to be honest it's
pretty young standard in compare with netflow and it implements
slightly different approach. Which will be useful in some cases.

For example, at huge Internet Exchanges you actually haven't any
netflow enabled devices (just check design architecures from AMX-IX,
DEC-IX, LINX or even MSK-IX).

Almost all IX developed with L2 in ming and they actually haven't any
devices which could produce netflow.

So IX could not use netflow even if they want. But you vote for "sflow
is weird protocol and you should avoid it".

How IX could monitor traffic if they haven't netflow? So if they
follow your recommendations they should drop idea about traffic
monitoring at all

I do not like holy wars about something vs something. But actually in
modern network world every technology has applicable usage and it's
not good idea to avoid it just because your religion (I'm speaking
about netflow religion) prohibit it for you.

Actually you are writing this email from company email and I could
conclude it's Arbor vision and is not your own. Could you clarify it?
Could I use your vision as Arbor's vision in public speeches /
presentations?

Thanks!

Dobbins_Roland · February 29, 2016, 10:14am

I have only one question. Why you against sFLOW protocol telemetry with so huge passion ?

Because I've had very poor experiences with it. And it doesn't seem to scale very well.

Actually, sflow is not so popular as netflow. But to be honest it's
pretty young standard in compare with netflow and it implements
slightly different approach.

sFlow has been around for a while, though. It isn't new.

So IX could not use netflow even if they want.

This depends upon the devices utilized - there are actually some devices which can export layer-2 NetFlow.

There are other issues with NetFlow as it's currently generally implemented which are also concerns with IX scenarios, FYI. I will leave it as an exercise for you to find out what they are.

But you vote for "sflow is weird protocol and you should avoid it".

My view is that it's generally better to use NetFlow or IPFIX, where and when possible.

How IX could monitor traffic if they haven't netflow? So if they
follow your recommendations they should drop idea about traffic
monitoring at all

Straw man. I never said that nor implied it. If sFlow is all that's available, then of course operators can and should use it.

But actually in modern network world every technology has applicable usage and it's
not good idea to avoid it just because your religion (I'm speaking
about netflow religion) prohibit it for you.

It isn't 'religion'. It's based upon the fact that a) my experiences with sFlow have been suboptimal and b) sFlow isn't generally available on large routers used at network edges.

Actually you are writing this email from company email and I could conclude it's Arbor vision and is not your own.

No, that is incorrect. I speak only for myself. And as I previously noted, Arbor products support sFlow, and have for many years; I'm just not a big fan of it.

Could you clarify it?

I just did.

Could I use your vision as Arbor's vision in public speeches / presentations?

No, you may not, per the above. Arbor is telemetry-neutral; we aim to support all relevant telemetry formats in line with the expressed needs of our customers. And that includes sFlow.

These trollish, passive-aggressive rhetorical tactics grow wearisome. I will not reply any further to this thread, so as to avoid further spamming the list.

Nikolay_Shopik · February 29, 2016, 10:15am

Cisco Nexus switches support sflow, since they are broadcom based.

Saku_Ytti1 · February 29, 2016, 12:03pm

I find that strange, because if you're doing in in HW, doing hash
lookup for flow and adding packets and bytes to the counter is cheap.
It's expensive having lot of those flows, but incrementing their
packet and byte counter isn't.

I know that all JNPR Trio kit (MX, T, EX9k...) do 1:1. I guess if
you're doing it in LC CPU things are very different.