NetFlow - path from Routers to Collector

In a message written on Wed, Sep 02, 2015 at 12:29:25AM +0700, Roland Dobbins wrote:

Bad advice. No amount of money will fix major platforms that are not happy to export flow telemetry via router management ports.

Correct. And for a proper network you may not wish to have those connections from in-band ports to your OOB/management network everywhere.

* Valdis.Kletnieks@vt.edu (Valdis.Kletnieks@vt.edu) [Tue 01 Sep 2015, 22:13 CEST]:

And that box ended up in your rack, why exactly?

Because variety of flow telemetry delivery options isn't the #1 ranked purchasing decider. Otherwise no Cisco would ever have been sold.

<insert excuses that boil down to "We were willing to accept gear that didn't have this functionality because we were OK with sending flow telemetry over the inband network">

Everything is a tradeoff. Welcome to the real world, where we have to make things work rather than pose on mailing lists about what we think other people should have.

  -- Niels.

this approach is fine for bitty boxes handling small quantities of traffic.
If you want to handle netflow data export for large amounts of traffic, it
would be pretty dumb to push it through the management plane of the router.

Nick

Agree. Most OOB is lacking redundancy too, so a single failure can really take the shine off an OOB deployment. Especially when you've put your management traffic on it, including radius traffic, and you're using 802.1X. Found that out the hard way a few years ago.

Chuck

There's no law that says that you must only plug designated management ports into OOB/DCN management networks.

Because variety of flow telemetry delivery options isn't the #1 ranked purchasing decider.

Actually, it is more often than you think. No use routing packets if you can't see what they do.

Otherwise no Cisco would ever have been sold.

Which is utter nonsense, of course, since Cisco a) invented flow telemetry and b) has been the consistent leader in innovating flow telemetry (FNF, IPFIX, anyone?). The EARL6/EARL7 problems are the only stumbles Cisco has made in this regard.

Concur 100%. You must use a port capable of doing so.

Roland,

Please stop digging, Sounds like you haven’t used Cisco recently. I’m happy to elaborate privately.

- Jared

Even if you're using old, grandfathered equipment for it, there's no reason why your OOB/DCN can't have a reasonable degree of redundancy. Since, it's like, *what you use to control your entire network*.

Underinvesting in management capabilities and capacities has always been a problem, of course. Some organizations just won't learn until they've gone through a disaster or three.

My experience in running large networks is these ports often can’t handle the traffic
involved.

The packet path in a juniper (for example to go from the PFE -> RE -> Ethernet) is very
sensitive to the jitter introduced by increased traffic loads and may result in the box
becoming unstable.

Other platforms (e.g.: IOS-XR based) have issues with the MgmtEther interfaces
which make them inoperable for many use-cases. There are many technical details
that are easily overlooked by those not using the routers to their abilities, so
a small network (as Wes mentioned before with 2500s/T1s) still as OOB is unlikely to see
data rates comparable to what is seen from a large router exporting data from hundreds of
gigs of flows.

Often net flow vendors tell customers things that create more flow records
which equals slightly higher data resolution but no actual net difference
in results except for the lowest of bitrates.

Making sure your flow implementation is optimized (ingress only, relevant links only)
is one part of having it scale. I’ve seen many a solution that scales poorly
or requires dozens of boxes for datasets that don’t require it. It’s
easy to say over specify for an attack because of the “Think of the Children^WDDoS”
mentality that exists, but when you are on the receiving end of a large attack
there are better tools to use.

- Jared

Please stop digging,

Since I'm not digging, I've no reason to stop. I see and deal with the various quirks of more different platforms exporting flow telemetry than most folks, all day, every day, so I know just a little bit about this topic.

Sounds like you haven’t used Cisco recently.

I use Cisco all the time, thanks. They aren't perfect - no vendor is. They have various issues with their NetFlow implementations on various platforms - for example, bursts of wildly inaccurate flow statistics from CRS boxes when a linecard is rebooted, a problem which has persisted for years and is just now being addressed. Odd stuff with EARL8 on Sup2T/DFC4 in certain configurations, and so forth.

But Niels is grossly exaggerating. I get very usable flow telemetry from them in many, many networks. I deal with flow telemetry from many, many vendors/platforms, and I can confidently assert that Cisco are nowhere near the bottom of the heap when it comes to the verisimilitude and functionality of their flow telemetry export. Quite the opposite

Most OOB is lacking redundancy too, so a single failure can really take the shine off an OOB deployment.

Even if you're using old, grandfathered equipment for it, there's no reason why your OOB/DCN can't have a reasonable degree of redundancy. Since, it's like, *what you use to control your entire network*.

Most networks use inband to manage them.

Underinvesting in management capabilities and capacities has always been a problem, of course. Some organizations just won't learn until they've gone through a disaster or three.

Yes. let me know when the vendors catch up in this area. I often see people say to create a new network as job security vs making the inband network survive attacks or be provisioned properly.

Most people I’ve seen have little data or insight into their networks, or don’t have the level that they would desire as tools are expensive or impossible to justify due to capital costs. Tossing in a recurring opex cost of DC XC fee + transport + XC fee + redundant aggregation often doesn’t have the ROI you infer here. I’ve put together some models in this area. It seems to me the DC/real estate companies involved could make a lot (more) money by offering an OOB service that is 10Mb/s flat-rate for the same as an XC fee and compete with their customers.

Things continue to be a challenge as less equipment works with a serial console and the expectation of developers of these embedded solutions don’t take into account low bitrate connections that are often used in last-resort situations.

We have a well oiled set of processes and checklists to monitor and test our management network. Patrick Gilmore has personally mocked me because of its method and technique, but the reality is it works.

- Jared

Sure, or a VRF, or whatever.

While that's not ideal, it's far better than doing management-plane stuff inband in the production network, though.

And those 2500 console concentrator connections are a great resource to have when everything goes haywire and you need something that lets you get to and actually type on the console. I'm not knocking them, and I understand that old, grandfathered equipment is used for these applications, and understand that in many cases they're underprovisioned for flow telemetry.

Which is why using VLANs, VRFs, whatever on the production network gear is completely understandable, and a lot of folks do just as you say.

Please stop digging,

Since I'm not digging, I've no reason to stop. I see and deal with the various quirks of more different platforms exporting flow telemetry than most folks, all day, every day, so I know just a little bit about this topic.

You are, Avi has said that the number of people with a network is outnumbered about 50:1 using his most favorable numbers. This means for your one example there are 50 people not doing this and the world hasn’t ended for them. If you aren’t listening to Avi, please
trust me, you don’t need your own OOB network for flow, nor is putting your flow there going to provide you some magical value. If you
can’t provision enough bandwidth for your telemetry data, you will obviously need to prune it back. 1:10k sampling works and you don’t
need much more than that unless you’re at extremely low bitrates. Most attacks last under 1 hour and even the small ones shout out
in netflow data doing a simple hash sort algorithm with the proper keys. You can even use QoS to mitigate if your goal is attack
traffic as they’re mostly UDP based attacks, see: draft-byrne-opsec-udp-advisory-00 for some advice/input.

I’ve shared my own input at recent NANOG meetings, including policers to keep the attacks under control.

Sounds like you haven’t used Cisco recently.

I use Cisco all the time, thanks. They aren't perfect - no vendor is. They have various issues with their NetFlow implementations on various platforms - for example, bursts of wildly inaccurate flow statistics from CRS boxes when a linecard is rebooted, a problem which has persisted for years and is just now being addressed. Odd stuff with EARL8 on Sup2T/DFC4 in certain configurations, and so forth.

I’m not talking about datacenter class equipment that you seem so focused on like the Earl7 with the TICO etc that did software sampling out of the hardware tcam and would be overrun.

But Niels is grossly exaggerating. I get very usable flow telemetry from them in many, many networks. I deal with flow telemetry from many, many vendors/platforms, and I can confidently assert that Cisco are nowhere near the bottom of the heap when it comes to the verisimilitude and functionality of their flow telemetry export. Quite the opposite

What people often don’t see is true “scale”[1] of netflow. When you have enough attributes or want to actually look at your IPv6 there have been significant shortcomings. We had to remind the patent holder for netflow how to implement it for their own hardware.

- Jared

aside: will you be in Yokohama? We should get lunch/dinner.

[1] - I hate this word, vendors use it as an excuse to hardcode limits and to not properly respond to valid use cases

It happens. You can deny it all you like, but I've seen it happen, and the resultant confusion and additional time to resolve problems it causes.

  I think the key here is that Roland isn't often constrained by these financial considerations.

That's entirely true. I deal every day with customers who are, though.

  I would respectfully disagree with Roland here and agree with Job, Niels, etc.

I understand where you and they are coming from, in this regard. I just disagree, as well.

  A few networks have robust out of band networks, but most I've seen have an interesting mixture of things

Concur 100%.

and inband is usually the best method.

Let me be clear - OOB for flow telemetry can be actually provisioned on the same boxes which are handling the production network traffic. It isn't ideal, but it's better than running it truly inband in the production network, mixed in with customer traffic. VLANs, VRFs, whatever are a reasonable compromise, and a lot of folks do this.

Inband is a huge risk, especially in a world of multi-hundred gb/sec reflection/amplification attacks (not to mention the other catastrophic failure scenarios). I know you sink a lot of UDP at the edges of your network to ameliorate this problem, but not all operators do that or agree with it either in principle or as a matter of optimal utility. I understand that this sort of thing is a decision that all network operators must make for themselves based upon their knowledge of their own networks and customer needs.

  Those that do have "seperate" networks may actually be CoC services from another deparment in the same company riding the same P/PE devices (sometimes routers).

Yes, that's what I'm getting at above. It isn't ideal, but there's no reason to make the perfect the enemy of the merely good, agreed.

  I've seen oob networks on DSL, datacenter wifi or cable swaps through the fence to an adjacent rack.

Absolutely. All kinds of creative lashups to get console access in difficult situations (and, as you noted previously, an increasing number of devices don't support serial console at all, which is highly annoying).

You are, Avi has said that the number of people with a network is outnumbered about 50:1 using his most favorable numbers.

Again, to clarify - I count VLANs/VRFs as being sufficiently out-of-band to handle flow telemetry on a reasonable basis without mixing it in with customer traffic.

That changes the ratio.

This means for your one example there are 50 people not doing this and the world hasn’t ended for them. If you aren’t listening to Avi, please trust me, you don’t need your own OOB network for flow, nor is putting your flow there going to provide you some magical value.

I agree with you, Avi, and others that a dedicated OOB network *just for flow telemetry* doesn't make economic sense in most (any?) scenarios.

What I'm saying is that it oughtn't to be mixed in with customer data-plane traffic. Ideally, all management-plane traffic would traverse a separate physical infrastructure. Since we don't live in an ideal world, virtual separation is generally Good Enough.

1:10k sampling works and you don’t need much more than that unless you’re at extremely low bitrates. Most attacks last under 1 hour and even the small ones shout out in netflow data doing a simple hash sort algorithm with the proper keys

Concur 100%. I spend a lot of time explaining to customers that no, they don't need/want 1:1 even if they could get it, and that the 'wake' left by attack traffic stands out very well even at relatively high sampling ratios.

Most of the network-oriented folks seem to grasp this pretty quickly. It's generally the 'security' types who often seem conceptually/attitudinally incapable of understanding these principles.

. You can even use QoS to mitigate if your goal is attack
traffic as they’re mostly UDP based attacks, see: draft-byrne-opsec-udp-advisory-00 for some advice/input.

I know you do this, and I understand why. Not everyone agrees with this and does it, and I also understand why (not).

ntp is easy, because there's the timesync packet-size classification hook. It gets a little dicier with other things.

I’ve shared my own input at recent NANOG meetings, including policers to keep the attacks under control.

And it's valuable experience to share, nobody disputes that.

I’m not talking about datacenter class equipment that you seem so focused on like the Earl7 with the TICO etc that did software sampling out of the hardware tcam and would be overrun.

I'm pretty sure the CRSes I referred to with the linecard-reboot issue in my example aren't datacenter-class equipment.

;>

What people often don’t see is true “scale”[1] of netflow. When you have enough attributes or want to actually look at your IPv6 there have been significant shortcomings. We had to remind the patent holder for netflow how to implement it for their own hardware.

This is very true. IPv6 flow telemetry is another area in which IPv4/IPv6 feature parity lags. Because of your focus on large-scale IPv6 deployment over the course of many years, you see and experience a lot more IPv6-related deficiencies than most folks.

aside: will you be in Yokohama? We should get lunch/dinner.

Yes, and yes.

;>

[1] - I hate this word, vendors use it as an excuse to hardcode limits and to not properly respond to valid use cases

Concur 100%.

Another annoying vendor trait is use-case obsession. In many contexts, the right answer is to understand that there is a baseline plateau of vitally necessary scaling (that word, again) capacity and required functionality which is universally applicable, irrespective of variations in particular use cases.

I recently had a discussion with someone who was asking me how many attack sources one typically sees in a given DDoS attack. My response was that there is no 'typical'; and that for IPv4, the theoretical potential is 2^32 sources, while in IPv6, the theoretical potential is for 2^128 sources.

It was a light-bulb moment.

;>

Other platforms (e.g.: IOS-XR based) have issues with the MgmtEther interfaces which make them inoperable for many use-cases.

I'm agreeing with you. Dedicated management ports on many boxes don't actually support important management-plane functions, like flow telemetry - which is nuts, but that's what happens.

There are many technical details that are easily overlooked by those not using the routers to their abilities, so a small network (as Wes mentioned before with 2500s/T1s) still as OOB is unlikely to see
data rates comparable to what is seen from a large router exporting data from hundreds of
gigs of flows.

That's true. I understand that even on large networks, the OOB/DCN is built from old, grandfathered equipment. I spend a lot of time helping network operators calculate optimal flow sampling rates, flow cache sizes, etc., and an important consideration in making optimal configuration choices is what the OOB/DCN network can handle.

Often net flow vendors tell customers things that create more flow records which equals slightly higher data resolution but no actual net difference in results except for the lowest of bitrates.

Concur 100%. I spend a non-trivial amount of time talking folks down from the assumption that unnecessarily-low flow sampling ratios are required (these are mainly 'security' folks, not network engineers).

As with everything in life...

Mark.

Not very straight forward when you have a network spanning several
continents.

Mark.