Rate-limiting BCOP?

Bryan_Holloway1 · May 21, 2020, 7:08pm

I'm curious if the community would be willing to share their best-practices and/or recommendations and thoughts on how they handle situations where a customer buys X amount of bandwidth, but the physical link is capable of Y, where Y > X. (Yes, I speak of policy-maps, tx/rx-queues, etc.)

For example, it might be arguably common to aggregate customer links Layer 2, and then push them upstream to where they anchor Layer 3. That Layer 2 <-> Layer 3 could be a couple of meters or several kilometers.

So, as I see it, my options are:

* Rate-limit at the Layer 2 switch for both customer ingress/egress,

* Rate-limit at the Layer 3 router upstream, i/e, or

* Some combination thereof? E.g.: Rate-limit my traffic towards the customer closer to the core, and rate-limit ingress closer to the edge?

I've done all three on some level in my travels, but in the past it's also been oftentimes vendor-centric which hindered a scalable or "templateable" solution. (Some things police in only one direction, or only well in one direction, etc.)

In case someone is interested in a tangible example, imagine an Arista switch and an ASR9k router.

Thoughts?

Saku_Ytti1 · May 22, 2020, 6:52am

I've done all three on some level in my travels, but in the past it's
also been oftentimes vendor-centric which hindered a scalable or
"templateable" solution. (Some things police in only one direction, or
only well in one direction, etc.)

I may misunderstand something here, but are you looking for vendor
agnostic solutions to replace your vendor specific one? That is an
unrealistic goal and it's not clear to me why it would be important
what spell incantation is needed at any given moment.

If you need just to 'rate-limit', and you don't need to discriminate
traffic in any way, it's ezpz🍋sqz. Only thing you need to consider is
your stepdown, is ingress rate higher than egress rate? If you have
speed stepdown, then you need to consider at which RTT rate do I
guarantee full rate on a single TCP session. If you police, you need
to allow a burst rate which can take in the TCP window growth, if you
use a shaper you need to configure buffers which can ingest it.
Consider sender is 10Gbps and RTT is 100ms, and receiver is 1Gbps. The
window the sender may burst is 9Gbps*100ms/2 = 56.25MB. If sender is

100ms, you contractually don't guarantee the customer full rate on a

single TCP session, you point the ticket to a product definition and
close it.

Now if you need to discriminate, things can get very complex.
Particularly if you do QoS at the L3 aggregation at the L2 access
physical rate, you need to understand very well what is the policer
counting L1, L2, L3 rate? And how to get it to count L1 rate. Ideally
you'd do all QoS at the congestion point at ANET, and not on the
subinterfaces. But often the access device is too dumb to do the QoS
you need.

Let's take the complex example, you need to discriminate and do all in
CSCO and ANET does just BE and is QoS unaware. We have a 10GE
interface in CSCO and a customer is 1GE connected at ANET.

Configuration would be something like this:

interface TenGigE0/1/2/3/4.100
service-policy output CUST:XYZ:PARENT account user-defined 28
encapsulation dot1q 100
!
policy-map CUST:XYZ:PARENT
class class-default
service-policy CUST:XYZ:CHILD
shape average 1 gbps
!
end-policy-map
!

RP/0/RSP1/CPU0:r15.labxtx01.us.bb#show run policy-map CUST:XYZ:CHILD
policy-map CUST:XYZ:CHILD
class NC
  bandwidth percent 1
!
class AF
  bandwidth percent 20
!
class BE
  bandwidth percent 78
!
class LE
  bandwidth percent 1
!
class class-default
!
end-policy-map
!

Homework:
  - what RED curve to use
  - how much should you buffer in given class at worst (in some cases
burst cannot be configured small enough, and you need to offer
lower-than-bought rate to be able to honor QoS contract)
  - what is right shaper burst
  - what to map to each class and how (tip: classify exclusively on
ingress interface by set qos-group, on egress class match exclusively
on the qos-group, this methodology translates across platforms)

If your 'account user-defined' is wrong even by a byte, your
in-contract traffic will be dropped by ANET, because CSCO is admitting

1Gbps ethL1 rate, and ANET is QoS-unaware so will drop LE just in the

same probability as AF.

Further complication, let's assume you are all-tomahawk on ASR9k.
Let's assume TenGigE0/1/2/3/4 as a whole is pushing 6Gbps traffic
across all VLAN, everything is in-contract, nothing is being dropped
for any VLAN in any class. Now VLAN 200 gets DDoS attack of 20Gbps
coming from single backbone interface. I.e. we are offering that
tengig interftace 26Gbps of traffic. What will happen is, all VLANs
start dropping packets QoS unaware, 12.5Gbps is being dropped by the
ingress NPU which is not aware to which VLAN traffic is going nor is
it aware of the QoS policy on the egress VLAN. So VLAN100 starts to
see NC, AF, BE, LE drops, even though the offered rate in VLAN100
remains in-contract in all classes.
To mitigate this to a degree on the backbone side of the ASR9k you
need to set VoQ priority, you have 3 priorities. You could choose for
example BE P2, NC+AF P1 and LE Pdefault. Then if the attack traffic to
VLAN200 is recognised and classified as LE, then we will only see
VLAN0100 LE dropping (as well as every other VLAN LE) instead of all
the classes.

To wish that this would a vendor agnostic is just not realistic, as
there are very specific platform decisions which impact on your QoS
design.

To stress how critical the accounting is, if you do QoS in the 'wrong'
place: https://ytti.fi/after.png

In this picture BE is out of contract, AFnb, AFb and EF are all
in-contract. However the customer sees loss in all classes, this is
because the L3 is shaping at L2 rate not L1 rate, so it's sending

1Gbps to the customer, which is physically limited to 1Gbps, forcing

the QoS-unaware access device to drop. Only thing fixed at 130 is
correct accounting parameter, which causes the L3 to reduce the rate
it can send and causes it to start dropping, and as it is QoS aware,
it can honor the contract, so all drops move to the out-of-contract
class, BE.

Mark_Tinka1 · May 24, 2020, 11:22am

* Rate-limit at the Layer 2 switch for both customer ingress/egress,

In the past, we did this on the routers, as most switches only supported
ingress policing and egress shaping, often with very tiny buffers.

Recently, some switches do now support ingress and egress policing.
Being able to do this as close to the customer as possible is always
most effective, especially if you run LAG's between a switch and
upstream router.

* Rate-limit at the Layer 3 router upstream, i/e, or

This is how we used to do it, but became problematic when you ran LAG's
between switches and routers.

However, between switches supporting ingress/egress policing, as well as
moving away from switch-router LAG's to native 100Gbps trunks, you can
now police on the router or switch without much concern. The choice of
either is determined by the number of services customers buy on a single
switch port.

* Some combination thereof? E.g.: Rate-limit my traffic towards the
customer closer to the core, and rate-limit ingress closer to the edge?

Where we run LAG's between routers and switches, we police on the switch.

Where we run 100Gbps native trunks between switches and routers, we
police on the router depending on the type of service, i.e., a Q-in-Q
setup for a customer where different services being delivered on the
same switch port have different policing requirements.

I've done all three on some level in my travels, but in the past it's
also been oftentimes vendor-centric which hindered a scalable or
"templateable" solution. (Some things police in only one direction, or
only well in one direction, etc.)

Yes, we've oscillated between different methods depending, particularly,
on what (switch) vendor we used.

In case someone is interested in a tangible example, imagine an Arista
switch and an ASR9k router.

Arista do support ingress/egress policing (tested on the 7280R). The
previous Juniper EX4550's we ran only shaped on egress, and that was
problematic due to the small buffers it has.

You should have a lot more flexibility on the ASR9000 router, except in
cases where you need to police services delivered on a LAG.

Mark.

Tarko_Tikan · May 24, 2020, 1:55pm

hey,

Being able to do this as close to the customer as possible is always
most effective, especially if you run LAG's between a switch and
upstream router.

DDoS can be a problem in this scenario. Assuming the PEs have plenty of capacity available and you can afford DDoS to reach PE, then you would shape to customer contract speed, drop the DDoS traffic and would not congest your access device uplink.

Saku_Ytti1 · May 24, 2020, 7:13pm

Provided you are using a strictly egress queueing platform, which OP's
ASR9k is not, its ingress NPU will drop packets, causing all customers
sharing the physical interface to suffer.

Tarko_Tikan · May 24, 2020, 7:27pm

hey,

Provided you are using a strictly egress queueing platform, which OP's
ASR9k is not, its ingress NPU will drop packets, causing all customers
sharing the physical interface to suffer.

Correct, QoS is a tricky thing that needs to be planned correctly. I was just pointing out additional benefits (or drawbacks depending where you look from).

Mark_Tinka1 · May 24, 2020, 8:06pm

That is one advantage of policing at the switch port, yes. But that
would be to manage traffic coming in from the customer.

If the attack traffic is coming from the Internet (toward the customer),
then policing on the router saves the router-switch trunk.

Either way, over-sizing router-switch trunks is always best.

Mark.

adamv0025 · May 31, 2020, 2:36pm

Saku Ytti
Sent: Friday, May 22, 2020 7:52 AM

> I've done all three on some level in my travels, but in the past it's
> also been oftentimes vendor-centric which hindered a scalable or
> "templateable" solution. (Some things police in only one direction, or
> only well in one direction, etc.)

Further complication, let's assume you are all-tomahawk on ASR9k.
Let's assume TenGigE0/1/2/3/4 as a whole is pushing 6Gbps traffic across all
VLAN, everything is in-contract, nothing is being dropped for any VLAN in any
class. Now VLAN 200 gets DDoS attack of 20Gbps coming from single
backbone interface. I.e. we are offering that tengig interftace 26Gbps of
traffic. What will happen is, all VLANs start dropping packets QoS unaware,
12.5Gbps is being dropped by the ingress NPU which is not aware to which
VLAN traffic is going nor is it aware of the QoS policy on the egress VLAN.

Hmm, is that so?
Shouldn’t the egress FIA(NPU) be issuing fabric grants (via central Arbiters) to ingress FIA(NPU) for any of the VOQs all the way up till egress NPU's processing capacity, i.e. till the egress NPU can still cope with the overall pps rate (i.e. pps rate from fabric & pps rate from "edge" interfaces), subject to ingress NPU fairness of course?
Or in other words, shouldn't all or most of the 26Gbps end up on egress NPU, since it most likely has all the necessary pps processing capacity to deal with the packets at the rate they are arriving, and decide for each based on local classification and queuing policy whether to enqueue the packet or drop it?

Looking at my notes, (from discussions with Xander and Thuijs and Aleksandar Vidakovic):
Each 10G entity is represented by on VQI = 4 VOQs (one VOQ for each priority level)
The trigger for the back-pressure is the utilisation level of RFD buffers.
RFD buffers are holding the packets while the NP microcode is processing them.
If you search for the BRKSPG-2904, the more feature processing the packet goes through, the longer it stays in RFD buffers.
RFD buffers are from-fabric feeder queues. - Fabric side backpressure kicks in if RFD queues are more than 60% full

So according to the above, should the egress NPU be powerful enough to deal with 26Gbps of traffic coming from fabric in addition to whatever business as usual duties its performing, (i.e RFD queues utilization is below 60%) then no drops should occur on ingress NPU (originating the DDoS traffic).

Saku_Ytti1 · May 31, 2020, 3:05pm

Shouldn’t the egress FIA(NPU) be issuing fabric grants (via central Arbiters) to ingress FIA(NPU) for any of the VOQs all the way up till egress NPU's processing capacity, i.e. till the egress NPU can still cope with the overall pps rate (i.e. pps rate from fabric & pps rate from "edge" interfaces), subject to ingress NPU fairness of course?

This is how it works in say MX. But in ASR9k VoQ are artificially
policed, no questions asked. And as policers are port level, if you
subdivide them via satellite or vlan you'll have collateral damage.
Technically the policer is programmable, and there is CLI, but CLI
config is binary between two low values, not arbitrary.

Or in other words, shouldn't all or most of the 26Gbps end up on egress NPU, since it most likely has all the necessary pps processing capacity to deal with the packets at the rate they are arriving, and decide for each based on local classification and queuing policy whether to enqueue the packet or drop it?

No, as per explanation given. Basically don't subdivide ports or don't
get attacked.