External BGP Controller for L3 Switch BGP routing

Hello,

A while back there was a discussion on how to do optimized (dynamic) BGP routing on a L3 switch which is only capable of handing a subset of BGP Routing table.

Someone has pointed out that there was a project to do just that, and had posted a link to a presentation on a European operator (Ireland ? ) who had done some code to take Exabgp and create such a setup..

(I am going by memory... )... Needless to say I am trying to find that link, or name of that project.

Anyone who can help in refreshing my memory with the link (my search skill are failing to find that presentation !)
would be greatly appreciated.

Many Thanks in Advance.

Faisal Imtiaz

Tore Anderson:

https://www.redpill-linpro.com/sysadvent/2016/12/09/slimming-routing-table.html

❦ 14 janvier 2017 05:24 GMT, Faisal Imtiaz <faisal@snappytelecom.net> :

A while back there was a discussion on how to do optimized (dynamic)
BGP routing on a L3 switch which is only capable of handing a subset
of BGP Routing table.

Someone has pointed out that there was a project to do just that, and
had posted a link to a presentation on a European operator (Ireland ?
) who had done some code to take Exabgp and create such a setup..

Maybe: GitHub - dbarrosop/sir: SDN Internet Router

Hey,

Hi Saku,

> Slimming down the Internet routing table – /techblog

---
As described in a prevous post, we’re testing a HPE Altoline 6920 in
our lab. The Altoline 6920 is, like other switches based on the
Broadcom Trident II chipset, able to handle up to 720 Gbps of
throughput, packing 48x10GbE + 6x40GbE ports in a compact 1RU chassis.
Its price is in all likelihood a single-digit percentage of the price
of a traditional Internet router with a comparable throughput rating.
---

This makes it sound like small-FIB router is single-digit percentage
cost of full-FIB.

Do you know of any traditional «Internet scale» router that can do ~720
Gbps of throughput for less than 10x the price of a Trident II box? Or
even <100kUSD? (Disregarding any volume discounts.)

Also having Trident in Internet facing interface may be suspect,
especially if you need to go from fast interface to slow or busy
interface, due to very minor packet buffers. This obviously won't be
much of a problem in inside-DC traffic.

Quite the opposite, changing between different interface speeds happens
very commonly inside the data centre (and most of the time it's done by
shallow-buffered switches using Trident II or similar chips).

One ubiquitous configuration has the servers and any external uplinks
attached with 10GE to leaf switches which in turn connects to a 40GE
spine layer with. In this config server<->server and server<->Internet
packets will need to change speed twice:

[server]-10GE-(leafX)-40GE-(spine)-40GE-(leafY)-10GE-[server/internet]

I suppose you could for example use a couple of MX240s or something as
a special-purpose leaf layer for external connectivity.
MPC5E-40G10G-IRB or something towards the 40GE spines and any regular
10GE MPC towards the exits. That way you'd only have one
shallow-buffered speed conversion remaining. But I'm very sceptical if
something like this makes sense after taking the cost/benefit ratio
into account.

Tore

In my setup, I use an BIRD instance to combine multiple internet full
tables, i use some filter to generate some override route to send to my L3
switch to do routing. The L3 switch is configured with the default route
to the main transit provider , if BIRD is down, the route would be
unoptimized, but everything else remain operable until i fixed that BIRD
instance.

I've asked around about why there isn't a L3 switch capable of handling
full tables, I really don't understand the difference/logic behind it.

I'm going to be keeping a close eye on this:

http://blogs.cisco.com/sp/a-bigger-helping-of-internet-please

Hey,

Do you know of any traditional «Internet scale» router that can do ~720
Gbps of throughput for less than 10x the price of a Trident II box? Or
even <100kUSD? (Disregarding any volume discounts.)

It's really hard to talk about pricing, as it's very dependant on many
factors. But I guess pretty much all Jericho boxes would fit that
bill? Arista will probably set you back anywhere in range of 15<35k,
will do full table (for now) and has deep packet buffers. NCS5501 is
also sub 100k, even with external TCAM. Probably single unit around
40k without external TCAM and 60k with external TCAM and you lose
8x10G and 2x100G ports.

But my comment wasn't really about what is available now, it was more
fundament about economics of large FIB or large buffers, they are not
inherently very BOM expensive.

I wonder if true whitelabel is possible, would some 'real' HW vendor,
of BRCM size, release HW docs openly? Then some integrator could start
selling the HW with BOM+10-20%, no support, no software at all. And
community could build the actual software on it.
It seems to me, what is keeping us away from near-BOM prices is
software engineering, and we cannot do it as a community, as HW docs
are not available.

Quite the opposite, changing between different interface speeds happens
very commonly inside the data centre (and most of the time it's done by
shallow-buffered switches using Trident II or similar chips).

Why I said it won't be a problem inside DC, is because low RTT, which
means small bursts. I'm talking about backend network infra in DC, not
Internet facing. Anywhere where you'll see large RTT and
speed/availability step-down you'll need buffers (unless we change TCP
to pace window-growth, unlike burst what it does now, AFAIK, you could
already configure your Linux server to do pacing at estimate BW, but
then you'd lose in congested links, as more aggressive TCP stack would
beat you to oblivion).

I suppose you could for example use a couple of MX240s or something as
a special-purpose leaf layer for external connectivity.
MPC5E-40G10G-IRB or something towards the 40GE spines and any regular
10GE MPC towards the exits. That way you'd only have one
shallow-buffered speed conversion remaining. But I'm very sceptical if
something like this makes sense after taking the cost/benefit ratio
into account.

MPC indeed is on completely another level in BOM, as it's NPU with
lookup and packets in DRAM, fairly complicated and space-inefficient.
But we have pipeline chips in the market with deep buffers and full
DFZ. There is no real reason that the markup on them would be
significant, control-plane should cost more. This is why the promise
of XEON router is odd to me, as it's fundamentally very expensive
chip, combined with poorly predictable performance (jitter,
latency...)

* Saku Ytti <saku@ytti.fi>

Why I said it won't be a problem inside DC, is because low RTT, which
means small bursts. I'm talking about backend network infra in DC, not
Internet facing. Anywhere where you'll see large RTT and
speed/availability step-down you'll need buffers (unless we change TCP
to pace window-growth, unlike burst what it does now, AFAIK, you could
already configure your Linux server to do pacing at estimate BW, but
then you'd lose in congested links, as more aggressive TCP stack would
beat you to oblivion).

But here you're talking about the RTT of each individual link, right,
not the RTT of the entire path through the Internet for any given flow?

Put it another way, my «Internet facing» interfaces are typically 10GEs with
a few (kilo)metres of dark fibre that x-connects into my IP-transit providers'
routers sitting in nearby rooms or racks (worst case somewhere else in
the same metro area). Is there any reason why I should need deep
buffers on those interfaces?

The IP-transit providers might need the deep buffers somewhere in their
networks, sure. But if so I'm thinking that's a problem I'm paying them
to not have to worry about.

BTW, in my experience the buffering and tail-dropping is actually a
bigger problem inside the data centre because of distributed
applications causing incast. So we get workarounds like DCTCP and BBR,
which is apparently cheaper than using deep-buffer switches everywhere.

Tore

Arista has a version of their switches that can handle a full table.

I think what the OP is asking about though is something like openflow though. Some have played around with using it to modify the switches routing table based on flows that exist. The same theory applies in regard to the presentation link provided (we don't need the full table 99%of the time, so just insert what you need).

Using filters is an "old school" technique that's been around for a long time, and I don't think that's what he's asking.

It would be RTT of the entire path through the Internet where TCP establishes sessions. Longer RTT, longer window, more burst.

So, yes you'll need deeper buffers for serving IP transit, regardless of the local link (router-to-router) latency.

For data centers, internal traffic within the DC is low latency end to end, so burst is relatively small.

James

I thought your post was fairly self-explanatory, but people seem to be all over the place... except for what you actually asked about.

But here you're talking about the RTT of each individual link, right,
not the RTT of the entire path through the Internet for any given flow?

I'm talking about RTT of end-to-end, which will determine window-size,
which will determine burst-size. Your worst burst will be half of
needed window size, and you need to be able to ingest this burst at
sender rate, regardless of receiver rate.

Put it another way, my «Internet facing» interfaces are typically 10GEs with
a few (kilo)metres of dark fibre that x-connects into my IP-transit providers'
routers sitting in nearby rooms or racks (worst case somewhere else in
the same metro area). Is there any reason why I should need deep
buffers on those interfaces?

Imagine content network having 40Gbps connection, and client having
10Gbps connection, and network between them is lossless and has RTT of
200ms. To achieve 10Gbps rate receiver needs 10Gbps*200ms = 250MB
window, in worst case 125MB window could grow into 250MB window, and
sender could send the 125MB at 40Gbps burst.
This means the port receiver is attached to, needs to store the 125MB,
as it's only serialising it at 10Gbps. If it cannot store it, window
will shrink and receiver cannot get 10Gbps.

This is quite pathological example, but you can try with much less
pathological numbers, remembering TridentII has 12MB of buffers.

❦ 16 janvier 2017 14:08 +0200, Saku Ytti <saku@ytti.fi> :

I wonder if true whitelabel is possible, would some 'real' HW vendor,
of BRCM size, release HW docs openly? Then some integrator could start
selling the HW with BOM+10-20%, no support, no software at all. And
community could build the actual software on it.
It seems to me, what is keeping us away from near-BOM prices is
software engineering, and we cannot do it as a community, as HW docs
are not available.

Mellanox with switches like the SN2700. I don't know how open is the
hardware documentation, but they are pushing support for their ASIC
directly into Linux (look at drivers/net/ethernet/mellanox/mlxsw). They
are also contributing to the switchdev framework which will at some
point allow transparent acceleration of the Linux box (switching,
routing, tunneling, firewalling, etc.), as we already have with
CumulusOS.

The datasheet is quite scarce. There is a 88k L2 forwarding entries but
no word for L3. Buffer sizes are not mentioned. But I suppose that
someone interested would be able to get more detailed information.

* Saku Ytti

> Put it another way, my «Internet facing» interfaces are typically
> 10GEs with a few (kilo)metres of dark fibre that x-connects into my
> IP-transit providers' routers sitting in nearby rooms or racks
> (worst case somewhere else in the same metro area). Is there any
> reason why I should need deep buffers on those interfaces?

Imagine content network having 40Gbps connection, and client having
10Gbps connection, and network between them is lossless and has RTT of
200ms. To achieve 10Gbps rate receiver needs 10Gbps*200ms = 250MB
window, in worst case 125MB window could grow into 250MB window, and
sender could send the 125MB at 40Gbps burst.
This means the port receiver is attached to, needs to store the 125MB,
as it's only serialising it at 10Gbps. If it cannot store it, window
will shrink and receiver cannot get 10Gbps.

This is quite pathological example, but you can try with much less
pathological numbers, remembering TridentII has 12MB of buffers.

I totally get why the receiver need bigger buffers if he's going to
shuffle that data out another interface with a slower speed.

But when you're a data centre operator you're (usually anyway) mostly
transmitting data. And you can easily ensure the interface speed facing
the servers can be the same as the interface speed facing the ISP.

So if you consider this typical spine/leaf data centre network topology
(essentially the same one I posted earlier this morning):

(Server) --10GE--> (T2 leaf X) --40GE--> (T2 spine) --40GE-->
(T2 leaf Y) --10GE--> (IP-transit/"the Internet") --10GE--> (Client)

If I understand you correctly you're saying this is a "suspect" topology
that cannot achieve 10G transmission rate from server to client (or
from client to server for that matter) because of small buffers on my
"T2 leaf Y" switch (i.e., the one which has the Internet-facing
interface)?

If so would it solve the problem just replacing "T2 leaf Y" with, say,
a Juniper MX or something else with deeper buffers?

Or would it help to use (4x)10GE instead of 40GE for the links between
the leaf and spine layers too, so there was no change in interface
speeds along the path through the data centre towards the handoff to
the IPT provider?

Tore

Hey,

(Server) --10GE--> (T2 leaf X) --40GE--> (T2 spine) --40GE-->
(T2 leaf Y) --10GE--> (IP-transit/"the Internet") --10GE--> (Client)

If I understand you correctly you're saying this is a "suspect" topology
that cannot achieve 10G transmission rate from server to client (or
from client to server for that matter) because of small buffers on my
"T2 leaf Y" switch (i.e., the one which has the Internet-facing
interface)?

This mostly isn't suspect, depending how utilised it is. If it's
verbatim like above, then it's never going to be suspect, as the
T2_leaf_Y is going to see large pauses between each frame coming from
the 40Gbps side, so it's not going to need to store large burst of
40Gbps traffic, as no one is generating at 40Gbps, so it can cope with
very small buffers.

* Saku Ytti

Put it another way, my «Internet facing» interfaces are typically
10GEs with a few (kilo)metres of dark fibre that x-connects into my
IP-transit providers' routers sitting in nearby rooms or racks
(worst case somewhere else in the same metro area). Is there any
reason why I should need deep buffers on those interfaces?

Imagine content network having 40Gbps connection, and client having
10Gbps connection, and network between them is lossless and has RTT of
200ms. To achieve 10Gbps rate receiver needs 10Gbps*200ms = 250MB
window, in worst case 125MB window could grow into 250MB window, and
sender could send the 125MB at 40Gbps burst.
This means the port receiver is attached to, needs to store the 125MB,
as it's only serialising it at 10Gbps. If it cannot store it, window
will shrink and receiver cannot get 10Gbps.

This is quite pathological example, but you can try with much less
pathological numbers, remembering TridentII has 12MB of buffers.

I totally get why the receiver need bigger buffers if he's going to
shuffle that data out another interface with a slower speed.

But when you're a data centre operator you're (usually anyway) mostly
transmitting data. And you can easily ensure the interface speed facing
the servers can be the same as the interface speed facing the ISP.

unlikely given that the interfaces facing the server is 1/10/25/50 and
the one facing the isp is n x 10 or n x 100

So if you consider this typical spine/leaf data centre network topology
(essentially the same one I posted earlier this morning):

(Server) --10GE--> (T2 leaf X) --40GE--> (T2 spine) --40GE-->
(T2 leaf Y) --10GE--> (IP-transit/"the Internet") --10GE--> (Client)

If I understand you correctly you're saying this is a "suspect" topology
that cannot achieve 10G transmission rate from server to client (or
from client to server for that matter) because of small buffers on my
"T2 leaf Y" switch (i.e., the one which has the Internet-facing
interface)?

you can externalize the cost of the buffer at the expense of latency
from the t2, e.g. by enabling flow control faciing the host or other
high capacity device, or engaging in packet pacing on the server if the
network is fairly shallow.

If the question is how can I ensure high link utilization rather than
maximum throughput for this one flow, the buffer requirement may be
substantially lower.

e.g. if you are sizing based on

buffer = (bandwidth delay * desired bandwidth) / sqrt(nr of flows)

rather than buffer = (bandwidth delay * bandwidth)

If so would it solve the problem just replacing "T2 leaf Y" with, say,
a Juniper MX or something else with deeper buffers?

broadcom jericho/ptx/qfx whatever sure it's plausible to have a large
buffer without using the feature rich extremely large fib asic.

Or would it help to use (4x)10GE instead of 40GE for the links between
the leaf and spine layers too, so there was no change in interface
speeds along the path through the data centre towards the handoff to
the IPT provider?

it can reduce the demand on the buffer, you can however multiplex two
our more flows that might otherwise run at 10Gb/s onto the same lag member.

In my setup, I use an BIRD instance to combine multiple internet full
tables, i use some filter to generate some override route to send to my L3
switch to do routing. The L3 switch is configured with the default route
to the main transit provider , if BIRD is down, the route would be
unoptimized, but everything else remain operable until i fixed that BIRD
instance.

I've asked around about why there isn't a L3 switch capable of handling
full tables, I really don't understand the difference/logic behind it.

In practice there are several merchant silicon implmentations that
support the addition of external tcams. building them accordingly
increases the COGS and and various performance and packaging limitions.

arista 7280r and cisco ncs5500 are broadcom jericho based devices that
are packaged accordingly.

Ethernet merchant silicon is heavily biased towards doing most if not
all the IO on the same asic, with limitations driven by gate size, die
size, heat dissipation pin count an so on.

There was a recent packet pushers episode with Pradeep Sindhu that
touched on some of these issues:

http://packetpushers.net/podcast/podcasts/show-315-future-networking-pradeep-sindhu/

Cisco and Arista are both able to squeeze a current full Internet table into the base space on their Jericho boxes, using the right space partitioning. Cisco added this in 6.1.2 without anything in the release notes, but you’ll notice they bumped the datasheet spec on the base 5502 to 1M FIB now where it used to be 256K. It works with the standard Internet table, but may not work if you have a ton of routes with lengths that do not work well with how the memory is carved up. Of course Jericho is more expensive than Trident.

Phil

Thank you for all the on-list and off-list replies..

The project I was looking for was/is called SIR.. (SDN Internet Router) and the original presentation was done by David Barroso..

Thanks to everyone who responded !

Regards.

Faisal Imtiaz
Snappy Internet & Telecom
7266 SW 48 Street
Miami, FL 33155
Tel: 305 663 5518 x 232

Help-desk: (305)663-5518 Option 2 or Email: Support@Snappytelecom.net