few big monolithic PEs vs many small PEs

Hi folks,

Recently I ran into a peculiar situation where we had to cap couple of PE
even though merely a half of the rather big chassis was populated with
cards, reason being that the central RE/RP was not able to cope with the
combined number of routes/vrfs/bgp sessions/etc..

So this made me think about the best strategy in building out SP-Edge
nowadays (yes I'm aware of the centralize/decentralize pendulum swinging
every couple of years).
The conclusion I came to was that *currently the best approach would be to
use several medium to small(fixed) PEs to replace a big monolithic chasses
based system.
So what I was thinking is,
Yes it will cost a bit more (router is more expensive than a LC)
Will end up with more prefixes in IGP, more BGP sessions etc.. -don't care.
But the benefits are less eggs in one basket, simplified and hence faster
testing in case of specialized PEs and obviously better RP CPU/MEM to port
ratio.
Am I missing anything please?

*currently,
Yes some old chassis systems or even multi-chassis systems used to support
additional RPs and offloading some of the processes (e.g. BGP onto those)
-problem is these are custom hacks and still a single OS which needs
rebooting LC/ASICs when being upgraded -so the problem of too many eggs in
one basket still exists (yes cisco NCS6k and recent ASR9k lightspeed LCs are
an exception)
And yes there is the "node-slicing" approach from Juniper where one can
offload CP onto multiple x86 servers and assign LCs to each server (virtual
node) - which would solve my chassis full problem -but honestly how many of
you are running such setup? Exactly. And that's why I'd be hesitant to
deploy this solution in production just yet. I don't know of any other
vendor solution like this one, but who knows maybe in 5 years this is going
to be the new standard. Anyways I need a solution/strategy for the next 3-5
years.

Would like to hear what are your thoughts on this conundrum.

adam

netconsultings.com
::carrier-class solutions for the telecommunications industry::

Hi Adam,

Depends on how big of a router you need for your “small PE”.

Taking Juniper as an example, the MX204 is pretty unbeatable cost wise if you can make do with its 4QSFP28 & 8SFP+ interfaces. There’s a very big gap between the MX204 and the first chassis based router in the MX lineup, even if you only try to replicate the port configuration at first.

Best regards,
Martijn

PS, take note of the MX204 port profiles, not every combination of interface speeds is possible: https://apps.juniper.net/home/port-checker/

The conclusion I came to was that *currently the best approach would be to
use several medium to small(fixed) PEs to replace a big monolithic chasses
based system.

For availability I think it is best approach to do many small edge
devices. Because software is terrible, will always be terrible. People
are bad at operating the devices and will always be. Hardware is is
something we think about lot when we think about redundancy, but it's
not that common reason for an outage.
With more smaller boxes the inevitable human cockup and software
defects will affect fewer customers. Why I believe this to be true, is
because the events are sufficiently rare and once those happen, we
find solution or at very least workaround rather fast. With full
inaction you could argue that having A3 and B1+B2 is same amount of
aggregate outage, as while outage in B affects fewer customers, there
are two B nodes with equal probability of outage. But I argue that the
events are not independent, they are dependent, so probability
calculation isn't straightforward. Once we get some rare software
defect or operator mistake on B1, we usually solve it before it
triggers on B2, making the aggregate downtime of entire system lower.

Yes it will cost a bit more (router is more expensive than a LC)

Several of my employees have paid only for LC. I don't think the CAPEX
difference is meaningful, but operating two separate devices may have
significant OPEX implications in electricity, rack space,
provisioning, maintenance etc.

And yes there is the "node-slicing" approach from Juniper where one can
offload CP onto multiple x86 servers and assign LCs to each server (virtual
node) - which would solve my chassis full problem -but honestly how many of
you are running such setup? Exactly. And that's why I'd be hesitant to
deploy this solution in production just yet. I don't know of any other
vendor solution like this one, but who knows maybe in 5 years this is going
to be the new standard. Anyways I need a solution/strategy for the next 3-5
years.

Node slicing indeed seems like it can be sufficient compromise here
between OPEX and availability. I believe (not know) that the shared
software risks are meaningfully reduced and that bringing down whole
system is sufficiently rare to allow availability upside compared to
single large box.

hey,

For availability I think it is best approach to do many small edge
devices.

This is also great for planned maintenance. ISSU has not really worked out for any of the vendors and with two small devices you can upgrade them independently.

Great for aggregation, enables you to dual-home access devices into two separate PEs that will never be down at the same time be it failure or planned maintenance (excluding the physical issues like power/cooling but dual-homing to two separate sites is always problematic for eyeball networks).

Yes it will cost a bit more (router is more expensive than a LC)

I found the reverse to be true... chassis' are cheap. Line cards are costly.

Would like to hear what are your thoughts on this conundrum.

So this depends on where you want to deliver your service, and the
function, in my opinion.

If you are talking about an IP/MPLS-enabled Metro-E network, then having
several, smaller routers spread across one or more rings is cheaper and
more effective.

If you are delivering services to large customers from within a data
centre, large edge routers make more sense, particularly given the
rising costs of co-location.

If you are providing BNG services, it depends on how you want to balance
ease of management vs. scale vs. cost. If you have the cash to spend,
de-centralizing your BNG's across a region/city/country will give you
more scale and better redundancy, but could be more costly depending on
your per-box sizing as well as an increase in management time. If you
want to improve management, you can have fewer boxes to cover large
parts of your region/city/country. But this may mean buying a very large
box to concentrate more users in fewer places.

If you are trying to combine Enterprise, Service Provider and Consumer
services in one chassis, well, as the saying goes, "If you are
competitor, I approve of this message" :-).

Mark.

Hey Saku,

From: Saku Ytti <saku@ytti.fi>
Sent: Thursday, June 20, 2019 7:04 AM

> The conclusion I came to was that *currently the best approach would
> be to use several medium to small(fixed) PEs to replace a big
> monolithic chasses based system.

For availability I think it is best approach to do many small edge devices.
Because software is terrible, will always be terrible. People are bad at
operating the devices and will always be. Hardware is is something we think
about lot when we think about redundancy, but it's not that common reason
for an outage.
With more smaller boxes the inevitable human cockup and software defects
will affect fewer customers. Why I believe this to be true, is because the
events are sufficiently rare and once those happen, we find solution or at
very least workaround rather fast. With full inaction you could argue that
having A3 and B1+B2 is same amount of aggregate outage, as while outage in
B affects fewer customers, there are two B nodes with equal probability of
outage. But I argue that the events are not independent, they are
dependent, so probability calculation isn't straightforward. Once we get
some rare software defect or operator mistake on B1, we usually solve it
before it triggers on B2, making the aggregate downtime of entire system
lower.

Yup I agree,
Just on the human cockups though, we're putting more and more automation in to help address the problem of human imperfections.
But automation can actually go both ways, some say it helps with the small day to day problems but occasionally creates a massive one.
So considering the B1 & B2 correlation if operations on these are automated then, depending on how the automation system is designed/operated, one might not get the chance to reflect/assess on B1 before B2 is touched -so this might further complicate the equation for the aggregate system downtime computation.
  

> Yes it will cost a bit more (router is more expensive than a LC)

Several of my employees have paid only for LC. I don't think the CAPEX
difference is meaningful, but operating two separate devices may have
significant OPEX implications in electricity, rack space, provisioning,
maintenance etc.

> And yes there is the "node-slicing" approach from Juniper where one
> can offload CP onto multiple x86 servers and assign LCs to each server
> (virtual
> node) - which would solve my chassis full problem -but honestly how
> many of you are running such setup? Exactly. And that's why I'd be
> hesitant to deploy this solution in production just yet. I don't know
> of any other vendor solution like this one, but who knows maybe in 5
> years this is going to be the new standard. Anyways I need a
> solution/strategy for the next 3-5 years.

Node slicing indeed seems like it can be sufficient compromise here between
OPEX and availability. I believe (not know) that the shared software risks are
meaningfully reduced and that bringing down whole system is sufficiently
rare to allow availability upside compared to single large box.

I tend to agree, though as you say it's a compromise nevertheless.
If one needs to switch to a new version of fabric in order to support new line-cards or upgrade code on the base system for that matter - the whole thing (NFVI) needs to be power-cycled.

adam

With automation we break far far less often, far far more. MTTR is
also increased due to skill rot, in CLI jockey network you break
something every day and you have to troubleshoot and fix it, so even
fixing complex problems becomes routine. With automation years may
pass without complex outages when they happen, people panic and are
able to act logically and focus on single problem.

I am absolutely PRO automation. But I'm saying there is a cost.

Hey,

From: Tarko Tikan
Sent: Thursday, June 20, 2019 8:28 AM

hey,

> For availability I think it is best approach to do many small edge
> devices.

This is also great for planned maintenance. ISSU has not really worked out for
any of the vendors and with two small devices you can upgrade them
independently.

Yup I guess no one is really using ISSU in production, and even with ISSU, currently, most of the NPUs on the market need to be power-cycled to load a new version of microcode so there's packet loss on data-plane anyways.

Great for aggregation, enables you to dual-home access devices into two
separate PEs that will never be down at the same time be it failure or
planned maintenance (excluding the physical issues like power/cooling but
dual-homing to two separate sites is always problematic for eyeball
networks).

Actually this is an interesting point you just raised.
(note: The assumption for the below is single-homed customers, as the dual-homed customer would probably what to be at least site diverse and pay premium for that service)
So what is the primary goal of us using the aggregation/access layer? It's to achieve better utilization of the expensive router ports right? (hence called aggregation)
And indeed there are cases where we connect customers directly on to the PEs, but then it's somehow ok for a line-card to be part of just a single chassis (or a PE).
Now let's take a step even further what if the line-card is not inside the chassis anymore -cause it's a fabric-extender or a satellite card.
Why all of a sudden we'd be uncomfortable again to have it part of just a single chassis (and there are tons of satellite/extender topologies to prove that this is a real concern among operators).
So to circle back to a standalone aggregation device -should we try and complicate the design by creating this "fabric" (PEs "spine" and aggregation devices "leaf") in an attempt to increase resiliency or shall we treat each aggregation device as unitary indivisible part of a single PE as if it was a card in a chassis -cause if the economics worked It would be a card in a chassis?

adam

hey,

So what is the primary goal of us using the aggregation/access layer? It's to achieve better utilization of the expensive router ports right? (hence called aggregation)

I'm in the eyeball business so saving router ports is not a primary concern.

Aggregation exists to aggregate downstream access devices like DSLAMs, OLTs etc. First of all they have interfaces that are not available in your typical PEs. Secondly they are physically located further downstream, closer to the customers. It is not economical or even physically possible to have an MPLS device next to every DSLAM, hence the aggregation.

Eyeball network topologies are very much driven by fiber layout that might have been built 10+ years ago following TDM network best practices (rings).

Ideally (and if your market situation and finances allow this) you want your access device (or in PON case, perhaps even a OLT linecard) to be only SPOF. If you now uplink this access device to a PE, PE linecard becomes a SPOF for many, let's say 40 as this is a typical port count, access devices.

If you don't want this to happen you can use second fiber pair for second uplink but you typically don't have fiber to second aggregation site. So your only option is to build on same fiber (so thats a SPOF too) to the same site. If you now uplink to same PE, you will still loose both uplinks during software upgrades.

Two devices will help with that making aggregation upgrades invisible for customers thus improving customer satisfaction. Again, it very much depends on market, in here the customers get nosy if they have more than one or two planned maintenances in a year (and this is not for some premium L3VPN service but just internet).

And indeed there are cases where we connect customers directly on to
the PEs, but then it's somehow ok for a line-card to be part of just a
single chassis (or a PE).

We'd typically do this for very high-speed ports (100Gbps), as it's
cheaper to aggregate 10Gbps-and-slower via an Ethernet switch trunking
to a router line card.

Now let's take a step even further what if the line-card is not inside the chassis anymore -cause it's a fabric-extender or a satellite card.
Why all of a sudden we'd be uncomfortable again to have it part of just a single chassis (and there are tons of satellite/extender topologies to prove that this is a real concern among operators).

I never quite saw the use-case for satellite ports. To me, it felt like
vendors trying to find ways to lock you into their revenue stream
forever, as many of these architectures do not play well with the other
kids. I'd rather keep it simple and have 802.1Q trunks between router
line cards and affordable Ethernet switches.

We are currently switching our Layer 2 aggregation ports in the data
centre from Juniper to Arista, talking to a Juniper edge router. I'd
have been in real trouble if I'd fallen for Juniper's satellite system,
as they have a number of shortfalls in the Layer 2 space, I feel.

So to circle back to a standalone aggregation device -should we try and complicate the design by creating this "fabric" (PEs "spine" and aggregation devices "leaf") in an attempt to increase resiliency or shall we treat each aggregation device as unitary indivisible part of a single PE as if it was a card in a chassis -cause if the economics worked It would be a card in a chassis?

See my previous response to you.

Mark.

Hey Mark,

From: Mark Tinka
Sent: Thursday, June 20, 2019 3:27 PM

> Yes it will cost a bit more (router is more expensive than a LC)

I found the reverse to be true... chassis' are cheap. Line cards are costly.

Well yes but if say I compare just a single line-card cost to a standalone fixed-format 1RU router with a similar capacity -the card will always be cheaper and then as I'll start adding cards on the left-hand side of the equation things should start to even out gradually (problem is this gradual increase is just a theoretical exercise -there are no fixed PE products to do this with).
Yes I can compare mpc7 with a mx204. Or asr9901 with some tomahawk card(s) probably not apples to apples?
But if I would venture above 1/2RU then I'm back in chassis based systems paying extra for REs/RPs and fabric and fans and PSUs... with every small PE I'm putting in so then I'm talking about add two new cards to existing chassis or ad two new cards to a new chassis.

Also one interesting CAPEX factor to consider is the connectivity back to the core, as with many small PEs in a POP one would need a lot of ports on core routers and also once again the aggregation factor is somewhat lost in doing so. Where I'd have just a couple of PEs with 100G back to the core now I'd need bunch of 10s-bundled or 40s -would probably need additional cards in core routers to accommodate the need for PE ports in the POP.

>
> Would like to hear what are your thoughts on this conundrum.

So this depends on where you want to deliver your service, and the function,
in my opinion.

If you are talking about an IP/MPLS-enabled Metro-E network, then having
several, smaller routers spread across one or more rings is cheaper and more
effective.

Well playing devil's advocate, having the metro rings build as dumb L1 or L2 with pair of PEs at the top is cheaper -although not much cheaper nowadays the economics in this sector changed significantly over the past years.

If you are delivering services to large customers from within a data centre,
large edge routers make more sense, particularly given the rising costs of co-
location.

So this particular case, the major POPs, is actually where we ran into the problem of RE/RP becoming full (too many VRFs/Routes/BGP sessions) halfway through the chassis.
Hence I'm considering whether it's actually better to go with multiple small chassis and/or fixed form PEs in the rack as opposed to half/full rack chassis.

adam

From: Mark Tinka
Sent: Friday, June 21, 2019 9:07 AM

> And indeed there are cases where we connect customers directly on to
> the PEs, but then it's somehow ok for a line-card to be part of just a
> single chassis (or a PE).

We'd typically do this for very high-speed ports (100Gbps), as it's cheaper to
aggregate 10Gbps-and-slower via an Ethernet switch trunking to a router line
card.

> Now let's take a step even further what if the line-card is not inside the
chassis anymore -cause it's a fabric-extender or a satellite card.
> Why all of a sudden we'd be uncomfortable again to have it part of just a
single chassis (and there are tons of satellite/extender topologies to prove
that this is a real concern among operators).

I never quite saw the use-case for satellite ports. To me, it felt like vendors
trying to find ways to lock you into their revenue stream forever, as many of
these architectures do not play well with the other kids. I'd rather keep it
simple and have 802.1Q trunks between router line cards and affordable
Ethernet switches.

We are currently switching our Layer 2 aggregation ports in the data centre
from Juniper to Arista, talking to a Juniper edge router. I'd have been in real
trouble if I'd fallen for Juniper's satellite system, as they have a number of
shortfalls in the Layer 2 space, I feel.

I'd actually like to hear more on that if you don't mind.

> So to circle back to a standalone aggregation device -should we try and
complicate the design by creating this "fabric" (PEs "spine" and aggregation
devices "leaf") in an attempt to increase resiliency or shall we treat each
aggregation device as unitary indivisible part of a single PE as if it was a card in
a chassis -cause if the economics worked It would be a card in a chassis?

See my previous response to you.

You actually haven't answered the question I'm afraid :slight_smile:
So would you connect the Juniper now Arista aggregation switch to at least two PEs in the POP (or all PEs in the POP -"fabric-style") or would you consider 1:1 mapping between an aggregation switch and a PE please?

adam

I'd actually like to hear more on that if you don't mind.

What part, Juniper's Ethernet switching portfolio?

You actually haven't answered the question I'm afraid :slight_smile:
So would you connect the Juniper now Arista aggregation switch to at least two PEs in the POP (or all PEs in the POP -"fabric-style") or would you consider 1:1 mapping between an aggregation switch and a PE please?

Each edge router connects to its own aggregation switch (one or more,
depending on the number of ports required). The outgoing EX4550's we
used were setup in a VC for ease of management when we needed more ports
on a router-switch pair. But since Arista don't support VC's, each
switch would have an independent port to the edge router. Based upon
experience with VC's and the EX4550, that's not necessarily a bad thing,
as what you provision and what you actually get and can use are totally
different things.

We do not dual-home aggregation switches to edge routers; that's just
asking for STP issues (which we once faced when we thought we should be
fancy and provide VRRP services between 2 edge routers and their
associated aggregated switches.

Mark.

Well yes but if say I compare just a single line-card cost to a standalone fixed-format 1RU router with a similar capacity -the card will always be cheaper and then as I'll start adding cards on the left-hand side of the equation things should start to even out gradually (problem is this gradual increase is just a theoretical exercise -there are no fixed PE products to do this with).
Yes I can compare mpc7 with a mx204. Or asr9901 with some tomahawk card(s) probably not apples to apples?

Yes, you can't always do that because not many vendors create 1U router
versions of their line cards. The MX204 is probably one of those that
comes reasonably close.

I'm not sure deciding whether you get an MPC7 line card or an MX204 will
be a meaningful exercise. You need to determine what your use-case fits.
For example, rather than buy MPC7 line cards to support 100Gbps
customers in our MX480's, it is easier to buy an MX10003. That way, we
can keep the MPC2 line cards in the MX480 chassis to support up to N x
10Gbps of customer links (aggregated to an Ethernet switch, of course)
and not pay the cost of trying to run 100Gbps services through the MX480.

The MX10003 would then be dedicated for 100Gbps customers (and 40Gbps),
meaning we can manage the ongoing operational costs of each type of
customer for a specific box.

We have thought about using MX204's to support 40Gbps and 100Gbps
customers, but there aren't enough ports on it for it to make sense,
particularly given those types of customers will want the routers they
connect to to have some kind of physical redundancy, which the MX204
does not have.

Our use-case for the MX204 is:

\- Peering\.
\- Metro\-E deployments for customers needing 10Gbps in the Access\.

Also one interesting CAPEX factor to consider is the connectivity back to the core, as with many small PEs in a POP one would need a lot of ports on core routers and also once again the aggregation factor is somewhat lost in doing so. Where I'd have just a couple of PEs with 100G back to the core now I'd need bunch of 10s-bundled or 40s -would probably need additional cards in core routers to accommodate the need for PE ports in the POP.

Yes, that's not a small issue to scoff at, and you raise a valid concern
that could be easily overlooked if you adopted several smaller edge
routers in the data centre in favour of fewer large ones.

That said, you could do what we do and have a Layer 2 core switching
network, where you aggregate all routers in the data centre, so that you
are not running point-to-point links between routers and your core
boxes. For us, because of this, we still have plenty of slots left in
our CRS-8 chassis 5 years after deploying them, even though we are
supporting several 100's of Gbps worth of downstream router capacity.

Well playing devil's advocate, having the metro rings build as dumb L1 or L2 with pair of PEs at the top is cheaper -although not much cheaper nowadays the economics in this sector changed significantly over the past years.

A dumb Metro-E access with all the smarts in the core is cheap to build,
but expensive to operate.

You can't run away from the costs. You just have to decide whether you
want to pay costs in initial cash or in long-term operational headache.

So this particular case, the major POPs, is actually where we ran into the problem of RE/RP becoming full (too many VRFs/Routes/BGP sessions) halfway through the chassis.
Hence I'm considering whether it's actually better to go with multiple small chassis and/or fixed form PEs in the rack as opposed to half/full rack chassis.

Are you saying that even the fastest and biggest control plane on the
market for your chassis is unable to support your requirements (assuming
their cost did not stop you from looking at them in the first place)?

Mark.

“It is not economical or even physically possible to have an MPLS device next to every DSLAM, hence the aggregation.”

https://mikrotik.com/product/RB750r2 MSRP $39.95

I readily admit that this device isn’t large enough for most cases, but you can get cheap and small MPLS routers.

I was reading this and thought, ....planet earth is a single point of failure.

...but, I guess we build and design and connect as much redundancy (logic, hw, sw, power) as the customer requires and pays for.... and that we can truly accomplish.

-Aaron

Fate sharing is also an important concept in system design.

I don't know about you, but we keep two earths in active/standby. Sure, the power requirements are through the roof, but hey -- it's worth it.

Hi Adam,

Over the years I have been bitten multiple times by having fewer big
routers with either far too many services/customers connected to them
or too much traffic going through them. These days I always go for
more smaller/more routers than fewer/larger routers.

One experience I have made is that when there is an outage on a large
PE, even when it still has spare capacity, is that the business impact
can be too much to handle (the support desk is overwhelmed, customers
become irate if you can't quickly tell them what all the impacted
services are, when service will be restored, the NMS has so many
alarms it’s not clear what the problem is or where it's coming from
etc.).

I’ve seen networks place change freeze on devices, with the exception
of changes that migrate customers or services off of the PE, because
any outage would create too great an impact to the business, or risk
the customers terminating their contract. I’ve also seen changes
freeze be placed upon large PEs because the complexity was too great,
trying to work out the impact of a change on one of the original PEs
from when the network was first built, which is somehow linked to
virtually every service on the network in some obscure and
unforeseeable way.

This doesn’t mean there isn’t a place for large routers. For example,
in a typical network, by the time we get to the P nodes layer in the
core we tend to have high levels of redundancy, i.e. any PE is
dual-homed to two or more P nodes and will have 100% redundant
capacity. Down at the access layer customers may be connected to a
single access layer device or the access layer device might have a
single backhaul link. So technically we have lots of customers,
services and traffic passing through larger P node devices, but these
devices have a low rate of changes / low touch, perform a low number
of functions, they are operationally simple, and are highly redundant.
Adversely at the service edge, which I guess is your main concern
here, I’m all about more smaller devices with single service dedicated
devices.

I’ve tried to write some of my experiences here
(https://null.53bits.co.uk/index.php?page=few-larger-routers-vs.-many-smaller-routers).
The tl;dr version though is that there’s rarely a technical
restriction to having fewer large routers and it’s an
operational/business impact problem.

I'd like to hear from anyone who has had great success with fewer larger PEs.

Cheers,
James.

From: James Bensley <jwbensley@gmail.com>
Sent: Thursday, June 27, 2019 9:56 AM

One experience I have made is that when there is an outage on a large PE,
even when it still has spare capacity, is that the business impact can be too
much to handle (the support desk is overwhelmed, customers become irate
if you can't quickly tell them what all the impacted services are, when service
will be restored, the NMS has so many alarms it’s not clear what the problem
is or where it's coming from etc.).

I see what you mean, my hope is to address these challenges by having a "single source of truth" provisioning system that will have, among other things, also HW-customer/service mapping -so Ops team will be able to say that if particular LC X fails then customers/services X,Y,Z will be affected.
But yes I agree with smaller PEs any failure fallout is minimized proportionally.

This doesn’t mean there isn’t a place for large routers. For example, in a
typical network, by the time we get to the P nodes layer in the core we tend
to have high levels of redundancy, i.e. any PE is dual-homed to two or more P
nodes and will have 100% redundant capacity.

Exactly, while the service edge topology might be dynamic as a result of horizontal scaling the core on the other hand I'd say should be fairly static and scaled vertically, that is I wouldn't want to scale core routers horizontally and as a result have core topology changing with every P scale out iteration at any POP, that would be bad news for capacity planning and traffic engineering...

I’ve tried to write some of my experiences here
(https://null.53bits.co.uk/index.php?page=few-larger-routers-vs.-many-
smaller-routers).
The tl;dr version though is that there’s rarely a technical restriction to having
fewer large routers and it’s an operational/business impact problem.

I'll give it a read, cheers.

adam