few big monolithic PEs vs many small PEs

From: Mark Tinka <mark.tinka@seacom.mu>
Sent: Friday, June 21, 2019 1:27 PM

> So this particular case, the major POPs, is actually where we ran into the
problem of RE/RP becoming full (too many VRFs/Routes/BGP sessions)
halfway through the chassis.
> Hence I'm considering whether it's actually better to go with multiple small
chassis and/or fixed form PEs in the rack as opposed to half/full rack chassis.

Are you saying that even the fastest and biggest control plane on the market
for your chassis is unable to support your requirements (assuming their cost
did not stop you from looking at them in the first place)?

I believe it would, for a time, but it would require SW upgrade -testing etc.. even newer SW in itself gave us better resource management and performance optimizations.
However even with powerful CP and streamlined SW we'd be still just buying time while pushing the envelope.
Hence the decentralization at the edge seems like a natural strategy to exit the uroboros paradigm.

adam

Hi Adam,

My experience is that it is much more complex than that (although it
also depends on what sort of service you're offering), one can't
easily model the inter-dependency between multiple physical assets
like links, interfaces, line cards, racks, DCs etc and logical
services such as a VRFs/L3VPNs, cloud hosted proxies and the P&T edge.

Consider this, in my opinion, relatively simple example:
Three PEs in a triangle. Customer is dual-homed to PE1 and PE2 and
their link to PE1 is their primary/active link. Transit is dual-homed
to PE2 and PE3 and your hosted filtering service cluster is also
dual-homed to PE2 and PE3 to be near the Internet connectivity.

How will you record the inter-dependencies that an outage on PE3
impacts Customer? Because when that Customer sends traffic to PE1
(lets say all their operations are hosted in a public cloud provider),
and PE1 has learned the shortest-path to 0/0 or ::0/0 from PE2, the
Internet traffic is sent from PE1 to PE2, and from PE2 into your
filtering cluster, and when the traffic comes back into PE2 after
passing through the filters it is then sent to PE3 because the transit
provider attached to PE3 has a better route to Customer's destination
(AWS/Azure/GCP/whatever) than the one directly attached to PE2.

That to me is a simple scenario, and it can be mapped with a
dependency tree. But in my experience, and maybe it's just me, things
are usually a lot more complicated than this. The root cause is
probably a bad design introducing too much complexity, which is
another vote for smaller PEs from me. With more service dedicated PEs
one can reduce or remove the possibility of piling multiple services
and more complexity onto the same PE(s).

Most places I've seen (managed service providers) simply can't map the
complex inter-dependencies they have been physical and logical
infrastructure without having some super bespoke and also complex
asset management / CMDB / CI system.

Cheers,
James.

I would tend to agree when the edge routers are massive, e.g., boxes
like the Cisco ASR9922 or the Juniper MX2020 are simply too large, and
present a real risk re: that level of customer aggregation (even for
low-revenue services such as Broadband). I don't think I'd ever justify
buying these towers to aggregate customers, mainly due to the risk.

For us, even the MX960 is too big, which is why we focus on the MX480
(ASR9906 being the equivalent). It's a happy medium between the small
and large end of the spectrum.

And as I mentioned before, we just look at a totally different box for
100Gbps customers.

Mark.

Well, this is one area where I can't meaningfully add value, since you
know your environment better than anyone else on this list.

Mark.

Which is one of the reasons we - painfully to the bean counters - insist
that routers are deployed for function.

We won't run peering and transit services on the same router.

We won't run SP and Enterprise on the same router as Broadband.

We won't run supporting services (DNS, RADIUS, WWW, FTP, Portals, NMS,
e.t.c.) on the same router where we terminate customers.

This level of distribution, although quite costly initially, means you
reduce the inter-dependency of services at a hardware level, and can
safely keep things apart so that when bits fail, you aren't committing
other services to the same fate.

Mark.

Agreed. This is worked well for me over time.

It’s costly in the initial capex out-lay but these boxes will have different upgrade/capacity increase times and price points, so over time everything spreads out.

Massive iron upgrades require biblical business cases and epic battles to get the funds approved. Periodic small to medium PE upgrades are nicer on the annual budget and the forecasting.

Cheers,
James.

Yeah, if you want to name specific boxes then yes I've made similar experiences with the same boxen. Even the MX960 is slightly too big for a PE depending on how you load it (port combinations).

Large boxes like the MX2020, ASR9922, NCS6K, etc. these can only reasonably be used as P nodes in my opinion.

Cheers,
James.

I’ve ran into many providers where they had routers in the top 10 or 15 markets… and that was it. If you wanted a connection in South Bend or Indianapolis or New Orleans or Ohio or… you were backhauled potentially hundreds of miles to a nearby big market.

More smaller POPs reduces the tromboning.

More smaller POPs means that one POP’s outage isn’t as disastrous on the traffic rerouting around it.

Big routers also mean they’re a lot more expensive. You have to squeeze more life out of them because they cost you hundreds of thousands of dollars. You run them longer than you really should.

If you run more, smaller, $20k or $30k routers, you’ll replace them on a more reasonable cycle.

I really dislike centralized routing. Mark.

Hi James,

From: James Bensley <jwbensley+nanog@gmail.com>
Sent: Thursday, June 27, 2019 1:48 PM

>
> > From: James Bensley <jwbensley@gmail.com>
> > Sent: Thursday, June 27, 2019 9:56 AM
> >
> > One experience I have made is that when there is an outage on a
> > large PE, even when it still has spare capacity, is that the
> > business impact can be too much to handle (the support desk is
> > overwhelmed, customers become irate if you can't quickly tell them
> > what all the impacted services are, when service will be restored,
> > the NMS has so many alarms it’s not clear what the problem is or where
it's coming from etc.).
> >
> I see what you mean, my hope is to address these challenges by having a
"single source of truth" provisioning system that will have, among other
things, also HW-customer/service mapping -so Ops team will be able to say
that if particular LC X fails then customers/services X,Y,Z will be affected.
> But yes I agree with smaller PEs any failure fallout is minimized
proportionally.

Hi Adam,

My experience is that it is much more complex than that (although it also
depends on what sort of service you're offering), one can't easily model the
inter-dependency between multiple physical assets like links, interfaces, line
cards, racks, DCs etc and logical services such as a VRFs/L3VPNs, cloud hosted
proxies and the P&T edge.

Consider this, in my opinion, relatively simple example:
Three PEs in a triangle. Customer is dual-homed to PE1 and PE2 and their link
to PE1 is their primary/active link. Transit is dual-homed to PE2 and PE3 and
your hosted filtering service cluster is also dual-homed to PE2 and PE3 to be
near the Internet connectivity.

I agree the scenario you proposed is perfectly valid seems simple but might contain high degree of complexity in terms of traffic patterns.
Thinking about this I'd propose to separate the problem into two parts,

The simpler one to solve is the physical resource allocation part of the problem
This is where the hierarchical record of physical assets could give us the right answers to what happens if this card fails
(example of hierarchy POP->PE->LineCard->PhysicalPort(s)-> PhysicalPort(s)->Aggregation-SW->PhysicalPort(s)->Customer/Service)

The other part of the problem is much harder and has two sub parts:
-first subpart is to model interactions between number of protocols to accurately predict traffic patterns under various failure conditions.
(I'd argue that this to some extent should be part of the design documentation and well understood and tested during POC testing for a new design -although entropy...)
-And now the tricky subpart is to be able to map individual customer->service/service->customer traffic flows onto the first subpart
(This subpart I didn't give much thought so can't possibly comment )

adam

Mark Tinka
Sent: Thursday, June 27, 2019 4:31 PM

> That to me is a simple scenario, and it can be mapped with a
> dependency tree. But in my experience, and maybe it's just me, things
> are usually a lot more complicated than this. The root cause is
> probably a bad design introducing too much complexity, which is
> another vote for smaller PEs from me. With more service dedicated PEs
> one can reduce or remove the possibility of piling multiple services
> and more complexity onto the same PE(s).

Which is one of the reasons we - painfully to the bean counters - insist that
routers are deployed for function.

We won't run peering and transit services on the same router.

We won't run SP and Enterprise on the same router as Broadband.

We won't run supporting services (DNS, RADIUS, WWW, FTP, Portals, NMS,
e.t.c.) on the same router where we terminate customers.

This level of distribution, although quite costly initially, means you reduce the
inter-dependency of services at a hardware level, and can safely keep things
apart so that when bits fail, you aren't committing other services to the same
fate.

If the PEs are sufficiently small I'd even go further as to L3VPNs-PE vs L2VPNs-PE services etc..., it's mostly because of streamlined/simplified hw and code certification testing.
But as with all the decentralize-centralize swings one has to strike the balance just right and weight the aggregation pros against too many eggs in one basket cons.

adam

On the VPN side, we sell more l2vpn then l3vpn. In fact, I don't believe
we've actually sold an l3vpn service, apart from the one we built to
deliver voice services.

l3vpn is a dying service in Africa. With everything in the cloud now,
everybody just wants a simple IP service.

Mark.

The NCS6000 was always designed as a core router to replace the CRS. We
just haven't seen the need for one since the CRS-X we run we operate
(8-slot chassis) is still more than enough for our requirements.

But yes, all of these edge routers, nowadays, are very decent core boxes
also, particularly if you run a BGP-free core and have no need to
support non-Ethernet links to any reasonable degree in there.

Mark.