LDPv6 Census Check

perhaps, but no-one planning to use srv6 is going to invest in kit which can handle srv6 but not the TE component. Or deploy srv6 on existing kit which can't handle TE.

Nick

On-package is not important, on-chip or off-chip is what matters, i.e.
    do you eat SERDES to connect memory or not.

[pmb]
Sorry meant to say on-die, not on-package.

Typically the time it takes to do those lookups are built into the system specs to attain the performance you need with deterministic latency within a certain bounds. There are certainly corner cases where you make tradeoffs, especially now that single NPUs are 10+ Tbps, but it's not really an MPLS vs. IPv4 vs. IPv6 thing. The other key is to do those types of accesses in a single pass, not traverse multiple hierarchy levels or do multiple operations. If you are tunneling then the table for any of those types is going to small on a mid-point router.

Except this implementation does not exist, but we can argue that is
missing feature. We can argue we should be able to tell the lookup
engine this CIDR is on-chip and it's host routes only. This is
certainly doable, and would make IP tunnels like MPLS tunnels for
lookup cost, just larger lookup key, which is not significant cost.

But even if we had this (we don't, we have for MPLS) IP would be still
inferior, it is more tunneling overhead, i.e. I need more overspeed.
Technically MPLS is just better tunneling header. I can understand
sentimental arguments for IPv4 and market seems to appreciate those
arguments particularly well.

From: David Sinn
Sent: Thursday, June 11, 2020 4:32 PM

However if you move away from large multi-chip systems,
to a system of fixed form-factor, single chip systems, labels fall
apart at scale with high ECMP. Needing to enumerate every possible path
within the network or having to have a super-deep label stack removes all

of

the perceived benefits of cheap and simple.

Looks like the deployments you describe are large DC Clos/Benes fabric, then
the potentially deep label imposition would be done by the VMs right?
On transit nodes the 64x ECMP or super-deep labels is no problem for
NPU/lookup process as it's always just top label lookup and resolving to a
single egress interface.

adam

I'm curious about this statement - have you hit practical ECMP issues
with label switching at scale?

We have ECMP'ed label switch paths with multiple paths for a single FEC
all over the place, and those work fine both on Cisco and Junos (of all
sizes), both for IPv4 and IPv6 FEC's. Have done for years.

Unless I misunderstand your concern.

Mark.

From: Mark Tinka <mark.tinka@seacom.mu>
Sent: Thursday, June 11, 2020 3:59 PM

> No, my line of reasoning is if you have MPLS LSPs signalled over v4 I see no
point having them signalled also over v6 in parallel.

It's not about signaling IPv4 LSP's over IPv6.
LDPv4 creates IPv4 FEC's.
LDPv6 creates IPv6 FEC's.

The idea is to create IPv6 FEC's so that IPv6 traffic can be label-switched in
the network natively, allowing you to remove BGPv6 in a native dual-stack
core.

Right I see what you are striving to achieve is migrate from BGP in a core to a BGP free core but not leveraging 6PE or 6VPE?

As you can see, just as with IPv4, IPv6 packets are now being MPLS-switched
in the core, allowing you to remove BGPv6 in the core and simplify
operations in that area of the network.

So this is native MPLSv6. It's not 6PE or 6VPE.

So considering you already had v4 FECs wouldn't it be simpler to do 6PE/6VPE, what do you see as drawbacks of these compared to native MPLSv6 please?

> Apart from X months worth of functionality, performance, scalability and
interworking testing -network wide code upgrades to address the bugs
found during the testing process and then finally rollout across the core and
possibly even migration from LDPv4 to LDPv6, involving dozens of people
from Arch, Design, OPS, Project management, etc... with potential for things
to break while making changes in live network.

Which you wouldn't have to do with SRv6, because you trust the vendors?

Well my point was that if v4 FECs would be enough to carry v6 traffic then I wouldn't need SRv6 nor LDPv6, hence I'm curious to hear from you about the benefits of v6 FEC over v4 FEC (or in other words MPLSv6 vs 6PE/6VPE).

adam

Right I see what you are striving to achieve is migrate from BGP in a core to a BGP free core but not leveraging 6PE or 6VPE?

Yes sir.

So considering you already had v4 FECs wouldn't it be simpler to do 6PE/6VPE, what do you see as drawbacks of these compared to native MPLSv6 please?

Because 6PE, for us, adds a lot more complexity in how we design the
network.

But most importantly, it creates a dependency for the success of IPv6 on
IPv4. If my IPv4 network were to break, for whatever reason, it would
take my IPv6 network down with it.

Years back, there was a nasty bug in the ASR920 that set an upper limit
on the MPLS label space it created FEC's for. Since Juniper sometimes
uses higher label numbers than Cisco, traffic between that ASR920 and
our Juniper network was blackholed. It took weeks to troubleshoot, Cisco
sent some engineering code, I confirmed it fixed the issue, and it was
rolled out generally. During that time when the ASR920 was unavailable
on IPv4, it was still reachable on IPv6.

Other issues are also with the ASR920 and ME3600X/3800X routers, where
0/0 and ::/0 are the last routes to be programmed into FIB when you run
BGP-SD. It can be a while until those boxes can reach the rest of the
world via default. IPv6 will get there faster.

I also remember another issue, back in 2015, where a badly-written IPv4
ACL kicked one of our engineers out of the box. Thankfully, he got back
in via IPv6.

I guess what I'm saying is we don't want to fate-share. IPv4 and IPv6
can operate independently. A failure mode in one of them does not
necessarily propagate to the other, in a native, dual-stack network. You
can deploy something in your IPv6 control/data plane without impacting
IPv4, and vice versa, if you want to roll out gracefully, without
impacting the other protocol.

6PE simply has too many moving parts to setup, comparing to just adding
an IPv6 address to a router interface and updating your IGP. Slap on
LDPv6 for good measure, and you've achieved MPLSv6 forwarding without
all the 6PE faffing.

Well my point was that if v4 FECs would be enough to carry v6 traffic then I wouldn't need SRv6 nor LDPv6, hence I'm curious to hear from you about the benefits of v6 FEC over v4 FEC (or in other words MPLSv6 vs 6PE/6VPE).

No need for 6PE deployment and day-to-day operation complexity.

A simplified and more native tunneling for IPv6-in-MPLSv6, rather than
IPv6-in-MPLSv4-on-IPv4.

No inter-dependence between IPv6 and IPv4.

Easier troubleshooting if one of the protocols is misbehaving, because
then you are working on just one protocol, and not trying to figure if
IPv4 or MPLSv4 are breaking IPv6, or vice versa.

For me, those 4 simple points help me sleep well at 3AM, meaning I can
stay up longer having more wine, in peace :-).

Mark.

I'm not sure what implementation you are saying doesn't exist. The Broadcom XGS line is all on-die. The two largest cloud providers are using them in their transport network (to the best of my understanding). So I'm not sure if your saying that no one is using small boxes like I'm describing or what. And it doesn't have to be MPLS over IP. That is one option, but IPIP is another.

I'm saying implementation which has off-chip and supports putting some
on-chip. So that you could have full table lookup for edge packets and
and fast exact lookup for others. Of course we do have platforms which
do have large LEM tables off-chip.

Again, feel free to look at only one small aspect and say that it is completely better in all cases. MPLS is not better in wide ECMP cases, full stop. SR doesn't help that when you actually look at the problems at massive scale as I have done. You continually are on the trade-off spectrum of irrationally deep label stacks or enumeration of all of the possible paths through the network and burn all of your next-hop re-writes. At least if you want high-radiux, single chip systems. So this is not sentimentally around a protocol, it's the practical reality when you look at the problems at scale using commodity components. So if you want to optimize for costs and power (which is operational costs), MPLS is not where it is at.

I'm not sure why this deep label stack keeps popping, if we need
multiple levels of tunneling, we need it in IP too, and it's almost
more expensive in IP. Cases I can think of in SR, you'll only loop top
label or two, even if you might have 10 labels.
For every apples to apples cases MPLS tunnels are superior to IP
tunnels. If you want cheap very small FIB backbone, then all traffic
will need to be IP tunneled to egress, and you get all the MPLS
problems, and you get more overhead and larger keys (larger keys is
not a big deal).

Now if discussion is do we need tunnelling at all, the discussion is
very different.

One unexpected benefit, I will say, with going native LDPv6 is that
MTR's for IPv6 destinations no longer report packet loss on the
intermediary core routers (CRS-X).

I know this was due to the control plane, and nothing to do with the
actual data plane, but it was always a tool explaining to customers why
MTR's for IPv4 destinations have 0% packet loss in our core, while IPv6
ones have 30% - 50% (in spite of the final end-host reporting 0% packet
loss).

Since going LDPv6, IPv6 traffic is now label-switched in the core, in
lieu of hop-by-hop IPv6 forwarding. The unforeseen-but-welcome side
effect is that customer packet loss MTR's for IPv6 destinations that
traverse the CRS-X core are as 0% as they are for IPv4 (even though we
haven't yet removed BGPv6 from the core due to IOS XE platforms that
don't yet run LDPv6).

One less trouble ticket to have to explain for our NOC; I'll gladly take
that...

As my Swedish friend would say, "That gives me an avenue of pleasure and
joy" :-).

Mark.

But why do you need full table lookup in the middle of the network? Why place that class of gear where it's not needed?

Some people do collapsed core. But this is getting bit theoretical,
because we definitely could do this on IP, we could do some lookups on
on-chip and some off-chip to do both, should market want it.

The label stack question is about the comparisons between the two extremes of SR that you can be in. You either label your packet just for it's ultimate destination or you apply the stack of the points you want to pass through.

Quite, but transit won't inspect the stack, it doesn't have to care
about it, so it can be very deep.

In the former case you are, at the forwarding plane, equal to what you see with traditional MPLS today, with every node along the path needing to know how to reach the end-point. Yes, you have lowered label space from traditional MPLS, but that can be done with site-cast labels already. And, while the nodes don't have to actually swap labels, when you look at commodity implementations (across the last three generations since you want to do this with what is deployed, not wholesale replace the network) a null swap still ends up eating a unique egress next-hop entry. So from a hardware perspective, you haven't improved anything. Your ECMP group count is high.

I don't disagree. What I'm trying to say, however you tunnel, you have
the same issues. If you need to tunnel, then MPLS is better tunnel
than IP. Ultimately both can be made LEM on-chip should market want
it, so difference left is what is the overhead of the tunnel, and MPLS
wins here handsdown. This is objectively true, now what practical
market reality is, that may be different, because market doesn't
optimise for best solution.

So, yes, MPLS works fine if you want to buy big iron boxes. But that come at a cost. So the point about MPLS is always better is not accurate. Engineering is about trade-offs and there are trade-offs to be made when you optimize in a different direction and that points away from MPLS and back to IPIP

Always if you need tunnel. Because the fundamental question is how
much overhead we have, and what is our key width. In both IP tunnel
and MPLS tunnel cases we will assume LEM lookup, to keep the lookup
cheap.

Salve,

Rewrites on MPLS is horrible from a memory perspective as maintaining the state and label transition to explore all possible discrete paths across the overall end-to-end path you are trying to take is hugely in-efficient. Applying circuit switching to a packet network was bad from the start. SR doesn’t resolve that, as you are stuck with a global label problem and the associated lack of being able to engineer your paths, or a label stack problem on ingress that means you need a massive ASIC’s and memories there.

I don’t think rewrites are horrible, but just very flexible and this can come up with a certain price. Irt to your memory argument that path engineering takes in vanilla TE a lot of forwarding slots we should remind us that this is not a design principle of MPLS. Discrete paths could also be signalled in MPLS with shared link-labels so that you will end up with the same big instructional headend packet as in SR. There are even implementations offering this.

IP at least gives you rewrite sharing, so in a lite-core you have way better trade-off on resources, especially in a heavily ECMP’ed network. Such as one build of massive number of open small boxes vs. a small number of huge opaque ones. Pick your poison but saying one is inheriantly better then another in all cases is just plane false.

If I understand this argument correctly then it shouldn’t be one because of “rewrite sharing” being irrelevant for the addressability of single nodes in a BGP network. Why a header lookup depth of 4B per label in engineered and non-engineered paths should be a bad requisite for h/w designers of modern networks is beyond me. In most MPLS networks (unengineered L3VPN) you need to read less of headers than in a eg. VXLAN fabric to make ECMP work (24B vs. 20B).

Well blatantly we are, because in the real world most of the value of
MPLS tunnels is not available as IP tunnels. Again technically
entirely possible to replace MPLS tunnels with IP tunnels, just
question how much overhead you have in transporting the tunnel key and
how wide they are.

Should we design a rational cost-efficient solution, we should choose
the lowest overhead and narrowest working keys.

Saku Ytti писал 2020-06-12 12:10:

Unless you want ECMP then it VERY much matters. But I guess since we are only talking about theoretical instead of building an actual practical network, it doesn't matter.

Well blatantly we are, because in the real world most of the value of
MPLS tunnels is not available as IP tunnels. Again technically
entirely possible to replace MPLS tunnels with IP tunnels, just
question how much overhead you have in transporting the tunnel key and
how wide they are.

Should we design a rational cost-efficient solution, we should choose
the lowest overhead and narrowest working keys.

Sorry for jumping in in the mddle of discussion, as a side note, in case of IPIP tunneling, shouldn't another protocol type be utilized in MAC header? As I understand, in VXLAN VTEP ip is dedicated for this purpose, so receiving a packet with VTEP DST IP already means "decapsulate and lookup the next header". But in traditional routers loopback IPs are used for multiple purposes and usually receiving a packet with lo0 IP means punt it to control plane. Isn't additional differentiator is needed here to tell a router which type of action it has to do? Or, as alternative, if dedicated stack of IPs is used for tunneling, then another lookup table is needed for it, isn't it? And now looks like we are coming to the header structure and forwarding process similar that we already have in MPLS, only with different label format. Please correct me if I went off track somewhere in this logical chain.

To David's point about ECMP I'd like to mention that in WAN networks number of diverse paths is always limited, so having multiple links taking same path doesn't make much sense. With current economics 4x10G and 1x100G are usually close from price POV. Obviously, there are different situations when multiple links are the only option, but how many, usually 4-8. Sure if you need multiple 400G then there is currently no option to go to higher speeds, but that's more DC use case than WAN network. So ECMP in WAN network isn't that big scale problem imho, also there are existing and proposed solutions, like SR, for it.

Kind regards,
Andrey

I don't think new etherType is mandatory by any means. Biggest gain is
security. SRv6 will necessarily have a lot of issues where
unauthorised packet gets treated as SRv6, which is much harder in MPLS
network. Many real-life devices work very differently with EH chains
(with massive performance drop, like can be 90%!). JNPR Trio will
parse up-to N EH, then drop if it cannot finish parsing. NOK FP wil
parse up-to N EH, then forward if it cannot finish parsing (i.e. now
it bypasses TCP/UDP denies, as it didn't know it is TCP/IP, or it
could have SRv6 EH, which it couldn't drop, as it didn't know it had
it).

But in terms of the big MPLS advantage, of having guaranteed exact
match lookups on small space, compared to LPM lookups on large space.
We could guarantee this in IPIP tunnels too, without having any
difference in the headers, other than obligation/guarantee that all
LSR packets are IPIP encapsulated with a small amount of outer packet
DADDRs.

Okay, got you. I thought you were running into these problems on the
"usual suspect" platforms.

Yes, commodity hardware certainly has a number of trade-offs for the
cost benefit. We've been evaluating this path since 2013, and for our
use-case, it will actually make less sense because we are more large
capacity, small footprint, i.e., typical network service provider.

If we were a cloud provider operating at scale in several data centres,
our current model of running Cisco's, Juniper's, Arista's, e.t.c., may
not necessarily be the right choice in 2020, particularly in the edge.

Mark.

< note that i am wearing arrcus hat >

I'll be honest, none of our customers ever asked us to deploy RPKI and
ROV. Will they benefit from it, sure? Is it about to become part of an
RFP requirements document? Probably not.

we have rov in rfps received from paying customers

randy

We have probably largely converged to the same place. Your vantage
point sees practical offerings where IPIP may make more sense to you
than MPLS, my vantage point definitely only implements the rich
features I need in MPLS tunnels (RSVP-TE, L2 pseudowires, FRR, L3 MPLS
VPN, all of which technically could be done of course with IPIPIP
tunnels) And the theory we agree that less is more.

ECMP appears to be your main pain point, the rich features are not
relevant, and you mentioned commodity hardware being able to hash on
IPIP. I feel this may be a very special case where HW can do IPIP hash
but not MPLSIP hash. Out of curiosity, what is this hardware? Jericho
can do MPLSIP, I know JNPR's pipeline offering, Paradise, can. Or
perhaps it's not even that the underlaying hardware you have, cannot
do, it's that the NOS you are being offered is so focused on your
use-case, it doesn't do anything else reasonably, then of course that
use-case is best by default.

One of the biggest challenges we found of leveraging commodity hardware
was locating a suitable OS that is not only fit-for-purpose in our
service provider environment, but that could also leverage the hardware
at its disposal to its fullest potential.

It's hard enough for one vendor to get both their own hardware and
software right most of the time. We posited it would be doubly hard for
an operator to marry hardware and software vendors that do not
necessarily co-ordinate with one another, if your goal is to run a
profit-oriented operational network.

Sure, the idea is great on paper, but there aren't that many shops that
can throw warm bodies at this problem like some of the more established
content and cloud folk.

If it was easy, I certainly wouldn't have started this thread in the
first place :-).

Mark.

I hope this becomes the norm, globally.

Mark.

Except that is actually the problem if you look at it in hardware. And to be very specific, I’m talking about commodity hardware, not flexible pipelines like you find in the MX and a number of the ASR’s. I’m also talking about the more recent approach of using Clos in PoP’s instead of “big iron” or chassis based systems.

TE gives you the most powerful traffic engineering tool kit available. Naturally it has a bit more weight than just a single screwdriver. It can you build nearly any kind of multipath transport while that Clos thing is just one architecture hunting for the cheapest implementation of IP/LDP-style ECMP.

On those boxes, it’s actually better to not do shared labels, as this pushes the ECMP decision to the ingress node. That does mean you have to enumerate every possible path (or some approximate) through the network, however the action on the commodity gear is greatly reduced. It’s a pure label swap, so you don’t run into any egress next-hop problems. You definitely do on the ingress nodes. Very, very badly actually.

Actually shared links are not a swap but just a pop similar to SR. But indeed this would shift your ECMP issue just to the headend. So for your ECMP scaling there would still be an option left to use an implementation which offers you a merge-point with a single label to all upstreams for a certain equal-cost multipath downstream. This does exist, so would certainly fix your ECMP scaling problem. But advanced control-plane code is certainly not cheap so in the end, like it was already said before, if a simple and cheap platform can solve all your needs then it might be the better one. Let‘s see what problems we need to solve in five years again.

What I’m getting at is that IP allows re-write sharing in that what needs to change on two IP frames taking the same paths but ultimately reaching different destinations are re-written (e.g. DMAC, egress-port) identically. And, at least with IPIP, you are able to look at the inner-frame for ECMP calculations. Depending on your MPLS design, that may not be the case. If you have too deep of a label stack (3-5 depending on ASIC), you can’t look at the payload and you end up with polarization.

Not really as you are still forced to rewrite on imposition for the simplest form of tunneling, and for TE as often as you need to go against your SPT as well, it‘s just happening on IP (and IP rewrites are more expensive than MPLS rewrites / forwarding operations).