400G forwarding - how does it work?

Hi All,

I've been trying to understand how forwarding at 400G is possible,
specifically in this example, in relation to the Broadcom J2 chips,
but I don't the mystery is anything specific to them...

According to the Broadcom Jericho2 BCM88690 data sheet it provides
4.8Tbps of traffic processing and supports packet forwarding at 2Bpps.
According to my maths that means it requires packet sizes of 300Bs to
reach line rate across all ports. The data sheet says packet sizes
above 284B, so I guess this is excluding some headers like the
inter-frame gap and CRC (nothing after the PHY/MAC needs to know about
them if the CRC is valid)? As I interpret the data sheet, J2 should
supports chassis with 12x 400Gbps ports at line rate with 284B packets
then.

Jericho2 can be linked to a BCM16K for expanded packet forwarding
tables and lookup processing (i.e. to hold the full global routing
table, in such a case, forwarding lookups are offloaded to the
BCM16K). The BCM16K documentation suggests that it uses TCAM for exact
matching (e.g.,for ACLs) in something called the "Database Array"
(with 2M 40b entries?), and SRAM for LPM (e.g., IP lookups) in
something called the "User Data Array" (with 16M 32b entries?).

A BCM16K supports 16 parallel searches, which means that each of the
12x 400G ports on a Jericho2 could perform an forwarding lookup at
same time. This means that the BCM16K "only" needs to perform
forwarding look-ups at a linear rate of 1x 400Gbps, not 4.8Tbps, and
"only" for packets larger than 284 bytes, because that is the Jericho2
line-rate Pps rate. This means that each of the 16 parallel searches
in the BCM16K, they need to support a rate of 164Mpps (164,473,684) to
reach 400Gbps. This is much more in the realm of feasible, but still
pretty extreme...

1 second / 164473684 packets = 1 packet every 6.08 nanoseconds, which
is within the access time of TCAM and SRAM but this needs to include
some computing time too e.g. generating a key for a lookup and passing
the results along the pipeline etc. The BCM16K has a clock speed of
1Ghz (1,000,000,000, cycles per second, or cycle every 1 nano second)
and supports an SRAM memory access in a single clock cycle (according
to the data sheet). If one cycle is required for an SRAM lookup, the
BCM16K only has 5 cycles to perform other computation tasks, and the
J2 chip needs to do the various header re-writes and various counter
updates etc., so how is magic this happening?!?

The obvious answer is that it's not magic and my understanding is
fundamentally flawed, so please enlighten me.

Cheers,
James.

Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:

The obvious answer is that it's not magic and my understanding is
fundamentally flawed, so please enlighten me.

So I can't answer to your specific question, but I just wanted to say
that your CPU analysis is simplistic and doesn't really match how CPUs
work now. Something can be "line rate" but not push the first packet
through in the shortest time. CPUs break operations down into a series
of very small operations and then run those operations in a pipeline,
with different parts of the CPU working on the micro operations for
different overall operations at the same time. The first object out of
the pipeline (packet destination calculated in this case) may take more
time, but then after that you keep getting a result every cycle/few
cycles.

For example, it might take 4 times as long to process the first packet,
but as long as the hardware can handle 4 packets in a queue, you'll get
a packet result every cycle after that, without dropping anything. So
maybe the first result takes 12 cycles, but then you can keep getting a
result every 3 cycles as long as the pipeline is kept full.

This type of pipelined+superscalar processing was a big deal with Cray
supercomputers, but made it down to PC-level hardware with the Pentium
Pro. It has issues (see all the Spectre and Retbleed CPU flaws with
branch prediction for example), but in general it allows a CPU to handle
a chain of operations faster than it can handle each operation
individually.

I'm not sure what your specific question is. So I answer my question instead.

Q: how can we do lookup fast enough to do 'big number' per second,
while underlying hardware inherently takes longer

I.e. say JNPR Trio PPE has many threads, and only one thread is
running, rest of the threads are waiting for answers from memory. That
is, once we start pushing packets through the device, it takes a long
ass time (like single digit microseconds) before we see any packets
out. 1000x longer than your calculated single digit nanoseconds.

So the most important bits are pipelining and parallelism. And this is substantially simplified, but hopefully it helps.

Pipelining basically means that you have a whole bunch of different operations that you need to perform to forward a packet. Lots of these are lookups into things like the FIB tables, the encap tables, the MAC tables, and literally dozens of other places where you store configuration and network state. Some of these are very small simple tables (“give me the value for a packet with TOS = 0b101”) and some are very complicated, like multi-level longest-prefix trees/tries that are built from lots of custom hardware logic and memory. It varies a lot from chip to chip, but there are on the order of 50-100 different tables for the current generation of “fast” chips doing lots of 400GE interfaces. Figuring out how to distribute all this forwarding state across all the different memory banks/devices in a big, fast chip is one of the Very Hard Problems that the chip makers and system vendors have to figure out.

So once you build out this pipeline, you’ve got a bunch of different steps that all happen sequentially. The “length” of the pipeline puts a floor on the latency for switching a single packet… if I have to do 25 lookups and they’re all dependent on the one before, it’s not possible for me to switch the packet in any less than 25 clocks…. BUT, if I have a bunch of hardware all running these operations at the same time, I can push the aggregate forwarding capacity way higher. This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput. Now there’s plenty of complexity in terms of HOW I do all that parallelism — figuring out whether I have to replicate entire memory structures or if I can come up with sneaky ways of doing multiple lookups more efficiently, but that’s getting into the magic secret sauce type stuff.

I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.

Hopefully that helps at least some.

Disclaimer: I’m a Cisco employee, these words are mine and not representative of anything awesome that I may or may not work on in my day job…

—lj

Thanks for the responses Chris, Saku…

Once upon a time, James Bensley <jwbensley+nanog@gmail.com> said:
> The obvious answer is that it's not magic and my understanding is
> fundamentally flawed, so please enlighten me.

So I can't answer to your specific question, but I just wanted to say
that your CPU analysis is simplistic and doesn't really match how CPUs
work now.

It wasn't a CPU analysis because switching ASICs != CPUs.

I am aware of the x86 architecture, but know little of network ASICs,
so I was deliberately trying to not apply my x86 knowledge here, in
case it sent me down the wrong path. You made references towards
typical CPU features;

For example, it might take 4 times as long to process the first packet,
but as long as the hardware can handle 4 packets in a queue, you'll get
a packet result every cycle after that, without dropping anything. So
maybe the first result takes 12 cycles, but then you can keep getting a
result every 3 cycles as long as the pipeline is kept full.

Yes, in the x86/x64 CPU world keeping the instruction cache and data
cache hot indeed results in optimal performance, and as you say modern
CPUs use parallel pipelines amongst other techniques like branch
prediction, SIMD, (N)UMA, and so on, but I would assume (because I
don’t know) that not all of the x86 feature set map nicely to packet
processing in ASICs (VPP uses these techniques on COTS CPUs, to
emulate a fixed pipeline, rather than run to completion model).

You and Saku both suggest that heavy parallelism is the magic source;

Something can be "line rate" but not push the first packet
through in the shortest time.

I.e. say JNPR Trio PPE has many threads, and only one thread is
running, rest of the threads are waiting for answers from memory. That
is, once we start pushing packets through the device, it takes a long
ass time (like single digit microseconds) before we see any packets
out. 1000x longer than your calculated single digit nanoseconds.

In principal I accept this idea. But lets try and do the maths, I'd
like to properly understand;

The non-drop rate of the J2 is 2Bpps @ 284 bytes == 4.8Tbps, my
example scenario was a single J2 chip in a 12x400G device. If each
port is receiving 400G @ 284 bytes (164,473,684 pps), that’s one every
6.08 nanoseconds coming in. What kind of parallelism is required to
stop from ingress dropping?

It takes say 5 microseconds to process and forward a packet (seems
reasonable looking at some Arista data sheets which use J2 variants),
which means we need to be operating on 5,000ns / 6.08ns == 822 packets
per port simultaneously, so 9868 packets are being processed across
all 12 ports simultaneously, to stop ingress dropping on all
interfaces.

I think the latest generation Trio has 160 PPEs per PFE, but I’m not
sure how many threads per PPE. Older generations had 20
threads/contexts per PPE, so if it hasn’t increased that would make
for 3200 threads in total. That is a 1.6Tbps FD chip, although not
apples to apples of course, Trio is run to completion too.

The Nokia FP5 has 1,200 cores (I have no idea how many threads per
core) and is rated for 4.8Tbps FD. Again doing something quite
different to a J2 chip, again its RTC.

J2 is a partially-fixed pipeline but slightly programmable if I have
understood correctly, but definitely at the other end of the spectrum
compared to RTC. So are we to surmise that a J2 chip has circa 10k
parallel pipelines, in order to process 9868 packets in parallel?

I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).

Cheers,
James.

Hi Lawrence, thanks for your response.

This is the parallelism part. I can take multiple instances of these memory/logic pipelines, and run them in parallel to increase the throughput.

...

I work on/with a chip that can forwarding about 10B packets per second… so if we go back to the order-of-magnitude number that I’m doing about “tens” of memory lookups for every one of those packets, we’re talking about something like a hundred BILLION total memory lookups… and since memory does NOT give me answers in 1 picoseconds… we get back to pipelining and parallelism.

What level of parallelism is required to forward 10Bpps? Or 2Bpps like
my J2 example :slight_smile:

Cheers,
James.

I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).

No you are right, FP has much much more PPEs than Trio.

For fair calculation, you compare how many lines FP has to PPEs in
Trio. Because in Trio single PPE handles entire packet, and all PPEs
run identical ucode, performing same work.

In FP each PPE in line has its own function, like first PPE in line
could be parsing the packet and extracting keys from it, second could
be doing ingressACL, 3rd ingressQoS, 4th ingress lookup and so forth.

Why choose this NP design instead of Trio design, I don't know. I
don't understand the upsides.

Downside is easy to understand, picture yourself as ucode developer,
and you get task to 'add this magic feature in the ucode'.
Implementing it in Trio seems trivial, add the code in ucode, rock on.
On FP, you might have to go 'aww shit, I need to do this before PPE5
but after PPE3 in the pipeline, but the instruction cost it adds isn't
in the budget that I have in the PPE4, crap, now I need to shuffle
around and figure out which PPE in line runs what function to keep the
PPS we promise to customer.

Let's look it from another vantage point, let's cook-up IPv6 header
with crapton of EH, in Trio, PPE keeps churning it out, taking long
time, but eventually it gets there or raises exception and gives up.
Every other PPE in the box is fully available to perform work.
Same thing in FP? You have HOLB, the PPEs in the line after thisPPE
are not doing anything and can't do anything, until the PPE before in
line is done.

Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
before and after lookup, before is normally needed for ingressACL but
after lookup ingressACL is needed for CoPP (we only know after lookup
if it is control-plane packet). Nokia doesn't do this at all, and I
bet they can't do it, because if they'd add it in the core where it
needs to be in line, total PPS would go down. as there is no budget
for additional ACL. Instead all control-plane packets from ingressFP
are sent to control plane FP, and inshallah we don't congest the
connection there or it.

I suspect many folks know the exact answer for J2, but it’s likely under NDA to talk about said specific answer for a given thing.

Without being platform or device-specific, the core clock rate of many network devices is often in a “goldilocks” zone of (today) 1 to 1.5GHz with a goal of 1 packet forwarded ‘per-clock’. As LJ described the pipeline that doesn’t mean a latency of 1 clock ingress-to-egress but rather that every clock there is a forwarding decision from one ‘pipeline’, and the MPPS/BPPS packet rate is achieved by having enough pipelines in parallel to achieve that.
The number here is often “1” or “0.5” so you can work the number backwards. (e.g. it emits a packet every clock, or every 2nd clock).

It’s possible to build an ASIC/NPU to run a faster clock rate, but gets back to what I’m hand-waving describing as “goldilocks”. Look up power vs frequency and you’ll see its non-linear.
Just as CPUs can scale by adding more cores (vs increasing frequency), ~same holds true on network silicon, and you can go wider, multiple pipelines. But its not 10K parallel slices, there’s some parallel parts, but there are multiple ‘stages’ on each doing different things.

Using your CPU comparison, there are some analogies here that do work:

  • you have multiple cpu cores that can do things in parallel – analogous to pipelines
  • they often share some common I/O (e.g. CPUs have PCIe, maybe sharing some DRAM or LLC) – maybe some lookup engines, or centralized buffer/memory
  • most modern CPUs are out-of-order execution, where under-the-covers, a cache-miss or DRAM fetch has a disproportionate hit on performance, so its hidden away from you as much as possible by speculative execution out-of-order
    – no direct analogy to this one - it’s unlikely most forwarding pipelines do speculative execution like a general purpose CPU does - but they definitely do ‘other work’ while waiting for a lookup to happen

A common-garden x86 is unlikely to achieve such a rate for a few different reasons:

  • packets-in or packets-out go via DRAM then you need sufficient DRAM (page opens/sec, DRAM bandwidth) to sustain at least one write and one read per packet. Look closer at DRAM and see its speed, Pay attention to page opens/sec, and what that consumes.
  • one ‘trick’ is to not DMA packets to DRAM but instead have it go into SRAM of some form - e.g. Intel DDIO, ARM Cache Stashing, which at least potentially saves you that DRAM write+read per packet
  • … but then do e.g. a LPM lookup, and best case that is back to a memory access/packet. Maybe it’s in L1/L2/L3 cache, but likely at large table sizes it isn’t.
  • … do more things to the packet (urpf lookups, counters) and it’s yet more lookups.

Software can achieve high rates, but note that a typical ASIC/NPU does on the order of >100 separate lookups per packet, and 100 counter updates per packet.
Just as forwarding in a ASIC or NPU is a series of tradeoffs, forwarding in software on generic CPUs is also a series of tradeoffs.

cheers,

lincoln.

It wasn’t a CPU analysis because switching ASICs != CPUs.

I am aware of the x86 architecture, but know little of network ASICs,
so I was deliberately trying to not apply my x86 knowledge here, in
case it sent me down the wrong path. You made references towards
typical CPU features;

A CPU is ‘jack of all trades, master of none’. An ASIC is ‘master of one specific thing’.

If a given feature or design paradigm found in a CPU fits with the use case the ASIC is being designed for, there’s no reason it cannot be used.

All high-performance networking devices on the market have pipeline architecture.

The pipeline consists of “stages”.

ASICs have stages fixed to particular functions:

Well, some stages are driven by code our days (a little flexibility).

Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost.

Network Processors emulate stages on general-purpose ARM cores. It is a pipeline too (different cores for different functions, many cores for every function), just it is a virtual pipeline.

Ed/

Juniper is pipeline-based too (like any ASIC). They just invented one special stage in 1996 for lookup (sequence search by nibble in the big external memory tree) – it was public information up to 2000year. It is a different principle from TCAM search – performance is traded for flexibility/simplicity/cost.

How do you define a pipeline? My understanding is that fabric and wan connections are in chip called MQ, ‘head’ of packet being some 320B or so (bit less on more modern Trio, didn’t measure specifically) is then sent to LU complex for lookup.
LU then sprays packets to one of many PPE, but once packet hits PPE, it is processed until done, it doesn’t jump to another PPE.
Reordering will occur, which is later restored for flows, but outside flows reorder may remain.

I don’t know what the cores are, but I’m comfortable to bet money they are not ARM. I know Cisco used to ezchip in ASR9k but is now jumping to their own NPU called lightspeed, and lightspeed like CRS-1 and ASR1k use tensilica cores, which are decidedly not ARM.

Nokia, as mentioned, kind of has a pipeline, because a single packet hits every core in line, and each core does separate thing.

Nope, ASIC vendors are not ARM-based for PFE. Every “stage” is a very specialized ASIC with small programmability (not so small for P4 and some latest generation ASICs).

ARM cores are for Network Processors (NP). ARM cores (with proper microcode) could emulate any “stage” of ASIC. It is the typical explanation for why NPs are more flexible than ASIC.

Stages are connected to the common internal memory where enriched packet headers are stored. The pipeline is just the order of stages to process these internal enriched headers.

The size of this internal header is the critical restriction of the ASIC, never disclosed or discussed (but people know it anyway for the most popular ASICs – it is possible to google “key buffer”).

Hint: the smallest one in the industry is 128bytes, the biggest 384bytes. It is not possible to process longer headers for one PFE pass.

Non-compressed SRv6 header could be 208bytes (+TCP/UDP +VLAN +L2 +ASIC_internal_staff). Hence, the need for compressed.

It was a big marketing announcement from one famous ASIC vendor just a few years ago that some ASIC stages are capable of dynamically sharing common big external memory (used for MAC/IP/Filters).

It may be internal memory too for small scalability, but typically it is external memory. This memory is always discussed in detail – it is needed for the operation team.

It is only about headers. The packet itself (payload) is stored in the separate memory (buffer) that is not visible for pipeline stages.

There were times when it was difficult to squeeze everything into one ASIC. Then one chip prepares an internal (enriched) header and may do some processing (some simple stages), then send this header to the next chip for other “stages” (especially the complicated lookup with external memory connected). It is the artifact now.

Ed/

How do you define a pipeline?

For what it’s worth, andwith just a cursory look through this email, and
without wishing to offend anyone’s knowledge:

a pipeline in processing is the division of the instruction cycle into a number of stages.
General purpose RISC processors are often organized into five such stages.
Under optimal conditions,
which can be fairly, albeit loosely,
interpreted as “one instruction does not affect its peers which are already in one of the stages”,
then a pipeline can increase the number of instructions retired per second,
often quoted as MIPS (millions of instructions per second)
by a factor equal to the number of stages in the pipeline.

Cheers,

Etienne

Pipeline Stages are like separate computers (with their own ALU) sharing the same memory.

In the ASIC case, the computers have different types (different capabilities).

I have no frame of reference here, but in comparison to Gen 6 Trio of
NP5, that seems very high to me (to the point where I assume I am
wrong).

No you are right, FP has much much more PPEs than Trio.

Can you give any examples?

Why choose this NP design instead of Trio design, I don't know. I
don't understand the upsides.

I think one use case is fixed latency. If you have minimal variation in your traffic you can provide a guaranteed upper bound on latency. This should be possible with the RTC model too of course, just harder because any variation in traffic at all, will result in a different run time duration, and I imagine it is easier to measure, find, and fix/tune chunks of code (running on separate cores, like in a pipeline) than in more code all running one core (like in RTC). So that's possibly a second benefit, maybe FP is easier to debug and measure changes?

Downside is easy to understand, picture yourself as ucode developer,
and you get task to 'add this magic feature in the ucode'.
Implementing it in Trio seems trivial, add the code in ucode, rock on.
On FP, you might have to go 'aww shit, I need to do this before PPE5
but after PPE3 in the pipeline, but the instruction cost it adds isn't
in the budget that I have in the PPE4, crap, now I need to shuffle
around and figure out which PPE in line runs what function to keep the
PPS we promise to customer.

That's why we have packet recirc <troll face>

Let's look it from another vantage point, let's cook-up IPv6 header
with crapton of EH, in Trio, PPE keeps churning it out, taking long
time, but eventually it gets there or raises exception and gives up.
Every other PPE in the box is fully available to perform work.
Same thing in FP? You have HOLB, the PPEs in the line after thisPPE
are not doing anything and can't do anything, until the PPE before in
line is done.

This is exactly the benefit of FP vs NPI, less flexible, more throughput. NPU has served us (the industry) well at the edge, and FP is serving us well in the core.

Today Cisco and Juniper do 'proper' CoPP, that is, they do ingressACL
before and after lookup, before is normally needed for ingressACL but
after lookup ingressACL is needed for CoPP (we only know after lookup
if it is control-plane packet). Nokia doesn't do this at all, and I
bet they can't do it, because if they'd add it in the core where it
needs to be in line, total PPS would go down. as there is no budget
for additional ACL. Instead all control-plane packets from ingressFP
are sent to control plane FP, and inshallah we don't congest the
connection there or it.

Interesting.

Cheers,
James.

“Pipeline” in the context of networking chips is not a terribly well-defined term. In some chips, you’ll have a pipeline that is built from very rigid hardware logic blocks – the first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila!

“Pipeline”, in the context of networking chips, is not a terribly well-defined term! In some chips, you’ll have an almost-literal pipeline that is built from very rigid hardware logic blocks. The first block does exactly one part of the packet forwarding, then hands the packet (or just the header and metadata) to the second block, which does another portion of the forwarding. You build the pipeline out of as many blocks as you need to solve your particular networking problem, and voila!

The advantages here is that you can make things very fast and power efficient, but they aren’t all that flexible, and deity help you if you ever need to do something in a different order than your pipeline!

You can also build a “pipeline” out of software functions - write up some Python code (because everyone loves Python, right?) where function A calls function B and so on. At some level, you’ve just build a pipeline out of different software functions. This is going to be a lot slower (C code will be faster but nowhere near as fast as dedicated hardware) but it’s WAY more flexible. You can more or less dynamically build your “pipeline” on a packet-by-packet basis, depending on what features and packet data you’re dealing with.

“Microcode” is really just a term we use for something like “really optimized and limited instruction sets for packet forwarding”. Just like an x86 or an ARM has some finite set of instructions that it can execute, so do current networking chips. The larger that instruction space is and the more combinations of those instructions you can store, the more flexible your code is. Of course, you can’t make that part of the chip bigger without making something else smaller, so there’s another tradeoff.

MOST current chips are really a hybrid/combination of these two extremes. You have some set of fixed logic blocks that do exactly One Set Of Things, and you have some other logic blocks that can be reconfigured to do A Few Different Things. The degree to which the programmable stuff is programmable is a major input to how many different features you can do on the chip, and at what speeds. Sometimes you can use the same hardware block to do multiple things on a packet if you’re willing to sacrifice some packet rate and/or bandwidth. The constant “law of physics” is that you can always do a given function in less power/space/cost if you’re willing to optimize for that specific thing – but you’re sacrificing flexibility to do it. The more flexibility (“programmability”) you want to add to a chip, the more logic and memory you need to add.

From a performance standpoint, on current “fast” chips, many (but certainly not all) of the “pipelines” are designed to forward one packet per clock cycle for “normal” use cases. (Of course we sneaky vendors get to decide what is normal and what’s not, but that’s a separate issue…) So if I have a chip that has one pipeline and it’s clocked at 1.25Ghz, that means that it can forward 1.25 billion packets per second. Note that this does NOT mean that I can forward a packet in “a one-point-two-five-billionth of a second” – but it does mean that every clock cycle I can start on a new packet and finish another one. The length of the pipeline impacts the latency of the chip, although this part of the latency is often a rounding error compared to the number of times I have to read and write the packet into different memories as it goes through the system.

So if this pipeline can do 1.25 billion PPS and I want to be able to forward 10BPPS, I can build a chip that has 8 of these pipelines and get my performance target that way. I could also build a “pipeline” that processes multiple packets per clock, if I have one that does 2 packets/clock then I only need 4 of said pipelines… and so on and so forth. The exact details of how the pipelines are constructed and how much parallelism I built INSIDE a pipeline as opposed to replicating pipelines is sort of Gooky Implementation Details, but it’s a very very important part of doing the chip level architecture as those sorts of decisions drive lots of Other Important Decisions in the silicon design…

–lj

mandatory slide of laundry analogy for pipelining https://cs.stanford.edu/people/eroberts/courses/soco/projects/risc/pipelining/index.html

As Lincoln said - all of us directly working with BCM/other silicon vendors have signed numerous NDAs.
However if you ask a well crafted question - there’s always a way to talk about it :wink:

In general, if we look at the whole spectrum, on one side there’re massively parallelized “many core” RTC ASICs, such as Trio, Lightspeed, and similar (as the last gasp of Redback/Ericsson venture - we have built 1400 HW threads ASIC (Spider).
On another side of the spectrum - fixed pipeline ASICs, from BCM Tomahawk at its extreme (max speed/radix - min features) moving with BCM Trident, Innovium, Barefoot(quite different animal wrt programmability), etc - usually shallow on chip buffer only (100-200M).

In between we have got so called programmable pipeline silicon, BCM DNX and Juniper Express are in this category, usually a combo of OCB + off chip memory (most often HBM), (2-6G), usually have line-rate/high scale security/overlay encap/decap capabilities. Usually have highly optimized RTC blocks within a pipeline (RTC within macro). The way and speed to access DBs, memories is evolving with each generation, number/speed of non networking cores(usually ARM) keeps growing - OAM, INT, local optimizations are primary users of it.

Cheers,

Jeff

Nokia FP is like >1k, Juniper Trio is closer to 100 (earlier Trio LUs
had much less). I could give exact numbers for EA and YT if needed,
they are visible in the CLI and the end user can even profile them, to
see what ucode function they are spending their time on.

What do we call Nokia FP? Where you have a pipeline of identical cores
doing different things, and the packet has to hit each core in line in
order? How do we contrast this to NPU where a given packet hits
exactly one core?

I think ASIC, NPU, pipeline, RTC are all quite ambiguous. When we say
pipeline, usually people assume a purpose build unique HW blocks
packet travels through (like DNX, Express) and not fully flexible
identical cores pipeline like FP.

So I guess I would consider 'true pipeline', pipeline of unique HW
blocks and 'true NPU' where a given packet hits exactly 1 core. And
anything else as more or less hybrid.

I expect once you get to the details of implementation all of these
generalisations use communicative power.