ASR1006 that has following cards…
1 x ESP40
1 x SIP40
4 x SPA-1x10GE-L-V2
1 x 6TGE
1 x RP2
We’ve been having latency and packet loss during peak periods…
We notice all is good until we reach 50% utilization on output of…
‘show platform hardware qfp active datapath utilization summary’
Literally … 47% good… 48% good… 49% latency to next hop goes from 1ms to 15-20ms… 50% we see 1-2% packet-loss and 30-40ms latency… 53% we see 60-70ms latency and 8-10% packet loss.
Is this expected… the ESP40 can only really push 20G and then starts to have performance issues?
So many years since I have used an asr1000 but, honestly you have an esp40 in a box with 10x10G interfaces? That’s a very underpowered processor for that job. The ESP40 was designed for a box that would have 1G interfaces and perhaps a couple of 10’s. The ASR1000 is a CPU based box, everything goes back to the processor and remember cisco math means half duplex not full.
we see similar problems on ASR1006-X with ESP100 and MIP100. At about ~45 Gbit/s of traffic (on ~30k PPPoE Sessions and ~700k CGN sessions) the QFP utilization skyrockets from ~45 % straight to ~95 %
I don’t know if it’s the CGN sessions or the traffic/packets causing the load increase, the datasheet says it supports something like 10M sessions… but maybe not if you really intend to push packets through it?
We have not seen such spikes with way higher pps, but lower CGN session count, when we had DDoS Attacks against end customers.
I'm not sure what a CPU based box means here. ASR1k isn't using a
general purpose core like PQ3, INTC or AMD. Like CRS-1 and nPower,
ASR1k has Cisco made forwarding logic using cores from tensilica
(CPP10/popey I believe was 40 x Tensilica DI 570T, next iteration was
64 cores).
I mean a router without ASIC based forwarding like a Juniper MX or Nokia 7750. The advantage of the 1k is you don't need a services card for cgnat, but the large disadvantage is everything passes through the ESP processor and this often leads to disappointing results under load.
I'm not sure what a CPU based box means here. ASR1k isn't using a general purpose core like PQ3, INTC or AMD. Like CRS-1 and nPower, ASR1k has Cisco made forwarding logic using cores from tensilica (CPP10/popey I believe was 40 x Tensilica DI 570T, next iteration was
I think ASR1k NPU perfect analog for Juniper MX Trio or Nokia 7750 FP,
I think these all fall in very common description of an NPU. We could
dive deep and explain why 7750 and MX are vastly different, in
decision of doing many small or few large cores, but ultimately they
easily fall under NPU definition.
I haven't experienced that across about a dozen ASR 1ks. Though I just checked and we are not pushing any of our ESP's over 50% currently (the closest we have is an ESP 40 doing 18Gbps). However, I'm pretty sure we've pushed older ESPs (5, 10's, and 20's) to ~75% or so in the past.
Given the components you have, I would have expected your router to handle 40Gbps input and 40Gbps output. That could either be 40Gbps into the 6 port card [and 40Gbps out of the four 1 port cards] or it could be 40Gbps input that is spread across the 6 port and 1 port cards [that is then output across both cards as well].
Despite other comments, I think your components are well matched. The only non-obvious thing here is that the 6 port card only has a ~40Gbps connection to the backplane so you cannot use all 6 ports at full bandwidth. I think this router is well suited to handle 20-30Gbps of customer demand doing standard destination based routing (if you're doing traffic shaping, NAT, tunnelling, or something else more involved than extended ACLs you may need something beefier at those traffic levels).
Our total inbound bandwidth from upstreams is about 20G at max… so that really is the total bandwidth…
Now we are terminating about 1800 PPPoE sessions on the router as well, and have policing set on them, as well as shaping on a couple of our major downstream links.
Is anyone interested in making a few $ and taking a look for us, to see if we are really hitting capacity, or if some sort of tuning could be done to help us eak out a little bit more from this device before upgrading.
He had a similar issue about 4 years ago.
We were showing packet loss and drops getting progressively worse and the router was falling over when reaching about 70% of usage.
We could see the interface reliability go down and input errors due to overruns on the interfaces.
Cisco blamed it on microburtst not being able to be handled under load.
"We were able to replicate this scenario in our lab as well.
QFP under high load generated input errors and overruns which in turn led to unicast failures/ drops/ latency.
The issue is not consistent with QFP % utilization as sometimes with even 80%+ traffic, we do not see the drops:"
And recommended removing traffic or upgrading esp.
One of our guys disabled nbar on the router and the problem disappeared.
I would suggest taking a look at what features you are using and if you can try and disable them to see if it makes any impact.
We then upgraded esps and all has been fine since.
If you still need netflow to gain some visibility on what’s happening, you could check the percentage of netflow export.
Usually 1/1000 is good or 0.1%. Maybe for you 1/1 000 000 could be good enough too.
If 100% was used, then indeed there are some real time performance penalties. Not much people need an accurate 100% of netflow exports. If you need 100% accuracy, then you need dedicated hardware.
0% or totally disabled is also often very good enough if you don’t need visibility.
Netflow is useful in my opinion, but maybe not for every case.