Catalyst 6500 High Switch Proc

Hello.

I've run into a bit of a snag and I hope some folks here may be able to enlighten. From time to time I check the 'sh platform hardware capacity' command on our Catalyst 6509s and have noticed this item:

CPU Resources
  CPU utilization: Module 5 seconds 1 minute 5 minutes
                   5 RP 1% / 0% 3% 4%
                   5 SP 82% / 27% 62% 73%

This is shown on two 6509 switches that we operate as Core layer devices. This value goes up to 85-90% during periods of peak traffic and I'm concerned that this may be a problem.

Checking 'sh proc cpu' is usually 10% or less.

I've gone over this document backwards and forwards and none of the situations outlined seem to apply here:
http://www.cisco.com/en/US/products/hw/switches/ps708/products_tech_note09186a00804916e0.shtml

One thing to note, is that our main ACL for ingress traffic is applied here due to historical reasons. It's roughly 5000 single host entries at present. We also use these devices for NDE.

I'm probably missing some other key details, but what could influence the SP like this? Any insight would be appreciated.

This should probably be on cisco-nsp rather than nanog, but...

5000 lines for ACL? I don't have any experience with ACLs of that size, but it sounds like a possible problem.

If you're doing netflow export and not doing sampled netflow, I'm guessing this is where your problem is. sh mls netflow table-contention detailed
might be able to confirm or rule this out.

* Jon Lewis:

I've run into a bit of a snag and I hope some folks here may be able
to enlighten. From time to time I check the 'sh platform hardware
capacity' command on our Catalyst 6509s and have noticed this item:

MSFC/PFC version is also relevant.

5000 lines for ACL? I don't have any experience with ACLs of that
size, but it sounds like a possible problem.

Yes, but it should be doable. I don't know the commands for the
current IOS releases, but "show tcam" (including "show tcam detail")
and "show fm interface" were quite helpful for designing ACLs for
efficient processing.

This is on a Sup720-3BXL by the way:

'sh mls netflow table-con detailed:'
Earl in Module 5
Detailed Netflow CAM (TCAM and ICAM) Utilization

This looks like the same issue I ran into not long ago. Switch your netflow over from full to sampled...you lose lots of data, but your hardware can't handle full netflow for your traffic level.

AFAIK, your only other options are to mess with the mls aging timers (shorten them) or buy cards with DFC and hope that gets you enough additional netflow capacity for the interfaces your collecting.

http://www.gossamer-threads.com/lists/cisco/nsp/94953

Jon Lewis wrote:

This is on a Sup720-3BXL by the way:

'sh mls netflow table-con detailed:'
Earl in Module 5
Detailed Netflow CAM (TCAM and ICAM) Utilization

TCAM Utilization : 100%
ICAM Utilization : 6%
Netflow TCAM count : 262024
Netflow ICAM count : 8
Netflow Creation Failures : 2085847
Netflow CAM aliases : 0

This looks like the same issue I ran into not long ago. Switch your netflow over from full to sampled...you lose lots of data, but your hardware can't handle full netflow for your traffic level.

AFAIK, your only other options are to mess with the mls aging timers (shorten them) or buy cards with DFC and hope that gets you enough additional netflow capacity for the interfaces your collecting.

Carbon60: Managed Cloud Services

Hopefully he is not trying to use netflow for accounting/billing.
I use:

mls sampling packet-based 1024 8192

As it is a convenient factor of ~1000 from the real numbers.
1Gbit/s of traffic shows up as 1Mbit/s. This has been accurate enough
for anything I have wanted to look at like per-AS traffic.

- Kevin

One thing to note, is that our main ACL for ingress traffic is applied
here due to historical reasons. It's roughly 5000 single host entries
at present. We also use these devices for NDE.

On a SUP7203BXL, if your ACL TCAM utilization is fine, this shouldn't
impact performance unless you're logging too much. Since you've been
over the CPU utilization doc, I'm guessing you know that.

"show platform hardware capacity acl" will give you a breakdown on
your ACL TCAM usage.

I'm probably missing some other key details, but what could influence
the SP like this? Any insight would be appreciated.

Cisco says that Netflow-based features always handle the first packet
of a flow in software, but I don't know if this is the RP or the SP.
It would make sense if a first-flow packet that didn't need punting
hit the SP and not the RP. In that case, your traffic level with
netflow enabled could explain your high SP utilization.

Ross Vandegrift wrote:

  

One thing to note, is that our main ACL for ingress traffic is applied here due to historical reasons. It's roughly 5000 single host entries at present. We also use these devices for NDE.
    
On a SUP7203BXL, if your ACL TCAM utilization is fine, this shouldn't
impact performance unless you're logging too much. Since you've been
over the CPU utilization doc, I'm guessing you know that.

"show platform hardware capacity acl" will give you a breakdown on
your ACL TCAM usage.

I'm probably missing some other key details, but what could influence the SP like this? Any insight would be appreciated.
    
Cisco says that Netflow-based features always handle the first packet
of a flow in software, but I don't know if this is the RP or the SP.
It would make sense if a first-flow packet that didn't need punting
hit the SP and not the RP. In that case, your traffic level with
netflow enabled could explain your high SP utilization.

It is a Sup720-3BXL. Based on the suggestions here, I went ahead and did 'no ip flow ingress' on all the interfaces just to see, and surely enough, the SP went down to about 10-15%. My colleague implemented packet count-based NetFlow sampling to attempt to reduce the 100% NetFlow TCAM usage, and it appears to be partially effective. It still fills up frequently, so we'll have to do some more tweaking.

I appreciate all the replies, public and private.