MLPPP Problem with Cisco 7513

I am hoping someone can shed some light on an interesting problem we are
having -

When we set up a customer for MLPPP, things tend to go well for a period
of time. Then - all of a sudden - we will begin to have problems with
our multilink bundles (generally only one at a time) and the only fix is
to reload our 7513. This problem happens on both of our 7513 routers
from time-to-time. Once we reload - the problem will stay gone for as
long as several months, or in the last case only about 12 hours.

Once we see the problem, it is apparent only in one direction. For
example the customer can push the full capacity of their circuits to us,
but they cannot pull anything above about 300k back to them on a two-T1
bundle. This is the same every single time we have the problem.

We have changed multilink bundles, tried different types of switching
and route caching, turning on and off fragmentation - the only thing
that solves the problem is reloading the entire router. We can pull the
T1s from the multilink bundle and each individual T1 works great. No
line errors, no crc errors - nothing. No errors are apparent while in
MLPPP mode either. No throttles or anything similar.

We have had this problem in the past and it was recommended that we
upgrade the code on our 7513s. We are currently running version
12.2(13)T5 on both our 7513 as well as the customers router. Upgrading
the code did not solve the problem. I have been unable to locate a Cisco
bug defining this type of problem for any version of their code.

This particular customer's T1s are both terminated on the same VIP (we
are running DMLP) but are terminated on separate PAs and hence separate
CT3s. We have noticed the problem even with T1 bundles on the exact same
PA and CT3. We are not doing multi-chassis DMLP. dCEF is enable on both
routers, however the problem remains the same even after disable dCEF.

Here are configs and router info:

interface Multilink6
description Eastgate Mall (s2/1/0/8:0 and s2/0/0/12:0) [20291]
ip address 207.158.1.133 255.255.255.252
no cdp enable
ppp multilink
ppp multilink interleave
multilink-group 6

interface Serial2/0/0/12:0
description Eastgate Mall #2
no ip address
encapsulation ppp
no fair-queue
ppp multilink
multilink-group 6

interface Serial2/1/0/8:0
description Eastgate Mall #1
no ip address
encapsulation ppp
no fair-queue
ppp multilink
multilink-group 6

AR04#sh diag 2
Slot 2:
        Physical slot 2, ~physical slot 0xD, logical slot 2, CBus 0
        Microcode Status 0x4
        Master Enable, LED, WCS Loaded
        Board is analyzed
        Pending I/O Status: None
        EEPROM format version 1
        VIP2 R5K controller, HW rev 2.02, board revision D0
        Serial number: 17953368 Part number: 73-2167-05
        Test history: 0x00 RMA number: 00-00-00
        Flags: cisco 7000 board; 7500 compatible

        EEPROM contents (hex):
          0x20: 01 1E 02 02 01 11 F2 58 49 08 77 05 00 00 00 00
          0x30: 68 00 00 01 00 00 00 00 00 00 00 00 00 00 00 00

        Slot database information:
        Flags: 0x4 Insertion time: 0x41C8 (1d08h ago)

        Controller Memory Size: 32 MBytes DRAM, 4096 KBytes SRAM

        PA Bay 0 Information:
                CT3 single wide PA, 1 port
                EEPROM format version 1
                HW rev 1.00, Board revision A0
                Serial number: 17814822 Part number: 73-3037-01

        PA Bay 1 Information:
                CT3 single wide PA, 1 port
                EEPROM format version 1
                HW rev 1.00, Board revision A0
                Serial number: 09725065 Part number: 73-3037-01

        --Boot log begin--

Cisco Internetwork Operating System Software
IOS (tm) VIP Software (SVIP-DW-M), Version 12.2(13)T5, RELEASE SOFTWARE (fc1)
TAC Support: http://www.cisco.com/tac
Copyright (c) 1986-2003 by cisco Systems, Inc.
Compiled Wed 28-May-03 21:57 by nmasa
Image text-base: 0x60010930, data-base: 0x604C0000

AR04#sh ver
Cisco Internetwork Operating System Software
IOS (tm) RSP Software (RSP-JSV-M), Version 12.2(13)T5, RELEASE SOFTWARE (fc1)
TAC Support: http://www.cisco.com/tac
Copyright (c) 1986-2003 by cisco Systems, Inc.
Compiled Wed 28-May-03 22:00 by nmasa
Image text-base: 0x60010948, data-base: 0x61F0A000

ROM: System Bootstrap, Version 11.1(8)CA1, EARLY DEPLOYMENT RELEASE SOFTWARE (fc1)

AR04 uptime is 1 day, 8 hours, 24 minutes
System returned to ROM by reload at 09:45:48 UTC Tue Dec 23 2003
System image file is "slot0:rsp-jsv-mz.122-13.T5.bin"

cisco RSP4 (R5000) processor with 262144K/2072K bytes of memory.
R5000 CPU at 200Mhz, Implementation 35, Rev 2.1, 512KB L2 Cache
Last reset from power-on
G.703/E1 software, Version 1.0.
G.703/JT2 software, Version 1.0.
X.25 software, Version 3.0.0.
SuperLAT software (copyright 1990 by Meridian Technology Corp).
Bridging software.
TN3270 Emulation software.
Primary Rate ISDN software, Version 1.1.
Chassis Interface.
3 VIP2 R5K controllers (2 FastEthernet)(6 Channelized T3).
2 FastEthernet/IEEE 802.3 interface(s)
168 Serial network interface(s)
6 Channelized T3 port(s)
123K bytes of non-volatile configuration memory.

20480K bytes of Flash PCMCIA card at slot 0 (Sector size 128K).
8192K bytes of Flash internal SIMM (Sector size 256K).

Slave in slot 7 is running Cisco Internetwork Operating System Software
IOS (tm) RSP Software (RSP-DW-M), Version 12.2(13)T5, RELEASE SOFTWARE (fc1)
TAC Support: http://www.cisco.com/tac
Copyright (c) 1986-2003 by cisco Systems, Inc.
Compiled Wed 28-May-03 22:33 by nmasa
Slave: Loaded from system
Slave: cisco RSP4 (R5000) processor with 262144K bytes of memory.

Configuration register is 0x2102

Any help would be greatly appreciated.

Richard,

One bug I know of that could possibly be a match is:

CSCec00268
Externally found severe defect: Resolved (R)
Input drops and * throttles on PPP multilink interface

fixed in 12.2(15)T9. The way to verify is to check
'sh int multilink <x>' and see if the interface is under
throttle.

Router#sh int mu 2
Multilink2 is up, line protocol is up
    <snip>
Received 0 broadcasts, 0 runts, 0 giants, 0* throttles
                                           ^-- Here.

If that's not it email me offline and I'll help you get
it resolved.

Thanks,
Rodney

Richard J. Sears said:

We have changed multilink bundles, tried different types of switching
and route caching, turning on and off fragmentation - the only thing

[snip]

dCEF is enable on both
routers, however the problem remains the same even after disable dCEF.

Your last line, I think, is rather interesting.

I have a 7513 with essentially the same hardware as you (ct3, fe's, etc) and
was recently doing testing with red/wred. I had read the notes about dCEF
and how this changes/limits some of what one can do with red/wred; as with
dCEF enabled, the queueing happens on the VIP instead of the RSP.

During testing, I setup a wred config on several interfaces and everything
worked fine -- while using standard CEF. Later on, I enabled dCEF and began
to test (d)wred. I found the limitations on parameters were a killjoy, so I
tried backing out of dCEF/dwred.

Interestingly, I couldn't back out. Even if I negated all the wred commands
and disabled dCEF, the 'sh int' would still report the VIP was doing all the
work. After more attempts and various ideas, I just reloaded the damn thing
-- it came back up with, as I had hoped, the VIP no long doing wred, but
rather the RSP.

So, I wonder whether or not the mlppp instability isn't due to some obscure
or yet undiscovered dCEF bug and also if when you disable dCEF, it's not
really getting disabled. Maybe disable dCEF -- then reload? It may also help
to get more familiar with the forwarding path data takes in this scenario;
I'm not familiar with cisco's mlppp enough to know if all the encapsulation
and multi-link work happens rsp side or vip side.

Cisco Internetwork Operating System Software
IOS (tm) VIP Software (SVIP-DW-M), Version 12.2(13)T5, RELEASE SOFTWARE

Also, instead of following 12.2, maybe try the 12.0(S) train, if you're not
needing specific features in 12.2. I've surveyed several other folks
recently about which version they run; 12.0 S-line seems to be the
least-hated and more-stable train. It sounds like (at this point) it'd be
worth trying just about anything to get the mlppp links stable. <G>

--Tk