OUTAGE: MCI/Worldcom frame-relay network

SEAN@dra.COM (Sean Donelan) writes:

I'm trying to post no more than once in 12 hours.

Its been another twelve hours.

Although I don't like all the red on our network map, it is showing how
resiliant IP protocols are. My core network uses several different facility
providers and technologies, and for those customers who have more than one
connection: IP works. How else could I send this e-mail? However, for
customers with only one connection, they don't really care how many other
connections are working. They only care about their one connection.

MCI/Worldcom claims 75% up their frame-relay network is 'up.' This is
down from their previous claim on Friday that 90% of the frame-relay
network was 'up.' We still see a number of PVCs throughout the US and
Canada either completely down, or dropping so many frames as to be
unusable. I received similar reports from other MCI/Worldcom frame-relay
users throughout the US and even as far away as Germany.

To try to figure out what is going on MCI/Worldcom imposed a 'quiet time'
on their network, telling their engineers to stop touching things. This
helped somewhat. Portions of the network stabilized on its own, but there
is still massive congestion. The quiet time is over. Now MCI/Worldcom is
trying to reduce the congestion by disconnecting parts of the frame-relay
network.

Remember this is Sunday, normally a low-traffic day. Monday is underway
in Europe, and will be hitting the East Coast of the USA soon. Traffic
levels are going to increase on the network, assuming any traffic can
get through at all.

MCI/Worldcom's web site continues its long silence about this outage.
I've got to hand it to MCI/Worldcom; its engineers may have trouble
keeping a frame-relay network going, but its PR department has done
a great job keeping a lid on the story. Even MCI/Worldcom customers
can't find out what is happening.

http://cbs.marketwatch.com/archive/19990806/news/current/wcom.htx?source=blq/yhoo&dist=yhoo

  It would seem that WCOM has indeed been communicating with the
  public, as early as 13:33 EST, August 6th.

  This is my concern with the recent trend in using NANOG
  as a pulpit for bringing focus to certain network problems.

  The information displayed here is not always accurate, and
  often subjective.

  I do not mean to suggest that Sean is, only that it does
  happen, and the list goes for a while with no correction.

  Usually I'd argue that any information is better than no
  information, but when the quality is of suspect value, I'm
  not sure the viewpoint stands.

  I'd recommend that an alternate forum be used for outage
  notification, like Stan Barber's outage-discuss list.

  -alan

Thus spake Sean Donelan (SEAN@SDG.DRA.COM)

  http://cbs.marketwatch.com/archive/19990806/news/current/wcom.htx?source=blq/yhoo&dist=yhoo

What frame relay switch is causing MCI/Worldcom such grief? The above
article contains this statement:

    "From what we know about the disruptions in the U.S., our UUNet is not
     being affected because its network rides on a different platform,"
     Wagner said.

From what I know of UUNet's frame relay network (which may be woefully

outdated knowledge) they are using Cascade/Ascend/Lucent switches. This
would tend to make one suspect that the problems are associated with
non-Lucent switches rather than being some unusual effect within the frame
relay protocol itself.

Is this another vendor related problem? If not, does it affect NNI
customers of Worldcom?

With the growing profile of MPLS, I suspect that more networks are
planning to roll out frame relay or ATM in their core in which case the
technical facts behind these events should be of great interest to many of
us on this list.

I am less interested in the length of time a network is down and more
interested in why it was down and how other operators could avoid the same
problem.

Yes. We have one PVC using NNI via MCI/Worldcom (got it from LDDS years
ago). It has been bouncing up and down for the past four days.

Actually, what I meant was whether this so-called packet storm has
extended to frame relay networks with NNI to Worldcom. The PVC will bounce
if Worldcom cannot maintain the PVC portion within their network or cannot
pass traffic through their network. But the public description so far
seems to imply that the problem propogates from one switch to another.
Assuming this is actually what is happening, I wondered whether the
problem was propogating into any other networks via NNI connections.

And if the answer is yes, then what vendor made the switches, etc. etc...?

Doe's anyone have an idea as to what Link State Algorithm the Cascade/Ascend
9000's switches run? I have been given conflicting reports that it is OSPF,
or others say that it is OSPF "like" since John Moy worked for Cascade, he
developed a version of OSPF for the Cascade/Ascend switches. Doe's anyone
have more info on this? Also, if so, what version is supported on the
Amethyst, Jade v.1, v.2. releases. Apparently, this may have something to do
with all the instability issues with WCOM.

CJ

Michael Dillon wrote:

What frame relay switch is causing MCI/Worldcom such grief?

The last time we had a Bay Networks salesperson visit he stated that MCI
(this was before the merger) was an all-Bay frame network. Whether that
has changed or not, I can't say, but I can't see them ditching it all so
quickly.

And yep, having one Bay router left doing frame here, I'd love to know
what the problem is. In NYC recently, we've had tons of problems as Bell
Atlantic migrates from frame to "frame emulation" on ATM switches.

Charles

I think the Bay guy had a switch to sell you. There are now and for a long
time have been using Cascade (now Ascend).

I'm confused by your message. Which is more questionable: The MarketWatch
story which was clearly inaccurate -- it didn't even mention impact to the
industry they report on (i.e., CBOT) -- or the messages Sean posted? From
all appearances Sean's messages were timely, highly accurate, and very
informative. This is more than can be said of the news media coverage.

This is the second time I have witnessed you complaining about outage
information being posted and discussed in this forum. Why is information
sharing bad?

  I don't believe that information sharing is bad.

  I think it is good.
  
  I do not believe NANOG is appropriate for real time operational
  issues.

  That's all.

  -alan

  PS - sean's messages were not 'highly accurate' as they maintained
       that WCOM was not reporting on the problems, while they were.

And now.... Lucent!

What a wonderful industry we work in when we can use the phrase "or
whatever label they're sticking on the fron this week" regularly.

- Forrest W. Christian (forrestc@imach.com) KD7EHZ

  I do not believe NANOG is appropriate for real time operational
  issues.

Let me see here .... What Can we not do on nanog:

We can't talk about cisco configuration issues as they are discussed
on cisco-nsp or something like that.

We can't talk about Routing Protocols - they belong on some other list
somewhere.

We can not talk about outages affecting most all of our customers - as
this isn't the appropriate forum.

I could go on and on and on....

Is there ANY topic which is ok to discuss on the nanog list?

- Forrest W. Christian (forrestc@imach.com) KD7EHZ

MCI (not Worldcom) used to use Bay BCNs in what used to be known as
Hyperstream service for frame switching along with GDC switches for ATM.

-dorian

From the Nanog website:

".....Establish a forum for the exchange of technical information"

I agree that we should use restraint with posts like "my line is
up" and "my line is down". This however, IS the place to discuss routers and
switches and software pertinant to our everyday practices.
Don't get me wrong, I still love to find out that a provider of mine is
having trouble on the west coast long before their call center has gotten
word of it.

mike

Michael Heller
Sr. Systems Engineer
Earthweb, Inc.
212.448.4175
mikeh@earthweb.com