Replacement for Avaya CNA/RouteScience

Drew_Weaver · July 3, 2008, 11:50am

Howdy for reasons it might be inappropriate to discuss on this list we've decided that we're going to replace our Avaya/RouteScience box and we're looking for recommendations on different solutions for 'BGP management appliances'.

We're aware of the Internap FCP product, but is there anything else out there besides 'oy, hire a BGP admin ya tool!' that anyone can offer?

As always, comments are appreciated.

-Drew

Michienne_Dixon · July 3, 2008, 2:36pm

Have you considered any of the options from Vyatta?
Aside from the "roll your own" community offerings they also have a
precompiled virtual appliance as well as a physical appliance you can
use.

Paul_WALL · July 3, 2008, 3:25pm

Going off this and previous posts, you'd well-served to follow the
advice you sarcastically dispense, and hire an engineer.

Opex and capex (spread over a ~2 year product lifetime) costs for the
above solutions in a small (several gigabits, several transit
providers) environment are right up there with the salary of a junior
to mid-level networking professional in most markets. By hiring a
live human, you get not only somebody who can tweak localpref, but
also a critical thinker who can aid in troubleshooting outages and
help you plan for growth.

Paul

Koch_Christian · July 3, 2008, 4:07pm

what does vyatta have to do with route intelligence/optimization?

vyatta is just a router..

-c

Eric_Van_Tol · July 3, 2008, 4:29pm

I'd like to hire that engineer, please. Can you send me his resume? Here's the job description:

- Required to works 24x7x365.
- Must monitor all network egress points to examine latency, retransmissions, packet loss, link utilization, and link cost.
- Required to "tweak localpref" on an average of 5000 prefixes per day, based upon a combination of the above criteria.
- Required to write up a daily, weekly, and monthly report to be sent to all managers on said schedule.
- Must not require health or dental care.

These devices are not a replacement for an actual engineer. They are a supplement to the network to assist the engineer in doing what he should be doing - engineering and planning as opposed to resolving some other network's packet loss/blackhole/peering dispute/latency problem.

-evt

Rob_Seastrom2 · July 3, 2008, 4:50pm

Eric Van Tol <eric@atlantech.net> writes:

I'd like to hire that engineer, please. Can you send me his resume?
Here's the job description:

- Required to works 24x7x365.
- Must monitor all network egress points to examine latency, retransmissions, packet loss, link utilization, and link cost.
- Required to "tweak localpref" on an average of 5000 prefixes per day, based upon a combination of the above criteria.
- Required to write up a daily, weekly, and monthly report to be sent to all managers on said schedule.
- Must not require health or dental care.

These devices are not a replacement for an actual engineer. They
are a supplement to the network to assist the engineer in doing what
he should be doing - engineering and planning as opposed to
resolving some other network's packet loss/blackhole/peering
dispute/latency problem.

You can certainly get close to the requirements stated above by
offering a decent salary and hiring a reasonably clued engineer with
an SP background. You may have to settle for IRC, WoW, or SecondLife
as daily recreational activity that doesn't buy you much (expressed in
your requirements list as "tweaking localpref").

My general experience with such boxes is that they're awfully good at
impressing the PHBs, but not something you can really defend from a
cost/benefit perspective. I really do need to go into the "custom
painted boxes with LCD screens on the front" business. I could make
"melons", like Tom Vu.

---Rob

Christian_Koch1 · July 3, 2008, 6:38pm

agreed. i see the most benefit from these boxes geared towards networks with
critical apps that are latency intensive and more than a handful of transit
providers than i do for a smaller provider..

depending on how many upstreams you're juggling, its not that hard to create
some traffic engineering policies that can easily be modified, (whether by
hand or you use a script with a front end that can push the changes for you)
in order to re-route traffic in the event of issues with an SP network in
your end to end path..

personally i think manual traffic engineering and re-routing is one of the
more fun parts of engineering..

-christian

Eric_Van_Tol · July 3, 2008, 8:51pm

From: Christian Koch [mailto:christian@broknrobot.com]
Sent: Thursday, July 03, 2008 2:39 PM
To: Robert E. Seastrom
Cc: Eric Van Tol; nanog@nanog.org
Subject: Re: Replacement for Avaya CNA/RouteScience

agreed. i see the most benefit from these boxes geared towards networks >with critical apps that are latency intensive and more than a handful of >transit providers than i do for a smaller provider..

Two questions:

First, what would you characterize as a "smaller provider"? One that has only one or two transits? If that's the case, then yes, I would definitely agree with you. However, once you go beyond just a few transits and peers, choosing which one to use for an unhealthy destination becomes tedious if you're trying to do it all manually. That said, I believe there is a stopping point at which the size of the network outgrows the need for such a device.

Second, can you provide an example of a network where users don't care about latency? I can't say that I've worked on tons of networks, but if "the internet is slow", and even though our customers may not be using the latest in realtime streaming media protocols and apps, they notice.

depending on how many upstreams you're juggling, its not that hard to >create some traffic engineering policies that can easily be modified, >(whether by hand or you use a script with a front end that can push the >changes for you) in order to re-route traffic in the event of issues with >an SP network in your end to end path..

It *is* relatively simple to make routing changes manually, but wouldn't you agree that human error is the cause of most outages? Even the most skilled engineers/techs have days where their fingers are larger than normal. These devices, at least the one we use, makes no changes to router configurations.

personally i think manual traffic engineering and re-routing is one of >the more fun parts of engineering..

-christian

Yes, as long as the problem is interesting. Manually changing localpref on a route because of packet loss in someone else's network, several times per week, is not interesting to me or my staff. Nor is checking every transit link several times a day to make sure that we're not going over a commit when other transits have plenty of bandwidth to spare.

In my opinion, most of the value of these types of appliances is to help identify problem areas outside of your network, before end users notice them. I know firsthand that our route optimization appliance frees up my staff to work on other issues such as capacity planning, new service deployments, or discussing the latest MGS4 strategies. Well, hopefully not that last one.

-evt

Christian_Koch1 · July 4, 2008, 2:36am

imo, no more than 3-4 transit providers and maybe a presence at 1 or 2 ixp's
with x amount of peer's would be small

im not saying customers won't/don't care about latency, its just not
difficult to route around the problematic nodes (unless SP A/B/C gets to it
first and band aid the issue until resolution), maybe i just don't see
enough issues to even recognize the problem?

agreed, human error is a big cause of a lot of issues.

well there are plenty of ways to manipulate traffic other than local_pref,
that is why i find it interesting, you have options.

i don't understand what the difficulty is in monitoring your bandwidth and
understanding your traffic patterns, if this is done properly, you can plan
capacity and execute your routing policies for optimal performance, and not
have to re-route/re-engineer traffic so often. does your traffic fluctuate
that much that you cant get a good grasp on what you're pushing, from who,
and when?

i definitely see value in appliances like the fcp and route science box, i
just think for a smaller provider it may not be necessary - or maybe i have
it backwards,and it is a better solution for a smaller provider so they
don't have to waste money on highly skilled engineers? maybe i am just
thinking "inside" the box at the moment, from an engineers view..if so my
apologies for steering off course

-christian

Ross_Vandegrift · July 4, 2008, 2:38pm

The FCP stinks at managing blackholing. There's supposedly new code
on the way to help with some of the blackhole avoidance, but I'll
believe it when I see it. It can only really control the outbound
path, so if someone else chooses a path to me that blackholed between
us, there's not a lot it can do.

On the other hand, the best value of the FCP is commit management. It
does a fantastic job of making sure we pay the least amount of money
to our tranit providers. No more manual balancing of traffic frees up
a lot of time, and having an automatic process for it means that we
never exceed commit on links that we don't have to.

The FCP produces lovely graphs and charts that describe this, which is
probably why people accuse it of being too PHB-friendly. But Internap
wasn't stupid - one of those pretty charts is cost savings the FCP has
accumulated this month vs. the natural BGP decision.

For a network with a heavy outbound bias, that quickly adds up to a
decent chunk of change.

Ross

Rubens_Kuhl · July 5, 2008, 11:24am

If you already own Cisco gear, Cisco OER (which now has another
marketing name) might do the trick without buying any appliances, as
it runs on top of IOS.

Rubens

Tom_Sands · July 5, 2008, 2:39pm

Ross Vandegrift wrote:

i definitely see value in appliances like the fcp and route science box, i
just think for a smaller provider it may not be necessary - or maybe i have
it backwards,and it is a better solution for a smaller provider so they
don't have to waste money on highly skilled engineers? maybe i am just
thinking "inside" the box at the moment, from an engineers view..if so my
apologies for steering off course

We've used the FCP for quite a few years now, with "good" success. The point at which we started seeing it being worthwhile was about 4 providers. Many of the challenges weren't having qualified engineers, or knowing the nature of your traffic, it was more a matter of being able to be dynamic, aware of the impact of the prefixes/ASN's that you are making changes to, managing cost, etc.

In a content heavy network, where your traffic patterns vary greatly (based on clients/visitors all over the world), just knowing your traffic isn't enough.

The argument could probably be made that you could script some of this, but it still doesn't get you the same solution, so partly it depends if you need a complete solution. We reached a point that in order to monitor traffic, commits, costs, performance, etc.. That we were spending a significant amount of time to do this with an engineer (or 3). It's an ongoing thing, not a once a day change, and with all the factors involved as to why you would make a change, it becomes far less accurate doing it with an engineer (using scripts, and traffic data) than an appliance designed to do it. Some of the biggest challenges we hit using an engineer were being able to "accurately" determine the amount of data you will be shifting when a change is made, based on a prefix or ASN, also knowing what the performance impact looks like for that prefix or ASN when a change it made to send it via another provider, not to mention monitoring your current active paths to attempt to be aware of performance problems you want to make a pro-active change for.

The FCP stinks at managing blackholing. There's supposedly new code
on the way to help with some of the blackhole avoidance, but I'll
believe it when I see it. It can only really control the outbound
path, so if someone else chooses a path to me that blackholed between
us, there's not a lot it can do.

Agreed, none of the appliances I've seen are 100%, nor are they infinitely scalable. We've had numerous issues with blackhole problems and the FCP, and I too won't hold my breath for this to get resolved. Especially since in the last 5 yrs we've used this product, we've seen very little evolution in features and functionality.

We are actually at the point that we're out growing the abilities of the FCP, and I'm interested in the input on this thread to try and figure out what's next. The preferred method of data collection with the FCP is SPAN/Monitor, however for our network/topology that doesn't scale well (not to mention their costs don't scale well either). They also support Netflow, but have a VERY limited ability to process it in any volume.