High throughput bgp links using gentoo + stipped kernel

Hello Everyone,

We are running:

Gentoo Server on Dual Core Intel Xeon 3060, 2 Gb Ram
Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet
Controller (rev 06)
Ethernet controller: Intel Corporation 82573E Gigabit Ethernet
Controller (rev 03)

2 bgp links from different providers using quagga, iptables etc....

We are transmitting an average of 700Mbps with packet sizes upwards of
900-1000 bytes when the traffic graph begins to flatten. We also start
experiencing some crashes at that point, and not have been able to
pinpoint that either.

I was hoping to get some feedback on what else we can strip from the
kernel. If you have a similar setup for a stable platform the .config
would be great!

Also, what are your thoughts on migrating to OpenBSD and bgpd, not
sure if there would be a performance increase, but the security would
be even more stronger?

Kind Regards,

Nick

Hello Nick,

Your email is pretty generic, the likelihood of anyone being able to provide any actual help or advice is pretty low. I suggest you check out Vyatta.org, its an Open Source router solution that uses Quagga for its underlying BGP management, and if you desire you can purpose a support package a few grand a year.

Cheers,
Mike

Hello Nick,

Hi Nick,

You're done. You can buy more recent server hardware and get another
small bump. You may be able to tweak interrupt rates from the NICs as
well, trading latency for throughput. But basically you're done:
you've hit the upper bound of what slow-path (not hardware assisted)
networking can currently do.

Options:

1. Buy equipment with a hardware fast path, such as the higher end
Juniper and Cisco routers.

2. Split the load. Run multiple BGP routers and filter some portion of
the /8's on each of them. On your IGP, advertise /8's instead of a
default.

Regards,
Bill Herrin

I think you've misinterpreted his numbers. He's using 1gb ethernet interfaces, so that's 700 mbit/s. He didn't mention if he'd done any IP stack tuning, or what sort of crashes he's having...but people have been doing higher bandwith than this on Linux for years.

This is some fairly ancient hardware, so what you can get out if it will be limited. Though gige should not be impossible.

The usual tricks are to make sure netfilter is not loaded, especially the conntrack/nat based parts as that will inspect every flow for state information. Either make sure those parts are compiled out or the modules/code never loads.

If you have any iptables/netfilter rules, make sure they are 1) stateless 2) properly organized (cant just throw everything into FORWARD and expect it to be performant).

You could try setting IRQ affinity so both ports run on the same core, however I'm not sure if that will help much as its still the same cache and distance to memory. On modern NICS you can do tricks like tie rx of port 1 with tx of port 2. Probably not on that generation though.

The 82571EB and 82573E is, while old, PCIe hardware, there should not be any PCI bottlenecks, even with you having to bounce off that stone age FSB that old CPU has. Not sure well that generation intel NIC silicon does linerate easily though.

But really you should get some newerish hardware with on-cpu PCIe and memory controllers (and preferably QPI). That architectural jump really upped the networking throughput of commodity hardware, probably by orders of magnitude (people were doing 40Gbps routing using standard Linux 5 years ago).

Curious about vmstat output during saturation, and kernel version too. IPv4 routing changed significantly recently and IPv6 routing performance also improved somewhat.

Hello Michael,

I totally understand how my question is generic in nature. I will
defiantly take a look at Vyatta, and weigh the effort vs. benefit
topic. The purpose of my email is to see how people with similar
setups managed to get more out of their system using kernel tweaks or
further stripping on their OS. In our case, we are using Gentoo.

Nick.

You might be maxing out your server's PCI bus throughput, so it might be a
better idea if you can get Ethernet NICs that are sitting at least on PCIe
x8 slots.

Nikola, thank you so much for your response! It kind of looks that
way, and we do have another candidate machine that has a PCIe 3 x8.
First thing, I never liked riser card and the candidate IBM x3250 M$
does use them. Not sure how much of a hit I will take for that.
Secondly are there any proven intel 4 port cards in PCIe 3 preferably
pro 1000.

Leaving that aside, I take it you've configured some sort of CPU/PCI
affinity?

For interrupts we disabled CONFIG_HOTPLUG_CPU in the kernel, and
assigned interrupts to the less used core using APIC. I am not sure if
there is anything more we can do?

As for migration to another OS, I find FreeBSD better as a matter of network
performance. The last time I checked OpenBSD was either lacking or was in
the early stages of multiple cores support.

I know I mentioned migration, but gentoo has been really good to us,
and we grew really fond of her :). Hope I can tune it further before
retiring it as our OS of choice.

Nick.

I had two Dell R3xx 1U servers with Quad Gige Cards in them and a few small
BGP connections for a few year. They were running CentOS 5 + Quagga with a
bunch of stuff turned off. Worked extremely well. We also had really small
traffic back then.

Server hardware has become amazingly fast under-the-covers these days. It
certainly still can't match an ASIC designed solution from Cisco etc, but
it should be able to push several GB of traffic.
In HPC storage applications, for example, we have multiple servers with
Quad 40Gig and IB pushing ~40GB of traffic of fairly large blocks. It's not
network, but it does demonstrate pushing data into daemon applications and
back down to the kernel at high rates.
Certainly a kernel routing table with no iptables and a small Quagga daemon
in the background can push similar.

In other words, get new hardware and design it flow.

Hi Nick,

You're done. You can buy more recent server hardware and get another
small bump. You may be able to tweak interrupt rates from the NICs as
well, trading latency for throughput. But basically you're done:
you've hit the upper bound of what slow-path (not hardware assisted)
networking can currently do.

Options:

1. Buy equipment with a hardware fast path, such as the higher end
Juniper and Cisco routers.

2. Split the load. Run multiple BGP routers and filter some portion of
the /8's on each of them. On your IGP, advertise /8's instead of a
default.

Regards,
Bill Herrin

Hey Bill, thanks for your reply!!!! Yeah option 1...... I think we
will do whatever it takes to avoid that route. I don't have a good
reason for it, it's just preference. Great manufactures/produts....
etc..., we just like the flexibility we get with how things are setup
right now. Not to mention extra rack space! Option 2 is exactly what
we are looking at. But before that, we are looking at upgrading to a
PCIe 3 x8 or x16 as mentioned earlier for that "small bump". If we hit
25% increase in throughout then that would keep the barracudas in
suits at bay. But for now, they are really breathing down my back...
:slight_smile:

N.

This is some fairly ancient hardware, so what you can get out if it will
be limited. Though gige should not be impossible.

Agreed!!!

The usual tricks are to make sure netfilter is not loaded, especially
the conntrack/nat based parts as that will inspect every flow for state
information. Either make sure those parts are compiled out or the
modules/code never loads.

If you have any iptables/netfilter rules, make sure they are 1)
stateless 2) properly organized (cant just throw everything into FORWARD
and expect it to be performant).

We do use a statefull iptables on our router, some forward rules...
This is known to be on of our issues, not sure if having a separate
iptables box would be the best and only solution for this?

You could try setting IRQ affinity so both ports run on the same core,
however I'm not sure if that will help much as its still the same cache
and distance to memory. On modern NICS you can do tricks like tie rx of
port 1 with tx of port 2. Probably not on that generation though.

Those figures include IRQ affinity tweaks at the kernel and APIC level.

The 82571EB and 82573E is, while old, PCIe hardware, there should not be
any PCI bottlenecks, even with you having to bounce off that stone age
FSB that old CPU has. Not sure well that generation intel NIC silicon
does linerate easily though.

But really you should get some newerish hardware with on-cpu PCIe and
memory controllers (and preferably QPI). That architectural jump really
upped the networking throughput of commodity hardware, probably by
orders of magnitude (people were doing 40Gbps routing using standard
Linux 5 years ago).

Any ideas of the setup??? Maybe as far as naming some chipset, interface?
And xserver that is the best candidate. Will google.. :slight_smile:

Curious about vmstat output during saturation, and kernel version too.
IPv4 routing changed significantly recently and IPv6 routing performance
also improved somewhat.

Will get that output during peak on monday for you guys. Newest kernel
3.6 or 7...

Thank you so much for your insight,

Nick.

What we are having a hard time with right now is finding that
"perfect" setup without going the whitebox route. For example the
x3250 M4 has one pci-e gen 3 x8 full length (great!), and one gen 2
x4 (Not so good...). The ideal in our case would be a newish xserver
with two full length gen 3 x8 or even x16 in a nice 1u for factor
humming along and being able to handle up to 64 GT/s of traffic,
firewall and NAT rules included.

Hope this is not considered noise to an old problem however, any help
is greatly appreciated, and will keep everyone posted on the final
numbers post upgrade.

N.

Not noise!

(oops, I keep forgetting to send with my nanog identity..)

We do use a statefull iptables on our router, some forward rules...
This is known to be on of our issues, not sure if having a separate
iptables box would be the best and only solution for this?

Ah, statefullness/conntrack .. once you load it you kinda lost already.. Sorry. Any gains from other tunables will likely be dwarfed by the cpu cycles spent by the kernel to track all connections. The more diverse the traffic the more it will hurt. Connection tracking is just inherently non-scalable (and fragile - by the way.)

However, the cheapest and simplest is probably just to throw more modern hardware at it. A Xeon E3 (or two for redudancy ;)) is quite cheap..

The long term, scalable solution is a deeper network like you hinted at, with statefullness - if really needed at all - pushed as close to your edge and as far away from your border as possible. But.. More boxes, more to manage, more power, more stuff that can fail, more redudancies needed.. adds up.

Then again if you are close to gig actual traffic already, you might want to at least think about future scalability..

<snip>

Any ideas of the setup??? Maybe as far as naming some chipset, interface?
And xserver that is the best candidate. Will google.. :slight_smile:

The big shift to integrated (and fast) I/O happened around 2008 IIRC, anything introduced after that is usually quite efficient at moving packets around, at least if Intel based. Even desktop i3/i5/i7 platforms can do 10gig as long as you make sure you put the network chips/cards on the cpu pcie controllers lanes. With anything new its hard to go wrong.

xserver?? xserve? That is quite old..

Curious about vmstat output during saturation, and kernel version too.
IPv4 routing changed significantly recently and IPv6 routing performance
also improved somewhat.

Will get that output during peak on monday for you guys. Newest kernel
3.6 or 7...

Good. That is at least fairly recent and has most of the more modern networking stuff (and better defaults)

I don't know about "only", but it'd have to come close to "best". iptables
(and stateful firewalling in general) is a pretty significant CPU and memory
sink. Definitely get rid of any stateful rules, preferably *all* the rules,
and apply them at a separate location. We've always had BGP routing
separated from firewalling, but we're currently migrating from
one-giant-core-firewall to lots-of-little-firewalls because our firewalls
are starting to cry a little. Nice thing is that horizontally scaling
firewalls is easy -- just whack 'em on each subnet instead of running
everything together. Core routing is a little harder to scale out
(although as has been described already, by no means impossible). The
important thing is to remove *anything* from your core routing boxes that
doesn't *absolutely* have to be there -- and stateful firewall rules are
*extremely* high on that list.

- Matt

That hardware should be fine to do two gig ports upstream, with another
two to go to your network?

I'd check with "vmstat 1" to see what your interrupt rate is like, if it's
above 40k/sec I'd check coalescing settings.

I also prefer OpenBSD/OpenBGP myself. It's a simpler configuration, with less
things to "fix".

With Linux you have to disable reverse path filtering, screw around with iptables
to do bypass on stateful filtering. Then Quagga itself can be buggy. (my original
reason for shifting away from Linux was that Quagga didn't fix enough of Zebra's
bugs.. although that was many years ago, things may have improved a little by then,
but ime significantly buggy software tends to stay buggy even with fixing)

With regards to security of OpenBSD versus Linux, you shouldn't be exposing any
services to the world with either. And it's more stability/configuration that would
push me to OpenBSD rather than performance.

And with regards to crashing I'd try and figure out what was happening there quickly
before making radical changes. Is it running out of memory, is Quagga dying? Is
there a default route that works when Quagga crashes? One issue I had was I found
Quagga crashing leaving a whole lot of routes lingering in the table, and I had a
script that'd go through and purge them.

I'm also a bit confused about your dual upstreams with two ethernet interfaces total,
are they both sharing one pipe, or are there some Broadcom or such ethernet interfaces
too. I've found Broadcom chipsets can be a bit problematic, and the only stability
issue I've ever had with OpenBSD is a Broadcom interface wedging for minutes under DDOS
attack, which was gigabit'ish speed DDOS with older hardware than you.

oh, to check coalescing settings under linux use: "ethtool -c eth0; ethtool -c eth1"

Ben.

Do you actually need stateful filtering? A lot of people seem to think
that it's important, when really they're accomplishing little from it,
you can block ports etc without it. And the idea of protecting hosts
from strange traffic is only really significant if the hosts have very
outdated TCP/IP stacks etc. And it breaks things like having multiple
routers.

There's an obscure NOTRACK rule you can use to cut down the number of
state entries, or remote state tracking for hosts that don't need it.

http://serverfault.com/questions/234560/how-to-turn-iptables-stateless

although googling for NOTRACK should find other things too.

Ben.

Base model e5 CPU is generally considered adequate, and has direct link between
cache and PCI bypassing memory.

http://www.intel.com/content/www/us/en/io/data-direct-i-o-faq.html

Motherboard is likely to have i350 chipset for ethernet.

http://www.intel.com/content/www/us/en/ethernet-controllers/ethernet-i350-server-adapter-brief.html

Ben.

I believe PCI compliance requires it, other things like it probably do too.

~Seth

It's the rare ISP who's border routers are in-scope for PCI compliance.