Gigabit Linux Routers

Hi All,
Sorry if this is a repeat topic. I've done a fair bit of trawling but can't
find anything concrete to base decisions on.

I'm hoping someone can offer some advice on suitable hardware and kernel
tweaks for using Linux as a router running bgpd via Quagga. We do this at
the moment and our box manages under the 100Mbps level very effectively.
Over the next year however we expect to push about 250Mbps outbound traffic
with very little inbound (50Mbps simultaneously) and I'm seeing differing
suggestions of what to do in order to move up to the 1Gbps level.

It seems even a dual core box with expensive NICs and some kernel tweaks
will accomplish this but we can't afford to get the hardware purchases
wrong. We'd be looking to buy one live and one standby box within the next
month or so. They will only run Quagga primarily with 'tc' for shaping.
We're in the UK if it makes any difference.

Any help massively appreciated, ideally from those doing the same in
production environments.

Thanks,

Chris

Chris <chris@ghostbusters.co.uk> writes:

I'm hoping someone can offer some advice on suitable hardware and kernel
tweaks for using Linux as a router running bgpd via Quagga.

There was a talk "Towards 10Gb/s open-source routing" at this years
Linux-Kongress in Hamburg. Here are th slides:

http://data.guug.de/slides/lk2008/10G_preso_lk2008.pdf

cheers

Jens

I've been pretty happy running IBM x-series hardware using RHEL4. Usually it's PPS rather than throughput that will kill it, so if you're doing 250Mbit of DNS/I-mix/HTTP, you'll probably have very different results. There are some rx-ring tweaks for the NICs that are needed, but on the most part it's all out of the box (No custom kernel patches, and such - Just some sysctl settings).

I have two x3650s (Quad core) doing around 6-700Mbit/sec (40k pps) at around 20% CPU right now. No Quagga BGP, but that's minimal in terms of CPU. I've not been able to get much beyond 1Gb/sec on this environment because my ASAs are not configured to support more than one Gig into that particular network.

Chris wrote:

I don't think you will have any troubles with industry standard hardware for
the rates you are quoting. When you get in excess of 300Mbps you have to
start worrying about PPS. When you are looking at >600Mbps then you
should pick out your system more carefully (tcpoe nics, pcie(X), cpu
at over Xghz, fast ram if you are doing a lot of BGP, tweaking your
linux distribution and kernel, etc.).

You should be fine with any recent hardware. A cheap HP dl360 would
do a great job.

--p

Chris wrote:

Hi All,
Sorry if this is a repeat topic. I've done a fair bit of trawling but can't
find anything concrete to base decisions on.

I'm hoping someone can offer some advice on suitable hardware and kernel
tweaks for using Linux as a router running bgpd via Quagga. We do this at
the moment and our box manages under the 100Mbps level very effectively.
Over the next year however we expect to push about 250Mbps outbound traffic
with very little inbound (50Mbps simultaneously) and I'm seeing differing
suggestions of what to do in order to move up to the 1Gbps level.

Any recent hardware can do do 1Gbps of routing from one NIC to another without issues. What you would need is PCI-Express cards, each with it's own slot (try avoiding dual/quad port cards for I/O intensive tasks).

Quagga with one full view and two feeds of about 5000 prefixes each consumes around 50MB of RAM. Putting alot of RAM in the box will not help you with increasing performance.

You can also use a kernel with LC-Trie as route hashing algorithm to improve FIB lookups.

It seems even a dual core box with expensive NICs and some kernel tweaks
will accomplish this but we can't afford to get the hardware purchases
wrong. We'd be looking to buy one live and one standby box within the next
month or so. They will only run Quagga primarily with 'tc' for shaping.
We're in the UK if it makes any difference.

Regarding tc, make sure you use a scalable algorithm like HTB/HSFQ and tweak your rules so that a packet will spend the least amount of time in matching and classifying routines.

Any help massively appreciated, ideally from those doing the same in
production environments.

At 100Mbps FDX full load (routing traffic from one NIC to another) on 2.53 GHz Celeron box with 512Mbps of traffic, the load is between 0.00 and 0.01-0.02

You've given me lots to think about ! Thanks for all the input so far.

A few queries for the replies if I may. My brain is whirring.

Chris: You're right and I'm tempted. I've almost had my arm twisted to go
down the proprietory route as I have some Cisco experience but have become
pretty familiar with Quagga and tc.

David: May I ask which NICs you use in the IBM boxes ? I see the Intels
recommended by Mike have dual ports on one board (the docs say "Two complete
Gigabit Ethernet connections in a single device • Lower latency due to one
electrical load on the bus").

Patrick: That's what I was hoping to hear :slight_smile: It's not the world's biggest
network.

Michael: Thanks very much. We have three upstreams. I guess 2GB of RAM would
cover many more sessions.

Eugeniu: That's very useful. The Intel dual port NICs mentioned aren't any
good then I presume (please see my comment to David).

Thanks again,

Chris

The boxes (3650s) came with Broadcom BCM5708 on-board, but I push most of my traffic over these:

1c:00.1 Ethernet controller: Intel Corporation 82571EB Gigabit Ethernet Controller (rev 06)
        Subsystem: Intel Corporation PRO/1000 PT Dual Port Server Adapter
        Flags: bus master, fast devsel, latency 0, IRQ 58
        Memory at c7ea0000 (32-bit, non-prefetchable) [size=128K]
        Memory at c7e80000 (32-bit, non-prefetchable) [size=128K]
        I/O ports at 6020 [size=32]
        Capabilities: [c8] Power Management version 2
        Capabilities: [d0] Message Signalled Interrupts: 64bit+ Queue=0/0 Enable+
        Capabilities: [e0] Express Endpoint IRQ 0
        Capabilities: [100] Advanced Error Reporting

There are four Intel ports in the boxes, so traffic may or may not stay

* Eugeniu Patrascu:

You can also use a kernel with LC-Trie as route hashing algorithm to
improve FIB lookups.

Do you know if it's possible to switch of the route cache? Based on
my past experience, it was a major source of routing performance
dependency on traffic patterns (it's basically flow-based forwarding).

Anyway, with very few flows, we get quite decent performance (several
hundred megabits five-minute peak, and we haven't bothered tuning
yet), running on mid-range single-socket server boards and Intel NICs
(PCI-X, this is all 2006 hardware). We use a router-on-a-stick
configuration with VLAN separation between all hosts to get a decent
number of ports.

My concern with PC routing (in the WAN area) is a lack of WAN NICs
with properly maintained kernel drivers.

Chris wrote:

Hi All,
Sorry if this is a repeat topic. I've done a fair bit of trawling but can't
find anything concrete to base decisions on.

I'm hoping someone can offer some advice on suitable hardware and kernel
tweaks for using Linux as a router running bgpd via Quagga. We do this at
the moment and our box manages under the 100Mbps level very effectively.
Over the next year however we expect to push about 250Mbps outbound traffic
with very little inbound (50Mbps simultaneously) and I'm seeing differing
suggestions of what to do in order to move up to the 1Gbps level.

As somebody else said, it's more pps than bits you need to worry about.
The Intel NICs can do a full gigabit without any difficulty, if packet
size is large enough. But they buckle somewhere around 300Kpps. 300K
100-byte packets is only 240 Mb/s. On the other hand, you mentioned
your traffic is mostly outbound, which makes me think you might be a
content provider. In that case, you'll know what your average packet
size is -- and it should be a lot bigger than 100 bytes. For that type
of traffic, using a Linux router up to, say, 1.5-2 Gb/s is pretty trivial.
You can do more than that, too, but have to start getting a lot more careful
about hardware selection, tuning, etc.

The other issue is the number of concurrent flows. The actual route
table size is unimportant -- it's the size of the route cache that
matters. Unfortunately, I have no figures here. But I did once
convert a router from limited routes (quagga, 10K routes) to full routes
(I think about 200K routes at the time), with absolutely no measurable
impact. There were only a few thousand concurrent flows, and that
number did not change -- and that's the one that might have made a
difference.

I hope this is helpful.

Jim

Just as another source of info here, I'm running:

Dual Core Intel Xeon 3060 @ 2.4Ghz
2 Gb Ram (it says "Mem: 2059280k total, 1258500k used, 800780k
free, 278004k buffers" right now)
2 of these on the motherboard: Ethernet controller: Intel Corporation
82571EB Gigabit Ethernet Controller (rev 06) (port-channel bonded to my
switch)
One other card with 2 ports: Ethernet controller: Intel Corporation
82573E Gigabit Ethernet Controller (Copper) (rev 03)
Gentoo Linux with a fairly small kernel with FIB_TRIE enabled.

I'm taking in 2 full BGP feeds, a decent amount of iptables rules, and
I've hit 1.2 Gbps with no problems. At this point, I just don't have
anything behind the router to push more than that.

Florian Weimer wrote:

* Eugeniu Patrascu:

My concern with PC routing (in the WAN area) is a lack of WAN NICs
with properly maintained kernel drivers.
  

Depending on your WAN interface, there's actually a decent amount of
stuff out there. The cheaper alternative to me has actually always been
to get some old cisco hardware with the proper interfaces and use it for
media conversion. I have a 6500 with Sup1As in it. It can't take BGP
feeds with the amount of memory it has, but with the right cards, it
will give my router Ethernet and push a few million pps with no problem.

Sounds like he's getting Ethernet from his provider though, so this
probably isn't an issue.

Florian Weimer wrote:

* Eugeniu Patrascu:

You can also use a kernel with LC-Trie as route hashing algorithm to
improve FIB lookups.

Do you know if it's possible to switch of the route cache? Based on
my past experience, it was a major source of routing performance
dependency on traffic patterns (it's basically flow-based forwarding).

I don't understand your question.

In kernel, when you compile it, you have two options:
- hash based route algorithm
- lc-trie based route algorithm

From what I've read on the internet about the latter algorithm, it's supposed to be faster regarding route lookups with large routing tables (like a global routing table).

Anyway, with very few flows, we get quite decent performance (several
hundred megabits five-minute peak, and we haven't bothered tuning
yet), running on mid-range single-socket server boards and Intel NICs
(PCI-X, this is all 2006 hardware). We use a router-on-a-stick
configuration with VLAN separation between all hosts to get a decent
number of ports.

In that configuration you'll split available bandwidth on the NIC and also have less throughput because server NICs are not optimized for "same interface switching".

My concern with PC routing (in the WAN area) is a lack of WAN NICs
with properly maintained kernel drivers.

Usually it's better to get a dedicated router for that kind of stuff than bother with PC WAN cards.

All the responses have been really helpful. Thanks to everyone for being
friendly and for taking the time to answer in detail.
I've asked a hardware provider to quote for a couple of x86 boxes and I'll
look for suitable Intel NICs too.

Jim: We're a very small ISP and have a full mix of packet sizes on the
network but the vast majority is outbound on port 80 so hopefully that'll
help.

Any more input will of course be considered. I may post the NIC models for
approval if I'm scratching my head again :slight_smile:

Thanks,

Chris

Just FYI, the more recent Intel hardware has multiple hardware TX/RX queues,
implemented via seperate (IIRC) PCIe channels, and Linux/FreeBSD is growing
support to handle these multiple queues via multiple kernel threads. Ie,
multiple CPUs handling packet forwarding.

The trick is whether they can pull it off in a way that scales the FIB
and RIB lookups and updates across 4 core (and more) boxes.

But 40kpps is absolutely doable on one CPU. Some of the FreeBSD guys working
on it are looking at supporting 1mil pps + on 10GE cards (in the public source
tree), so .. :slight_smile:

Adrian

Chris wrote:

Eugeniu: That's very useful. The Intel dual port NICs mentioned aren't any
good then I presume (please see my comment to David).

Actually it depends on the motherboard chipset. Some chipsets allocate an interrupt per slot, and when you have lot's a traffic between two ports on a dual port card the will increase dramatically, but should get you at 1Gbps, at higer speeds... depends.

It's adviseable to use a 2.6 kernel as the network stack, compared to 2.4, is way better and you can achieve higher speeds.

Greetings all,

We are a software development firm that currently delivers our install ISOs via Sourceforge. We need to start serving them ourselves for marketing reasons and are therefore increasing our bandwidth and getting a 2nd ISP in our datacenter. Both ISPs will be delivering 100mbit/sec links. We don't expect to increase that for the next year or so and expect average traffic to be about 40-60mbit/sec.

We are planning to run two OpenBSD based firewalls (with CARP and pf) running OpenBGP in order to connect to the two ISPs.

I saw from previous email that Quagga was recommended as opposed to OpenBGP. Any further comments on that? Also, any comments on the choice of OpenBSD vs. Linux?

I don't want to start a religious war :slight_smile: Just curious about what most folks are doing and what their experiences have been.

Thanks in advance,

Marc Runkel
Technical Operations Manager
Untangle, Inc.

OpenBSD SMP support is quite limited. NetBSD SMP is quite limited. FreeBSD and Linux
seem to be running better. :slight_smile:

Adrian

[snip]

Greetings all,

We are a software development firm that currently delivers our install ISOs via Sourceforge.
We need to start serving them ourselves for marketing reasons and are therefore increasing
our bandwidth and getting a 2nd ISP in our datacenter. Both ISPs will be delivering
100mbit/sec links. We don't expect to increase that for the next year or so and expect
average traffic to be about 40-60mbit/sec.

We are planning to run two OpenBSD based firewalls (with CARP and pf) running OpenBGP
in order to connect to the two ISPs.

I saw from previous email that Quagga was recommended as opposed to OpenBGP. Any
further comments on that? Also, any comments on the choice of OpenBSD vs. Linux?

IMO, the performance and utility of OpenBSD as a routing/networking
platform is unmatched by any other open source platform. OpenBGPD
(recent 4-byte ASN issues notwithstanding) has been very stable for us
in production (running roughly equivalent traffic levels to what
you're discussing), and the best part is that you get stateful
transparent failover with CARP, filtering/redirection with pf, load
balancing all the way up through layer7 with relayd, and a host of
other excellent tools for the network engineer's toolkit, all
included, and all integrated. Then of course there's the wider issues
of OpenBSD's track record on security and networking in comparison
with the other OSS platforms, the smaller pool of folks to draw on who
are experienced in running and tuning OpenBSD (although any reasonably
competent UNIX admin should be able to adapt to it in a few days,
given the generally clean layout and high degree of internal
consistency).

advocacy@openbsd.org is down the hall, so I'll stop there. :slight_smile:

As Adrian said, there are other platforms with better SMP
implementations ... but my experience has been that for small and
mid-size sites, CPU utilization on a reasonably modern x86-based
router is the least of one's worries.

the recent facebook engineering post on scaling memcached to 200-300K
UDP requests/sec/node may be germaine here (in particular, patches to
make irq handling more intelligent become very useful at the traffic
levels being discussed).
http://www.facebook.com/note.php?note_id=39391378919&id=9445547199&index=0

/sf

Give Click a try - it is an alternative forwarding plane for Linux, that ran much faster than regular Linux forwarding a few years ago, and I imagine would still do so.

The XORP routing suite supports various different FIBs, including Click.

http://read.cs.ucla.edu/click/