TCP time_wait and port exhaustion for servers

RFC 793 arbitrarily defines 2MSL (how long to hold a socket in
TIME_WAIT state before cleaning up) as 4 min.

Linux is a little more reasonable in this and has it baked into the
source as 60 seconds in "/usr/src/linux/include/net/tcp.h":
#define TCP_TIMEWAIT_LEN (60*HZ)

Where there is no way to change this though /proc (probably a good
idea to keep users from messing with it), I am considering re-building
a kernel with a lower TCP_TIMEWAIT_LEN to deal with the following
issue.

With a 60 second timeout on TIME_WAIT, local port identifiers are tied
up from being used for new outgoing connections (in this case a proxy
server). The default local port range on Linux can easily be
adjusted; but even when bumped up to a range of 32K ports, the 60
second timeout means you can only sustain about 500 new connections
per second before you run out of ports.

There are two options to try an deal with this, tcp_tw_reuse, and
tcp_tw_recycle; but both seem to be less than ideal. With
tcp_tw_reuse, it doesn't appear to be effective in situations where
you're sustaining 500+ new connections per second rather than a small
burst. With tcp_tw_recycle it seems like too big of a hammer and has
been reported to cause problems with NATed connections.

The best solution seems to be trying to keep TIME_WAIT in place, but
being faster about it.

30 seconds would get you to 1000 connections a second; 15 to 2000, and
10 seconds to about 3000 a second.

A few questions:

Does anyone have any data on how typical it is for TIME_WAIT to be
necessary beyond 10 seconds on a modern network?
Has anyone done some research on how low you can make TIME_WAIT safely?
Is this a terrible idea? What alternatives are there? Keep in mind
this is a proxy server making outgoing connections as the source of
the problem; so things like SO_REUSEADDR which work for reusing
sockets for incoming connections don't seem to do much in this
situation.

Anyone running large proxies or load balancers have this situation?
If so what is your solution?

Ray,

With a 60 second timeout on TIME_WAIT, local port identifiers are tied
up from being used for new outgoing connections (in this case a proxy
server). The default local port range on Linux can easily be
adjusted; but even when bumped up to a range of 32K ports, the 60
second timeout means you can only sustain about 500 new connections
per second before you run out of ports.

Is that 500 new connections per second per {protocol, remote address,
remote port} tuple, that's too few for your proxy? (OK, this tuple is more
or less equivalent with only {remote address} if we talk about a web
proxy.) Just curious.

Regards,
András

This would be outgoing connections sourced from the IP of the proxy,
destined to whatever remote website (so 80 or 443) requested by the
user.

Essentially it's a modified Squid service that is used to filter HTTP
for CIPA compliance (required by the government) for keep children in
public schools from stumbling on to inappropriate content.

Like most web traffic, the majority of these connections open and
close in under a second. When we get to a point that there is enough
traffic from users behind the proxy to be generating over 500 new
outgoing connections per second, sustained, we start having users
experience an error where there are no local ports available to Squid
to use since they're all tied up in a TIME_WAIT state.

Here is an example of netstat totals on a box we're seeing the behavior on:

   10 LAST_ACK
   32 LISTEN
    5 SYN_RECV
    5 CLOSE_WAIT
  756 ESTABLISHED
   26 FIN_WAIT1
   40 FIN_WAIT2
    5 CLOSING
   10 SYN_SENT
481947 TIME_WAIT

As a band-aid we've opened up the local port range to allow up to 50K
local ports with /proc/sys/net/ipv4/ip_local_port_range, but they're
brushing up against that limit again at peak times.

It's a shame because memory and CPU-wise the box isn't breaking a sweat.

Enabling TW_REUSE doesn't seem to have any effect for this case
(/proc/sys/net/ipv4/tcp_tw_reuse)
Using TW_RECYCLE drops the TIME_WAIT count to about 10K instead of
50K, but everything I read online says to avoid using TW_RECYCLE
because it will break things horribly.

Someone responded off-list saying that TIME_WAIT is controlled by
/proc/sys/net/ipv4/tcp_fin_timeout, but that is just incorrect
information that has been parroted by a lot on blogs. There is no
relation between fin_timeout and TCP_TIMEWAIT_LEN.

This level of use seems to translate into about 250 Mbps of traffic on
average, FWIW.

This would be outgoing connections sourced from the IP of the proxy,
destined to whatever remote website (so 80 or 443) requested by the
user.

Essentially it's a modified Squid service that is used to filter HTTP
for CIPA compliance (required by the government) for keep children in
public schools from stumbling on to inappropriate content.

Like most web traffic, the majority of these connections open and
close in under a second. When we get to a point that there is enough
traffic from users behind the proxy to be generating over 500 new
outgoing connections per second, sustained, we start having users
experience an error where there are no local ports available to Squid
to use since they're all tied up in a TIME_WAIT state.

Here is an example of netstat totals on a box we're seeing the behavior on:

    10 LAST_ACK
    32 LISTEN
     5 SYN_RECV
     5 CLOSE_WAIT
   756 ESTABLISHED
    26 FIN_WAIT1
    40 FIN_WAIT2
     5 CLOSING
    10 SYN_SENT
481947 TIME_WAIT

As a band-aid we've opened up the local port range to allow up to 50K
local ports with /proc/sys/net/ipv4/ip_local_port_range, but they're
brushing up against that limit again at peak times.

We've found it necessary to use address pools to source outgoing connections from our DC devices in order to prevent collisions with ports in timewait. for some particularly high traffic destinations for us. It kinda sucks to burn a /28 or shorter per outbound proxy, but there you have it.

You could simply add another IP address to the servers's source-
address pool, which effectively gives you another 32K (or whatever
value you have for the local port range) identifiers.

Owen

Stupid question but how does 500 x 60 = 481947? To have that many
connections in TIME_WAIT on a 60 second timer, you'd need more like
8000 connections per second, wouldn't you?

Regards,
Bill Herrin

Isn't TIME_WAIT based on disconnections, not connections?

Sure, assuming all connections are for equal durations, then the disconnection
rate would be roughly equal to the connection rate, and, of course over the long
term they will eventually trend towards equality, but that doesn't mean that the
peak of connections in TIME_WAIT will not be greater than the incoming connection
rate would suggest.

Owen

For each second that goes by you remove X addresses from the available
pool of ports for new connections for whatever the TCP_TIMEWAIT_LEN is
set to (60 seconds by default in Linux).

In this case it's making quick connections for HTTP requests (most of
which finish in less than a second).

Say you have a pool of 30,000 ports and 500 new connections per second
(typical):
1 second goes by you now have 29500
10 seconds go by you now have 25000
30 seconds go by you now have 15000
at 59 seconds you get to 29500,
at 60 you get back 500 and stay at 29500 and that keeps rolling at
29500. Everyone is happy.

Now say that you're seeing an average of 550 connections a second.
Suddenly there aren't any available ports to use.

So, your first option is to bump up the range of allowed local ports;
easy enough, but even if you open it up as much as you can and go from
1025 to 65535, that's still only 64000 ports; with your 60 second
TCP_TIMEWAIT_LEN, you can sustain an average of 1000 connections a
second.

Our problem is that our busy sites are easily peaking to that 1000
connection a second average, and when we enable TCP_TW_RECYCLE, we see
them go past that to 1500 or so connections per second sustained.

Unfortinately, TCP_TW_RECYCLE is a little too blunt of a hammer and breaks TCP.

From what I've read and heard from others, in a high connection

environment the key is really to drop down the TCP_TIMEWAIT_LEN.

My question is basically, "how low can you go?"

There seems to be consensus around 20 seconds being safe, 15 being a
99% OK, and 10 or less being problematic.

So if I rebuild the kernel to use a 20 second timeout, then that 30000
port pool can sustain 1500, and a 60000 port pool can sustain 3000
connections per second.

The software could be re-written to round-robin though IP addresses
for outgoing requests, but trying to avoid that.

There is an extra 7 on that number, it was 48194 (was sitting on a
different PC so I typed it instead of copy-paste).

The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then
connect() without a bind() in the middle, then the limit applies *per
destination IP:port pair*. So, you should be able to do 30,000
connections to 192.168.1.1 port 80, another 30,000 connections to
192.168.1.2 port 80, and so on.

You should only fail if you A) bump against the top of NR_OPEN or B)
try to do a massive number of TCP connections to the same remote IP
address.

Try it: set up a listener on discard that just closes the connection
and repeat connect() to 127.0.0.5 until you get an error. Then confirm
that you're out of ports:

telnet 127.0.0.5 9
Trying 127.0.0.5...
telnet: Unable to connect to remote host: Cannot assign requested address

And confirm that you can still make outbound connections to a
different IP address:

telnet 127.0.0.4 9
Trying 127.0.0.4...
Connected to 127.0.0.4.
Escape character is '^]'.
Connection closed by foreign host.

Regards,
Bill Herrin

It's kind of a hack, but you don't have to rewrite the software to get different source IPs for different connections. On linux, you could do the following:

*) Keep your normal default route
*) Configure extra IPs as aliases (eth0:0, eth0:1,...) on the proxy
*) Split up the internet into however many subnets you have proxy host IPs *) route each part of the internet to your default gateway tacking on "dev eth0:n".

This will make the default IP for reaching each subnet of the internet the IP from eth0:n.

Of course you probably won't get very good load balancing of connections over your IPs that way, but it's better than nothing and a really quick fix that would give you immediate additional capacity.

I was going to also suggest, that to get better balancing, you could periodically (for some relatively short period) rotate the internet subnet routes such that you'd change which parts of the internet were pointed at which dev eth0:n every so many seconds or minutes, but that's kind of annoying to people like me (similar to the problem I recently posted about with AT&T 3G data web proxy). Having your software round robin the source IPs would probably introduce the same problem/effect.

10:17PM lenovo:~% sudo sysctl -a |grep wait
net.netfilter.nf_conntrack_tcp_timeout_fin_wait = 120
net.netfilter.nf_conntrack_tcp_timeout_close_wait = 60
net.netfilter.nf_conntrack_tcp_timeout_time_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_fin_wait = 120
net.ipv4.netfilter.ip_conntrack_tcp_timeout_close_wait = 60
net.ipv4.netfilter.ip_conntrack_tcp_timeout_time_wait = 120
10:17PM lenovo:~%

?

We use this to work around the default limit on our internal load balancers.

HIH.

> For each second that goes by you remove X addresses from the available
> pool of ports for new connections for whatever the TCP_TIMEWAIT_LEN is
> set to (60 seconds by default in Linux).
>
> In this case it's making quick connections for HTTP requests (most of
> which finish in less than a second).
>
> Say you have a pool of 30,000 ports and 500 new connections per second
> (typical):
> 1 second goes by you now have 29500
> 10 seconds go by you now have 25000
> 30 seconds go by you now have 15000
> at 59 seconds you get to 29500,
> at 60 you get back 500 and stay at 29500 and that keeps rolling at
> 29500. Everyone is happy.

The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then
connect() without a bind() in the middle, then the limit applies *per
destination IP:port pair*. So, you should be able to do 30,000
connections to 192.168.1.1 port 80, another 30,000 connections to
192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the
source address when making the connect. This is needed when you
are required to use a fixed source address.

If you want to get into software rewriting, the simplest thing I might come up with would be to put TCBs in some form of LRU list and, at a point where you need a port back, close the TCB that least recently did anything. My understanding is that this was implemented 15 years ago to manage SYN attacks, and could be built on to manage this form of "attack".

Or, change the period of time a TCB is willing to stay in time-wait. Instead of 60 seconds, make it 10.

Hi Mark,

There are ways around this problem in Linux. For example you can mark
a packet with iptables based on the uid of the process which created
it and then you can NAT the source address based on the mark. Little
messy but the tools are there.

Anyway, Ray didn't indicate that he needed a fixed source address
other than the one the machine would ordinarily choose for itself.

Regards,
Bill Herrin

I'm trying to imagine how even 10 could be problematic nowadays. Have you
found people reporting specific issues with 10?

-Terry

>> The thing is, Linux doesn't behave quite that way.
>>
>> If you do an anonymous connect(), that is you socket() and then
>> connect() without a bind() in the middle, then the limit applies *per
>> destination IP:port pair*. So, you should be able to do 30,000
>> connections to 192.168.1.1 port 80, another 30,000 connections to
>> 192.168.1.2 port 80, and so on.
>
> The socket api is missing a bind + connect call which restricts the
> source address when making the connect. This is needed when you
> are required to use a fixed source address.

Hi Mark,

There are ways around this problem in Linux. For example you can mark
a packet with iptables based on the uid of the process which created
it and then you can NAT the source address based on the mark. Little
messy but the tools are there.

And not available to the ordinary user. Nameservers potentially run
into this limit. This is something The OpenGroup need to address when
updating the next revision of the socket api in POSIX.

Anyway, Ray didn't indicate that he needed a fixed source address
other than the one the machine would ordinarily choose for itself.

But he didn't say it wasn't required either.

Mark

I can say for certain that it was implemented (at least) twice that long ago (circa 1983) in a TCP implementation for a particular memory constrained environment ("640K should be good enough for anybody") :).

Regards,
-drc

In article <xs4all.20121205220127.7F6F12CA0F17@drugs.dv.isc.org> you write:

The thing is, Linux doesn't behave quite that way.

If you do an anonymous connect(), that is you socket() and then
connect() without a bind() in the middle, then the limit applies *per
destination IP:port pair*. So, you should be able to do 30,000
connections to 192.168.1.1 port 80, another 30,000 connections to
192.168.1.2 port 80, and so on.

The socket api is missing a bind + connect call which restricts the
source address when making the connect. This is needed when you
are required to use a fixed source address.

William was talking about the destination address. Linux (and I would
hope any other network stack) can really open a million connections
from one source address, as long as it's not to one destination address
but to lots of different ones. It's not the (srcip,srcport) tuple that
needs to be unique; it's the (srcip,srcport,dstip,dstport) tuple.

Anyway, you can actually bind to a source address and still have a
dynamic source port; just use port 0. Lots of tools do this.

(for example, strace nc -s 127.0.0.2 127.0.0.1 22 and see what it does)

Mike.

Eventually the bind call fails. Below was a

counter: dest address in hex

16376: 1a003ff9
16377: 1a003ffa
bind: before bind: Can't assign requested address
16378: 1a003ffb
connect: Can't assign requested address
bind: before bind: Can't assign requested address

and if you remove the bind() the connect fails

16378: 1a003ffb
16379: 1a003ffc
connect: Can't assign requested address
16380: 1a003ffd

this is with a simple loop

  socket()
  ioctl(FIONBIO)
  bind(addr++:80)
  connect()

I had a firewall dropping the connection attempts