Shady areas of TCP window autotuning?

Hi all,

   TCP window autotuning is part of several OSs today. However, the actual
implementations behind this buzzword differ significantly and might impose
negative side-effects to our networks - which I'd like to discuss here.
There seem to be two basic approaches which differ in the main principle:

#1: autotuning tries to set rx window to a sensible value for a given RTT
#2: autotuning just ensures, that rx window is always bigger than
    congestion window of the sender, i.e. it never limits the flow

While both approaches succeed to achieve high throughput on high-RTT paths,
their behaviour on low-RTT paths is very different - mainly because the
fact, that #2 suffers from "spiraling death" syndrome. I.e. when RTT
increases due to queueing at the bottleneck point, autotuning reacts by
increasing the advertised window, which again increases RTT...
So the net effect of #2 is, that after very short TCP connection lifetime
it might advertise extermely huge RX window compared to BDP of the path:

RTT when idle Max advertised window #1 Max advertised window #2

Briefly? They're correct - the rx advertised window has nothing to do with congestion control and everything to do with flow control.

The problem you've described *is* a problem, but not because of its effects on congestion control -- the problem it causes is one we call a lack of agility: it takes longer for control signals to take effect if you're doing things like fast-forwarding a YouTube movie that's being delivered over TCP.

If you want patches for Linux that properly decrease the window size, I can send you them out-of-band.

But in general, TCP's proper behavior is to try to fill up the bottleneck buffer. This isn't a huge problem *in general*, but can be fairly annoying on, e.g., cable modems with oversized buffers, which are fairly common. But that's pretty fundamental with the way TCP is designed. Otherwise, you WILL sacrifice throughput at other times.

   -Dave

In a message written on Mon, Mar 16, 2009 at 10:15:37AM +0100, Marian ??urkovi?? wrote:

This however doesn't seem to be of any concern for TCP maintainers of #2,
who claim that receiver is not supposed to anyhow assist in congestion
control. Instead, they advise everyone to use advanced queue management,
RED or other congestion-control mechanisms at the sender and at every
network device to avoid this behaviour.

I think the advice here is good, but it actually overlooks the
larger problem.

Many edge devices have queues that are way too large.

What appears to happen is vendors don't auto-size queues. Something
like a cable or DSL modem may be designed for a maximum speed of
10Mbps, and the vendor sizes the queue appropriately. The service
provider then deploys the device at 2.5Mbps, which means roughly
(as it can be more complex) the queue should be 1/4th the size.
However the software doesn't auto-size the buffer to the link speed,
and the operator doesn't adjust the buffer size in their config.

The result is that if the vendor targeted 100ms of buffer you now
have 400ms of buffer, and really bad lag.

As network operators we have to get out of the mind set that "packet
drops are bad". While that may be true in planning the backbone
to have sufficient bandwidth, it's the exact opposite of true when
managing congestion at the edge. Reducing the buffer to be ~50ms
of bandwidth makes the users a lot happier, and allows TCP to work.
TCP needs drops to manage to the right speed.

My wish is for the vendors to step up. I would love to be able to
configure my router/cable modem/dsl box with "queue-size 50ms" and
have it compute, for the current link speed, 50ms of buffer. Sure,
I can do that by hand and turn it into "queue 20 packets", but that
is very manual and must be done for every different link speed (at
least, at slower speeds). Operators don't adjust because it is too
much work.

If network operators could get the queue sizes fixed then it might
be worrying about the behavior you describe; however I suspect 90%
of the problem you describe would also go away.

The result is that if the vendor targeted 100ms of buffer you now
have 400ms of buffer, and really bad lag.

Well, this is one of the reasons why I hate the fact that we're
effectively stuck in a 1500 MTU world. My customers are vastly
concerned with the quantity of data they can transmit per unit of
latency. You may be more familiar with this termed as "through-put".
Customers beat us operators and engineers up over it every day. TCP
window tuning does help that if you can manage the side effects. A
larger default layer 2 MTU (why we didn't change this when GE came
out, I will never understand) would help even more by reducing the
total number of frames necessary to transmit a packet across a give
wire.

As network operators we have to get out of the mind set that "packet
drops are bad"

Well, thats easier said than done and arguably not realistic. I got
started in this business when 1-3% packet loss was normal and
expected. As the network has grown, the expectation for 0% loss in all
cases has grown with it. You have to remember that in the early days,
the network itself was expected to guarentee data delivery. (ie X.25)
Then the network improved and that burdon was cast on the host
devices. Well, technology has continued to improve to the point where
you litterally can expect 0% packet loss in relatively confined
areas. (Say, Provider X in Los Angeles to user Y in San Jose.) But as
you go further afield, such as from LAX to Israel, expectations have
to change. Today, that mindset is not always there.

As you illude to, this has also bred applications that are almost
entirely intollerant of packet loss and extremely sensitive to
jitter. (VOIP people, are you listening?) Real time gaming is a great
example. Back in the days when 99% of us were on modems, any loss or
varying delay between the client and the user made the difference
between an enjoyable session and nothing but frustration and it was
often hit and miss. A congested or dirty link in the middle of the
path destroyed the user's experience. This is further compounded by
the ever increasingly international participation in some of these
services which means that 24x7 requirements render the customers and
their users more and more sensitive to maintenance activities. (There
can be areas where there is no "after hours" in which to do this
stuff.) Add to this that as media companies expand their use of the
network that customers have forced providers to write into their SLAs
performance based metrics that, rather than simple uptime, now require
often arbitrary guarentees of latency and data loss and you've got a
real problem for operations and engineering.

Techniques that can help improve network integrity are worth
exploring. The difficulty is in proving these techniques under a wide
array of circumstances, getting them properly adopted, and not having
vendors or customers arbitrarily break them because of improper
understanding, poor implementations, or bad configs (PMTUD, anyone?)

Going forward, this sort of thing is going to be more and more
important and harder and harder to get right. I'm actually glad to see
this particular thread appear and will be quite interested in what
people have to say on the matter.

-Wayne

Hi,

It was my understanding that (most) cable modems are L2 devices -- how it is
that they have a buffer, other than what the network processor needs to
switch it?

Frank

In my mind, the problem is that they tend to use FIFO, not that the queues are too large.

This is most likely due to the enormous price competition in the market, where you might lose a DSL CPE deal because you charged $1 per unit more than the competition.

What we need is ~100ms of buffer and fair-queue or equivalent, at both ends of the end-user link (unless it's 100 meg or more, where 5ms buffers and FIFO tail-drop seems to work just fine), because 1 meg uplink (ADSL) and 200ms buffer is just bad for the customer experience, and if they can't figure out how to do fair-queue properly, they might as well just to WRED 30 ms 50 ms (100% drop probability at 50ms) or even taildrop at 50ms.
It's very rare today that an end user is helped by anything buffering their packet more than 50ms.

I've done some testing with fairly fast links with big buffers (T3/OC3 and real routers) and doing FIFO and tuned TCP windows (single session) it's easy to get 100ms buffering, which is just pointless.

So either smaller buffers and FIFO, or large buffers and some kind of intelligent queue handling.

* Marian Ďurkovič:

   TCP window autotuning is part of several OSs today. However, the actual
implementations behind this buzzword differ significantly and might impose
negative side-effects to our networks - which I'd like to discuss here.
There seem to be two basic approaches which differ in the main principle:

This has bene discussed previously on the netdev list:

  <http://thread.gmane.org/gmane.linux.network/121674&gt;

You may want to review the dicussion over there before replying on
NANOG.

Many edge devices have queues that are way too large.

What appears to happen is vendors don't auto-size queues. Something
like a cable or DSL modem may be designed for a maximum speed of
10Mbps, and the vendor sizes the queue appropriately. The service
provider then deploys the device at 2.5Mbps, which means roughly
(as it can be more complex) the queue should be 1/4th the size.
However the software doesn't auto-size the buffer to the link speed,
and the operator doesn't adjust the buffer size in their config.

The result is that if the vendor targeted 100ms of buffer you now
have 400ms of buffer, and really bad lag.

This is a very good point. Let me add, that it happens also for every
autosensing 10/100/1000Base-T ethernet port, which typically does not
auto-reduce buffers when the actual negotiated speed is not 1 Gbps.

As network operators we have to get out of the mind set that "packet
drops are bad". While that may be true in planning the backbone
to have sufficient bandwidth, it's the exact opposite of true when
managing congestion at the edge. Reducing the buffer to be ~50ms
of bandwidth makes the users a lot happier, and allows TCP to work.
TCP needs drops to manage to the right speed.

My wish is for the vendors to step up. I would love to be able to
configure my router/cable modem/dsl box with "queue-size 50ms" and
have it compute, for the current link speed, 50ms of buffer.

Reducing buffers to 50 msec clearly avoids excessive queueing delays,
but let's look at this from the wider perspective:

1) initially we had a system where hosts were using fixed 64 kB buffers
This was unable to achieve good performance over high BDP paths

2) OS maintainers have fixed this by means of buffer autotuning, where
the host buffer size is no longer the problem.

3) the above fix introduces unacceptable delays into networks and users
are complaining, especially if autotuning approach #2 is used

4) network operators will fix the problem by reducing buffers to e.g. 50 msec

So at the end of the day, we'll again have a system which is unable to
achieve good performance over high BDP paths, since with reduced buffers
we'll have an underbuffered bottleneck in the path which will prevent full
link untilization if RTT>50 msec. Thus all the above exercises will end up
in having almost the same situation as before (of course YMMV).

Something is seriously wrong, isn't it?

And yes, I opened this topic last week on Linux netdev mailinglist and tried
hard to persuade those people that some less aggresive approach is probably
necessary to achieve good balance between the requirements for fastest
possible throughput and fairness in the network. But the maintainers simply
didn't want to listen :frowning:

        M.

The Ethernet is typically faster than the upstream cable channel. So
it needs some place to put the data that arrives from the Ethernet port
until it gets sent upstream.

This has nothing to do with layer 2 / layer 3. Any device connecting
between media of different speeds (or connecting more than two ports --
creating the possibility of contention) would need some amount of
buffering.

     -- Brett

In a message written on Tue, Mar 17, 2009 at 08:46:50AM +0100, Mikael Abrahamsson wrote:

In my mind, the problem is that they tend to use FIFO, not that the queues
are too large.

We could quickly get lost in queuing science, but at a high level you
are most correct that both are a problem.

What we need is ~100ms of buffer and fair-queue or equivalent, at both
ends of the end-user link (unless it's 100 meg or more, where 5ms buffers
and FIFO tail-drop seems to work just fine), because 1 meg uplink (ADSL)
and 200ms buffer is just bad for the customer experience, and if they
can't figure out how to do fair-queue properly, they might as well just to
WRED 30 ms 50 ms (100% drop probability at 50ms) or even taildrop at 50ms.
It's very rare today that an end user is helped by anything buffering
their packet more than 50ms.

Some of this technology exists, just not where it can do a lot of
good. Some fancier CPE devices know how to queue VOIP in a priority
queue, and elevate some games. This works great when the cable
modem or DSL modem are integrated, but when you buy a "router" and
hook it to your provider supplied DSL or Cable Modem it's doing no
good. I hate to suggest such a thing, but perhaps a protocol for a
modem to communicate a comitted rate to a router would be a good
thing...

I'd also like to point out, where this technology exists today it's
almost never used. How many 2600's and 3600's have you seen
terminating T1's or DS-3's that don't have anything changed from
the default FIFO queue? I am particularly fond of the DS-3 frame
circuits with 100 PVC's, each with 40 packets of buffer. 4000
packets of buffer on a DS-3. No wonder performance is horrid.

In a message written on Tue, Mar 17, 2009 at 09:47:39AM +0100, Marian ??urkovi?? wrote:

Reducing buffers to 50 msec clearly avoids excessive queueing delays,
but let's look at this from the wider perspective:

1) initially we had a system where hosts were using fixed 64 kB buffers
This was unable to achieve good performance over high BDP paths

Note that the host buffer, which generally should be 2 * Bandwidth
* Delay is, well, basically unrelated to the hop by hop network
buffers.

2) OS maintainers have fixed this by means of buffer autotuning, where
the host buffer size is no longer the problem.

3) the above fix introduces unacceptable delays into networks and users
are complaining, especially if autotuning approach #2 is used

4) network operators will fix the problem by reducing buffers to e.g. 50 msec

So at the end of the day, we'll again have a system which is unable to
achieve good performance over high BDP paths, since with reduced buffers
we'll have an underbuffered bottleneck in the path which will prevent full
link untilization if RTT>50 msec. Thus all the above exercises will end up
in having almost the same situation as before (of course YMMV).

This is an incorrect conclusion. The host buffer has to wait for
an RTT for an ack to return, so it has to buffer a full RTT of data
and then some. Hop by hop buffers only have to buffer until an
output port on the same device is free. This is why a router with
20 10GE interfaces can have a 75 packet deep queue on each interface
and work fine, the packet only sits there until a 10GE output
interface is available (a few microseconds).

The problems are related, as TCP goes faster there is an increased
probability it will fill the buffer at any particular hop; but that
means a link is full and TCP is hitting the maximum speed for that
path anyway. Reducing the buffer size (to a point) /does not slow/
TCP, it reduces the feedback loop time. It provides less jitter
to the user, which is good for VoIP and ssh and the like.

However, if the hop-by-hop buffers are filling and there is lag and
jitter, that's a sign the hop-by-hop buffers were always too large.
99.99% of devices ship with buffers that are too large.

Leo Bicknell wrote:

As network operators we have to get out of the mind set that "packet
drops are bad".

They are bad.

TCP needs drops to manage to the right speed.

This is whats bad. TCP should be slightly more intelligent and start considering rtt jitter as its primary source of congestion information.

Designing L2 network performance to optimize a l3 protocol is backwards.

Or use a transmission-layer protocol that optimizes delay end-to-end.
http://tools.ietf.org/html/draft-shalunov-ledbat-congestion-00

TCP Vegas did this but sadly it never became popular.
(It doesn't compete well with Reno.)

Tony.

FWIW, Compound TCP does this (shipping with Vista, but disabled by default.) There are other delay-based or delay-sensitive TCP flavors, too.

Lars

> So at the end of the day, we'll again have a system which is unable to
> achieve good performance over high BDP paths, since with reduced buffers
> we'll have an underbuffered bottleneck in the path which will prevent full
> link untilization if RTT>50 msec. Thus all the above exercises will end up
> in having almost the same situation as before (of course YMMV).

This is an incorrect conclusion. The host buffer has to wait for
an RTT for an ack to return, so it has to buffer a full RTT of data
and then some. Hop by hop buffers only have to buffer until an
output port on the same device is free.

[snip]

However, if the hop-by-hop buffers are filling and there is lag and
jitter, that's a sign the hop-by-hop buffers were always too large.
99.99% of devices ship with buffers that are too large.

Vendors size the buffers according to principles outlined e.g. here:

http://tiny-tera.stanford.edu/~nickm/papers/sigcomm2004.pdf

It's fine to have smaller buffers in the high-speed core, but at the edge you
still need to buffer for full RTT if you want to fully utilize the link with
TCP Reno. Thus my conclusion holds - if we reduce buffers at the bottleneck
point to 50 msec, flows with RTT>50 msec would suffer from reduced throughput.

Anyway we probably have no other chance in situations when the only available
queueing is FIFO. And if this gets implemented on larger scale, it could even
have a positive side-effect - it might finally motivate OS maintainers to
seriously consider deploying some delay-sensitive variant of TCP since Reno
will no longer give them the best results.

        M.

In a message written on Wed, Mar 18, 2009 at 09:04:42AM +0100, Marian ??urkovi?? wrote:

It's fine to have smaller buffers in the high-speed core, but at the edge you
still need to buffer for full RTT if you want to fully utilize the link with
TCP Reno. Thus my conclusion holds - if we reduce buffers at the bottleneck
point to 50 msec, flows with RTT>50 msec would suffer from reduced throughput.

Ah, I understand your point now. There is a balance to be had at
the edge; the tuning to support a single TCP stream at full bandwidth
and the tuning to reduce latency and jitter are on some level
incompatable; so one must strike a balance between the two.

Of course...

Anyway we probably have no other chance in situations when the only available
queueing is FIFO. And if this gets implemented on larger scale, it could even
have a positive side-effect - it might finally motivate OS maintainers to
seriously consider deploying some delay-sensitive variant of TCP since Reno
will no longer give them the best results.

Many of the problems can be mitigated with well known queueing
strategies. WRED, priority queues, and other options have been
around long enough I refuse to believe they add significant cost.
Rather, I think the problem is of awareness, far too few network
engineers seem to really understand what effect the queuing options
have on traffic.