"Does TCP Need an Overhaul?" (internetevolution, via slashdot)

in <http://www.internetevolution.com/author.asp?section_id=499&doc_id=150113>
larry roberts says:

  ..., last year a new alternative to using output queues, called "flow
  management" was introduced. This concept finally solves the TCP
  unfairness problem and leads to my answer: Fix the network, not TCP.

  ...

  What is really necessary is to detect just the flows that need to slow
  down, and selectively discard just one packet at the right time, but
  not more, per TCP cycle. Discarding too many will cause a flow to
  stall -- we see this when Web access takes forever.

  Flow management requires keeping information on each active flow,
  which currently is inexpensive and allows us to build an intelligent
  process that can precisely control the rate of every flow as needed to
  insure no overloads. Thus, there are now two options for network
  equipment:

   o Random discards from output queues bIntelligent rate control of
     every flow -- creates much TCP unfairness

   o Intelligent rate control of every flow -- eliminates most TCP
     unfairness

  ...

i wouldn't want to get in an argument with somebody who was smart and savvy
enough to invent packet switching during the year i entered kindergarden,
but, somebody told me once that keeping information on every flow was *not*
"inexpensive." should somebody tell dr. roberts?

(i'd hate to think that everybody would have to buy roberts' (anagran's)
Fast Flow Technology at every node of their network to make this work. that
doesn't sound "inexpensive" to me.

I suppose he could try to sell it... and people with larger networks
could see if keeping state on a few million active flows per device is
'expensive' or 'inexpensive'. Perhaps it's less expensive than it
seems as though it would.

Oh, will this be in linecard RAM? main-cpu-RAM? calculated on
ASIC/port or overall for the whole box/system? How about deconflicting
overlapping ip-space (darn that mpls!!!) what about asymmetric flows?

I had thought the flow-routing thing was a dead end subject long ago?

-Chris

And you can't get high speeds with Ethernet; you get too many
collisions. Besides, it doesn't have any fairness properties.
Clearly, you need token ring.

Oh yeah, they fixed those.

I have no idea if it's economically feasible or not -- technology and
costs change, and just because something wasn't possible 5 years ago
doesn't mean it isn't possible today.

It does strike me that any such scheme would be implemented on access
routers, not backbone routers, for lots of good and sufficient
reasons. That alone makes it more feasible.

I also note that many people are using NetFlow, which shares some of
the same properties as this scheme.

As for the need -- well, it does deal with the BitTorrent problem,
assuming that that is indeed a problem.

Bottom line: I have no idea if it makes any economic sense, but I'm not
willing to rule it out without analysis.

    --Steve Bellovin, http://www.cs.columbia.edu/~smb

Paul Vixie wrote:
[..]

i wouldn't want to get in an argument with somebody who was smart and savvy
enough to invent packet switching during the year i entered kindergarden,
but, somebody told me once that keeping information on every flow was *not*
"inexpensive." should somebody tell dr. roberts?

Isn't the reason that "NetFlow" (or v10 which is the the IETF/Cisco named IPFIX) exists the side-effect of having routers doing "flow based routing" aka "keeping an entry per IP flow, thus using that entry for every next packet to quickly select the outgoing interface instead of having to go through all the prefixes" ?
The flows are in those boxes, but only for stats purposes exported with NetFlow/IPFIX/sFlow/etc. Apparently it was not as fast as they liked it to be and there where other issues. Thus what exactly is new here in his boxes that has not been tried and failed before?

Greets,
  Jeroen

This is essentially correct. NetFlow was originally intended as a switching mechanism, but then it became apparent that the value of the information in the cache and the ability to export it as telemetry were of more value, as there were other, more efficient methods of moving the packets around.

I don't claim to understand this area more than Dr. Roberts either, but to paraphrase George Santayana:

"Those who do not understand SACK are condemned to re-implement it." (or fight it)

Years ago I worked on a project that was supposed to allow significant over-subscription of bandwidth to unmodified clients/servers by tracking the full state of all TCP traffic going through it. When the output interface started filling up, we'd do tricks like delaying the packets and selectively dropping packets "fairly" so that any one flow wasn't penalized too harshly unless it was monopolizing the bandwidth. To sum up many months of work that went nowhere:

As long as you didn't drop more packets than SACK could handle (generally 2 packets in-flight) dropping packets is pretty ineffective at causing TCP to slow down. As long as the packets are making it there quickly, and SACK retransmits happen fast enough to keep the window full... You aren't slowing TCP down much. Of course if you're intercepting TCP packets you could disable SACK, but that strikes me as worse off than doing nothing.

If you are dropping enough packets to stall SACK, you're dropping a lot of packets. With a somewhat standard 32K window and 1500 byte packets, to lose 3 non-contiguous packets inside a 32K window you're talking about 13% packet loss within one window. I would be very annoyed if I were on a connection that did this regularly. You get very little granularity when it comes to influencing a SACK flow - too little loss and SACK handles it without skipping a beat. Too much loss and you're severely affecting the connection.

You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time. I don't doubt that someone else could do a better job than I did in this field, but I'd be really curious to know how much of an effect a intermediary router can have on a TCP flow with SACK that doesn't cause more packet loss than anyone would put up with for interactive sessions.

The biggest thing we learned was that end user perceived speed is something completely different from flow speed. Prioritizing UDP/53 and TCP setup packets had a bigger impact than anything else we did, from a user's perspective. If either of those got delayed/dropped, pages would appear to stall while loading, and the delay between a click and visible results could greatly increase. This annoyed users far more than a slow download.

Mark UDP/53 and tcpflags(syn) packets as high priority. If you wanted to get really fancy and can track TCP state, prioritize the first 2k of client->server and server->client of HTTP to allow the request and reply headers to pass uninterrupted. Those made our client happier than anything else we did, at far far less cost.

-- Kevin

> i wouldn't want to get in an argument with somebody who was smart and savvy
> enough to invent packet switching during the year i entered kindergarden,
> but, somebody told me once that keeping information on every flow was *not*
> "inexpensive." should somebody tell dr. roberts?

Isn't the reason that "NetFlow" (or v10 which is the the IETF/Cisco
named IPFIX) exists the side-effect of having routers doing "flow based
routing" aka "keeping an entry per IP flow, thus using that entry for
every next packet to quickly select the outgoing interface instead of
having to go through all the prefixes" ?

flow-cache based forwarding can work perfectly fine provided:
- your flow table didn't overflow
- you didn't need to invalidate large amounts of your table at once (e.g.
next-hop change)

this was the primary reason why Cisco went from 'fast-switch' to 'CEF' which
uses a FIB.

the problem back then was that when you had large amounts of invalidated
flow-cache entries due to a next-hop change, typically that next-hop change was
caused by something in the routing table - and then you had a problem because
you wanted to use all your router CPU to recalculate the next-best paths yet
you couldn't take advantage of any shortcut information, so you were dropping
to a 'slow path' of forwarding.

for a long long time now, Cisco platforms with netflow have primarily had
netflow as an _accounting_ mechanism and generally not as the primary
forwarding path.

some platforms (e.g. cat6k) have retained a flow-cache that CAN be used to
influence forwarding decisions, and that has been the basis for how things like
NAT can be done in hardware (where per-flow state is necessary), but the
primary forwarding mechanism even on that platform has been CEF in hardware
since Supervisor-2 came out.

no comment on the merits of the approach by Larry, anything i'd say would be
through rose-coloured glasses anyway.

cheers,

lincoln.
(work:ltd@cisco.com)

I'm not quite sure about with this Fast Flow technology is exactly what it's really doing that isn't being done already by the DPI vendors (eg. Cisco SCE2000 on the Cisco webpage claims to be able to track 2million unidirectional flows).

Am I missing something?

MMC

Paul Vixie wrote:

...

Reworded:

What I'm not quite sure "Fast Flow" is exactly what is it really doing that isn't being done already by the DPI vendors (eg. Cisco SCE2000 on the Cisco webpage claims to be able to track 2million unidirectional flows).

Am I missing something?

Aside from my horribly broken attempt at English?

:slight_smile:

MMC

You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host
parameter caching to limit the affect of packet loss on recovery time. I
don't doubt that someone else could do a better job than I did in this
field, but I'd be really curious to know how much of an effect a
intermediary router can have on a TCP flow with SACK that doesn't cause more
packet loss than anyone would put up with for interactive sessions.

my takeaway from the web site was that one of the ways p2p is bad is that
it tends to start several parallel tcp sessions from the same client (i guess
think of bittorrent where you're getting parts of the file from several folks
at once). since each one has its own state machine, each will try to sense
the end to end bandwidth-delay product. thus, on headroom-free links, each
will get 1/Nth of that link's bandwidth, which could be (M>1)/Nth aggregate,
and apparently this is unfair to the other users depending on that link.

i guess i can see the point, if i squint just right. nobody wants to get
blown off the channel because someone else gamed the fairness mechanisms.
(on the other hand some tcp stacks are deliberately overaggressive in ways
that don't require M>1 connections to get (M>1)/Nth of a link's bandwidth.
on the internet, generally speaking, if someone else says fairness be damned,
then fairness will be damned.

however, i'm not sure that all TCP sessions having one endpoint in common or
even all those having both endpoints in common ought to share fate. one of
those endpoints might be a NAT box with M>1 users behind it, for example.

in answer to your question about SACK, it looks like they simulate a slower
link speed for all TCP sessions that they guess are in the same flow-bundle.
thus, all sessions in that flow-bundle see a single shared contributed
bandwidth-delay product from any link served by one of their boxes.

You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host
parameter caching to limit the affect of packet loss on recovery time. I
don't doubt that someone else could do a better job than I did in this
field, but I'd be really curious to know how much of an effect a
intermediary router can have on a TCP flow with SACK that doesn't cause more
packet loss than anyone would put up with for interactive sessions.

my takeaway from the web site was that one of the ways p2p is bad is that
it tends to start several parallel tcp sessions from the same client (i guess
think of bittorrent where you're getting parts of the file from several folks
at once). since each one has its own state machine, each will try to sense
the end to end bandwidth-delay product. thus, on headroom-free links, each
will get 1/Nth of that link's bandwidth, which could be (M>1)/Nth aggregate,
and apparently this is unfair to the other users depending on that link.

This is true. But it's not just bittorrent that does this. IE8 opens up to 6 parallel TCP sockets to a single server, Firefox can be tweaked to open an arbitrary number (and a lot of "Download Optimizers" do exactly that), etc. Unless you're keeping a lot of state on the history of what each client is doing, it's going to be hard to tell the difference between 6 IE sockets downloading cnn.com rapidly and bittorrent masquerading as HTTP.

i guess i can see the point, if i squint just right. nobody wants to get
blown off the channel because someone else gamed the fairness mechanisms.
(on the other hand some tcp stacks are deliberately overaggressive in ways
that don't require M>1 connections to get (M>1)/Nth of a link's bandwidth.
on the internet, generally speaking, if someone else says fairness be damned,
then fairness will be damned.

Exactly. I'm nervously waiting for the first bittorrent client to have their own TCP engine built into it that plays even more unfairly. I seem to remember a paper that described where one client was sending ACKs faster than it was actually receiving the data it from several well connected servers, and ended up bringing enough traffic in to completely swamp their university's pipes.

As soon as P2P authors realize they can get around caps by not playing by the rules, you'll be back to putting hard limits on each subscriber - which is where we are now. I'm not saying some fancier magic couldn't be put over top of that, but that's all depending on everyone to play by the rules to begin with.

however, i'm not sure that all TCP sessions having one endpoint in common or
even all those having both endpoints in common ought to share fate. one of
those endpoints might be a NAT box with M>1 users behind it, for example.

in answer to your question about SACK, it looks like they simulate a slower
link speed for all TCP sessions that they guess are in the same flow-bundle.
thus, all sessions in that flow-bundle see a single shared contributed
bandwidth-delay product from any link served by one of their boxes.

Yeah, I guess the point I was trying to make is that once you throw SACK into the equation you lose the assumption that if you drop TCP packets, TCP slows down. Before New Reno, fast-retransmit and SACK this was true and very easy to model. Now you can drop a considerable number of packets and TCP doesn't slow down very much, if at all. If you're worried about data that your clients are downloading you're either throwing away data from the server (which is wasting bandwidth getting all the way to you) or throwing away your clients' ACKs. Lost ACKs do almost nothing to slow down TCP unless you've thrown them *all* away.

I'm not saying all of this is completely useless, but it's relying a lot on the fact that the people you're trying to rate limit are going to be playing by the same rules you intended. This makes me really wish that something like ECN had taken off - any router between the two end-points can say "slow this connection down" and (if both ends are playing by the rules) they do so without wasting time on retransmits.

-- Kevin

in answer to your question about SACK, it looks like they simulate a slower
link speed for all TCP sessions that they guess are in the same flow-bundle.
thus, all sessions in that flow-bundle see a single shared contributed
bandwidth-delay product from any link served by one of their boxes.

Yeah, I guess the point I was trying to make is that once you throw SACK into the equation you lose the assumption that if you drop TCP packets, TCP slows down. Before New Reno, fast-retransmit and SACK this was true and very easy to model. Now you can drop a considerable number of packets and TCP doesn't slow down very much, if at all. If you're worried about data that your clients

That's only partially correct: TCP doesn't _time out_, but it still cuts its sending window in half (ergo, it cuts the rate at which it sends in half). The TCP sending rate computations are unchanged by either NewReno or SACK; the difference is that NR and SACK are much more efficient at getting back on their feet after the loss and:
  a) Are less likely to retransmit packets they've already sent
  b) Are less likely to go into a huge timeout and therefore back to slow-start

You can force TCP into basically whatever sending rate you want by dropping the right packets.

are downloading you're either throwing away data from the server (which is wasting bandwidth getting all the way to you) or throwing away your clients' ACKs. Lost ACKs do almost nothing to slow down TCP unless you've thrown them *all* away.

You're definitely tossing useful data. One can argue that you're going to do that anyway at the bottleneck link, but I'm not sure I've had enough espresso to make that argument yet. :slight_smile:

I'm not saying all of this is completely useless, but it's relying a lot on the fact that the people you're trying to rate limit are going to be playing by the same rules you intended. This makes me really wish that something like ECN had taken off - any router between the two end-points can say "slow this connection down" and (if both ends are playing by the rules) they do so without wasting time on retransmits.

Yup.

  -Dave

I suggest reading the excellent page:
"High-Speed TCP Variants":
http://kb.pert.geant2.net/PERTKB/TcpHighSpeedVariants

Enough material there to keep NANOG readers busy all weekend long.

-Hank

I suggest reading the excellent page:
"High-Speed TCP Variants":
http://kb.pert.geant2.net/PERTKB/TcpHighSpeedVariants

[Charles N Wyble]

Hank,

This is in fact an excellent resource. Thanks for sharing it!

outstanding compilation Hank, thanks !!!

Rgds
-Jorge

The flows are in those boxes, but only for stats purposes
exported with NetFlow/IPFIX/sFlow/etc. Apparently it was not
as fast as they liked it to be and there where other issues.
Thus what exactly is new here in his boxes that has not been
tried and failed before?

Roberts is selling a product to put in at the edge of your
WAN to solve packet loss problems in the core network. Since
most ISPs don't have packet loss problems in the core, but
most enterprise networks *DO* have problems in the core, I think
that Roberts is selling a box with less magic, and more science
behind what it does. People seem to be assuming that Roberts
is trying to put these boxes in the biggest ISP networks which
does not appear to be the case. I expect that he is smart enough
to realize that there are too many flows in such networks. On
the other hand, Enterprise WANs are small enough to feasibly
implement flow discards, yet big enough to be pushing the
envelope of the skill levels of enterprise networking people.

In addition Enterprise WANs can live with the curse of QoS, i.e. that
you have to punish some network users when you implement QoS. In the
Enterprise, this is acceptable because not all users have the same
cost/benefit. If flow switching lets them sweat their network assets
harder, then they will be happy.

--Michael Dillon

It is the result of the Geant2 SA3 working group.

There is much more there than just TCP variants.

Regards,
Hank

Kevin Day wrote:

Yeah, I guess the point I was trying to make is that once you throw SACK into the equation you lose the assumption that if you drop TCP packets, TCP slows down. Before New Reno, fast-retransmit and SACK this was true and very easy to model. Now you can drop a considerable number of packets and TCP doesn't slow down very much, if at all. If you're worried about data that your clients are downloading you're either throwing away data from the server (which is wasting bandwidth getting all the way to you) or throwing away your clients' ACKs. Lost ACKs do almost nothing to slow down TCP unless you've thrown them *all* away.

If this was true surely it would mean that drop models such WRED/RED are becoming useless?

Sam

As long as you didn't drop more packets than SACK could handle (generally 2 packets in-flight) dropping packets is pretty ineffective at causing TCP to slow down.

It shouldn't be. TCP hovers around the maximum bandwidth that a path will allow (if the underlying buffers are large enough). It increases its congestion window in congestion avoidance until a packet is dropped, then the congestion window shrinks but it also starts growing again.

If you read "The macroscopic behavior of the TCP Congestion Avoidance algorithm" by Mathis et al you'll see that TCP performance conforms to:

bandwdith = MSS / RTT * C / sqrt(p)

Where MSS is the maximum segment size, RTT the round trip time, C a constant close to 1 and p the packet loss probability.

Since the overshooting of the congestion window causes congestion = packet loss, you end up at some equilibrium of bandwidth and packet loss. Or, for a given link: number of flows, bandwidth and packet loss.

I'm sure this behavior isn't any different in the presence of SACK.

However, the caveat is that the congestion window never shrinks between two maximum segment sizes. If packet loss is such that you reach that size, then more packet loss will not slow down sessions. Note that for short RTTs you can still move a fair amount of data in this state, but any lost packet means a retransmission timeout, which stalls the session.

You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time.

The really interesting one is TCP Vegas, which doesn't need packet loss to slow down. But Vegas is a bit less aggressive than Reno (which is what's widely deployed) or New Reno (which is also deployed but not so widely). This is a disincentive for users to deploy it, but it would be good for service providers. Additional benefit is that you don't need to keep huge numbers of buffers in your routers and switches because Vegas flows tend to not overshoot the maximum available bandwidth of the path.

As long as you didn't drop more packets than SACK could handle (generally 2 packets in-flight) dropping packets is pretty ineffective at causing TCP to slow down.

It shouldn't be. TCP hovers around the maximum bandwidth that a path will allow (if the underlying buffers are large enough). It increases its congestion window in congestion avoidance until a packet is dropped, then the congestion window shrinks but it also starts growing again.

I'm sure this behavior isn't any different in the presence of SACK.

At least in FreeBSD, packet loss handled by SACK recovery changes the congestion window behavior. During a SACK recovery, the congestion window is clamped down to allow no more than 2 additional segments in flight, but that only lasts until the recovery is complete and quickly recovers. (That's significantly glossing over a lot of details that probably only matter to those who already know them - don't shoot me for that not being 100% accurate :slight_smile: )

I don't believe that Linux or Windows are quite that aggressive with SACK recovery though, but I'm less familiar there.

As a quick example on two FreeBSD 7.0 boxes attached directly over GigE, with New Reno, fast retransmit/recovery, and 256K window sizes, with an intermediary router simulating packet loss. A single HTTP TCP session going from a server to client.

SACK enabled, 0% packet loss: 780Mbps
SACK disabled, 0% packet loss: 780Mbps

SACK enabled, 0.005% packet loss: 734Mbps
SACK disabled, 0.005% packet loss: 144Mbps (19.6% the speed of having SACK enabled)

SACK enabled, 0.01% packet loss: 664Mbps
SACK disabled, 0.01% packet loss: 88Mbps (13.3%)

However, this falls apart pretty fast when the packet loss is high enough that SACK doesn't spend enough time outside the recovery phase. It's still much better than without SACK though:

SACK enabled, 0.1% packet loss: 48Mbps
SACK disabled, 0.1% packet loss: 36Mbps (75%)

However, the caveat is that the congestion window never shrinks between two maximum segment sizes. If packet loss is such that you reach that size, then more packet loss will not slow down sessions. Note that for short RTTs you can still move a fair amount of data in this state, but any lost packet means a retransmission timeout, which stalls the session.

True, a longer RTT changes this effect. Same test, but instead of back-to-back GigE, this is going over a real-world trans-atlantic link:

SACK enabled, 0% packet loss: 2.22Mbps
SACK disabled, 0% packet loss: 2.23Mbps

SACK enabled, 0.005% packet loss: 2.03Mbps
SACK disabled, 0.005% packet loss: 1.95Mbps (96%)

SACK enabled, 0.01% packet loss: 2.01Mbps
SACK disabled, 0.01% packet loss: 1.94Mbps (96%)

SACK enabled, 0.1% packet loss: 1.93Mbps
SACK disabled, 0.1% packet loss: 0.85Mbps (44%)

(No, this wasn't a scientifically valid test there, but the best I can do for an early Monday morning)

You've also got fast retransmit, New Reno, BIC/CUBIC, as well as host parameter caching to limit the affect of packet loss on recovery time.

The really interesting one is TCP Vegas, which doesn't need packet loss to slow down. But Vegas is a bit less aggressive than Reno (which is what's widely deployed) or New Reno (which is also deployed but not so widely). This is a disincentive for users to deploy it, but it would be good for service providers. Additional benefit is that you don't need to keep huge numbers of buffers in your routers and switches because Vegas flows tend to not overshoot the maximum available bandwidth of the path.

It would be very nice if more network-friendly protocols were in use, but with "download optimizers" for Windows that cranks the TCP window sizes way up, the general move to solving latency by opening more sockets, and P2P doing whatever it can to evade ISP detection - it's probably a bit late.

-- Kevin