buffer bloat and packet pacing


In past few years there's been lot of talk about reducing buffer
depths, and many seem to think vendors are throwing memory on the
chips for the fun of it.

If we look at some particularly pathological case. Let's assume sender
is CDN network with 40GE connected server and receiver is 10GE
connected. There is 300ms latency between them.

10Gbps * 300ms = 375MB, is the window size the client wants to be able
to fill its pipe to the 40GE sender.
However TCP does not normally pace packets inside the window, so 40GE
server will flood the window as fast as it can, instead of limiting
itself to 10Gbps, optimally it'll send at linerate. While receiver can
only serialise them 10GE out, causing majority of that 375MB ending up
in the sender side switch/router buffers.
If we can't buffer that, then the receiver cannot receive at 10Gbps,
as window size will shrink. Is this a problem? What rate should you be
able to expect to get and at what latency? Usually contracts to
customers won't have any limitations on bandwidth achievable on given
latency and writing such down might make you appear inferior to your

Perhaps this is unrealistic case, however if you run the numbers in
much less pathological cases, you'll still end up having much larger
buffer needs than large number of switch chips out there have.

Some new ones, like JNPR QFX10k and Broadcom Jericho come with much
larger buffers than predecessors, and will be able to deal with what I
hope are most practical cases.

Linux actually these days does have bandwidth estimator for TCP
sessions, but it's not used by default for anything, it's just for
consumption for other layers so they can do something about it. And I
believe in 'tc' you can use these to cause packet pacing inside a
QUIC and MinimaLT, AFAIK, do bandwidth estimation and packet pacing by default.

In perfect world, we'd be done now. Receiver side switch can do with
very small buffers, few packets should suffice. However, if network
itself is congested, the bandwidth estimation keeps sinking, and these
well-behaved streams are losing to the aggressive TCP streams, and
you'll end up having 0bps estimations.
So perhaps the bandwidth estimator should be application aware, and
never report lower estimate than what is practical for given
application, so that it could compete fairly with aggressive streams,
up-to required rate.

Information I'd love to have, is how large window sizes do TCP
sessions peak at, in real network? Some CDN network must be collecting
these stats. I'd love to see rough statistics. <1% go over 100MB? 2%
between 50MB-100MB? ... few large brackets of distribution of window
sizes by some CDN offering content download (GGC, OpenConnect are not
interesting, as they won't send large files).
Also, are some CDN's already implementing packet pacing inside window?
If so, how? Do they have lower limit to it?

Some related URLs:

only serialise them 10GE out, causing majority of that 375MB ending up
in the sender side switch/router buffers.


optimally, but tcp slow start will generally stop this from happening on
well behaved sending-side stacks so you send up ramping up quickly to path
rate rather than egress line rate from the sender side. Also, regardless
of an individual flow's buffering requirements, the intermediate path will
be catering with large numbers of flows, so while it's interesting to talk
about 375mb of intermediate path buffers, this is shared buffer space and
any attempt on the part of an individual sender to (ab)use the entire path
buffer will end up causing RED/WRED for everyone else.

Otherwise, this would be a fascinating talk if people had real world data.


This assumes network is congested and unable to reach its potential
rate. If it can reach its potential rate, eventually the window will
scale to 375MB and the pathological flooding will occur.
Mostly network is congested, and the pathological case cannot happen,
as the egress cannot ingest the floods, not allowing window to grow to
needed size, which also means the potential rate will not be reached,
and rate will be something less than 10Gbps. Essentially we threw the
baby out with the bath water, kind of like protecting from DoS by
killing the victim.

The original analysis is flawed because it assumes latency is constant.
Any analysis has to include the fact that buffering changes latency.

If you start with a 300ms path (by propogation delay, switching latency,
hetc.), and 375MB of buffers on a 10G port, then, when the buffers
fill, you end up with a 600ms path[1]. And a 375MB window is no longer
sufficient to keep the pipe full.

Instead, you need a 750MB buffer.

But now the latency is 900ms.

And so on. This doesn't converge. Every byte of filled buffer is
another byte you need in the window if you're going to fill the pipe.

Not accounting for this is part of the reason the original analysis is
flawed. The end result is that you always run out of window or run out
of buffer (causing packet loss).

Here's a paper that shows you don't need buffers equal to
bandwidth*delay to get near capacity:
(I'm not endorsing it. Just pointing out it out as a datapoint.)

     -- Brett

[1] 0.300 + 375E6 * 8 / 10E9 = 600ms

Hey Brett,

Here's a paper that shows you don't need buffers equal to
bandwidth*delay to get near capacity:
(I'm not endorsing it. Just pointing out it out as a datapoint.)

Quick glance makes me believe the S and D nodes are equal bandwidth,
but only R1-R2 bandwidth is explicitly stated.S1, D1, Sn, Dn are only
ever mentioned in the topology. If Sender is same or lower rate than
Destination, then we really shouldn't need almost any buffering.
Issue should only come when Sender is significantly higher rate than
Destination and network is not limiting them.

Hey Brett,

> Here's a paper that shows you don't need buffers equal to
> bandwidth*delay to get near capacity:
> http://www.cs.bu.edu/~matta/Papers/hstcp-globecom04.pdf
> (I'm not endorsing it. Just pointing out it out as a datapoint.)

Quick glance makes me believe the S and D nodes are equal bandwidth,
but only R1-R2 bandwidth is explicitly stated.S1, D1, Sn, Dn are only
ever mentioned in the topology. If Sender is same or lower rate than
Destination, then we really shouldn't need almost any buffering.

Unless Sender is higher than R1-R2.

Issue should only come when Sender is significantly higher rate than
Destination and network is not limiting them.

I didn't read it in detail either, but at first glance, it appears to
me that the model is infinite bandwidth and zero latency between S and
R1, and D and R2, with queueing happening in R1.

That's not going to give materially different results than, having S-R1
be 4 times R1-R2, and R2-D being the same as R1-R2. So it fits well
with the original discussion here of 40G into 10G.

     -- Brett

Can anyone provide references on this top so I can educate myself?

This e-mail and any attachments thereto is intended only for use by the addressee(s) named herein and may be proprietary and/or legally privileged. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this email, and any attachments thereto, without the prior written permission of the sender is strictly prohibited. If you receive this e-mail in error, please immediately telephone or e-mail the sender and permanently delete the original copy and any copy of this e-mail, and any printout thereof. All documents, contracts or agreements referred or attached to this e-mail are SUBJECT TO CONTRACT. The contents of an attachment to this e-mail may contain software viruses that could damage your own computer system. While Hibernia Networks has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage that you sustain as a result of software viruses. You should carry out your own virus checks before opening any attachment.

About every edition of Packet Pushers Podcast for the last 18 months would
be a good start probably. That'll keep you busy.


This might be of help



There's also a quite comprehensive survey from an academic angle:


A bit more effort will be required on your part to get the most out
it, but one potentially in depth resource would be Nick Feamster's
Software Defined Networking course, currently available through



For a more academic perspective:
"Software-defined networking: A comprehensive survey"

Hi Rod,

Ivan's Pepelnjak blog is good source of information about SDN (and what
it is not).

Blog link:

Ivan's presentation at last RIPE meeting in Amsterdam:

Software Defined Networks - Four Years Later


Can anyone provide references on this top so I can educate myself?

This e-mail and any attachments thereto is intended only for use by the addressee(s) named herein and may be proprietary and/or legally privileged. If you are not the intended recipient of this e-mail, you are hereby notified that any dissemination, distribution or copying of this email, and any attachments thereto, without the prior written permission of the sender is strictly prohibited. If you receive this e-mail in error, please immediately telephone or e-mail the sender and permanently delete the original copy and any copy of this e-mail, and any printout thereof. All documents, contracts or agreements referred or attached to this e-mail are SUBJECT TO CONTRACT. The contents of an attachment to this e-mail may contain software viruses that could damage your own computer system. While Hibernia Networks has taken every reasonable precaution to minimize this risk, we cannot accept liability for any damage that you sustain as a result of software viruses. You should carry out your
own virus checks before opening any attachment.

All of that for 11 1/2 words?


I think it's time to change my SMTP greeting to:

220-By submitting e-mail to this server, you agree all legal
disclaimers are null and void.
220 You also agree that I am awesome.


I like that. Unfortunately, I no longer operate a mail host.

I have been trying to figure out how to mechanically route messages containing them to the spam sump.

IANAL, but I thing an interesting case would be trying to enforce that crap in a situation involving unsolicited email (as in this case).

Would be hard to prove that you implicitly agreed to the constraints
mentioned within the email by just merely receiving it and reading it.
Even EULA's require you to check a box or click "I Accept."

Does anybody have a citation that legal disclaimers attached to
publicly posted mail aren't null and void? Seems to me that
what they're trying to say is "Sorry, we're too lame to use
PGP or similar on actually sensitive e-mail"...

hi valdis

Does anybody have a citation that legal disclaimers attached to
publicly posted mail aren't null and void? Seems to me that
what they're trying to say is "Sorry, we're too lame to use
PGP or similar on actually sensitive e-mail"...

i keep wondering why "they" keep using sniffable clear text
smtp/imap/pop3 instead of at least encrypted version

the problem also is both ends, the sender and the receiver
and all the laptops/desktops need to be configured

more importantly, why not just use https based webmail
or even smpts encrypted google mail where less setup and
configuring would be needed for sender and receiver

have a nice weekend