Westnet and Utah outage

This belongs on the end2end-interest list or IPPM or elsewhere, but
I'll save a lot of people going through the archives.

In order to get X bandwidth on a given TCP flow you need to have an
average window size of X * RTT. This is expressed in terms of TCP
segments N = (X * RTT) / MSS (or more correctly the segment size in
use rather than MSS). To sustain an average window of N segments, you
must ideally reach a steady state where you cut cwnd (current window)
in half, then grow linearly, fluctuating between 2/3 and 4/3 of the
target size. This would mean one drop in 2/3 N windows or DropRate in
terms of time is 2/3 N * RTT. In one RTT on average X * RTT amount of
data flows. In practice, you rarely drop at the perfect time, so the
constant 2/3 (call it K) can be raised to 1-2. Since N = (X * RTT) /
MSS, DropRate = K * X * RTT * X * RTT / MSS. Units are b/s * sec *
b/s * sec / b, or b. The DropRate expressed in bits can be converted
to seconds or packets (divide by X or by MSS). This type of analysis
is courtesy of the good folks at PSC (Matt, Jamshid, et al).

For example, to get 40 Mb/s at 70 msec RTT and 4096 MSS, you get one
error about every 6 seconds (K=1) or 1 in 7,300 packets. If you look
at 56k Kb/s and 512 MSS you get a very interesting result. You need
one error every 66 msec or 1 error in 0.9 packets. This gives a good
incentive to increase delay. At 250 msec, you get a result of one
error in 11.7 packets (much better!).

Another interesting point to note is that you need 3 duplicate ACKs
for TCP fast retransmit to work, so your window must be at least 4
segments (and should be more). If you have a very large number of TCP
flows, where on average people get less than 1200 baud or so, the
delay you need to make TCP work well starts to exceed the magic 3
second boundary. This was discussed ad nauseum on end2end-interest.
An important result is that you need more queueing than the delay
bandwidth product for severely congested links. Another is that there
is a limit to the number of active TCP flows that can be supported per
bandwidth. One suggestion to address the latter problem is to further
drop segment size if cwnd is less than 4 segments in size and/or when
estimated RTT gets into the seconds range.

This analysis of how much loss is acceptable to TCP may not be outside
the bounds of an informational RFC, but so far none exists.

Curtis