Questions about Internet Packet Losses

Hello, and best wishes for what's left of 1997. Now, if you would, ...

Below are some questions I hope you'll help me answer about packet loss on
the Internet.

Here are two paragraphs taken from:

http://www.merit.edu/~ipma/netnow/docs/info.html:

"Early experiments with NetNow show that 30% packet loss between public
exchange points is common for major Internet service providers during peak
usage periods. The initial investigation also suggests that loss rates are
closely related to bandwidth usage and congestion problems. Although some
of the packet loss is inadvertent, a large percentage of the public
exchange point connectivity problems reflect intentional engineering
decisions by Internet service providers based on commercial settlement
issues.

"The high packet loss may not generally reflect problems seen by the
majority of customers of the larger network service providers. In fact,
increasing levels of Internet traffic are not traversing the public
exchange points. Instead, many large service providers are migrating their
inter-provider traffic to private exchange points, or direct connections to
other providers. Merit is working closely with providers to develop tools
and infrastructure that more closely reflect Internet performance as
observed by the majority of backbone customers."

Questions:

Are you familiar with this packet loss data from Merit? If not, please see
above URL.

Is Merit's packet loss data (NetNow) credible? Do packet losses in the
Internet now average between 2% and 4% daily? Are 30% packet losses common
during peak periods? Is there any evidence that Internet packet losses are
trending up or down?

If Merit's data is not correct, where has Merit gone wrong? Where is there
better data?

Were Merit's data correct, what would be the impact of 30% packet losses on
opening up TCP connections? On TCP throughput, say through a 28.8Kbps
modem? On Web throughput, since so many TCP connections are involved? On
DNS look-ups? On email transport?

How big a problem is HTTP's opening of so many TCP connections? Does TCP
need to operate differently than it does now when confronted routinely with
30% packet losses and quarter-second transit delays? What is the proper
response of an IP-based protocol, like TCP, as packet losses climb? Try
harder or back off or what? How robust are various widespread TCP/IP
implementations in the face of 30% packet loss and quarter-second transit
delays?

Is the Internet's sometimes bogging down due mainly to packet losses or
busy servers or what, or does the Internet not bog down?

What fraction of Internet traffic still goes through public exchange points
and therefore sees these kinds of packet losses? What fraction of Internet
traffic originates and terminates within a single ISP?

Where is the data on packet losses experienced by traffic that does not go
through public exchange points?

If 30% loss impacts are noticeable, what should be done to eliminate the
losses or reduce their impacts on Web performance and reliability?

Are packet losses due mainly to transient queue buffer overflows of user
traffic or to discards by overburdened routing processors or something else?

What does Merit mean when they say that some of these losses are
intentional because of settlement issues? Are ISPs cooperating
intelligently in the carriage of Internet traffic, or are ISPs competing
destructively, to the detriment of them and their customers?

Any help you can offer on these questions would be appreciated.

/Bob Metcalfe, InfoWorld

Bob,

You quote:

   "Although some
   of the packet loss is inadvertent, a large percentage of the public
   exchange point connectivity problems reflect intentional engineering
   decisions by Internet service providers based on commercial settlement
   issues.

I think that this is an _extremely_ dangerous assertion on Merit's part.
As always, ascribing intent rather than raw data requires much more
justification which I have yet to see.

   Are you familiar with this packet loss data from Merit? If not, please see
   above URL.

Am now... :wink:

   Is Merit's packet loss data (NetNow) credible? Do packet losses in the
   Internet now average between 2% and 4% daily? Are 30% packet losses common
   during peak periods? Is there any evidence that Internet packet losses are
   trending up or down?

Yes, that matches my instinctive feel. I don't have concrete data which
corroborates or disputes their data, nor reflects high packet loss rates
nor trends.

   Were Merit's data correct, what would be the impact of 30% packet losses on
   opening up TCP connections?

TCP is pretty damn robust. Opening a connection is still likely to work.

   On TCP throughput, say through a 28.8Kbps
   modem? On Web throughput, since so many TCP connections are involved? On
   DNS look-ups? On email transport?

As you might imagine, that kind of packet loss rate is 'highly detrimental'
to throughput. If you're asking for concrete numbers, I don't have them,
but I've lived through them. Qualitatively, it means that interactive
usage is intolerable. On the bright side, email works just fine.

   How big a problem is HTTP's opening of so many TCP connections?

It's a very significant problem. It decreases the average packet size,
thereby making router work much harder. It generates many more packets
than necessary, and then closes down the connection after a very short
transfer. In short, it's a horribly inefficient use of the net.

   Does TCP need to operate differently than it does now when confronted
   routinely with 30% packet losses and quarter-second transit delays?

Your question presumes that we should live with the 30% losses. We should
not. TCP does palatably well at surviving such brown-outs and I would not
suggest changes for that cause. Note that there are other changes that I'd
like to see, such as more use of Path MTU Discovery and fixing HTTP which
are much more important. The quarter-second transit delays fall into two
categories: one are transient delays, mostly caused by routing transients.
Obviously we need to minimize such transients. The second is normal
propagation delay. Using larger windows would aid that a great deal. I
don't think that many TCP implementations allocate sufficient buffering
today to truly be efficient.

   What is the proper
   response of an IP-based protocol, like TCP, as packet losses climb? Try
   harder or back off or what?

Back off. Slow start is the accepted algorithm. Trying harder only
increases congestion.

   How robust are various widespread TCP/IP
   implementations in the face of 30% packet loss and quarter-second transit
   delays?

I have yet to see a significant problem with robustness.

   Is the Internet's sometimes bogging down due mainly to packet losses or
   busy servers or what, or does the Internet not bog down?

That depends on your definitions. "The Internet" as a whole does not bog
down. It's a modular system and there are localized problems and
congestion which result in poor service to a wide-ranging set of users.
The causes of the problems vary. I've seen lots of really slow servers,
congested access links, unhappy routers, congested interconnects, etc.

   Where is the data on packet losses experienced by traffic that does not go
   through public exchange points?

I suspect that you'd have to ask the parties involved in the private
exchange point. I suspect that there are not such statistics currently
kept, or if so, they would not be willing to disclose them. Thus IPPM...

   If 30% loss impacts are noticeable, what should be done to eliminate the
   losses or reduce their impacts on Web performance and reliability?

Ah... Yes, loss rates of 30% are noticeable and painful. There are
literally hundreds of things that can and should be done to imrpove
things. Let's see, just off the top of my head:

- more private interconnects are necessary in the long term to scale the
  network. We cannot have interconnects of infinite bandwidth as hardware
  simply doesn't scale as quickly as demand. Thus, we need to invoke
  parallelism. I think that this is already happening in a reasonable way.
- more bandwidth. Of course, faster is better. OC3 SONET technology is
  quickly becoming an obvious upgrade path from today's T3 backbones.
- better routers. Current implementations have many shortcomings which
  aggravate instability.
- accurate reporting. There seems to be a trend to find a problem and get
  everyone hyped up over it, far in excess of reality. We spend time
  dealing with such issues rather than doing beneficial engineering.
- improved protocols. We have an ongoing scalability problem with our
  routing protocols.
- fixed host stacks. Using the full MTU would be a boon. Recent data
  indicates that >40% of the packets out there are 40 bytes.

   Are packet losses due mainly to transient queue buffer overflows of user
   traffic or to discards by overburdened routing processors or something else?

"mainly" is a dangerous quantifier given that there's no hard data. My
intuition says that sheer congestion is the most serious problem, followed
closely by router implementation.

   What does Merit mean when they say that some of these losses are
   intentional because of settlement issues?

I think you really need to ask Merit that. I could find no justification
for that on their Web page.

   Are ISPs cooperating
   intelligently in the carriage of Internet traffic, or are ISPs competing
   destructively, to the detriment of them and their customers?

Ummm... I see them cooperating. "intelligently" is in the eye of the
beholder. Certainly there are some who are being anti-social.

Tony

If your provider has 30% packet loss you need to look at a new provider. I
think most providers have little packet loss. This is a ping -c 1000 from
one of my servers in Arlington, VA to a router a router at PAIX.

--- 205.215.63.18 ping statistics ---
1000 packets transmitted, 1000 packets received, 0% packet loss
round-trip min/avg/max = 77.3/80.0/127.3 ms

I know you are sad that the net did not fall apart, but most of us are
able to keep up. The nice thing is that bandwidth is starting to drop, we
have some OC-3 circuits that are just a little more then a DS3.

P.S. Yes the delay is up there, but we are installing a DS3 from Palo Alto
to Arlington, so packets from Arlington - Palo Alto will not need to go
through Atlanta or Chicago to get to CA.

Nathan Stratton President, NetRail,Inc.