Lossy cogent p2p experiences?

William Herrin wrote:

No, not at all. First, though you explain slow start,
it has nothing to do with long fat pipe. Long fat
pipe problem is addressed by window scaling (and SACK).

So, I've actually studied this in real-world conditions and TCP
behaves exactly as I described in my previous email for exactly the
reasons I explained.

Yes of course, which is my point. Your problem is that your
point of slow start has nothing to do with long fat pipe.

> Window scaling and SACK makes it possible for TCP to grow to consume
> the entire whole end-to-end pipe when the pipe is at least as large as
> the originating interface and -empty- of other traffic.

Totally wrong.

Unless the pipe is long and fat, a plain TCP without window scaling
or SACK is to grow to consume the entire whole end-to-end pipe when
the pipe is at least as large as the originating interface and
-empty- of other traffic.

> Those
> conditions are rarely found in the real world.

It is usual that TCP consumes all the available bandwidth.

Exceptions, not so rare in the real world, are plain TCPs over
long fat pipes.

          Masataka Ohta

Well it doesn't show up in long slow pipes because the low
transmission speed spaces out the packets, and it doesn't show up in
short fat pipes because there's not enough delay to cause the
burstiness. So I don't know how you figure it has nothing to do with
long fat pipes, but you're plain wrong.

Regards,
Bill Herrin

William Herrin wrote:

Well it doesn't show up in long slow pipes because the low
transmission speed spaces out the packets,

Wrong. That is a phenomenon with slow access and fast backbone,
which has nothing to do with this thread.

If backbone is as slow as access, there can be no "space out"
possible.

and it doesn't show up in
short fat pipes because there's not enough delay to cause the
burstiness.

Short pipe means speed of burst shows up continuously
without interruption.

> So I don't know how you figure it has nothing to do with
> long fat pipes,

That's your problem.

            Masataka Ohta

Nick Hilliard wrote:

Are you saying you thought a 100G Ethernet link actually consisting
of 4 parallel 25G links, which is an example of "equal speed multi
parallel point to point links", were relying on hashing?

this is an excellent example of what we're not talking about in this thread.

Not "we", but "you".

A 100G serdes is an unbuffered mechanism which includes a PLL, and this allows the style of clock/signal synchronisation required for the deserialised 4x25G lanes to be reserialised at the far end. This is one of the mechanisms used for packet / cell / bit spray, and it works really well.

That's why I, instead of fully shared buffer, mentioned round robin
as the proper solution for the case.

This thread is talking about buffered transmission links on routers / switches on systems which provide no clocking synchronisation and not even a guarantee that the bearer circuits have comparable latencies. ECMP / hash based load balancing is a crock, no doubt about it;

See the first three lines of this mail to find that I explicitly
mentioned "equal speed multi parallel point to point links" as the
context for round robin.

As I already told you:

: In theory, you can always fabricate unrealistic counter examples
: against theories by ignoring essential assumptions of the theories.

you are keep ignoring essential assumptions for no good purposes.

            Masataka Ohta

Cogent support has been about as bad as you can get. Everything is great, clean your fiber, iperf isn’t a good test, install a physical loop oh wait we don’t want that so go pull it back off, new updates come at three to seven day intervals, etc. If the performance had never been good to begin with I’d have just attributed this to their circuits, but since it worked until late June, I know something has changed. I’m hoping someone else has run into this and maybe knows of some hints I could give them to investigate. To me it sounds like there’s a rate limiter / policer defined somewhere in the circuit, or an overloaded interface/device we’re forced to traverse, but they assure me this is not the case and claim to have destroyed and rebuilt the logical circuit.

Sure smells like port buffer issues somewhere in the middle. ( mismatched deep / shallow, or something configured to support jumbo frames, but buffers not optimized for them)

No... you are saying that.

Mark.

It is amusing how he tried to pivot the discussion. Nobody was talking about how lane transport in optical modules works.

Mark.

Mark Tinka wrote:

Are you saying you thought a 100G Ethernet link actually consisting
of 4 parallel 25G links, which is an example of "equal speed multi
parallel point to point links", were relying on hashing?

No...

So, though you wrote:

>> If you have multiple parallel links over which many slow
>> TCP connections are running, which should be your assumption,
>> the proper thing to do is to use the links with round robin
>> fashion without hashing. Without buffer bloat, packet
>> reordering probability within each TCP connection is
>> negligible.
>
> So you mean, what... per-packet load balancing, in lieu of per-flow
> load balancing?

you now recognize that per-flow load balancing is not a very
good idea.

Good.

you are saying that.

See above to find my statement of "without hashing".

            Masataka Ohta

You keep moving the goal posts. Stay on-topic.

I was asking you to clarify your post as to whether you were speaking of per-flow or per-packet load balancing. You did not do that, but I did not return to that question because your subsequent posts inferred that you were talking to per-packet load balancing.

And just because I said per-flow load balancing has been the gold standard for the last 25 years, does not mean it is the best solution. It just means it is the gold standard.

I recognize what happens in the real world, not in the lab or text books.

Mark.

Fun fact about the real world, devices do not internally guarantee
order. That is, even if you have identical latency links, 0
congestion, order is not guaranteed between packet1 coming from
interfaceI1 and packet2 coming from interfaceI2, which packet first
goes to interfaceE1 is unspecified.
This is because packets inside lookup engine can be sprayed to
multiple lookup engines, and order is lost even for packets coming
from interface1 exclusively, however after the lookup the order is
restored for _flow_, it is not restored between flows, so packets
coming from interface1 with random ports won't be same order going out
from interface2.

So order is only restored inside a single lookup complex (interfaces
are not guaranteed to be in the same complex) and only for actual
flows.

It is designed this way, because no one runs networks which rely on
order outside these parameters, and no one even knows their kit works
like this, because they don't have to.

Saku Ytti wrote:

Fun fact about the real world, devices do not internally guarantee
order. That is, even if you have identical latency links, 0
congestion, order is not guaranteed between packet1 coming from
interfaceI1 and packet2 coming from interfaceI2, which packet first
goes to interfaceE1 is unspecified.

So, you lack fundamental knowledge on the E2E argument fully
applicable to situations in the real world Internet.

In the very basic paper on the E2E argument published in 1984:

  End-To-End Arguments in System Design
  https://groups.csail.mit.edu/ana/Publications/PubPDFs/End-to-End%20Arguments%20in%20System%20Design.pdf

reordering is recognized both as the real and the theoretical
world as:

  3.4 Guaranteeing FIFO Message Delivery
  Ensuring that messages arrive at the receiver in the same
  order in which they are sent is another function usually
  assigned to the communication subsystem.

which means, according to the paper, the "function" of
reordering by network can not be complete or correct, and,
unlike you, I'm fully aware of it.

> This is because packets inside lookup engine can be sprayed to
> multiple lookup engines, and order is lost even for packets coming
> from interface1 exclusively, however after the lookup the order is
> restored for _flow_, it is not restored between flows, so packets
> coming from interface1 with random ports won't be same order going out
> from interface2.

That is a broken argument for how identification of flows by
intelligent intermediate entities could work against the E2E
argument and the reality initiated this thread.

In the real world, according to the E2E argument, attempts to identify
flows by intelligent intermediate entities is just harmful from the
beginning, which is why flow driven architecture including that of
MPLS is broken and hopeless.

I really hope you understand the meaning of "intelligent intermediate
entities" in the context of the E2E argument.

            Masataka Ohta

What's the difference between theory and practice? In theory, there is
no difference.

William Herrin wrote:

I recognize what happens in the real world, not in the lab or text books.

What's the difference between theory and practice?

W.r.t. the fact that there are so many wrong theories
and wrong practices, there is no difference.

In theory, there is no difference.

Especially because the real world includes labs and text
books and, as such, all the theories including all the wrong
ones exist in the real world.

          Masataka Ohta

Mark Tinka <mark@tinka.africa> writes:

And just because I said per-flow load balancing has been the gold
standard for the last 25 years, does not mean it is the best
solution. It just means it is the gold standard.

TCP looks quite different in 2023 than it did in 1998. It should handle
packet reordering quite gracefully; in the best case the NIC will
reassemble the out-of-order TCP packets into a 64k packet and the OS
will never even know they were reordered. Unfortunately current
equipment does not seem to offer per-packet load balancing, so we cannot
test how well it works.

It is possible that per-packet load balancing will work a lot better
today than it did in 1998, especially if the equipment does buffering
before load balancing and the links happen to be fairly short and not
very diverse.

Switching back to per-packet would solve quite a lot of problems,
including elephant flows and bad hashing.

I would love to hear about recent studies.

/Benny

TCP looks quite different in 2023 than it did in 1998. It should handle
packet reordering quite gracefully; in the best case the NIC will

I think the opposite is true, TCP was designed to be order agnostic.
But everyone uses cubic, and for cubic reorder is the same as packet
loss. This is a good trade-off. You need to decide if you want to
recover fast from occasional packet loss, or if you want to be
tolerant of reordering.
The moment cubic receives frame+1 it expects, it acks frame-1 again,
signalling loss of packet, causing unnecessary resend and window size
reduction.

will never even know they were reordered. Unfortunately current
equipment does not seem to offer per-packet load balancing, so we cannot
test how well it works.

For example Juniper offers true per-packet, I think mostly used in
high performance computing.

For example Juniper offers true per-packet, I think mostly used in
high performance computing.

At least on MX, what Juniper calls ‘per-packet’ is really ‘per-flow’.

Cisco did it too with CEF supporting "ip load-sharing per-packet" at the interface level.

I am not sure this is still supported on modern code/boxes.

Mark.

Unless you specifically configure true "per-packet" on your LAG:

 set interfaces ae2 aggregated\-ether\-options load\-balance per\-packet

I ran per-packet on a Juniper LAG 10 years ago. It produced 100% perfect traffic distribution. But the reordering was insane, and the applications could not tolerate it.

If you applications can tolerate reordering, per-packet is fine. In the public Internet space, it seems we aren't there yet.

Mark.

TCP looks quite different in 2023 than it did in 1998. It should handle
packet reordering quite gracefully; in the best case the NIC will
reassemble the out-of-order TCP packets into a 64k packet and the OS
will never even know they were reordered. Unfortunately current
equipment does not seem to offer per-packet load balancing, so we cannot
test how well it works.

I ran per-packet load balancing on a Juniper LAG between 2015 - 2016. Let's just say I won't be doing that again.

It balanced beautifully, but OoO packets made customers' lives impossible. So we went back to adaptive load balancing.

It is possible that per-packet load balancing will work a lot better
today than it did in 1998, especially if the equipment does buffering
before load balancing and the links happen to be fairly short and not
very diverse.

Switching back to per-packet would solve quite a lot of problems,
including elephant flows and bad hashing.

I would love to hear about recent studies.

2016 is not 1998, and certainly not 2023... but I've not heard about any improvements in Internet-based applications being better at handling OoO packets.

Open to new info.

100Gbps ports has given us some breathing room, as have larger buffers on Arista switches to move bandwidth management down to the user-facing port and not the upsteam router. Clever Trio + Express chips have also enabled reasonably even traffic distribution with per-flow load balancing.

We shall revisit the per-flow vs. per-packet problem when 100Gbps starts to become as rampant as 10Gbps did.

Mark.

Yes, this has been my understanding of, specifically, Juniper's forwarding complex.

Packets are chopped into near-same-size cells, sprayed across all available fabric links by the PFE logic, given a sequence number, and protocol engines ensure oversubscription is managed by a request-grant mechanism between PFE's.

I'm not sure what mechanisms other vendors implement, but certainly OoO cells in the Juniper forwarding complex is not a concern within the same internal system itself.

Mark.