Lossy cogent p2p experiences?

Sorry for the confusion - let me provide some background context since we deployed the PTX ages ago (and core nodes are typically boring).

The issue we ran into was to do with our deployment tooling, which was based on 'enhanced-hash-key' that is required for MPC's on the MX.

The tooling used to deploy the PTX was largely built on what we use to deploy the MX, with tweaks of critically different items. At the time, we did not know that the PTX required 'hash-key' as opposed to 'enhanced-hash-key'. So nothing got deployed on the PTX specifically for load balancing (we might have assumed it to have been non-existent or incomplete feature at the time).

So the "surprise" I speak of is how well it all worked with load balancing across LAG's and EoMPLS traffic compared to the CRS-X, despite not having any load balancing features explicitly configured, which is still the case today.

It works, so we aren't keen to break it.

Mark.

Mark Tinka wrote:

I wouldn't call 50 megabit/s an elephant flow

Fair point.

Both of you are totally wrong, because the proper thing to do
here is to police, if *ANY*, based on total traffic without
detecting any flow.

100 50Mbps flows are as harmful as 1 5Gbps flow.

Moreover, as David Hubbard wrote:

> I’ve got a non-rate-limited 10gig circuit

there is no point of policing.

Detection of elephant flows were wrongly considered useful
with flow driven architecture to automatically bypass L3
processing for the flows, when L3 processing capability
were wrongly considered limited.

Then, topology driven architecture of MPLS appeared, even
though topology driven is flow driven (you can't put inner
labels of MPLS without knowing detailed routing information
at the destinations, which is hidden at the source through
route aggregation, on demand after detecting flows.)

            Masataka Ohta

I don't think it's as much an issue of flow detection as it is the core's ability to balance the Layer 2 payload across multiple links effectively.

At our shop, we understand the limitations of trying to carry large EoMPLS flows across an IP/MPLS network that is, primarily, built to carry IP traffic.

While some vendors have implemented adaptive load balancing algorithms on decent (if not custom) silicon that can balance EoMPLS flows as well as they can IP flows, it is hit & miss depending on the code, hardware, vendor, e.t.c.

In our case, our ability to load balance EoMPLS flows as well as we do IP flows has improved since we moved to the PTX1000/10001 for our core routers. But even then, we will not sell anything above 40Gbps as an EoMPLS service. Once it gets there, time for EoDWDM. At least, until 800Gbps or 1Tbps Ethernet ports become both technically viable and commercially feasible.

For as long as core links are based on 100Gbps and 400Gbps ports, optical carriage for 40Gbps and above is more sensible than EoMPLS.

Mark.

Mark Tinka wrote:

it is the core's ability to balance the Layer 2 payload across multiple links effectively.

Wrong. It can be performed only at the edges by policing total
incoming traffic without detecting flows.

While some vendors have implemented adaptive load balancing algorithms

There is no such algorithms because, as I wrote:

: 100 50Mbps flows are as harmful as 1 5Gbps flow.

            Masataka Ohta

Wrong. It can be performed only at the edges by policing total
incoming traffic without detecting flows.

I am not talking about policing in the core, I am talking about detection in the core.

Policing at the edge is pretty standard. You can police a 50Gbps EoMPLS flow coming in from a customer port in the edge. If you've got N x 10Gbps links in the core and the core is unable to detect that flow in depth to hash it across all those 10Gbps links, you can end up putting all or a good chunk of that 50Gbps of EoMPLS traffic into a single 10Gbps link in the core, despite all other 10Gbps links having ample capacity available.

There is no such algorithms because, as I wrote:

: 100 50Mbps flows are as harmful as 1 5Gbps flow.

Do you operate a large scale IP/MPLS network? Because I do, and I know what I see with the equipment we deploy.

You are welcome to deny it all you want, however. Not much I can do about that.

Mark.

This is quite an unusual opinion. Maybe you could explain?

Nick

Mark Tinka wrote:

Wrong. It can be performed only at the edges by policing total
incoming traffic without detecting flows.

I am not talking about policing in the core, I am talking about detection in the core.

I'm not talking about detection at all.

Policing at the edge is pretty standard. You can police a 50Gbps EoMPLS flow coming in from a customer port in the edge. If you've got N x 10Gbps links in the core and the core is unable to detect that flow in depth to hash it across all those 10Gbps links, you can end up putting all or a good chunk of that 50Gbps of EoMPLS traffic into a single 10Gbps link in the core, despite all other 10Gbps links having ample capacity available.

Relying on hash is a poor way to offer wide bandwidth.

If you have multiple parallel links over which many slow
TCP connections are running, which should be your assumption,
the proper thing to do is to use the links with round robin
fashion without hashing. Without buffer bloat, packet
reordering probability within each TCP connection is
negligible.

Faster TCP may suffer from packet reordering during slight
congestion, but the effect is like that of RED.

Anyway, in this case, the situation is:

:Moreover, as David Hubbard wrote:
:> I've got a non-rate-limited 10gig circuit

So, if you internally have 10 parallel 1G circuits expecting
perfect hashing over them, it is not "non-rate-limited 10gig".

            Masataka Ohta

If you have multiple parallel links over which many slow
TCP connections are running, which should be your assumption,
the proper thing to do is to use the links with round robin
fashion without hashing. Without buffer bloat, packet
reordering probability within each TCP connection is
negligible.

So you mean, what... per-packet load balancing, in lieu of per-flow load balancing?

So, if you internally have 10 parallel 1G circuits expecting
perfect hashing over them, it is not "non-rate-limited 10gig".

It is understood in the operator space that "rate limiting" generally refers to policing at the edge/access.

The core is always abstracted, and that is just capacity planning and management by the operator.

Mark.

Can you provide some real world data to back this position up?

What you said reminds me of the old saying: in theory, there's no difference between theory and practice, but in practice there is.

Nick

Mark Tinka wrote:

So you mean, what... per-packet load balancing, in lieu of per-flow load balancing?

Why, do you think, you can rely on existence of flows?

So, if you internally have 10 parallel 1G circuits expecting
perfect hashing over them, it is not "non-rate-limited 10gig".

It is understood in the operator space that "rate limiting" generally refers to policing at the edge/access.

And nothing beyond, of course.

The core is always abstracted, and that is just capacity planning and management by the operator.

ECMP, surely, is a too abstract concept to properly manage/operate
simple situations with equal speed multi parallel point to point links.

            Masataka Ohta

Nick Hilliard wrote:

the proper thing to do is to use the links with round robin
fashion without hashing. Without buffer bloat, packet
reordering probability within each TCP connection is
negligible.

Can you provide some real world data to back this position up?

See, for example, the famous paper of "Sizing Router Buffers".

With thousands of TCP connections at the backbone recognized
by the paper, buffers with thousands of packets won't cause
packet reordering.

What you said reminds me of the old saying: in theory, there's no difference between theory and practice, but in practice there is.

In theory, you can always fabricate unrealistic counter examples
against theories by ignoring essential assumptions of the theories.

In this case, "Without buffer bloat" is an essential assumption.

            Masataka Ohta

Why, do you think, you can rely on existence of flows?

You have not quite answered my question - but I will assume you are in favour of per-packet load balancing.

I have deployed per-packet load balancing before, ironically, trying to deal with large EoMPLS flows in a LAG more than a decade ago. I won't be doing that again... OoO packets is nasty at scale.

And nothing beyond, of course.

No serious operators polices in the core.

ECMP, surely, is a too abstract concept to properly manage/operate
simple situations with equal speed multi parallel point to point links.

I must have been doing something wrong for the last 25 years.

Mark.

Why, do you think, you can rely on existence of flows?

You have not quite answered my question - but I will assume you are in favour of per-packet load balancing.

I have deployed per-packet load balancing before, ironically, trying to deal with large EoMPLS flows in a LAG more than a decade ago. I won't be doing that again... OoO packets is nasty at scale.

And nothing beyond, of course.

No serious operator polices in the core.

ECMP, surely, is a too abstract concept to properly manage/operate
simple situations with equal speed multi parallel point to point links.

I must have been doing something wrong for the last 25 years.

Mark.

Hi David,

That sounds like normal TCP behavior over a long fat pipe. After
establishment, TCP sends a burst of 10 packets at wire speed. There's
a long delay and then they basically get acked all at once so it sends
another burst of 20 packets this time. This doubling burst repeats
itself until one of the bursts overwhelms the buffers of a mid-path
device, causing one or a bunch of them to be lost. That kicks it out
of "slow start" so that it stops trying to double the window size
every time. Depending on how aggressive your congestion control
algorithm is, it then slightly increases the window size until it
loses packets, and then falls back to a smaller size.

It actually takes quite a while for the packets to spread out over the
whole round trip time. They like to stay bunched up in bursts. If
those bursts align with other users' traffic and overwhelm a midpoint
buffer again, well, there you go.

I have a hypothesis that TCP performance could be improved by
intentionally spreading out the early packets. Essentially, upon
receiving an ack to the first packet that contained data, start a rate
limiter that allows only one packet per 1/20th of the round trip time
to be sent for the next 20 packets. I left the job where I was looking
at that and haven't been back to it.

Regards,
Bill Herrin

I can see how this conclusion could potentially be reached in specific styles of lab configs, but the real world is more complicated and the assumptions you've made don't hold there, especially the implicit ones. Buffer bloat will make this problem worse, but small buffers won't eliminate the problem.

That isn't to say that packet / cell spray arrangements can't work. There are some situations where they can work reasonably well, given specific constraints, e.g. limited distance transmission path and path congruence with far-side reassembly (!), but these are the exception. Usually this only happens inside network devices rather than between devices, but occasionally you see products on the market which support this between devices with varying degrees of success.

Generally in real world situations on the internet, packet reordering will happen if you use round robin, and this will impact performance for higher speed flows. There are several reasons for this, but mostly they boil down to a lack of control over the exact profile of the packets that the devices are expected to transmit, and no guarantee that the individual bearer channels have identical transmission characteristics. Then multiply that across the N load-balanced hops that each flow will take between source and destination. It's true that per-hash load balancing is a nuisance, but it works better in practice on larger heterogeneous networks than RR.

Nick

Nick Hilliard wrote:

In this case, "Without buffer bloat" is an essential assumption.

I can see how this conclusion could potentially be reached in
specific styles of lab configs,

I'm not interested in how poorly you configure your
lab.

but the real world is more complicated and

And, this thread was initiated because of unreasonable
behavior apparently caused by stupid attempts for
automatic flow detection followed by policing.

That is the real world.

Moreover, it has been well known both in theory and
practice that flow driven architecture relying on
automatic detection of flows does not scale and is
no good, though MPLS relies on the broken flow
driven architecture.

> Generally in real world situations on the internet, packet reordering
> will happen if you use round robin, and this will impact performance
> for higher speed flows.

That is my point already stated by me. You don't have to repeat
it again.

> It's true that per-hash load
> balancing is a nuisance, but it works better in practice on larger
> heterogeneous networks than RR.

Here, you implicitly assume large number of slower speed flows
against your statement of "higher speed flows".

          Masataka Ohta

William Herrin wrote:

Hi David,

That sounds like normal TCP behavior over a long fat pipe.

No, not at all. First, though you explain slow start,
it has nothing to do with long fat pipe. Long fat
pipe problem is addressed by window scaling (and SACK).

As David Hubbard wrote:

: I've got a non-rate-limited 10gig circuit

and

: The initial and recurring packet loss occurs on any flow of
: more than ~140 Mbit.

the problem is caused not by wire speed limitation of a "fat"
pipe but by artificial policing at 140M.

            Masataka Ohta

Mark Tinka wrote:

ECMP, surely, is a too abstract concept to properly manage/operate
simple situations with equal speed multi parallel point to point links.

I must have been doing something wrong for the last 25 years.

Are you saying you thought a 100G Ethernet link actually consisting
of 4 parallel 25G links, which is an example of "equal speed multi
parallel point to point links", were relying on hashing?

            Masataka Ohta

this is an excellent example of what we're not talking about in this thread.

A 100G serdes is an unbuffered mechanism which includes a PLL, and this allows the style of clock/signal synchronisation required for the deserialised 4x25G lanes to be reserialised at the far end. This is one of the mechanisms used for packet / cell / bit spray, and it works really well.

This thread is talking about buffered transmission links on routers / switches on systems which provide no clocking synchronisation and not even a guarantee that the bearer circuits have comparable latencies. ECMP / hash based load balancing is a crock, no doubt about it; it's just less crocked than other approaches where there are no guarantees about device and bearer circuit behaviour.

Nick

So, I've actually studied this in real-world conditions and TCP
behaves exactly as I described in my previous email for exactly the
reasons I explained. If you think it doesn't, you don't know what
you're talking about.

Window scaling and SACK makes it possible for TCP to grow to consume
the entire whole end-to-end pipe when the pipe is at least as large as
the originating interface and -empty- of other traffic. Those
conditions are rarely found in the real world.

Regards,
Bill Herrin