Re: overly timid congestion control with amazon prime live video

I replied to Dan off list to investigate.

Any Prime Video quality issues reported to an ISP by customers or CDN issues can be sent directly to primevideo-isp-us@amazon.com .

Thanks,
Sean

Do you have an explanation for the question he asked? I am sure it would be of interest to many here.

Shane

A couple cake notes below...

Do you have an explanation for the question he asked? I am sure it would be of interest to many here.

Shane

>
> I replied to Dan off list to investigate.
>
> Any Prime Video quality issues reported to an ISP by customers or CDN issues can be sent directly to primevideo-isp-us@amazon.com .
>
> Thanks,
> Sean
>
>>>
>>> While streaming football last night from AT&T fiber (AS7018), I
>>> noticed the video quality went way down when I did a large download on
>>> another system. I have gigabit fiber but I'm using Linux tc to
>>> throttle my network traffic. I've configured cake with a 200mbit
>>> limit, and I also use a low BQL setting to further ensure low latency
>>> for low-bandwidth traffic.

OK there are several things about cake that seem to be conflated here.
Glad you are using it!

1) BQL is for outbound traffic only and provides backpressure at the
native rate of the interface. If you further apply cake's shaper to a
rate below that, BQL hardly enters the picture.

2) Applying a 200Mbit inbound limit to a gig fiber connection is
overkill. Worst case it should be at 50%, and we generally recommend
85%. People keep trying to go above 95% and that fails to control slow
start. This does not account for how congested the backhaul is and we
certainly see people trying desparately to figure that out - see the
cake-autorate project for a 5g example.

3) By default cake is in triple-isolate mode which does per-host/per
flow fq. This means if you have two devices asking for the bandwidth,
one with 1 flow, the other with dozens, each device will get roughly
half the bandwidth. fq_codel, on the other hand, shares on pure flow
basis. We put per host fq in there because of torrent, and to some
extent, web traffic (which typically opens 15 flows at a time).

However, if cake is on a natted router the per host mode fails unless
you apply the "nat" option on the invocation. Arguably we should have
made nat the default.

If you have demand for less bandwidth that your fair share, you
experience near zero latency and no loss for your flow. At 200Mbit,
assuming nat mode was on, your amazon flow (probably running below
20mbit) should have seen no congestion or loss at all while another
machine merrily downloaded at 180, dropping the occasional packet to
keep it under control.

Anyway, moving on...

>>> IOW, my Linux router will drop packets across the board rather
>>> liberally in the face of large downloads, but I've always seen streams

Arguably cake drops less packets than any other AQM we know of. It
still does not early and fast enough to slow start on short RTTs. I
keep trying to get people to deploy ecn.

>>> fight back for their share of the bandwidth -- except for amazon's.
>>>
>>> The live stream appears to use UDP on a non-standard port (not 443).
>>> Does anyone know what amazon has done to cause their congestion
>>> control algorithms to yield so much bandwidth and not fight for their
>>> fair share?

Anyway, this is a good question, if that was the observed behavior. A
packet capture showing it not recovering after a loss or delay would
be interesting.

A couple cake notes below...

Hey Dave, thanks for replying and for all your hard work on AQM and latency.

> Do you have an explanation for the question he asked? I am sure it would be of interest to many here.

Sean noted live video uses a custom protocol called Sye that estimates
available throughput. My understanding is this estimation can be too
low as compared to TCP when the network is slow or congested.

That is, in cases where TCP would buffer and then be able to move a
large amount of data at the expense of latency, Sye may instead push
much less data, sacrificing quality for latency.

IMO amazon streaming should fall back to TCP in these conditions,
perhaps noting to the customer their stream is no longer "live" but is
slightly delayed to improve video quality. Ideally a network that
could push on average 10mbit/s consistently over TCP could also push
10mbit/s consistently over a custom UDP protocol, but when this is not
the case (due to any number of bizarre real-world conditions), the
system should detect this and reset. Giving the customer a delayed but
high-quality nearly-live stream would (again, IMO) be a better
experience than a live but poor-quality video.

Of course I could probably have achieved myself this by simply
rewinding the live stream, but I was blinded to this option at the
time by my surprise and amazement at how poor the video quality was.

2) Applying a 200Mbit inbound limit to a gig fiber connection is
overkill. Worst case it should be at 50%, and we generally recommend
85%.

The reasoning for 200mbit is it's about 50% of best-case real-world
802.11 performance across a house. The goal is to keep buffers in the
APs as empty as possible. I'd rather enforce this on the APs, but
lacking that ability I do it for all traffic from the router to the
rest of the network.

If you have demand for less bandwidth that your fair share, you
experience near zero latency and no loss for your flow. At 200Mbit,
assuming nat mode was on, your amazon flow (probably running below
20mbit) should have seen no congestion or loss at all while another
machine merrily downloaded at 180, dropping the occasional packet to
keep it under control.

You're absolutely right. It's very possible the issue I experienced
was due to slowness at the wireless network level, and not Linux
traffic shaping.

> >>> The live stream appears to use UDP on a non-standard port (not 443).
> >>> Does anyone know what amazon has done to cause their congestion
> >>> control algorithms to yield so much bandwidth and not fight for their
> >>> fair share?

Anyway, this is a good question, if that was the observed behavior. A
packet capture showing it not recovering after a loss or delay would
be interesting.

My guess is that since Sye prioritizes live data over throughput, it
will essentially by design deliver poor quality in situations where
bandwidth is limited and TCP streams are vying to use as much of it as
they can. This unfortunately describes a lot of home networks using
wifi in real-world conditions.

-- Dan

> A couple cake notes below...

Hey Dave, thanks for replying and for all your hard work on AQM and latency.

I'm just the loudest...

> > Do you have an explanation for the question he asked? I am sure it would be of interest to many here.

Sean noted live video uses a custom protocol called Sye that estimates
available throughput. My understanding is this estimation can be too
low as compared to TCP when the network is slow or congested.

That is, in cases where TCP would buffer and then be able to move a
large amount of data at the expense of latency, Sye may instead push
much less data, sacrificing quality for latency.

IMO amazon streaming should fall back to TCP in these conditions,
perhaps noting to the customer their stream is no longer "live" but is
slightly delayed to improve video quality. Ideally a network that
could push on average 10mbit/s consistently over TCP could also push
10mbit/s consistently over a custom UDP protocol, but when this is not
the case (due to any number of bizarre real-world conditions), the
system should detect this and reset. Giving the customer a delayed but
high-quality nearly-live stream would (again, IMO) be a better
experience than a live but poor-quality video.

Of course I could probably have achieved myself this by simply
rewinding the live stream, but I was blinded to this option at the
time by my surprise and amazement at how poor the video quality was.

> 2) Applying a 200Mbit inbound limit to a gig fiber connection is
> overkill. Worst case it should be at 50%, and we generally recommend
> 85%.

The reasoning for 200mbit is it's about 50% of best-case real-world
802.11 performance across a house. The goal is to keep buffers in the
APs as empty as possible. I'd rather enforce this on the APs, but
lacking that ability I do it for all traffic from the router to the
rest of the network.

These days I am a huge fan of the mt79 wifi chipset, either the
gl-inet mt6000 oe the new openwrt one.

> If you have demand for less bandwidth that your fair share, you
> experience near zero latency and no loss for your flow. At 200Mbit,
> assuming nat mode was on, your amazon flow (probably running below
> 20mbit) should have seen no congestion or loss at all while another
> machine merrily downloaded at 180, dropping the occasional packet to
> keep it under control.

You're absolutely right. It's very possible the issue I experienced
was due to slowness at the wireless network level, and not Linux
traffic shaping.

> > >>> The live stream appears to use UDP on a non-standard port (not 443).
> > >>> Does anyone know what amazon has done to cause their congestion
> > >>> control algorithms to yield so much bandwidth and not fight for their
> > >>> fair share?
>
> Anyway, this is a good question, if that was the observed behavior. A
> packet capture showing it not recovering after a loss or delay would
> be interesting.

My guess is that since Sye prioritizes live data over throughput, it
will essentially by design deliver poor quality in situations where
bandwidth is limited and TCP streams are vying to use as much of it as
they can. This unfortunately describes a lot of home networks using
wifi in real-world conditions.

It sounds like a protocol that could be improved.