CDN Overload?

Mike_Hammett · September 19, 2016, 5:34pm

I participate on a few other mailing lists focused on eyeball networks. For a couple years I've been hearing complaints from this CDN or that CDN was behaving badly. It's been severely ramping up the past few months. There have been some wild allegations, but I would like to develop a bit more standardized evidence collection. Initially LimeLight was the only culprit, but recently it has been Microsoft as well. I'm not sure if there have been any others.

The principal complaint is that upstream of whatever is doing the rate limiting for a given customer there is significantly more capacity being utilized than the customer has purchased. This could happen briefly as TCP adjusts to the capacity limitation, but in some situations this has persisted for days at a time. I'll list out a few situations as best as I can recall them. Some of these may even be merges of a couple situations. The point is to show the general issue and develop a better process for collecting what exactly is happening at the time and how to address it.

One situation had approximately 45 megabit/s of capacity being used up by a customer that had a 1.5 megabit/s plan. All other traffic normally held itself within the 1.5 megabit/s, but this particular CDN sent excessively more for extended periods of time.

An often occurrence has someone with a single digit megabit/s limitation consuming 2x - 3x more than their plan on the other side of the rate limiter.

Last month on my own network I saw someone with 2x - 3x being consumed upstream and they had *190* connections downloading said data from Microsoft.

The past week or two I've been hearing of people only having a single connection downloading at more than their plan rate.

These situations effectively shut out all other Internet traffic to that customer or even portion of the network for low capacity NLOS areas. It's a DoS caused by downloads. What happened to the days of MS BITS and you didn't even notice the download happening? A lot of these guys think that the CDNs are just a pile of dicks looking to ruin everyone's day and I'm certain that there are at least a couple people at each CDN that aren't that way.

Lots of rambling, sure. What do I need to have these guys collect as evidence of a problem and who should they send it to?

Jared_Mauch · September 19, 2016, 8:19pm

I think the growing gap between those with high speed links and so-called slower links will be an ongoing issue. I’ve heard of tricks that the SP can do to avoid super large windows from occurring, but with the increased focus on tcp fast open and this data stuffing, I would expect the impact for these less studied low-speed links to get worse.

I’ve been helping a local WISP prepare their fiber OSP installation to try and mitigate some of the problems they have with local capacity and to work around the worsening NLOS situations that occur with annual tree growth.

- Jared

Jon_Lewis1 · September 20, 2016, 1:49am

It sounds like either the rate-limiting just isn't working, or the CDNs are trying too hard to ramp up the transfer rate in spite of your dropping some/most of the packets. I assume drops are happening either as part of the rate-limiting/policing, or simply as a result of trying to stuff 45mbit/s onto a 1.5mbit/s pipe....96.5% packet loss...and they're not slowing down at the sender?!?

This is kind of a funny problem though, because CDNs get paid to deliver data, and they get compared/graded according to who can deliver the bits the fastest...and here you are complaining that they're delivering the bits too fast (or at least faster than you'd like them to).

Mike_Hammett · September 20, 2016, 3:05am

http://www.theregister.co.uk/2016/06/08/is_win_10_ignoring_sysadmins_qos_settings/

This explains the recent situations (well, not really an explanation, but a bit more information from other people). Not so much for the ones going back a year or two.

George_Skorup1 · September 20, 2016, 6:14am

I have witnessed this issue first hand for several years. Four for sure, maybe five or six. The very first one I remember is a customer doing Usenet downloads and using what he called an "internet download manager" which I assumed was screwing with TCP ACKs. I believe he was a 4Mbps user at the time and this download manager thing was causing 2 to maybe 2.5x his subscribed rate, as Mike says, on the upstream facing router interface. He shut down or uninstalled the software and it stopped. Yes, this customer is on PTMP fixed wireless. Traffic policing was taking place via MikroTik simple queue at the site router.. I could cut his downstream rate in half and it would follow with double still hitting the backhaul. I could also move his queue all the way to the border router and it was still there at double rate.

BTW, we still have this guy as a customer on fixed wireless. He's been on 25/5Mbps for over a year. And we're about to upgrade him to 50/10Mbps with new gear. 25/5 and 50/10 is a far cry from this claimed "slow" WISP service. This shit ain't cheap to get to bumfsck Illinois so farmer Joe can watch porn and his kids can watch Netflix at the same time. Yup, we have slow NLOS service too, because customers decide they want the rural life buried in a mile of trees while "needing" the city benefits. If you want the gigabits, then move outta the sticks. Running a hundred combined miles of fiber to get to 20 customers that want to pay less than $50/mo is not feasible. /rant off

Another time, maybe three years ago, we had a customer on Canopy 5.7 FSK at 4/1Mbps using the built-in QoS. He was watching Netflix and I saw 8Mbps hitting the AP's ethernet interface. I thought the Canopy scheduler was broken. Until I looked deeper and saw that it was working exactly as designed.. with 50% discard rate on his VC. I want to say this was from LLNW at the time. I could be totally wrong about that, I really don't remember.

Now lets move the Windows 10 updates. A 'buried in the sticks' customer on Canopy 900 FSK. 1.5Mbps/384k. Multiple streams from Microsoft and LLNW at the same time. LLNW alone had maybe 10 streams going and was sending at over 15Mbps on average and at worst about 25Mbps... to a 1.5Mbps subscriber. I could throw in a MikroTik queue upstream which only moved the problem as that 15-25Mbps was still hitting backhaul links. And when I have a 100Mbps link going into the site, 25Mbps is a lot.

We've had numerous customers call in for the last month or two with 'teh innernets is down, my phoen wyfy don't work either'. No, your Windows 10 updates are overloading your service. Shut off your PC to use your internet service. Telling a customer those exact words is ridiculous, but we have to do it.

We had a known issue with a particular licensed microwave vendor's radios that we have in use. It was the ethernet buffer becoming saturated at nowhere near the RF link capacity. They put out a new software release and that was resolved. And that was well before this Windows 10 update overload stuff started.

Normal TCP congestion control behavior works perfectly fine. It's not the network. It's the sender not doing normal TCP stuffs. I don't know why the CDNs and/or Microsoft thinks this is a good idea, but to me, it looks like a DDoS. I'm on some of the same lists as Mike and we know of many others reporting similar issues. A couple to the tune of 50-100Mbps overload destined for 5 or 10Mbps tier subscribers. So thanks to Mike for trying to get a conversation going on this topic. And it's not just us red headed step children WISPs.

Matthew_Walster · September 20, 2016, 7:44am

on Canopy 900 FSK. 1.5Mbps/384k. Multiple streams from Microsoft and LLNW
at the same time. LLNW alone had maybe 10 streams going and was sending at
over 15Mbps on average and at worst about 25Mbps... to a 1.5Mbps
subscriber. I could throw in a MikroTik queue upstream which only moved the
problem as that 15-25Mbps was still hitting backhaul links. And when I have
a 100Mbps link going into the site, 25Mbps is a lot.

Maybe I'm being naive but this sounds like an issue primarily with buffers.
Police rather than shape the traffic, and reduce the burst size, and a lot
of this should disappear...

M

Mike_Hammett · September 20, 2016, 11:45am

What do most broadband platforms do for rate limiting?

Mike_Hammett · September 20, 2016, 7:18pm

This is what I'm asking of them:

Florian_Weimer · September 20, 2016, 8:34pm

* Jon Lewis:

This is kind of a funny problem though, because CDNs get paid to
deliver data, and they get compared/graded according to who can
deliver the bits the fastest...and here you are complaining that
they're delivering the bits too fast (or at least faster than you'd
like them to).

Surely CDNs bill packets which are subsequently dropped by the
network.

Baldur_Norddahl · September 21, 2016, 9:02am

How come we have never seen this problem? We have a ton of DSL and many of those are slow, but no customer complaints about overloaded lines from CDN networks.

Could it be that the way you throttle the bandwidth is defect? It is easy to blame the other guy but could it be that you are doing it wrong?

Regards,

Badur

Mike_Hammett · September 21, 2016, 11:45am

Likewise, why was it never an issue before and why does it only affect certain types of traffic from certain CDNs?

Josh_Reynolds1 · September 21, 2016, 12:36pm

With so many geographically diverse complaints on many hardware routing and
switching platforms, I'm going to go with a "no".

Spyros_Kakaroukas · September 21, 2016, 1:47pm

That's interesting.

Once, a few years/jobs/etc ago, I observed a flow from mobile youtube being really really bursty, peaking to a 40-50mbps on a 10mbps circuit, but that was the only time I've ever seen such an issue. After that one flow died, it never happened again.

That aside, I do work for a business-focused mostly-wireless SP at the moment and I haven't had any issues with CDNs so far. The only similar incidents I can recall involved customers running programs like aspera and signiant which, when misconfigured, can result in quite some volume coming your way.

My thoughts and words are my own.

Spyros

This e-mail and any attachment(s) contained within are confidential and are intended only for the use of the individual to whom they are addressed. The information contained in this communication may be privileged, or exempt from disclosure. If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distribution, or copying of this communication is strictly prohibited. If you have received this communication in error, please notify the sender and delete the communication without retaining any copies. Connecticore SA is not responsible for, nor endorses, any opinion, recommendation, conclusion, solicitation, offer or agreement or any information contained in this communication.

Mike_Hammett · September 21, 2016, 2:08pm

https://goo.gl/forms/LvgFRsMdNdI8E9HF3

I have made this into a Google Form to make it easier to track compared to randomly formatted responses on multiple mailing lists, Facebook Groups, etc.

Baldur_Norddahl · September 21, 2016, 2:32pm

It appears all complaints are from SP doing wireless. I am going to go with
a yes and put forth a these that these guys have a common factor somewhere.
It could be equipment from a some popular vendor of wireless or maybe some
common method to throttle that is popular in the wireless community.

I note that while we have slow links we have no throttling or bandwidth
management going on except for the buffering that happens in the DSLAM.

Also there is no way to cheat. If you send 4 mbps to a 2 mbps DSL it will
drop half of the traffic and TCP will not survive that. The CDN would have
an effective transfer rate approaching zero for that customer. That seems
to be a rather bad business proposal seen from the view if the CDN so they
would not do that. The other customers will be unaffected as the DSLAM
itself has plenty of capacity.

Regards

Baldur

Mike_Hammett · September 21, 2016, 2:40pm

I've had DSL and AE service providers respond with the issues.

So far there is not a common element other than CDNs.

That's the point of the questions I'm asking, to gather a ton of information and then figure out how to act on it.

You're assuming that the CDNs are using an unmolested, vanilla TCP stack. That may not be the case, especially if doing something like Fast TCP.

Mike_Hammett · September 22, 2016, 12:30am

https://docs.google.com/spreadsheets/d/1Jdm0dOBf81kSnXEvVfI6ZJbWFNt5AbYUV8CDxGwLSm8/edit?usp=sharing

I have made the anonymized answers public. This will obviously have some bias to it given that I mostly know fixed wireless operators, but I'm hoping this gets some good distribution to catch more platforms.

martinhannigan · September 22, 2016, 1:19am

Mike,

I will forward to the requisite group for a look. Have you brought this to our attention previously? I don't see anything. If you did, please forward me the ticket numbers or message(s) (peering@ is best) so wee can track down and see if someone already has it in queue.

Jared alluded to fasttcp a few emails ago. Astute man.

Best,

Martin Hannigan
AS 20940 // AS 32787

Mike_Hammett · September 22, 2016, 1:29am

Thanks Marty. I have only experienced this on my network once and it was directly with Microsoft, so I haven't done much until a couple days ago when I started this campaign. I don't know if anyone else has brought this to anyone's attention. I just sent an e-mail to Owen when I saw yours.

martinhannigan · September 22, 2016, 1:40am

No problem.

If you can drop a pcap file somewhere we can reach (and drop me an email where) that was created during the event that'd be great.

Thanks again, and great use of the list.

Best,

Martin Hannigan
AS 20940 // AS 32787