I would appreciate if someone from Comcast could contact me about this.
We’re having serious throughput issues with our AS20326 pushing packets to Comcast over v4. Our transfers are either the full line-speed of the Comcast customer modem, or they’re seemingly capped at 200-300KB/s. This behavior appears to be almost stateful, as if the speed is decided when the connection starts. As long as it starts fast it will remain fast for the length of the transfer and slow if it starts slow. Traces seem reasonable and currently we’ve influenced the path onto GTT both ways. If we prepend and reroute on our side, the same exact issue with happen on another transit provider.
This issue does not affect v6 and that is full speed on every attempt. This may be regionalized to the Comcast Pittsburgh market.
This is most widely affecting our linux mirror repository server: http://mirror.pit.teraswitch.com/
Our colocation customers who are hosting VPN systems are also noticing bottlenecks have started recently for their Comcast employees.
That's actually kinda normal for TCP behavior. The two most dominating
factors in TCP throughput are the round-trip time (RTT) and how large
the congestion window has grown prior to the first lost packet. Other
factors (including later mild packet loss) tend to move the needle so
slowly you might not notice it moving at all.
One of the interesting patterns with TCP is that the sender tends to
shove out all the packets it can in the first few percent of the RTT
and then sits idle. When the bandwidths are relatively fast, the
receiver receives and acks them all in a short time window as well. As
a result you get these high-bandwidth spurts where packet loss due to
full buffers is likely even though for most of the RTT there are no
packets being transmitted at all. It can take several minutes for
packets to spread out within the RTT, and by then the congestion
window (hence throughput) is firmly established.
We started getting the wave of complaints over the last two weeks or so. Perhaps up to a month ago was when the initial few issues that were reported but were chalked up to being “an issues out on the internet.”
Have you tried running a test to see if there may be ECMP issues? I wrote a rudimentary script once, [ https://pastebin.com/TTWEj12T | https://pastebin.com/TTWEj12T ] , that might help here. This script is written to detect packet loss on multiple ECMP paths, but you might be able to modify it for througput.
The rationale behind my thinking is that if you have certain ECMP links that are oversubscribed, the TCP sessions following that path will stay "low" bandwidth. Sessions what win the ECMP lottery and pass through a non-congested ECMP path may show better performance.
And for a slightly more formal package to do this,
there’s UDPing, developed by the amazing networking
team at Yahoo; it was written to identify intermittent
issues affecting a single link in an ECMP or L2-hashed
aggregate link pathway.
It does have the disadvantage of being designed for
one-way measurement in each direction; that decision
was intentional, to ensure each direction was measuring
a completely known, deterministic pathway based on the
hash values in the packets, without the return trip potentially
obscuring or complicating identification of problematic links.
But if you have access to both the source and destination ends
of the connection, it’s a wonderful tool to narrow down exactly
where the underlying problem on a hashed ECMP/aggregate
link is.