v6 cdn problems

Pete_Carah · November 8, 2014, 10:55pm

Prefix this - I'm on fios in the Baltimore area, using a HE tunnel
terminating in ashburn.
(*still* no native v6 on fios Speedtest shows little or no
congestion, and didn't change significantly when I reduced mtu by 8.
(interestingly, speedtest.net usually reads faster than verizon's
internal speedtest, and rarely averages less than my billed speed.)

I've recently had problems (started a few weeks ago with www.att.com,
4-5 days ago with *.google.com)
which unfortunately happy eyeballs doesn't help.
att.com uses akamai, google uses their own cdn (per dns; I don't know if
there are any connections
behind the scenes.) This occurs on several google sites, all of which
resolve to the same netblocks from here (maps.google.com,
www.google.com, maps.gstatic.com, and at least one of the ad servers).

Symptom with akamai is that it connects immediately then data transfer
times out.
With google, symptom involves both slow connection, and data transfer
timing out. I don't know if chrome's happy eyeballs is working since if
it was, and absent address caching, I shouldn't see the slow connection.

v6 connections to my hosts in Los Angeles (not on HE address space, but
we do peer with them on
any2) work fine transferring graphics and large database files both
ways, so I'm pretty sure I don't have an mtu problem nor some other fios
or HE problem. Just to be sure, I dropped the 1500 to 1492 on the fios
link and did the same adjustment to the mtu on my tunnel (becomes
1472). No change on my hosts. att.com appears a little better, though
still very slow. Google shows no change at all.

I saw some of the same problem yesterday from Frederick on comcast (only
to google, didn't look at att), but couldn't take the time to do
traceroutes. If desired, I'm likely to go out there tomorrow and can do
that too. (we use a freebsd+pf router there).

Is this a provisioning problem where v6 eyeballs are outstripping cdn
provisioning (since win7 and 8 both default to v6)? Or is something
else going on. Since this seems to affect more than one cdn, it seems odd.

I run my own resolver locally instead of using verizon's. (and my own
(vyatta) router instead of theirs. Actually I'm still using theirs as a
bridge to talk to the set-top box (I don't know if Motorola still makes the
coax terminal that would bridge it directly...)

Resolve and traceroutes of relevant sites:

[pete@port5 ~]$ host www.att.com
www.att.com is an alias for prod-www.zr-att.com.akadns.net.
prod-www.zr-att.com.akadns.net is an alias for www.att.com.edgekey.net.
www.att.com.edgekey.net is an alias for e2318.dscb.akamaiedge.net.
e2318.dscb.akamaiedge.net has address 23.76.217.145
e2318.dscb.akamaiedge.net has IPv6 address 2600:807:320:202:9200::90e
e2318.dscb.akamaiedge.net has IPv6 address 2600:807:320:202:8600::90e

Traceroute (v4) to this shows it is in Newark or environs:
[pete@port5 ~]$ traceroute e2318.dscb.akamaiedge.net
traceroute to e2318.dscb.akamaiedge.net (23.76.217.145), 30 hops max, 60 byte packets
1 rtr.east.altadena.net (192.168.170.1) 2.008 ms 2.450 ms 3.091 ms
2 L300.BLTMMD-VFTTP-64.verizon-gni.net (108.3.150.1) 9.021 ms 9.054 ms 9.045 ms
3 G0-7-4-5.BLTMMD-LCR-21.verizon-gni.net (100.41.195.94) 10.670 ms 10.683 ms 10.677 ms
4 ae4-0.RES-BB-RTR2.verizon-gni.net (130.81.209.230) 9.002 ms ae20-0.RES-BB-RTR1.verizon-gni.net (130.81.151.112) 8.995 ms so-1-1-0-0.RES-BB-RTR1.verizon-gni.net (130.81.199.2) 8.953 ms
5 * * *
6 * * *
7 0.xe-5-0-4.XL3.EWR6.ALTER.NET (140.222.225.73) 51.202 ms 41.102 ms 40.345 ms
8 0.ae1.XL4.EWR6.ALTER.NET (140.222.228.41) 33.065 ms TenGigE0-6-0-3.GW8.EWR6.ALTER.NET (152.63.19.158) 11.550 ms TenGigE0-6-0-6.GW8.EWR6.ALTER.NET (152.63.25.10) 11.659 ms
9 TenGigE0-7-0-1.GW8.EWR6.ALTER.NET (152.63.19.166) 19.854 ms akamai-gw.customer.alter.net (152.179.185.126) 1766.402 ms TenGigE0-7-0-7.GW8.EWR6.ALTER.NET (152.63.25.30) 18.227 ms
10 akamai-gw.customer.alter.net (152.179.185.126) 1747.269 ms a23-76-217-145.deploy.static.akamaitechnologies.com (23.76.217.145) 10.672 ms 11.263 ms

Traceroute6 to it appears to be local (but is hard to tell. Next-to-last hop looks like either a router or
load-balancer is overloaded. Same for the v4 traceroute...

[pete@port5 ~]$ traceroute6 www.att.com
traceroute to www.att.com (2600:807:320:202:9200::90e), 30 hops max, 80 byte packets
1 rtr.east.altadena.net (2001:470:e160:1::1) 5.182 ms 5.274 ms 5.254 ms
2 altadenamd-1.tunnel.tserv13.ash1.ipv6.he.net (2001:470:7:126::1) 11.452 ms 15.040 ms 18.592 ms
3 ge4-12.core1.ash1.he.net (2001:470:0:90::1) 20.273 ms 20.574 ms 20.567 ms
4 eqx.br6.iad8.verizonbusiness.com (2001:504:0:2::701:1) 20.522 ms 20.232 ms 20.475 ms
5 * * *
6 2600:802:44f::2 (2600:802:44f::2) 1283.113 ms 1296.115 ms 1296.181 ms
7 2600:807:320:200::1743:f397 (2600:807:320:200::1743:f397) 20.181 ms 16.169 ms 14.073 ms

[pete@port5 ~]$ host www.google.com
www.google.com has address 74.125.228.16
www.google.com has address 74.125.228.20
www.google.com has address 74.125.228.17
www.google.com has address 74.125.228.19
www.google.com has address 74.125.228.18
www.google.com has IPv6 address 2607:f8b0:4004:800::1012

Traceroute (v4) to this shows something odd, but I don't know where "burl" is for verizon. Also I appear to
hit two nodes for the terminal. At least one google node appears to be ashburn (or environs)
:
[pete@port5 ~]$ traceroute www.google.com
traceroute to www.google.com (74.125.228.20), 30 hops max, 60 byte packets
1 rtr.east.altadena.net (192.168.170.1) 2.646 ms 2.816 ms 3.536 ms
2 L300.BLTMMD-VFTTP-64.verizon-gni.net (108.3.150.1) 4.109 ms 4.194 ms 4.186 ms
3 G0-7-4-4.BLTMMD-LCR-22.verizon-gni.net (130.81.170.84) 7.928 ms 8.096 ms 8.088 ms
4 ae20-0.PHIL-BB-RTR1.verizon-gni.net (130.81.151.120) 10.881 ms ae20-0.PHIL-BB-RTR2.verizon-gni.net (130.81.151.124) 11.074 ms so-6-1-0-0.PHIL-BB-RTR2.verizon-gni.net (130.81.199.4) 11.047 ms
5 0.xe-7-0-2.XL2.IAD8.ALTER.NET (152.63.4.93) 14.872 ms 0.xe-3-0-1.XL3.IAD8.ALTER.NET (152.63.3.61) 37.703 ms 12.268 ms
6 0.xe-9-2-0.GW9.IAD8.ALTER.NET (152.63.36.30) 12.866 ms 0.xe-11-2-1.GW9.IAD8.ALTER.NET (152.63.42.2) 14.442 ms 0.xe-9-2-0.GW9.IAD8.ALTER.NET (152.63.36.30) 11.918 ms
7 * 0.xe-10-1-1.GW9.IAD8.ALTER.NET (152.63.35.113) 16.901 ms pool-96-236-104-66.burl.east.verizon.net (96.236.104.66) 136.110 ms
8 pool-96-236-104-66.burl.east.verizon.net (96.236.104.66) 137.977 ms 216.239.46.248 (216.239.46.248) 13.875 ms pool-96-236-104-66.burl.east.verizon.net (96.236.104.66) 134.602 ms
9 216.239.46.248 (216.239.46.248) 15.918 ms 10.708 ms 10.162 ms
10 72.14.238.173 (72.14.238.173) 11.347 ms 12.111 ms iad23s05-in-f20.1e100.net (74.125.228.20) 12.769 ms

Corresponding traceroute6 (shows lack of reverse on most hits...):
[pete@port5 ~]$ traceroute6 www.google.com
traceroute to www.google.com (2607:f8b0:4004:800::1011), 30 hops max, 80 byte packets
1 rtr.east.altadena.net (2001:470:e160:1::1) 1.640 ms 1.811 ms 1.801 ms
2 altadenamd-1.tunnel.tserv13.ash1.ipv6.he.net (2001:470:7:126::1) 11.977 ms 15.279 ms 19.265 ms
3 ge4-12.core1.ash1.he.net (2001:470:0:90::1) 19.779 ms 20.776 ms 22.303 ms
4 2001:4860:1:1:0:1b1b:0:d (2001:4860:1:1:0:1b1b:0:d) 22.267 ms 22.514 ms 22.507 ms
5 2001:4860::1:0:9ff (2001:4860::1:0:9ff) 22.508 ms 22.471 ms 22.455 ms
6 2001:4860:0:1::551 (2001:4860:0:1::551) 22.467 ms 19.139 ms 19.116 ms
7 2607:f8b0:8000:18::c (2607:f8b0:8000:18::c) 19.054 ms 2607:f8b0:8000:18::f (2607:f8b0:8000:18::f) 7.716 ms 2607:f8b0:4004:800::1b (2607:f8b0:4004:800::1b) 8.379 ms

Again shows multiple terminals for the given address.

Ping works fine to all of the addresses, both v4 and v6, and the att one
always connects immediately. The google one doesn't always.

When I disable the HE tunnel, (and restart the browser - apparently
happy-eyeballs caches), everything works just fine, so the problem does
appear to relate to v6.

For reference, I mostly use chrome in linux. My daughter sees the same
problem with google, mostly using chrome in win 7. I see the problem
with firefox (in linux) also (to both sites).

-- Pete

Hugo_Slabbert · November 8, 2014, 11:00pm

Possibly https://puck.nether.net/pipermail/outages/2014-November/007421.html ?

Frank_Bulk1 · November 8, 2014, 11:02pm

The Google angle is also being discussed on outages. Initial suspicions are PTB packets not flowing through tunneled connections.

Frank

Jeroen_Massar · November 8, 2014, 11:10pm

[..]

Symptom with akamai is that it connects immediately then data transfer
times out.
With google, symptom involves both slow connection, and data transfer
timing out.

See amongst others:

and already reported in various places, eg ipv6-ops@ etc.

Akamai is working on it as they have noted in various places already,
(thanks to Marty etc).

Google does not seem to be home. They used to have a handy
ipv6@google.com address, but alas, that does not exist anymore.
And it does not look their own employees actually use IPv6 otherwise
they would have noticed it themselves, or like you know their monitoring
systems showing that lots of connections are hanging and never actually
properly finishing.

I don't know if chrome's happy eyeballs is working since if
it was, and absent address caching, I shouldn't see the slow connection.

Chrome's Happy Eyeballs does not work when the TCP session has been
made. (At least that is what it looks like on OSX). Hence, when the
session gets stuck it is waiting for the TCP timeout to happen before it
retries. It then does seem to remember that that connections is broken.

[..]

Is this a provisioning problem where v6 eyeballs are outstripping cdn
provisioning (since win7 and 8 both default to v6)? Or is something
else going on. Since this seems to affect more than one cdn, it seems odd.

Coincidence it seems.

[..]

When I disable the HE tunnel, (and restart the browser - apparently
happy-eyeballs caches), everything works just fine, so the problem does
appear to relate to v6.

*TEMPORARY* null routing the relevant prefixes on your *CPE* resolves
the problems you are seeing as then your local router reports !N and
your browser falls back to IPv4, which then works again.

Greets,
Jeroen

Pete_Carah · November 9, 2014, 1:13am

[..]

Symptom with akamai is that it connects immediately then data transfer
times out.
With google, symptom involves both slow connection, and data transfer
timing out.

See amongst others:

So, the current Akamai IPv6 problem
Forum - Problems with IPV6 on connect,facebook.net :: SixXS - IPv6 Deployment & Tunnel Broker
Forum - Many Google sub-sites unresponsive over IPv6 as of Nov 7 (fonts.gstatic.com, others) :: SixXS - IPv6 Deployment & Tunnel Broker

and already reported in various places, eg ipv6-ops@ etc.

Another list to subscribe...

*TEMPORARY* null routing the relevant prefixes on your *CPE* resolves
the problems you are seeing as then your local router reports !N and
your browser falls back to IPv4, which then works again.

DIsgusting but necessary. At least I don't have to do this on verizon's
actiontec

-- Pete

Pete_Carah · November 9, 2014, 1:17am

So, I can do this fine. How do we get my proverbial grandmother to do it?
(or even my daughter, who at least knows what a router is, but only that
it contains
suitable magic).

-- Pete

Joel_Jaeggli · November 9, 2014, 9:53pm

The Google angle is also being discussed on outages. Initial suspicions are PTB packets not flowing through tunneled connections.

you can also have problems in the other direction e.g. if your tunnel
ingress sends a ptb towards a load balanced service it may not arrive.

if you're tunneled it does help a lot if your mss reflects the cost of
the tunnel you know exists.

Christopher_Morrow · November 9, 2014, 10:00pm

to be clear, folk who care do know about the problem and are working
on a solution...

Jeroen_Massar · November 10, 2014, 5:51am

Google does not seem to be home.

Note that you skipped the rest:

"Google does not seem to be home. They used to have a handy
ipv6@google.com address, but alas, that does not exist anymore."

There used to be a handy ipv6@google address for reporting things. This
nowadays bounces.

to be clear, folk who care do know about the problem and are working
on a solution...

The problem Google was having was already resolved according to Damian
as noted on the outages list. Seems those archives don't update at the
moment, hence:

http://permalink.gmane.org/gmane.org.operators.ipv6/10232

Greets,
Jeroen

Christopher_Morrow · November 10, 2014, 8:10am

yes, it changed to noc@ I think. and yup, damian (and a few other
folk) beat the mtu issue with a cold trout.

Jeroen_Massar · November 10, 2014, 8:17am

There used to be a handy ipv6@google address for reporting things. This
nowadays bounces.

yes, it changed to noc@ I think.

Thus, in case of an IPv6 issue, contacting noc@google.com is the right
thing to do? Good to hear that the folks there are aware of IPv6.

and yup, damian (and a few other folk) beat the mtu issue with a cold trout.

Thanks for that.

From a message by Lorenzo:

Some very nice broken IPv6 networks at Google and Akamai

it seems Google is breaking PMTUD on purpose preferring to force the
MSS to a minimum value instead.

But the problem there is not PMTUD, but what is described in:

Which makes sense on a Google-scale of connections. I am not sure that
breaking PTMUD and forcing MSS is the correct answer though. Forcing MSS
is likely a good intermediary step, actually fixing the load-balancer is
a better one though.

I am now wondering if that is what is hitting Akamai too, as that would
explain the problem being seen: contacting the same IP sometimes works
and sometimes does not; which could be a result of the real endnode not
always seeing the correct ICMP and thus knowing the correct MTU.

Greets,
Jeroen