I'm an engineer working on QUIC at Google.
Sorry for the delayed response. This was an issue with QUIC traffic (from
Chrome users) to Google for some users behind NATs. The issue started at
small volumes on 2015-09-09 when we started rolling out an optimization to
our frontends for QUIC users behind NATs, and then increased in magnitude
on 2015-09-22. We rolled back the feature by 2015-09-23 12:40 PDT, which
did clear all issues.
Here are the technical details:
As we have posted about QUIC before, and was pointed out earlier in this
thread, QUIC runs over UDP. For users behind a NAT, QUIC relies upon NATs
maintaining port bindings when the UDP path is in active use. To ensure
that the NAT does not time out with outstanding requests, QUIC sends a
keepalive ping after 15 seconds. To optimize for the case when a NAT times
out, we implemented a feature in our frontend to allow the port to migrate
on open QUIC connections -- the frontend would send a GOAWAY to the client
to indicate the connection should be re-established, but outstanding
requests could be completed. This triggered a previously unknown bug in
Chrome where it tries to use an open QUIC connection that has already
received a GOAWAY for new requests. Hence the failures.
As this happens only following a NAT rebinding, these failures were
isolated to networks with a lot of NAT rebindings on active connections.
This bug was particularly visible on connections to GMail and Drive because
they utilize hanging GETs which last multiple minutes. All new requests
using timed out connections would have immediately failed as well.
After rolling back the feature, we have started working on fixing the
underlying issue and improving how we detect and avoid issues like this in
-- Ian Swett