Google's peering, GGC, and congestion management

Baptiste_Jonglez · October 14, 2015, 5:07pm

Hi,

In its peering documentation [Google Edge Network],
Google claims that it can drive peering links at 100% utilisation:

Congestion management

Peering ports with Google can be run at 100% capacity in the short term,
with low (<1-2%) packet loss. Please note that an ICMP ping may display
packet loss due to ICMP rate limiting on our platforms. Please contact
us to arrange a peering upgrade.

How do they achieve this?

More generally, is there any published work on how Google serves content
from its CDN, the Google Global Cache? I'm especially interested in two
aspects:

- for a given eyeball network, on which basis are the CDN nodes selected?

- is Google able to spread traffic over distinct peering links for the
same eyeball network, in case some of the peering links become
congested? If so, how do they measure congestion?

Thanks for your input,
Baptiste

ianai · October 15, 2015, 2:35pm

In its peering documentation [Google Edge Network],
Google claims that it can drive peering links at 100% utilisation:

Congestion management

Peering ports with Google can be run at 100% capacity in the short term,
with low (<1-2%) packet loss. Please note that an ICMP ping may display
packet loss due to ICMP rate limiting on our platforms. Please contact
us to arrange a peering upgrade.

How do they achieve this?

The 100% number is silly. My guess? They’re at 98%.

That is easily do-able because all the traffic is coming from them. Coordinate the HTTPd on each of the servers to serve traffic at X bytes per second, ensure you have enough buffer in the switches for micro-bursts, check the NICs for silliness such as jitter, and so on. It is non-trivial, but definitely solvable.

Google is not the only company who can do this. Akamai has done it far longer. And Akamai has a much more difficult traffic mix, with -paying customers- to deal with.

More generally, is there any published work on how Google serves content
from its CDN, the Google Global Cache? I'm especially interested in two
aspects:

- for a given eyeball network, on which basis are the CDN nodes selected?

As for picking which GGC for each eyeball, that is called “mapping”. It varies among the different CDNs. Netflix drives it mostly from the client. That has some -major- advantages over other CDNs. Google has in the past (haven’t checked in over a year) done it by giving each user a different URL, although I think they use DNS now. Akamai uses mostly DNS, although they have at least experimented with other ways. Etc., etc.

- is Google able to spread traffic over distinct peering links for the
same eyeball network, in case some of the peering links become
congested? If so, how do they measure congestion?

Yes. Easily.

User 1 asks for Stream 1, Google sends them them to Node 1. Google notices Link 1 is near full. User 2 asks for Stream 2, Google sends them to Node 2, which uses Link 2.

This is possible for any set of Users, Streams, Nodes, and Links.

It is even possible to send User 2 to Node 2 when User 2 wants Stream 1. Or sending User 1 to Node 2 for their second request despite the fact they just got a stream from Node 1. There are few, if any, restrictions on the combinations.

Remember, they control the servers. All CDNs (that matter) can do this. They can re-direct users with different URLs, different DNS responses, 302s, etc., etc. It is not BGP.

Everything is much easier when you are one of the end points. (Or both, like with Netflix.) When you are just an ISP shuffling packets you neither send nor receive, things are both simpler and harder.

Baldur_Norddahl · October 15, 2015, 7:50pm

You would not need to control the servers to do this. All you need is the
usual hash function of src+dst ip+port to map sessions into buckets and
then dynamically compute how big a fraction of the buckets to route through
a different path.

A bit surprising that this is not a standard feature on routers.

Regards,

Baldur

ianai · October 15, 2015, 8:00pm

The reason routers do not do that is what you suggest would not work.

First, you make the incorrect assumption that inbound will never exceed outbound. Almost all CDN nodes have far more capacity between the servers and the router than the router has to the rest of the world. And CDN nodes are probably the least complicated example in large networks. The only way to ensure A < B is to control A or B - and usually A.

Second, the router has no idea how much traffic is coming in at any particular moment. Unless you are willing to move streams mid-flow, you can’t guarantee this will work even if sum(in) < sum(out). Your idea would put Flow N on Port X when the SYN (or SYN/ACK) hits. How do you know how many Mbps that flow will be? You do not, therefore you cannot do it right. And do not say you’ll wait for the first few packets and move then. Flows are not static.

Third…. Actually, since 1 & 2 are each sufficient to show why it doesn’t work, not sure I need to go through the next N reasons. But there are plenty more.

Baldur_Norddahl · October 15, 2015, 9:13pm

The reason routers do not do that is what you suggest would not work.

Of course it will work and it is in fact exactly the same as your own
suggestion, just implemented in the network. Besides it _is already_ a
standard feature, it is called equal cost multipath routing. The only
difference is dynamically changing the weights between the multipaths.

First, you make the incorrect assumption that inbound will never exceed
outbound. Almost all CDN nodes have far more capacity between the servers
and the router than the router has to the rest of the world. And CDN nodes
are probably the least complicated example in large networks. The only way
to ensure A < B is to control A or B - and usually A.

I make absolutely no assumptions about ingress (towards the ASN) as we have
no control of that. There is no requirement that routing is symmetric and
it is the responsibility of whoever controls the ingress to do something if
the port is overloaded in that direction. In the case of a CDN however, the
ingress will be very little. Netflix does not take much data in from their
customers, it is all egress traffic towards the customers and the CDN is in
control of that. The same goes for Google.

Two non CDN peers could use the system, but if the traffic level is
symmetric then they better both do it.

Second, the router has no idea how much traffic is coming in at any
particular moment. Unless you are willing to move streams mid-flow, you
can’t guarantee this will work even if sum(in) < sum(out). Your idea would
put Flow N on Port X when the SYN (or SYN/ACK) hits. How do you know how
many Mbps that flow will be? You do not, therefore you cannot do it right.
And do not say you’ll wait for the first few packets and move then. Flows
are not static.

Flows can move at any time in a BGP network. As we are talking about CDNs
we can assume that we have many many small flows (compared to port size).
We can be fairly sure that traffic will not make huge jumps from one second
to the next - you will have a nice curve here. You know exactly how much
traffic you had the last time period, both out through the contested port
and through the alternative paths. Recalculating the weights is just a
matter of assuming that the next time period will be the same or that the
delta will be the same. It is a classic control loop problem. TCP is trying
to do much the same btw.

You can adjust how close to 100% you want the algorithm to hit. If it
performs badly, give it a little bit more space.

If the time period is one second, flows can move once a second at maximum
and very few flows would be likely to move. You could get a few out of
order packets on your flow, which is not such a big issue in a rare event.

Third…. Actually, since 1 & 2 are each sufficient to show why it doesn’t
work, not sure I need to go through the next N reasons. But there are
plenty more.

There are more reasons why this problem is hard to do on the servers :-).

Regards,

Baldur

ianai · October 15, 2015, 9:46pm

The reason routers do not do that is what you suggest would not work.

Of course it will work and it is in fact exactly the same as your own
suggestion, just implemented in the network. Besides it _is already_ a
standard feature, it is called equal cost multipath routing. The only
difference is dynamically changing the weights between the multipaths.

You are confused. But I think I see the source of your confusion.

Perhaps you are only considering a single port on a multi-port router with many paths to the same destination. Sure, if you want to say when Port X gets full (FSVO “full”), move some flows to the second best path. Yes, that is physically possible.

However, that is a tiny fraction of CDN Mapping. Plus you have a vast number of assumptions - not the least of which is that there _is_ another port to move traffic to. How many CDN nodes have you seen? You think most of them have a ton of ports to a slew of different networks? Or do they plonk a bunch of servers behind a single router (or switch!) connected to a single network (since most of them are _inside_ that network)?

My original point is the CDN can control how much traffic is sent to each destination. Routers cannot do this.

BTW: What you suggest breaks a lot of other things - which may or may not be a good trade off for avoiding congesting individual ports. But the idea to make identical IP path decisions inside a single router non-deterministic is .. let’s call it questionable.

First, you make the incorrect assumption that inbound will never exceed
outbound. Almost all CDN nodes have far more capacity between the servers
and the router than the router has to the rest of the world. And CDN nodes
are probably the least complicated example in large networks. The only way
to ensure A < B is to control A or B - and usually A.

I make absolutely no assumptions about ingress (towards the ASN) as we have
no control of that. There is no requirement that routing is symmetric and
it is the responsibility of whoever controls the ingress to do something if
the port is overloaded in that direction. In the case of a CDN however, the
ingress will be very little. Netflix does not take much data in from their
customers, it is all egress traffic towards the customers and the CDN is in
control of that. The same goes for Google.

Two non CDN peers could use the system, but if the traffic level is
symmetric then they better both do it.

You are still confused.

I have 48 servers connected @ GigE to a router with 4 x 10G outbound. When all 48 get nailed, where in the hell does the extra 8 Gbps go?

While if I own the CDN, I can easily ensure those 48 servers never push more than 40 Gbps. Or even 20 Gbps to any single destination. Or even 10 Mbps to any single destination.

The CDN can ensure the router is -never- congested. The router itself cannot do that.

Second, the router has no idea how much traffic is coming in at any
particular moment. Unless you are willing to move streams mid-flow, you
can’t guarantee this will work even if sum(in) < sum(out). Your idea would
put Flow N on Port X when the SYN (or SYN/ACK) hits. How do you know how
many Mbps that flow will be? You do not, therefore you cannot do it right.
And do not say you’ll wait for the first few packets and move then. Flows
are not static.

Flows can move at any time in a BGP network. As we are talking about CDNs
we can assume that we have many many small flows (compared to port size).
We can be fairly sure that traffic will not make huge jumps from one second
to the next - you will have a nice curve here. You know exactly how much
traffic you had the last time period, both out through the contested port
and through the alternative paths. Recalculating the weights is just a
matter of assuming that the next time period will be the same or that the
delta will be the same. It is a classic control loop problem. TCP is trying
to do much the same btw.

You can adjust how close to 100% you want the algorithm to hit. If it
performs badly, give it a little bit more space.

If the time period is one second, flows can move once a second at maximum
and very few flows would be likely to move. You could get a few out of
order packets on your flow, which is not such a big issue in a rare event.

This makes me lean towards my original idea that you have a total of one port on one router being considered.

Perhaps that is what the OP meant. If so, sure, have at it.

If they are interested in how CDN Mapping works, not even close.

Third…. Actually, since 1 & 2 are each sufficient to show why it doesn’t
work, not sure I need to go through the next N reasons. But there are
plenty more.

There are more reasons why this problem is hard to do on the servers :-).

The problem is VERY hard on the servers. Or, more precisely, on the control plane (which is frequently not on the servers themselves).

But the difference between “it's hard” and “it's un-possible” is kinda important.

Mark_Tinka1 · October 16, 2015, 6:17am

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

> > Remember, they control the servers. All CDNs (that matter) can do

this. They can re-direct users with different URLs, different DNS
responses, 302s, etc., etc. It is not BGP.

Of course, some other CDN's don't use DNS, and instead use BGP by
Anycasting target IP addresses locally. Of course, the challenge with
this is that those CDN's need to have their own IP addresses in the
markets they serve, while the DNS-based CDN's can use IP addresses of
the local network with whom they host.

I find the latter easier for ISP's, but I'm sure many of the CDN's find
the former easier for them, particularly with the lack of IPv4 space in
all but one region.

Mark.

Mark_Tinka1 · October 16, 2015, 6:19am

Of course, some other CDN's don't use DNS, and instead use BGP by
Anycasting target IP addresses locally. Of course, the challenge with
this is that those CDN's need to have their own IP addresses in the
markets they serve, while the DNS-based CDN's can use IP addresses of
the local network with whom they host.

I find the latter easier for ISP's, but I'm sure many of the CDN's find
the former easier for them, particularly with the lack of IPv4 space in
all but one region.

Mark.