Spanner: synchronizing the largest global database with GPSDO stratum 1 in every datacenter

(GPSDO for local data center as stratum 1)

Exclusive: Inside Google Spanner, the Largest Single Database on Earth

By Cade Metz 11.26.12 6:30 AM

Each morning, when Andrew Fikes sat down at his desk inside Google
headquarters in Mountain View, California, he turned on the “VC” link to New

VC is Google shorthand for video conference. Looking up at the screen on his
desk, Fikes could see Wilson Hsieh sitting inside a Google office in
Manhattan, and Hsieh could see him. They also ran VC links to a Google office
in Kirkland, Washington, near Seattle. Their engineering team spanned three
offices in three different parts of the country, but everyone could still
chat and brainstorm and troubleshoot without a moment’s delay, and this is
how Google built Spanner.

“You walk into our cubes, and we’ve got VC on — all the time,” says Fikes,
who joined Google in 2001 and now ranks among the company’s distinguished
software engineers. “We’ve been doing this for years. It lowers all the
barriers to communication that you typically have.”

‘As a distributed-systems developer, you’re taught from — I want to say
childhood — not to trust time. What we did is find a way that we could trust
time — and understand what it meant to trust time.’

— Andrew Fikes

The arrangement is only appropriate. Much like the engineering team that
created it, Spanner is something that stretches across the globe while
behaving as if it’s all in one place. Unveiled this fall after years of hints
and rumors, it’s the first worldwide database worthy of the name — a database
designed to seamlessly operate across hundreds of data centers and millions
of machines and trillions of rows of information.

Spanner is a creation so large, some have trouble wrapping their heads around
it. But the end result is easily explained: With Spanner, Google can offer a
web service to a worldwide audience, but still ensure that something
happening on the service in one part of the world doesn’t contradict what’s
happening in another.

Google’s new-age database is already part of the company’s online ad system —
the system that makes its millions — and it could signal where the rest of
the web is going. Google caused a stir when it published a research paper
detailing Spanner in mid-September, and the buzz was palpable among the
hard-core computer systems engineers when Wilson Hsieh presented the paper at
a conference in Hollywood, California, a few weeks later.

“It’s definitely interesting,” says Raghu Murty, one of the chief engineers
working on the massive software platform that underpins Facebook — though he
adds that Facebook has yet to explore the possibility of actually building
something similar.

Google’s web operation is significantly more complex than most, and it’s
forced to build custom software that’s well beyond the scope of most online
outfits. But as the web grows, its creations so often trickle down to the
rest of the world.

Before Spanner was revealed, many didn’t even think it was possible. Yes, we
had “NoSQL” databases capable of storing information across multiple data
centers, but they couldn’t do so while keeping that information “consistent”
— meaning that someone looking at the data on one side of the world sees the
same thing as someone on the other side. The assumption was that consistency
was barred by the inherent delays that come when sending information between
data centers.

But in building a database that was both global and consistent, Google’s
Spanner engineers did something completely unexpected. They have a history of
doing the unexpected. The team includes not only Fikes and Hsieh, who oversaw
the development of BigTable, Google’s seminal NoSQL database, but also
legendary Googlers Jeff Dean and Sanjay Ghemawat and a long list of other
engineers who worked on such groundbreaking data-center platforms as
Megastore and Dremel.

This time around, they found a new way of keeping time.

“As a distributed systems developer, you’re taught from — I want to say
childhood — not to trust time,” says Fikes. “What we did is find a way that
we could trust time — and understand what it meant to trust time.”

Time Is of the Essence

On the net, time is of the essence. Yes, in running a massive web service,
you need things to happen quickly. But you also need a means of accurately
keeping track of time across the many machines that underpin your service.
You have to synchronize the many processes running on each server, and you
have to synchronize the servers themselves, so that they too can work in
tandem. And that’s easier said than done.

Typically, data-center operators keep their servers in sync using what’s
called the Network Time Protocol, or NTP. This is essentially an online
service that connects machines to the official atomic clocks that keep time
for organizations across the world. But because it takes time to move
information across a network, this method is never completely accurate, and
sometimes, it breaks altogether. In July, several big-name web operations
experienced problems — including Reddit, Gawker, and Mozilla — because their
software wasn’t prepared to handle a “leap second” that was added to the
world’s atomic clocks.

‘We wanted something that we were confident in. It’s a time reference that’s
owned by Google.’

— Andrew Fikes

But with Spanner, Google discarded the NTP in favor of its own time-keeping
mechanism. It’s called the TrueTime API. “We wanted something that we were
confident in,” Fikes says. “It’s a time reference that’s owned by Google.”

Rather than rely on outside clocks, Google equips its Spannerized data
centers with its own atomic clocks and GPS (global positioning system)
receivers, not unlike the one in your iPhone. Tapping into a network of
satellites orbiting the Earth, a GPS receiver can pinpoint your location, but
it can also tell time.

These time-keeping devices connect to a certain number of master servers, and
the master servers shuttle time readings to other machines running across the
Google network. Basically, each machine on the network runs a daemon — a
background software process — that is constantly checking with masters in the
same data center and in other Google data centers, trying to reach a
consensus on what time it is. In this way, machines across the Google network
can come pretty close to running a common clock.

‘The System Responds — And Not a Human’

How does this bootstrap a worldwide database? Thanks to the TrueTime service,
Google can keep its many machines in sync — even when they span multiple data
centers — and this means they can quickly store and retrieve data without
stepping on each other’s toes.

“We can commit data at two different locations — say the West Coast [of the
United States] and Europe — and still have some agreed upon ordering between
them,” Fikes says, “So, if the West Coast write happens first and then the
one in Europe happens, the whole system knows that — and there’s no
possibility of them being viewed in a different order.”

‘By using highly accurate clocks and a very clever time API, Spanner allows
server nodes to coordinate without a whole lot of communication.’

— Andy Gross

According to Andy Gross — the principal architect at Basho, an outfit that
builds an open source database called Riak that’s designed to run across
thousands of servers — database designers typically seek to synchronize
information across machines by having them talk to each other. “You have to a
do a whole lot of communication to decide the correct order for all the
transactions,” he told us this fall, when Spanner was first revealed.

The problem is that this communication can bog down the network — and the
database. As Max Schireson — the president of 10gen, maker of the NoSQL
database MongoDB — told us: “If you have large numbers of people accessing
large numbers of systems that are globally distributed so that the delay in
communications between them is relatively long, it becomes very hard to keep
everything synchronized. If you increase those factors, it gets even harder.”

So Google took a completely different tack. Rather than struggle to improve
communication between servers, it gave them a new way to tell time. “That was
probably the coolest thing about the paper: using atomic clocks and GPS to
provide a time API,” says Facebook’s Raghu Murty.

In harnessing time, Google can build a database that’s both global and
consistent, but it can also make its services more resistant in the face of
network delays, data-center outages, and other software and hardware snafus.
Basically, Google uses Spanner to accurately replicate its data across
multiple data centers — and quickly move between replicas as need be. In
other words, the replicas are consistent too.

When one replica is unavailable, Spanner can rapidly shift to another. But it
will also move between replicas simply to improve performance. “If you have
one replica and it gets busy, your latency is going to be high. But if you
have four other replicas, you can choose to go to a different one, and trim
that latency,” Fikes says.

One effect, Fikes explains, is that Google spends less money managing its
system. “When there are outages, things just sort of flip — client machines
access other servers in the system,” he says. “It’s a much easier service
story…. The system responds — and not a human.”

Spanning Google’s Footsteps

Some have questioned whether others can follow in Google’s footsteps — and
whether they would even want to. When we spoke to Andy Gross, he guessed that
even Google’s atomic clocks and GPS receivers would be prohibitively
expensive for most operations.

Yes, rebuilding the platform would be a massive undertaking. Google has
already spent four and half years on the project, and Fikes — who helped
build Google’s web history tool, its first product search service, and Google
Answers, as well as BigTable — calls Spanner the most difficult thing he has
ever worked on. What’s more, there are countless logistical issues that need
dealing with.

‘The important thing to think about is that this is a service that is
provided to the data center. The costs of that are amortized across all the
servers in your fleet. The cost per server is some incremental amount — and
you weigh that against the types of things we can do for that.’

— Andrew Fikes

As Fikes points out, Google had to install GPS antennas on the roofs of its
data centers and connect them to the hardware below. And, yes, you do need
two separate types of time keepers. Hardware always fails, and your time
keepers must fail at, well, different times. “The atomic clocks provide
stability if there is a GPS issue,” he says.

But according to Fikes, these are relatively inexpensive devices. The GPS
units aren’t as cheap as those in your iPhone, but like Google’s atomic
clocks, they cost no more than a few thousand dollars apiece. “They’re sort
of in the order of the cost of an enterprise server,” he says, “and there are
a lot of different vendors of these devices.” When we discussed the matter
with Jeff Dean — one of Google primary infrastructure engineers and another
name on the Spanner paper — he indicated much the same.

Fikes also makes a point of saying that the TrueTime service does not require
specialized servers. The time keepers are kept in racks onside the servers,
and again, they need only connect to some machines in the data center.

“You can think of it as only a handful of these devices being in each data
center. They’re boxes. You buy them. You plug them into your rack. And you’re
going to connect to them over Ethernet,” Fikes says. “The important thing to
think about is that this is a service that is provided to the data center.
The costs of that are amortized across all the servers in your fleet. The
cost per server is some incremental amount — and you weigh that against the
types of things we can do for that.”

No, Spanner isn’t something every website needs today. But the world is
moving in its general direction. Though Facebook has yet to explore something
like Spanner, it is building a software platform called Prism that will run
the company’s massive number crunching tasks across multiple data centers.

Yes, Google’s ad system is enormous, but it benefits from Spanner in ways
that could benefit so many other web services. The Google ad system is an
online auction — where advertisers bid to have their ads displayed as someone
searches for a particular item or visits particular websites — and the
appearance of each ad depends on data describing the behavior of countless
advertisers and web surfers across the net. With Spanner, Google can juggle
this data on a global scale, and it can still keep the whole system in sync.

As Fikes put it, Spanner is just the first example of Google taking advantage
of its new hold on time. “I expect there will be many others,” he says. He
means other Google services, but there’s a reason the company has now shared
its Spanner paper with the rest of the world.

Illustration by Ross Patton