The key to the speed and reliability of Google search is cutting up data into chunks, its top engineer said.
Behind the scenes
Urs Hoelzle, Google vice president of operations and vice president of engineering, offered a rare behind-the-scenes tour of Google's architecture on Wednesday. Hoelzle spoke here at EclipseCon 2005, a conference on the open source, extensible platform for software tools.
To deal with the more than 10 billion Web pages and tens of terabytes of information on Google's servers, the company combines cheap machines with plenty of redundancy, Hoelzle said. Its commodity servers cost around $1,000 apiece, and Google's architecture places them into interconnected nodes.
All machines run on a stripped-down Linux kernel. The distribution is Red Hat, but Hoelzle said Google doesn't use much of the distro. Moreover, Google has created its own patches for things that haven't been fixed in the original kernel.
"The downside to cheap machines is, you have to make them work together reliably," Hoelzle said. "These things are cheap and easy to put together. The problem is, these things break."
In fact, at Google, many will fail every day. So, Google has automated methods of dealing with machine failures, allowing it to build a fast, highly reliable service with cheap hardware.
Google replicates the Web pages it caches by splitting them up into pieces it calls "shards." The shards are small enough that several can fit on one machine. And they're replicated on several machines, so that if one breaks, another can serve up the information. The master index is also split up among several servers, and that set also is replicated several times. The engineers call these "chunk servers."
As a search query comes into the system, it hits a Web server, then is split into chunks of service. One set of index servers contains the index; one set of machines contains one full index. To actually answer a query, Google has to use one complete set of servers. Since that set is replicated as a fail-safe, it also increases throughput, because if one set is busy, a new query can be routed to the next set, which drives down search time per box.
In parallel, clusters of document servers contain copies of Web pages that Google has cached. Hoelzle said that the refresh rate is from one to seven days, with an average of two days. That's mostly dependent on the needs of the Web publishers.
"One surprising limitation is we can't crawl as fast as we would like, because [smaller] webmasters complain," he said.
Each set of document servers contains one copy of the Web. These machines are responsible for delivering the content snippets that show searchers relevant text from the page.
"When we have your top 10 results, they get sent to the document servers, which load the 10 result pages into memory," Hoelzle said. "Then you parse through them and find the best snippet that contains all the query words."
Prepared for failure
Google uses three software systems built in-house to route queries, balance server loads and make programming easier.
The Google File System was written specifically to deal with the cheap machines that will fail.
"We take our files and chunk them up, then you randomly distribute the chunks across different machines, making sure each chunk has at least two copies that are not physically adjacent -- not on same power strip or same switch," Hoelzle said. "We try to make sure that even if one copy goes away, another copy is still here." Chunks typically are 64 megabytes and are replicated three times.
All this replication makes it easier to make changes, Hoelzle said. Google simply takes one replica at a time offline, updates it, then plugs the machines back in.
Because these chunks are randomly distributed all over, Google needs a master containing metadata to keep track of where the chunks are. When a query comes into the system, the file system master tells it which chunk server has the data. "From there on, you just talk to the chunk servers," he said.
Client machines are responsible for dealing with fault tolerance. If a client requests a file from the specified chunk server and gets no response within the designated time period, it uses the meta information to locate another chunk server, while sending the file master a hint that the first chunk server might have died. If the master confirms the chunk went out, it will replicate the chunks that were on it to another server, making sure that the information is replicated at least the minimum number of times.
"You were vulnerable for only a very brief period," he said.
To enable Google programmers to write applications to run in parallel on 1,000 machines, engineers created the Map/Reduce Framework in 2004.
"The Map/Reduce Framework provides automatic and efficient parallelization and distribution," Hoelzle said. "It's fault tolerant and it does the I/O scheduling, being a little bit smart about where the data lives."
Programmers write two simple functions, map and reduce, to create a long list of key/value pairs. Then, the mapping function produces other key/value pairs. "You just map one pair to another pair," he said.
For example, if an application is needed to count URLs on one host, the programmer would take the URL and the contents and map them into the pair consisting of hostname and 1. "This produces an intermediate set of key/value pairs with different values."
Next, a reduction operation takes all the outputs that have the same key and combines them to produce a single output.
"Map/Reduce is simplified large-scale data processing," Hoelzle said, "a very simple abstraction that makes it possible to write programs that run over these terabytes of data with little effort."
Global work queue
The third homegrown application is Google's Global Work Queue, which is for scheduling.
Global Work Queue works like old-time batch processing. It schedules queries into batch jobs and places them on pools of machines. The setup is optimized for running random computations over tons of data.
"Mostly, you want to split the huge task into lots of small chunks, which provides even load balancing across machines," Hoelzle said. The idea is to have more tasks than machines so machines are never idle.
Hoelzle also demonstrated how Google uses its massive architecture to learn from data. It analyzes the most common misspellings of queries, and uses that information to power the function that suggests alternate spellings for queries.
The company also is applying machine learning to its system to give better results. Theoretically, he said, if someone searches for "Bay Area cooking class," the system should know that "Berkeley courses: vegetarian cooking" is a good match even though it contains none of the query words.
To do this, the system tries to cluster concepts into "reasonably coherent" subclusters that seem related. These clusters, some tiny and some huge, are named automatically. Then, when a query comes in, the system produces a probability score for the various clusters. This kind of machine learning has had little success in academic trials, Hoelzle said, because they didn't have enough data. "If you have enough data, you get reasonably good answers out of it."
In addition to improving query results, Google uses this learning to better deliver contextual ads for its AdSense service to Web publishers, as well as to more accurately cluster news stories within Google News.
Google's redundancy theory works on a meta level, as well, according to Hoelzle. One literal meltdown -- a fire at a datacenter in an undisclosed location -- brought out six fire trucks but didn't crash the system.
"You don't have just one data center," he said, "you have multiples."