Can distributed processing search for web pages?

Inspired by projects that harness spare time on PCs, one programmer wants to hand back control of Internet searching to users. Are you worried about Google's growing dominance of the Internet search market? Alex Chudnovsky certainly is. To develop a community-led alternative, the Birmingham-based Russian programmer is building a new type of search engine. By harnessing the power of distributed computing, he's already managed to build an index that covers 1bn web pages. He has called his venture Majestic-12, possibly a reference to the alleged secret committee formed after the 1947 Roswell UFO incident. His passion for technically challenging work stems from worries about Google's iron grip on the market, its tight control of search results, and even whether some sites are indexed at all. "Because of their success, they have effectively created a monopoly in the virtual world. Monopolies never end up well for consumers," says Chudnovsky, who has developed search engines and other software for leading UK retailers. "I want to build the biggest UK search index." He has a challenge. The market research firm Nielsen/NetRatings says Google has a UK market share of more than 60%, with Yahoo and MSN trailing by a substantial margin. "To many, search and Google are synonymous. Its dominance increases bit by bit each month," says a spokesman. Chudnovsky, though, wants to use the technique that has worked so well - at least for recruitment - for the Search for Extraterrestrial Intelligence (Seti), cancer searches, climate change modelling and, most recently, cracking a set of encrypted messages sent from a submarine during the second world war. Distributed computing lets many participants do little bits of work to create a huge result, using spare time on their computers. Big shoes to fill Chudnovsky has a huge result to follow. Google stopped publicising the size of its search index when it reached 8bn pages four months ago. "We maintain the largest collection of documents searchable on the web," says a Google spokeswoman. "We estimate this expanded search index to be more than three times as large as any other search engine. We update the entire index about once a month, and some areas more frequently." In his latest book "An Introduction to Search Engines and Web Navigation" Mark Levene, the professor of computer science at Birkbeck College, says Google has more than 15,000 servers and "crawls" - examines for indexing - 3,000 URLs per second. (Other estimates have ranged from 31,000 to 79,000 servers.) Your home PC is clearly no match. For example, you cannot crawl more than 1m pages a day on a 2Mbps broadband connection. It will take you 8,000 days (about 22 years) to acquire a Google-sized but hopelessly out-of-date index. The solution? Recruit like-minded people who donate computer time, as they do for Seti@home and other projects. "Google's database is about 8bn pages, so fewer than 10,000 people taking part in this project can recrawl the whole of Google's database every single day," says Chudnovsky. A large-scale distributed crawling project has been attempted and involved thousands at its peak. Danny Sullivan, the editor-in-chief of Search Engine Watch, points to Looksmart's Grub project of 2003, which is no longer operational. Majestic-12's volunteers - 60 so far - are crawling about 50m pages a day using unlimited broadband connections and software that runs in the background. Over the past few months, 7bn pages have been crawled although, at 1bn pages, the completed index lags behind for now. This is stored centrally to enable the Majestic-12 distributed search engine (via to return fast, relevant results. "Ideally, I'd like to distribute the search index," says Chudnovsky. This is a challenging proposition that would see duplicate chunks of a huge index distributed between broadband-connected PCs. There are also parallels with peer-to-peer systems such as Gnutella, which share music, films and software. A small-scale experiment with one country, perhaps Finland, may happen later this year. Professor Jon Crowcroft of Cambridge University says this type of collaborative web crawling and indexing is very reasonable. "Many search engines do this to reduce the traffic load returning to a single central site - distributing the index itself is OK, so long as you have an efficient mechanism to search the index." These efforts also interest Professor Levene. "I hope the project succeeds. People finding novel ways of doing crawling or search is good for the competition," he says. Should Google, Yahoo, and MSN be worried? "It would be hard to push Google out of the way - they're just going to buy you out." Chudnovsky's aspirations are more community-minded, helping to develop a search engine that users control. Nevertheless, his innovative code might revitalise searches on corporate websites or, more controversially, assist with search engine optimisation. But as video, images and music are added to burgeoning search engine indices, crawling and search tasks will need to become more distributed.

