Types of Policies
– Two types of policies studied by Cho and Garcia-Molina
– Dynamic assignment policy allows central server to assign new URLs to crawlers dynamically
– Dynamic assignment can balance load and add/remove downloader processes
– Static assignment policy uses fixed rules to assign new URLs to crawlers
– Static assignment requires exchange of URLs between crawling processes
Dynamic Assignment
– Central server assigns new URLs to crawlers dynamically
– Dynamic assignment balances load and can add/remove downloader processes
– Large crawler configuration distributes DNS resolver and queues
– Small crawler configuration has central DNS resolver and central queues per website
– Workload must be transferred to distributed crawling processes for large crawls
Static Assignment
– Fixed rule defines how to assign new URLs to crawlers
– Hashing function transforms URLs into corresponding crawling process index
– Exchange of URLs must occur for external links between crawling processes
– URL exchange should be done in batch to reduce overhead
– Most cited URLs should be known by all crawling processes before the crawl
Implementations
– Most modern commercial search engines use distributed web crawling
– Google and Yahoo use thousands of individual computers for crawling
– Newer projects enlist volunteers to join distributed crawling using personal computers
– LookSmart’s Grub distributed web-crawling project uses less structured collaboration
– Wikia acquired Grub from LookSmart in 2007
Drawbacks
– According to Nutch FAQ, distributed web crawling doesn’t significantly save bandwidth
– Successful search engine requires more bandwidth for query result pages than crawler needs for downloading pages
This article needs additional citations for verification. (July 2008) |
Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.