Skip to main content
Glossary Term

Distributed web crawling

Types of Policies - Two types of policies studied by Cho and Garcia-Molina - Dynamic assignment policy allows central server to assign new URLs to crawlers dynamically - Dynamic assignment can balance load and add/remove downloader processes - Static assignment policy uses fixed rules to assign new URLs to crawlers - Static assignment requires exchange of URLs between crawling processes Dynamic Assignment - Central server assigns new URLs to crawlers dynamically - Dynamic assignment balances load and can add/remove downloader processes - Large crawler configuration distributes DNS resolver and queues - Small crawler configuration has central DNS resolver and central queues per website - Workload must be transferred to distributed crawling processes for large crawls Static Assignment - Fixed rule defines how to assign new URLs to crawlers - Hashing function transforms URLs into corresponding crawling process index - Exchange of URLs must occur for external links between crawling processes - URL exchange should be done in batch to reduce overhead - Most cited URLs should be known by all crawling processes before the crawl Implementations - Most modern commercial search engines use distributed web crawling - Google and Yahoo use thousands of individual computers for crawling - Newer projects enlist volunteers to join distributed crawling using personal computers - LookSmart's Grub distributed web-crawling project uses less structured collaboration - Wikia acquired Grub from LookSmart in 2007 Drawbacks - According to Nutch FAQ, distributed web crawling doesn't significantly save bandwidth - Successful search engine requires more bandwidth for query result pages than crawler needs for downloading pages