Access the NEW Basecamp Support Portal

Distributed web crawling

« Back to Glossary Index

Types of Policies
– Two types of policies studied by Cho and Garcia-Molina
– Dynamic assignment policy allows central server to assign new URLs to crawlers dynamically
– Dynamic assignment can balance load and add/remove downloader processes
– Static assignment policy uses fixed rules to assign new URLs to crawlers
– Static assignment requires exchange of URLs between crawling processes

Dynamic Assignment
– Central server assigns new URLs to crawlers dynamically
– Dynamic assignment balances load and can add/remove downloader processes
– Large crawler configuration distributes DNS resolver and queues
– Small crawler configuration has central DNS resolver and central queues per website
– Workload must be transferred to distributed crawling processes for large crawls

Static Assignment
– Fixed rule defines how to assign new URLs to crawlers
– Hashing function transforms URLs into corresponding crawling process index
– Exchange of URLs must occur for external links between crawling processes
URL exchange should be done in batch to reduce overhead
– Most cited URLs should be known by all crawling processes before the crawl

Implementations
– Most modern commercial search engines use distributed web crawling
Google and Yahoo use thousands of individual computers for crawling
– Newer projects enlist volunteers to join distributed crawling using personal computers
– LookSmart’s Grub distributed web-crawling project uses less structured collaboration
– Wikia acquired Grub from LookSmart in 2007

Drawbacks
– According to Nutch FAQ, distributed web crawling doesn’t significantly save bandwidth
– Successful search engine requires more bandwidth for query result pages than crawler needs for downloading pages

Distributed web crawling is a distributed computing technique whereby Internet search engines employ many computers to index the Internet via web crawling. Such systems may allow for users to voluntarily offer their own computing and bandwidth resources towards crawling web pages. By spreading the load of these tasks across many computers, costs that would otherwise be spent on maintaining large computing clusters are avoided.

« Back to Glossary Index

Request an article

Please let us know what you were looking for and our team will not only create the article but we'll also email you to let you know as soon as it's been published.
Most articles take 1-2 business days to research, write, and publish.
Content/Article Request Form

Submit your RFP

We can't wait to read about your project. Use the form below to submit your RFP!
Request for Proposal

Contact and Business Information

Provide details about how we can contact you and your business.


Quote Request Details

Provide some information about why you'd like a quote.