Glossary Term
Distributed web crawling
Types of Policies
- Two types of policies studied by Cho and Garcia-Molina
- Dynamic assignment policy allows central server to assign new URLs to crawlers dynamically
- Dynamic assignment can balance load and add/remove downloader processes
- Static assignment policy uses fixed rules to assign new URLs to crawlers
- Static assignment requires exchange of URLs between crawling processes
Dynamic Assignment
- Central server assigns new URLs to crawlers dynamically
- Dynamic assignment balances load and can add/remove downloader processes
- Large crawler configuration distributes DNS resolver and queues
- Small crawler configuration has central DNS resolver and central queues per website
- Workload must be transferred to distributed crawling processes for large crawls
Static Assignment
- Fixed rule defines how to assign new URLs to crawlers
- Hashing function transforms URLs into corresponding crawling process index
- Exchange of URLs must occur for external links between crawling processes
- URL exchange should be done in batch to reduce overhead
- Most cited URLs should be known by all crawling processes before the crawl
Implementations
- Most modern commercial search engines use distributed web crawling
- Google and Yahoo use thousands of individual computers for crawling
- Newer projects enlist volunteers to join distributed crawling using personal computers
- LookSmart's Grub distributed web-crawling project uses less structured collaboration
- Wikia acquired Grub from LookSmart in 2007
Drawbacks
- According to Nutch FAQ, distributed web crawling doesn't significantly save bandwidth
- Successful search engine requires more bandwidth for query result pages than crawler needs for downloading pages