Web Crawler Overview and Policies
– A web crawler is also known as a spider, ant, automatic indexer, or Web scutter.
– A web crawler starts with a list of URLs called seeds.
– As the crawler visits these URLs, it identifies all the hyperlinks in the retrieved web pages and adds them to the crawl frontier.
– URLs from the crawl frontier are recursively visited according to a set of policies.
– If the crawler is performing web archiving, it copies and saves the information as it goes.
– The archived web pages are stored in a repository designed to store and manage the collection.
– The behavior of a web crawler is determined by a combination of policies: selection policy, re-visit policy, politeness policy, and parallelization policy.
– The selection policy determines which pages to download.
– The re-visit policy determines when to check for changes to the pages.
– The politeness policy ensures that the crawler does not overload web sites.
– The parallelization policy coordinates distributed web crawlers.
Crawling Strategies and Experiments
– Large search engines cover only a portion of the publicly available web.
– The importance of a page is determined by its intrinsic quality, popularity in terms of links or visits, and even its URL.
– Different strategies, such as breadth-first, backlink count, and partial PageRank calculations, have been tested.
– Researchers have conducted studies and experiments to improve web crawling strategies.
– A study on crawling scheduling tested different ordering metrics such as breadth-first, backlink count, and partial PageRank calculations.
– Another crawl experiment found that a breadth-first crawl captures pages with high PageRank early in the crawl.
– OPIC (On-line Page Importance Computation) is an algorithm-based crawling strategy that distributes cash to pages based on their importance.
– Simulation experiments have been conducted on subsets of the web to compare strategies like breadth-first, depth-first, random ordering, and omniscient strategy.
Crawling Techniques and Considerations
– Crawlers may seek out only HTML pages.
– HTTP HEAD request can be made to determine MIME type.
– Certain characters in the URL can be used to request specific resources.
– URL normalization avoids crawling the same resource multiple times.
– Conversion of URLs to lowercase.
– Removal of . and .. segments.
– Adding trailing slashes to the non-empty path component.
– Standardizing URLs in a consistent manner.
– Path-ascending crawler ascends to every path in each URL it intends to crawl.
– Focused crawlers download pages similar to each other.
Web Crawler Policies and Architecture
– Crawlers can have a crippling impact on server performance.
– Costs of using Web crawlers include network resources, server overload, poorly written crawlers, and disruptive personal crawlers.
– The robots exclusion protocol helps indicate which parts of a server should not be accessed by crawlers.
– Commercial search engines use the Crawl-delay parameter to indicate the interval between requests.
– A parallel crawler runs multiple processes in parallel.
– A highly optimized architecture is essential for a crawler.
– Building a high-performance system presents challenges in system design, I/O and network efficiency, and robustness.
– Lack of detail in published crawler designs prevents reproducibility.
– Web crawling can lead to unintended consequences and compromises.
– Security measures should be in place to prevent unauthorized access during crawling.
Crawler Identification and Other Considerations
– Web crawlers identify themselves to a Web server using the User-agent field of an HTTP request.
– The deep web consists of web pages accessible only through queries to a database.
– Visual web scrapers/crawlers structure data into columns and rows based on user requirements.
– Various crawler architectures exist for general-purpose crawlers.
– Automatic indexing, Gnutella crawler, Web archiving, Webgraph, and Website mirroring software are related topics.
A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).
Web search engines and some other websites use Web crawling or spidering software to update their web content or indices of other sites' web content. Web crawlers copy pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.
Crawlers consume resources on visited systems and often visit sites unprompted. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For example, including a robots.txt
file can request bots to index only parts of a website, or nothing at all.
The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly.
Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping and data-driven programming.