Skip to main content
Glossary Term

Web crawler

Web Crawler Overview and Policies - A web crawler is also known as a spider, ant, automatic indexer, or Web scutter. - A web crawler starts with a list of URLs called seeds. - As the crawler visits these URLs, it identifies all the hyperlinks in the retrieved web pages and adds them to the crawl frontier. - URLs from the crawl frontier are recursively visited according to a set of policies. - If the crawler is performing web archiving, it copies and saves the information as it goes. - The archived web pages are stored in a repository designed to store and manage the collection. - The behavior of a web crawler is determined by a combination of policies: selection policy, re-visit policy, politeness policy, and parallelization policy. - The selection policy determines which pages to download. - The re-visit policy determines when to check for changes to the pages. - The politeness policy ensures that the crawler does not overload web sites. - The parallelization policy coordinates distributed web crawlers. Crawling Strategies and Experiments - Large search engines cover only a portion of the publicly available web. - The importance of a page is determined by its intrinsic quality, popularity in terms of links or visits, and even its URL. - Different strategies, such as breadth-first, backlink count, and partial PageRank calculations, have been tested. - Researchers have conducted studies and experiments to improve web crawling strategies. - A study on crawling scheduling tested different ordering metrics such as breadth-first, backlink count, and partial PageRank calculations. - Another crawl experiment found that a breadth-first crawl captures pages with high PageRank early in the crawl. - OPIC (On-line Page Importance Computation) is an algorithm-based crawling strategy that distributes cash to pages based on their importance. - Simulation experiments have been conducted on subsets of the web to compare strategies like breadth-first, depth-first, random ordering, and omniscient strategy. Crawling Techniques and Considerations - Crawlers may seek out only HTML pages. - HTTP HEAD request can be made to determine MIME type. - Certain characters in the URL can be used to request specific resources. - URL normalization avoids crawling the same resource multiple times. - Conversion of URLs to lowercase. - Removal of . and .. segments. - Adding trailing slashes to the non-empty path component. - Standardizing URLs in a consistent manner. - Path-ascending crawler ascends to every path in each URL it intends to crawl. - Focused crawlers download pages similar to each other. Web Crawler Policies and Architecture - Crawlers can have a crippling impact on server performance. - Costs of using Web crawlers include network resources, server overload, poorly written crawlers, and disruptive personal crawlers. - The robots exclusion protocol helps indicate which parts of a server should not be accessed by crawlers. - Commercial search engines use the Crawl-delay parameter to indicate the interval between requests. - A parallel crawler runs multiple processes in parallel. - A highly optimized architecture is essential for a crawler. - Building a high-performance system presents challenges in system design, I/O and network efficiency, and robustness. - Lack of detail in published crawler designs prevents reproducibility. - Web crawling can lead to unintended consequences and compromises. - Security measures should be in place to prevent unauthorized access during crawling. Crawler Identification and Other Considerations - Web crawlers identify themselves to a Web server using the User-agent field of an HTTP request. - The deep web consists of web pages accessible only through queries to a database. - Visual web scrapers/crawlers structure data into columns and rows based on user requirements. - Various crawler architectures exist for general-purpose crawlers. - Automatic indexing, Gnutella crawler, Web archiving, Webgraph, and Website mirroring software are related topics.