Glossary Term
Web crawler
Web Crawler Overview and Policies
- A web crawler is also known as a spider, ant, automatic indexer, or Web scutter.
- A web crawler starts with a list of URLs called seeds.
- As the crawler visits these URLs, it identifies all the hyperlinks in the retrieved web pages and adds them to the crawl frontier.
- URLs from the crawl frontier are recursively visited according to a set of policies.
- If the crawler is performing web archiving, it copies and saves the information as it goes.
- The archived web pages are stored in a repository designed to store and manage the collection.
- The behavior of a web crawler is determined by a combination of policies: selection policy, re-visit policy, politeness policy, and parallelization policy.
- The selection policy determines which pages to download.
- The re-visit policy determines when to check for changes to the pages.
- The politeness policy ensures that the crawler does not overload web sites.
- The parallelization policy coordinates distributed web crawlers.
Crawling Strategies and Experiments
- Large search engines cover only a portion of the publicly available web.
- The importance of a page is determined by its intrinsic quality, popularity in terms of links or visits, and even its URL.
- Different strategies, such as breadth-first, backlink count, and partial PageRank calculations, have been tested.
- Researchers have conducted studies and experiments to improve web crawling strategies.
- A study on crawling scheduling tested different ordering metrics such as breadth-first, backlink count, and partial PageRank calculations.
- Another crawl experiment found that a breadth-first crawl captures pages with high PageRank early in the crawl.
- OPIC (On-line Page Importance Computation) is an algorithm-based crawling strategy that distributes cash to pages based on their importance.
- Simulation experiments have been conducted on subsets of the web to compare strategies like breadth-first, depth-first, random ordering, and omniscient strategy.
Crawling Techniques and Considerations
- Crawlers may seek out only HTML pages.
- HTTP HEAD request can be made to determine MIME type.
- Certain characters in the URL can be used to request specific resources.
- URL normalization avoids crawling the same resource multiple times.
- Conversion of URLs to lowercase.
- Removal of . and .. segments.
- Adding trailing slashes to the non-empty path component.
- Standardizing URLs in a consistent manner.
- Path-ascending crawler ascends to every path in each URL it intends to crawl.
- Focused crawlers download pages similar to each other.
Web Crawler Policies and Architecture
- Crawlers can have a crippling impact on server performance.
- Costs of using Web crawlers include network resources, server overload, poorly written crawlers, and disruptive personal crawlers.
- The robots exclusion protocol helps indicate which parts of a server should not be accessed by crawlers.
- Commercial search engines use the Crawl-delay parameter to indicate the interval between requests.
- A parallel crawler runs multiple processes in parallel.
- A highly optimized architecture is essential for a crawler.
- Building a high-performance system presents challenges in system design, I/O and network efficiency, and robustness.
- Lack of detail in published crawler designs prevents reproducibility.
- Web crawling can lead to unintended consequences and compromises.
- Security measures should be in place to prevent unauthorized access during crawling.
Crawler Identification and Other Considerations
- Web crawlers identify themselves to a Web server using the User-agent field of an HTTP request.
- The deep web consists of web pages accessible only through queries to a database.
- Visual web scrapers/crawlers structure data into columns and rows based on user requirements.
- Various crawler architectures exist for general-purpose crawlers.
- Automatic indexing, Gnutella crawler, Web archiving, Webgraph, and Website mirroring software are related topics.