Access the NEW Basecamp Support Portal

Web crawler

« Back to Glossary Index

Web Crawler Overview and Policies
Web crawler is also known as a spider, ant, automatic indexer, or Web scutter.
– A Web crawler starts with a list of URLs called seeds.
– It identifies hyperlinks in web pages and adds them to the crawl frontier.
– Crawler performs archiving of websites and saves information in the repository.
– The repository stores HTML pages as distinct files.
– Crawler downloads a limited number of web pages due to the large volume.
Web crawler behavior is determined by selection policy, re-visit policy, politeness policy, and parallelization policy.
– Selection policy determines which pages to download.
– Re-visit policy determines when to check for changes to the pages.
– Politeness policy helps avoid overloading web sites.
– Parallelization policy coordinates distributed web crawlers.
– Large search engines index only a portion of the publicly available web.
– Importance metric is used to prioritize web pages.
– Importance is based on intrinsic quality, popularity, and URL.
– Designing a good selection policy is challenging due to partial information.
– Different strategies like breadth-first and backlink count have been studied.

Research on Crawling Strategies
– Junghoo Cho et al. studied crawling scheduling policies using a data set from stanford.edu.
– Najork and Wiener performed an actual crawl on 328 million pages using breadth-first ordering.
– Abiteboul designed a crawling strategy called OPIC based on online page importance computation.
– Boldi et al. conducted simulations on subsets of the Web to compare breadth-first, depth-first, and random ordering strategies.
– These studies aimed to improve the efficiency and effectiveness of web crawlers.
– Crawling strategies like OPIC and per-site queue length are better than breadth-first crawling.
– Using a previous crawl to guide the current one can be very effective.
– Simulation on subsets of the Web showed the effectiveness of different crawling strategies.
– Community-based algorithms can discover good seeds for crawling.
– Crawling web pages with high PageRank from different communities can be more efficient.

Techniques and Considerations in Web Crawling
– Crawlers may seek only HTML pages and avoid other MIME types.
– Making HTTP HEAD requests can determine the MIME type before requesting the entire resource.
– Certain URL endings like .html, .htm, .asp, etc., can be used to request HTML resources.
– Crawlers may avoid requesting dynamically produced resources to avoid spider traps.
URL rewriting can make the strategy of avoiding URLs with ‘?’ unreliable.
URL normalization is the process of modifying and standardizing URLs.
– Types of normalization include converting URLs to lowercase and removing . and .. segments.
– Adding trailing slashes to the non-empty path component is another normalization technique.
URL normalization helps avoid crawling the same resource multiple times.
– Different normalization techniques can be performed to achieve consistency in URL representation.
– Path-ascending crawlers aim to download/upload as many resources as possible from a specific website.
– Path-ascending crawlers ascend to every path in each URL they intend to crawl.
– They can effectively find isolated resources or resources with no inbound links in regular crawling.
– Focused crawlers download pages similar to a given query.
– Similarity can be predicted using anchor text, complete content of visited pages, or other methods.
– Focused crawling relies on the richness of links in the specific topic being searched.

Challenges and Considerations in Web Crawling
– Two re-visiting policies: Uniform policy and Proportional policy.
– Uniform policy outperforms proportional policy in terms of average freshness.
– Proportional policy allocates more resources to frequently updating pages.
– Optimal re-visiting policy penalizes elements that change too often.
– Optimal policy keeps accesses evenly spaced to minimize obsolescence time.
– Crawlers can impact site performance and overload servers.
– Costs of using web crawlers include network resources and server overload.
– Poorly written crawlers can crash servers or download unmanageable pages.
– Personal crawlers can disrupt networks and web servers.
– Robots exclusion protocol helps manage crawler access to servers.
– Parallel crawler runs multiple processes to maximize download rate.
– Policy needed to avoid repeated downloads of the same page.
– Assigning new URLs discovered during crawling process is crucial.
– Goal is to minimize overhead from parallelization.
– Avoid downloading the same page more than once.
– A highly optimized architecture is crucial for a web crawler.
– Challenges in system design, I/O and network efficiency, and manageability.
– Details on crawler algorithms and architecture are often kept as secrets.
– Lack of detail in published designs prevents reproduction of work.
– Concerns about search engine spamming limit algorithm disclosure.
– Web crawling can lead to unintended consequences and compromise.
– Website owners want broad indexing, but it can result in data breaches.
– Search engines can index resources that should remain private.
– Security measures should be taken to protect sensitive information.
– Balancing indexing and security is a challenge for web crawlers.

Identifying Crawlers and Specialized Crawling
– Web crawlers identify themselves using the User-agent field of an HTTP request.
– Web administrators use the user agent field to determine which crawlers have visited the web server.
– Some administrators use tools to identify, track, and verify web crawlers.
– Spambots and malicious crawlers may not provide identifying information or may mask their identity.
– Identifying crawlers is useful for contacting owners, stopping problematic crawlers, and knowing when web pages will be indexed.
– The deep web contains web pages that are only accessible through database queries.
– Regular crawlers cannot find deep web pages without links pointing to them.
Google’s Sitemaps protocol and mod_oai help discover deep web resources.
– Deep web crawling increases the number of web links to be crawled.
– Screen scraping and specialized software can be used to target and aggregate data from deep web sources.
– Visual web scrapers/crawlers crawl pages and structure data based on user requirements.
– Visual crawlers require less programming ability compared to classic crawlers.
– Users teach visual

Web crawler (Wikipedia)

A Web crawler, sometimes called a spider or spiderbot and often shortened to crawler, is an Internet bot that systematically browses the World Wide Web and that is typically operated by search engines for the purpose of Web indexing (web spidering).

Architecture of a Web crawler

Web search engines and some other websites use Web crawling or spidering software to update their web content or indices of other sites' web content. Web crawlers copy pages for processing by a search engine, which indexes the downloaded pages so that users can search more efficiently.

Crawlers consume resources on visited systems and often visit sites unprompted. Issues of schedule, load, and "politeness" come into play when large collections of pages are accessed. Mechanisms exist for public sites not wishing to be crawled to make this known to the crawling agent. For example, including a robots.txt file can request bots to index only parts of a website, or nothing at all.

The number of Internet pages is extremely large; even the largest crawlers fall short of making a complete index. For this reason, search engines struggled to give relevant search results in the early years of the World Wide Web, before 2000. Today, relevant results are given almost instantly.

Crawlers can validate hyperlinks and HTML code. They can also be used for web scraping and data-driven programming.

« Back to Glossary Index

Request an article

Please let us know what you were looking for and our team will not only create the article but we'll also email you to let you know as soon as it's been published.
Most articles take 1-2 business days to research, write, and publish.
Content/Article Request Form

Submit your RFP

We can't wait to read about your project. Use the form below to submit your RFP!
Request for Proposal

Contact and Business Information

Provide details about how we can contact you and your business.


Quote Request Details

Provide some information about why you'd like a quote.