Skip to main content
Glossary Term

Focused crawler

Definition and Purpose of Focused Crawlers - A focused crawler collects web pages that meet specific criteria - Prioritizes the crawl frontier and manages hyperlink exploration - Predicates can be based on domain, topics, or page properties - Focused crawlers can use web directories, text indexes, or backlinks - Prediction of relevance is crucial before downloading unvisited pages Techniques and Approaches in Focused Crawling - Anchor text of links can be used as a predictor - Topical crawling was introduced by Filippo Menczer - Text classifiers are used to prioritize the crawl frontier - Reinforcement learning is employed by some focused crawlers - Context graphs and text content are used to train classifiers Semantic Focused Crawlers and Ontology-based Categorization - Semantic focused crawlers use domain ontologies - Ontologies can be updated during the crawling process - Support vector machines are used to update ontological concepts - Markup languages like RDFa, Microformats, and Microdata are crawled - Bandit-based selection strategies are used for efficient crawling Factors Affecting the Performance of Focused Crawlers - Richness of links in the specific topic affects performance - Focused crawling relies on general web search engines for starting points - Seed selection significantly influences crawling efficiency - Whitelist strategy limits crawling to high-quality seed URLs - Performance studies explain why focused crawling succeeds on broad topics Related Research and Techniques in Focused Crawling - Various crawl prioritization policies and their effects on link popularity - Breadth-first crawling from popular seed pages collects large-PageRank pages - Detection of stale pages has been reported for improved crawling - Recognition of common areas in web pages using visual information - Evolution of crawling strategy for academic search engines using whitelists and blacklists