Glossary Term
Focused crawler
Definition and Purpose of Focused Crawlers
- A focused crawler collects web pages that meet specific criteria
- Prioritizes the crawl frontier and manages hyperlink exploration
- Predicates can be based on domain, topics, or page properties
- Focused crawlers can use web directories, text indexes, or backlinks
- Prediction of relevance is crucial before downloading unvisited pages
Techniques and Approaches in Focused Crawling
- Anchor text of links can be used as a predictor
- Topical crawling was introduced by Filippo Menczer
- Text classifiers are used to prioritize the crawl frontier
- Reinforcement learning is employed by some focused crawlers
- Context graphs and text content are used to train classifiers
Semantic Focused Crawlers and Ontology-based Categorization
- Semantic focused crawlers use domain ontologies
- Ontologies can be updated during the crawling process
- Support vector machines are used to update ontological concepts
- Markup languages like RDFa, Microformats, and Microdata are crawled
- Bandit-based selection strategies are used for efficient crawling
Factors Affecting the Performance of Focused Crawlers
- Richness of links in the specific topic affects performance
- Focused crawling relies on general web search engines for starting points
- Seed selection significantly influences crawling efficiency
- Whitelist strategy limits crawling to high-quality seed URLs
- Performance studies explain why focused crawling succeeds on broad topics
Related Research and Techniques in Focused Crawling
- Various crawl prioritization policies and their effects on link popularity
- Breadth-first crawling from popular seed pages collects large-PageRank pages
- Detection of stale pages has been reported for improved crawling
- Recognition of common areas in web pages using visual information
- Evolution of crawling strategy for academic search engines using whitelists and blacklists