Focused crawler

« Back to Glossary Index

Definition and Purpose of Focused Crawlers
– A focused crawler collects web pages that meet specific criteria
– Prioritizes the crawl frontier and manages hyperlink exploration
– Predicates can be based on domain, topics, or page properties
– Focused crawlers can use web directories, text indexes, or backlinks
– Prediction of relevance is crucial before downloading unvisited pages

Techniques and Approaches in Focused Crawling
– Anchor text of links can be used as a predictor
– Topical crawling was introduced by Filippo Menczer
– Text classifiers are used to prioritize the crawl frontier
– Reinforcement learning is employed by some focused crawlers
– Context graphs and text content are used to train classifiers

Semantic Focused Crawlers and Ontology-based Categorization
– Semantic focused crawlers use domain ontologies
– Ontologies can be updated during the crawling process
– Support vector machines are used to update ontological concepts
– Markup languages like RDFa, Microformats, and Microdata are crawled
– Bandit-based selection strategies are used for efficient crawling

Factors Affecting the Performance of Focused Crawlers
– Richness of links in the specific topic affects performance
– Focused crawling relies on general web search engines for starting points
– Seed selection significantly influences crawling efficiency
– Whitelist strategy limits crawling to high-quality seed URLs
– Performance studies explain why focused crawling succeeds on broad topics

Related Research and Techniques in Focused Crawling
– Various crawl prioritization policies and their effects on link popularity
– Breadth-first crawling from popular seed pages collects large-PageRank pages
– Detection of stale pages has been reported for improved crawling
– Recognition of common areas in web pages using visual information
– Evolution of crawling strategy for academic search engines using whitelists and blacklists

Focused crawler (Wikipedia)

A focused crawler is a web crawler that collects Web pages that satisfy some specific property, by carefully prioritizing the crawl frontier and managing the hyperlink exploration process. Some predicates may be based on simple, deterministic and surface properties. For example, a crawler's mission may be to crawl pages from only the .jp domain. Other predicates may be softer or comparative, e.g., "crawl pages about baseball", or "crawl pages with large PageRank". An important page property pertains to topics, leading to 'topical crawlers'. For example, a topical crawler may be deployed to collect pages about solar power, swine flu, or even more abstract concepts like controversy while minimizing resources spent fetching pages on other topics. Crawl frontier management may not be the only device used by focused crawlers; they may use a Web directory, a Web text index, backlinks, or any other Web artifact.

A focused crawler must predict the probability that an unvisited page will be relevant before actually downloading the page. A possible predictor is the anchor text of links; this was the approach taken by Pinkerton in a crawler developed in the early days of the Web. Topical crawling was first introduced by Filippo Menczer. Chakrabarti et al. coined the term 'focused crawler' and used a text classifier to prioritize the crawl frontier. Andrew McCallum and co-authors also used reinforcement learning to focus crawlers. Diligenti et al. traced the context graph leading up to relevant pages, and their text content, to train classifiers. A form of online reinforcement learning has been used, along with features extracted from the DOM tree and text of linking pages, to continually train classifiers that guide the crawl. In a review of topical crawling algorithms, Menczer et al. show that such simple strategies are very effective for short crawls, while more sophisticated techniques such as reinforcement learning and evolutionary adaptation can give the best performance over longer crawls. It has been shown that spatial information is important to classify Web documents.

Another type of focused crawlers is semantic focused crawler, which makes use of domain ontologies to represent topical maps and link Web pages with relevant ontological concepts for the selection and categorization purposes. In addition, ontologies can be automatically updated in the crawling process. Dong et al. introduced such an ontology-learning-based crawler using support vector machine to update the content of ontological concepts when crawling Web Pages.

Crawlers are also focused on page properties other than topics. Cho et al. study a variety of crawl prioritization policies and their effects on the link popularity of fetched pages. Najork and Weiner show that breadth-first crawling, starting from popular seed pages, leads to collecting large-PageRank pages early in the crawl. Refinements involving detection of stale (poorly maintained) pages have been reported by Eiron et al. A kind of semantic focused crawler, making use of the idea of reinforcement learning has been introduced by Meusel et al. using online-based classification algorithms in combination with a bandit-based selection strategy to efficiently crawl pages with markup languages like RDFa, Microformats, and Microdata.

The performance of a focused crawler depends on the richness of links in the specific topic being searched, and focused crawling usually relies on a general web search engine for providing starting points. Davison presented studies on Web links and text that explain why focused crawling succeeds on broad topics; similar studies were presented by Chakrabarti et al. Seed selection can be important for focused crawlers and significantly influence the crawling efficiency. A whitelist strategy is to start the focus crawl from a list of high quality seed URLs and limit the crawling scope to the domains of these URLs. These high quality seeds should be selected based on a list of URL candidates which are accumulated over a sufficiently long period of general web crawling. The whitelist should be updated periodically after it is created.

« Back to Glossary Index

Submit your RFP

We can't wait to read about your project. Use the form below to submit your RFP!

Gabrielle Buff
Gabrielle Buff

Just left us a 5 star review

Great customer service and was able to walk us through the various options available to us in a way that made sense. Would definitely recommend!

Stoute Web Solutions has been a valuable resource for our business. Their attention to detail, expertise, and willingness to help at a moment's notice make them an essential support system for us.

Paul and the team are very professional, courteous, and efficient. They always respond immediately even to my minute concerns. Also, their SEO consultation is superb. These are good people!

Paul Stoute & his team are top notch! You will not find a more honest, hard working group whose focus is the success of your business. If you’re ready to work with the best to create the best for your business, go Stoute Web Solutions; you’ll definitely be glad you did!

Wonderful people that understand our needs and make it happen!

Paul is the absolute best! Always there with solutions in high pressure situations. A steady hand; always there when needed; I would recommend Paul to anyone!

facebook
Vince Fogliani
recommends

The team over at Stoute web solutions set my business up with a fantastic new website, could not be happier

facebook
Steve Sacre
recommends

If You are looking for Website design & creativity look no further. Paul & his team are the epitome of excellence.Don't take my word just refer to my website "stevestours.net"that Stoute Web Solutions created.This should convince anyone that You have finally found Your perfect fit

facebook
Jamie Hill
recommends

Paul and the team at Stoute Web are amazing. They are super fast to answer questions. Super easy to work with, and knows their stuff. 10,000 stars.

Paul and the team from Stoute Web solutions are awesome to work with. They're super intuitive on what best suits your needs and the end product is even better. We will be using them exclusively for our web design and hosting.

facebook
Dean Eardley
recommends

Beautifully functional websites from professional, knowledgeable team.

Along with hosting most of my url's Paul's business has helped me with website development, graphic design and even a really cool back end database app! I highly recommend him as your 360 solution to making your business more visible in today's social media driven marketplace.

I hate dealing with domain/site hosts. After terrible service for over a decade from Dreamhost, I was desperate to find a new one. I was lucky enough to win...

Paul Stoute has been extremely helpful in helping me choose the best package to suite my needs. Any time I had a technical issue he was there to help me through it. Superb customer service at a great value. I would recommend his services to anyone that wants a hassle free and quality experience for their website needs.

Paul is the BEST! I am a current customer and happy to say he has never let me down. Always responds quickly and if he cant fix the issue right away, if available, he provides you a temporary work around while researching the correct fix! Thanks for being an honest and great company!!

Paul Stoute is absolutely wonderful. Paul always responds to my calls and emails right away. He is truly the backbone of my business. From my fantastic website to popping right up on Google when people search for me and designing my business cards, Paul has been there every step of the way. I would recommend this company to anyone.

I can't say enough great things about Green Tie Hosting. Paul was wonderful in helping me get my website up and running quickly. I have stayed with Green...