Glossary Term
Spider trap
Definition and Purpose of Spider Traps
- Spider traps are sets of web pages that can cause web crawlers to make infinite requests or crash.
- They can be intentionally or unintentionally created.
- Spider traps are used to catch spambots or crawlers that waste website bandwidth.
- They can be unintentionally created by calendars or algorithmically generated language poetry.
- There is no algorithm to detect all spider traps.
Impact on Web Crawlers
- Spider traps waste the resources of web crawlers.
- They lower the productivity of web crawlers.
- Poorly written crawlers can crash when encountering spider traps.
- Polite web crawlers are affected to a lesser degree than impolite ones.
- Legitimate polite bots would not fall into spider traps.
Politeness and Spider Traps
- Polite web crawlers alternate requests between different hosts.
- Polite crawlers do not request documents from the same server too frequently.
- Politeness helps reduce the impact of spider traps on web crawlers.
- The use of robots.txt can prevent polite bots from falling into traps.
- Impolite bots disregarding robots.txt settings are affected by traps.
Related Concepts
- Robots exclusion standard is related to spider traps.
- Web crawlers are closely associated with spider traps.
- Z39.50, Search/Retrieve Web Service, and Search/Retrieve via URL are related technologies.
- OpenSearch and Representational State Transfer (REST) are relevant concepts.
- Wide area information server (WAIS) is another related technology.
References
- Techopedia provides information on spider traps.
- Neil M Hennessy's work discusses L=A=N=G=U=A=G=E poetry on the web.
- Portent is a source for spider trap information.
- Thesitewizard.com provides guidance on setting up robots.txt.
- The DEV Community offers insights on building a polite web crawler.