Definition and Purpose of Spider Traps
– Spider traps are sets of web pages that can cause web crawlers to make infinite requests or crash.
– They can be intentionally or unintentionally created.
– Spider traps are used to catch spambots or crawlers that waste website bandwidth.
– They can be unintentionally created by calendars or algorithmically generated language poetry.
– There is no algorithm to detect all spider traps.
Impact on Web Crawlers
– Spider traps waste the resources of web crawlers.
– They lower the productivity of web crawlers.
– Poorly written crawlers can crash when encountering spider traps.
– Polite web crawlers are affected to a lesser degree than impolite ones.
– Legitimate polite bots would not fall into spider traps.
Politeness and Spider Traps
– Polite web crawlers alternate requests between different hosts.
– Polite crawlers do not request documents from the same server too frequently.
– Politeness helps reduce the impact of spider traps on web crawlers.
– The use of robots.txt can prevent polite bots from falling into traps.
– Impolite bots disregarding robots.txt settings are affected by traps.
Related Concepts
– Robots exclusion standard is related to spider traps.
– Web crawlers are closely associated with spider traps.
– Z39.50, Search/Retrieve Web Service, and Search/Retrieve via URL are related technologies.
– OpenSearch and Representational State Transfer (REST) are relevant concepts.
– Wide area information server (WAIS) is another related technology.
References
– Techopedia provides information on spider traps.
– Neil M Hennessy’s work discusses L=A=N=G=U=A=G=E poetry on the web.
– Portent is a source for spider trap information.
– Thesitewizard.com provides guidance on setting up robots.txt.
– The DEV Community offers insights on building a polite web crawler.
A spider trap (or crawler trap) is a set of web pages that may intentionally or unintentionally be used to cause a web crawler or search bot to make an infinite number of requests or cause a poorly constructed crawler to crash. Web crawlers are also called web spiders, from which the name is derived. Spider traps may be created to "catch" spambots or other crawlers that waste a website's bandwidth. They may also be created unintentionally by calendars that use dynamic pages with links that continually point to the next day or year.
Common techniques used are:
- creation of indefinitely deep directory structures like
http://example.com/bar/foo/bar/foo/bar/foo/bar/...
- Dynamic pages that produce an unbounded number of documents for a web crawler to follow. Examples include calendars and algorithmically generated language poetry.
- documents filled with many characters, crashing the lexical analyzer parsing the document.
- documents with session-id's based on required cookies.
There is no algorithm to detect all spider traps. Some classes of traps can be detected automatically, but new, unrecognized traps arise quickly.