Glossary Term
robots.txt
History and Standardization of robots.txt
- Proposed by Martijn Koster in February 1994
- Proposed on the www-talk mailing list
- Provoked by Charles Stross' badly-behaved web crawler
- Became a de facto standard for web crawlers
- Officially standardized by Google in July 2019
- Robots.txt file is placed in the root of the website hierarchy
- Contains instructions for web robots in a specific format
- Instructs robots which web pages they can and cannot access
- Important for web crawlers from search engines like Google
- Each subdomain and protocol/port needs its own robots.txt file
Security and Limitations of robots.txt
- Robots.txt compliance is voluntary and advisory
- Malicious web robots may ignore robots.txt instructions
- Security through obscurity is discouraged by standards bodies
- NIST recommends against relying on secrecy for system security
- Robots.txt should not be solely relied upon for security purposes
Alternatives to robots.txt
- Robots can pass a special user-agent to the web server
- Server can be configured to return failure or alternative content
- Some sites have humans.txt files for human-readable information
- Some sites redirect humans.txt to an About page
- Google previously had a joke file instructing the Terminator not to harm the founders
Nonstandard extensions of robots.txt
- Crawl-delay directive allows throttling of bot visits
- Interpretation of crawl-delay value depends on the crawler
- Yandex uses crawl-delay as the number of seconds to wait between visits
- Bing defines crawl-delay as the size of a time window for accessing a site
- Google provides a search console interface for controlling bot visits
- Some crawlers support the Sitemap directive in robots.txt
- Allows multiple Sitemaps in the same robots.txt file
- Sitemaps are specified with full URLs
- BingBot defines crawl-delay as the size of a time window
- Google provides a search console interface for managing Sitemaps
Meta tags, headers, and related concepts
- Robots exclusion directives can be applied through meta tags and HTTP headers.
- Robots meta tags cannot be used for non-HTML files.
- X-Robots-Tag can be added to non-HTML files using .htaccess and httpd.conf files.
- A noindex meta tag can be used to exclude a page from indexing.
- A noindex HTTP response header can also be used to exclude a page from indexing.
- ads.txt: a standard for listing authorized ad sellers.
- security.txt: a file for reporting security vulnerabilities.
- Automated Content Access Protocol: a failed proposal to extend robots.txt.
- BotSeer: an inactive search engine for robots.txt files.
- Distributed web crawling: a technique for distributing web crawling tasks.
Note: The content related to maximum size of a robots.txt file, robots.txt and web archives, and external links has been excluded as they do not fit into the identified groups.