History and Standardization of robots.txt
– Proposed by Martijn Koster in February 1994
– Proposed on the www-talk mailing list
– Provoked by Charles Stross’ badly-behaved web crawler
– Became a de facto standard for web crawlers
– Officially standardized by Google in July 2019
– Robots.txt file is placed in the root of the website hierarchy
– Contains instructions for web robots in a specific format
– Instructs robots which web pages they can and cannot access
– Important for web crawlers from search engines like Google
– Each subdomain and protocol/port needs its own robots.txt file
Security and Limitations of robots.txt
– Robots.txt compliance is voluntary and advisory
– Malicious web robots may ignore robots.txt instructions
– Security through obscurity is discouraged by standards bodies
– NIST recommends against relying on secrecy for system security
– Robots.txt should not be solely relied upon for security purposes
Alternatives to robots.txt
– Robots can pass a special user-agent to the web server
– Server can be configured to return failure or alternative content
– Some sites have humans.txt files for human-readable information
– Some sites redirect humans.txt to an About page
– Google previously had a joke file instructing the Terminator not to harm the founders
Nonstandard extensions of robots.txt
– Crawl-delay directive allows throttling of bot visits
– Interpretation of crawl-delay value depends on the crawler
– Yandex uses crawl-delay as the number of seconds to wait between visits
– Bing defines crawl-delay as the size of a time window for accessing a site
– Google provides a search console interface for controlling bot visits
– Some crawlers support the Sitemap directive in robots.txt
– Allows multiple Sitemaps in the same robots.txt file
– Sitemaps are specified with full URLs
– BingBot defines crawl-delay as the size of a time window
– Google provides a search console interface for managing Sitemaps
Meta tags, headers, and related concepts
– Robots exclusion directives can be applied through meta tags and HTTP headers.
– Robots meta tags cannot be used for non-HTML files.
– X-Robots-Tag can be added to non-HTML files using .htaccess and httpd.conf files.
– A noindex meta tag can be used to exclude a page from indexing.
– A noindex HTTP response header can also be used to exclude a page from indexing.
– ads.txt: a standard for listing authorized ad sellers.
– security.txt: a file for reporting security vulnerabilities.
– Automated Content Access Protocol: a failed proposal to extend robots.txt.
– BotSeer: an inactive search engine for robots.txt files.
– Distributed web crawling: a technique for distributing web crawling tasks.
robots.txt is the filename used for implementing the Robots Exclusion Protocol, a standard used by websites to indicate to visiting web crawlers and other web robots which portions of the website they are allowed to visit.
This relies on voluntary compliance. Not all robots comply with the standard; indeed, email harvesters, spambots, malware and robots that scan for security vulnerabilities may very well start with the portions of the website they have been asked (by the Robots Exclusion Protocol) to stay out of.
The "robots.txt" file can be used in conjunction with sitemaps, another robot inclusion standard for websites.
1912 NW 143rd Ave #24,
Portland, OR 97229, USA