Glossary Term

robots.txt

History and Standardization of robots.txt - Proposed by Martijn Koster in February 1994 - Proposed on the www-talk mailing list - Provoked by Charles Stross' badly-behaved web crawler - Became a de facto standard for web crawlers - Officially standardized by Google in July 2019 - Robots.txt file is placed in the root of the website hierarchy - Contains instructions for web robots in a specific format - Instructs robots which web pages they can and cannot access - Important for web crawlers from search engines like Google - Each subdomain and protocol/port needs its own robots.txt file Security and Limitations of robots.txt - Robots.txt compliance is voluntary and advisory - Malicious web robots may ignore robots.txt instructions - Security through obscurity is discouraged by standards bodies - NIST recommends against relying on secrecy for system security - Robots.txt should not be solely relied upon for security purposes Alternatives to robots.txt - Robots can pass a special user-agent to the web server - Server can be configured to return failure or alternative content - Some sites have humans.txt files for human-readable information - Some sites redirect humans.txt to an About page - Google previously had a joke file instructing the Terminator not to harm the founders Nonstandard extensions of robots.txt - Crawl-delay directive allows throttling of bot visits - Interpretation of crawl-delay value depends on the crawler - Yandex uses crawl-delay as the number of seconds to wait between visits - Bing defines crawl-delay as the size of a time window for accessing a site - Google provides a search console interface for controlling bot visits - Some crawlers support the Sitemap directive in robots.txt - Allows multiple Sitemaps in the same robots.txt file - Sitemaps are specified with full URLs - BingBot defines crawl-delay as the size of a time window - Google provides a search console interface for managing Sitemaps Meta tags, headers, and related concepts - Robots exclusion directives can be applied through meta tags and HTTP headers. - Robots meta tags cannot be used for non-HTML files. - X-Robots-Tag can be added to non-HTML files using .htaccess and httpd.conf files. - A noindex meta tag can be used to exclude a page from indexing. - A noindex HTTP response header can also be used to exclude a page from indexing. - ads.txt: a standard for listing authorized ad sellers. - security.txt: a file for reporting security vulnerabilities. - Automated Content Access Protocol: a failed proposal to extend robots.txt. - BotSeer: an inactive search engine for robots.txt files. - Distributed web crawling: a technique for distributing web crawling tasks. Note: The content related to maximum size of a robots.txt file, robots.txt and web archives, and external links has been excluded as they do not fit into the identified groups.

Back to Glossary