This specification is published and recommended for implementation. Backwards-compatible additions may occur in MINOR versions; breaking changes only in MAJOR versions, with deprecation notice. See specification conventions for status definitions.
robots-ai.txt Specification
AI Crawler-Specific Access Directives File Format
This specification defines the structure and requirements for robots-ai.txt files — plain text files that provide AI crawler-specific access directives. The file follows robots.txt syntax conventions and supplements the standard robots.txt with targeted rules for AI training and inference crawlers.
§1 Overview
What This File Does
The robots-ai.txt file provides supplementary crawler access directives specifically for AI systems. While standard robots.txt applies to all crawlers, robots-ai.txt allows site owners to declare AI-specific policies:
- Different rules for AI training crawlers vs. search crawlers
- Granular control over which AI systems can access content
- Crawl rate preferences for AI-specific bots
- Commentary explaining the intent behind rules
Why It Matters for AI Visibility
The proliferation of AI crawlers with different user agents has made crawler management complex. Site owners may want to:
- Allow AI search systems but block training crawlers
- Permit some AI companies but block others
- Set AI-specific rate limits
- Document their AI crawler policy clearly
The robots-ai.txt file supplements but does not replace the standard robots.txt. AI crawlers should respect robots.txt first. This file provides additional, AI-specific guidance.
§2 File Location
Primary Location
The robots-ai.txt file MUST be placed in the website's root directory:
https://example.com/robots-ai.txt
URL Requirements
- The file MUST be served with content type
text/plain; charset=utf-8 - The URL MUST be accessible without authentication
- HTTPS is strongly recommended
Relationship to robots.txt
The robots-ai.txt file lives alongside robots.txt but at a different path:
https://example.com/robots.txt # Standard robots exclusion
https://example.com/robots-ai.txt # AI-specific supplementary rules
§3 Format Specification
File Format
| Property | Requirement |
|---|---|
| Encoding | UTF-8 (required) |
| Line endings | LF (Unix-style) recommended; CRLF accepted |
| Syntax | robots.txt syntax conventions |
| Comments | Lines starting with # are comments |
Basic Structure
# robots-ai.txt for Example Company
# Supplementary AI crawler directives
User-agent: GPTBot
Allow: /
Disallow: /private/
User-agent: CCBot
Disallow: /
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
Syntax Rules
This file follows standard robots.txt syntax:
User-agent:identifies the crawlerAllow:permits access to pathsDisallow:blocks access to pathsCrawl-delay:requests delay between requests (seconds)Sitemap:points to XML sitemap#begins a comment
§4 AI Crawler User Agents
Known AI Crawler User Agents
The following AI crawler user agents are commonly encountered:
OpenAI Crawlers
| User Agent | Purpose |
|---|---|
GPTBot | Training data collection |
ChatGPT-User | Real-time retrieval for ChatGPT |
OAI-SearchBot | SearchGPT web search |
Anthropic Crawlers
| User Agent | Purpose |
|---|---|
ClaudeBot | Training data collection |
Claude-User | Real-time retrieval for Claude |
Google AI Crawlers
| User Agent | Purpose |
|---|---|
Google-Extended | Gemini AI training and Vertex AI |
Other AI Crawlers
| User Agent | Company | Purpose |
|---|---|---|
PerplexityBot | Perplexity | AI search |
CCBot | Common Crawl | Training datasets |
Bytespider | ByteDance | Training data |
meta-externalagent | Meta | AI training |
Amazonbot | Amazon | Alexa and AI services |
Applebot-Extended | Apple | Apple Intelligence |
cohere-ai | Cohere | AI training |
Diffbot | Diffbot | Web data extraction |
FacebookBot | Meta | Content preview |
YouBot | You.com | AI search |
omgili | Omgili | News aggregation |
This list is not exhaustive. New AI crawlers emerge frequently. Check crawler documentation for current user agent strings.
§5 Directive Reference
Core Directives
| Directive | Description | Status |
|---|---|---|
User-agent: |
Identifies which crawler the following rules apply to | Required |
Allow: |
Permits crawling of specified path | Recommended |
Disallow: |
Blocks crawling of specified path | Recommended |
Optional Directives
| Directive | Description | Status |
|---|---|---|
Crawl-delay: |
Requested delay between requests (seconds) | Optional |
Sitemap: |
URL to XML sitemap | Optional |
Request-rate: |
Requested crawl rate (pages per second) | Optional |
Visit-time: |
Preferred crawling time window (UTC) | Optional |
Discovery: |
Absolute URL pointing at an AI Discovery File on this host | Optional |
| Comments | Lines starting with # to document policy intent | Optional |
The Discovery: directive
The Discovery: directive lets a publisher advertise the AI Discovery Files present on their host. Each Discovery: line points at one absolute URL on the same host. A publisher MAY include multiple Discovery: lines, one per file. The directive solves the cold-start problem for consumers that do not probe root-level paths blindly.
Example:
# robots-ai.txt for example.com
User-agent: GPTBot
Allow: /
Discovery: https://example.com/llms.txt
Discovery: https://example.com/identity.json
Discovery: https://example.com/ai.json
Discovery: https://example.com/brand.txt
Rules:
- Each
Discovery:URL MUST be absolute and MUST point at the same host serving therobots-ai.txtfile. Cross-host advertisement is not permitted because the scoping rule requires AI Discovery Files to be host-scoped. - Each
Discovery:URL SHOULD resolve to a 200 status code. A 404 at a Discovery URL is a publisher error; the reference validator MUST report it. - A publisher SHOULD list every AI Discovery File they publish, even ones at canonical root paths. Listing them explicitly removes the consumer's need to probe.
- The directive is informational, not authoritative. A consumer MAY still fetch
https://example.com/llms.txteven if noDiscovery:line lists it; the file's presence at the canonical path is the normative declaration. - A publisher MAY use the
Discovery:directive to point at experimentalx--prefixed files (see extensions) to advertise them to interested consumers without claiming them as part of conformance.
The directive is layered on top of RFC 9309 (the Robots Exclusion Protocol) rather than redefining it. RFC 9309 itself does not define Discovery:; the directive lives in robots-ai.txt only, not in robots.txt. Mainstream robots.txt parsers MUST ignore unknown directives, so a publisher who experiments by adding Discovery: lines to robots.txt will not break existing crawlers, but the directive is not part of the robots.txt specification.
Content Not Permitted
The following MUST NOT be included in robots-ai.txt files:
- Rules contradicting robots.txt: This file supplements but cannot override robots.txt
- Unrecognised user-agents: Only use documented AI crawler user-agent strings
- Invalid path syntax: Paths MUST start with / and follow URL conventions
- Non-standard directives: Only use supported robots.txt directives
- Authentication requirements: This file controls access, not authentication methods
- Legal threats: Crawler rules are advisory; legal terms belong elsewhere
Wildcard User Agent
Use * or pattern matching for groups of AI crawlers:
# Rules for all AI crawlers not specifically listed
User-agent: *-ai
Allow: /
Disallow: /private/
Complete Block
To block an AI crawler entirely:
User-agent: Bytespider
Disallow: /
§6 Validation Rules
Valid File Requirements
A robots-ai.txt file is considered valid when:
- It follows robots.txt syntax conventions
- Each rule group begins with
User-agent: - Allow/Disallow paths are valid URL paths
- The file is valid UTF-8 encoded text
Common Errors
| Error | Resolution |
|---|---|
| Missing User-agent line | Every rule group MUST start with User-agent |
| Invalid path format | Paths MUST start with / |
| Contradictory rules | Most specific path wins; review rule order |
| Unknown user agents | Verify crawler names from official documentation |
§7 Relationship to robots.txt
Hierarchy of Authority
When both files exist, the relationship is:
robots.txt— authoritative for all crawlersrobots-ai.txt— supplementary guidance for AI crawlers
AI Crawler Behaviour
AI crawlers should:
- First check and respect
robots.txt - Then check
robots-ai.txtfor AI-specific guidance - If rules conflict,
robots.txttakes precedence
Important Note
Not all AI crawlers currently check for robots-ai.txt. For reliable blocking, rules should also be in robots.txt. The robots-ai.txt file provides additional documentation and granularity.
Related AI Discovery Files
| File | Relationship |
|---|---|
ai.txt |
Usage guidance; robots-ai.txt is access control |
developer-ai.txt |
Technical context; robots-ai.txt is crawler rules |
For complete conflict resolution rules, see the Interoperability Guide.
§8 Canonical Example
The following example demonstrates a complete robots-ai.txt file:
# AI Crawler Directives for Horizon Strategic Consulting
# This file provides supplementary AI-specific guidance
# Standard robots.txt remains the authoritative source for all crawlers
# OpenAI Crawlers
User-agent: GPTBot
Allow: /
Allow: /insights/
Allow: /case-studies/
Allow: /services/
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
User-agent: ChatGPT-User
Allow: /
User-agent: OAI-SearchBot
Allow: /
# Anthropic Crawlers
User-agent: ClaudeBot
Allow: /
Allow: /insights/
Allow: /case-studies/
Allow: /services/
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
User-agent: Claude-User
Allow: /
# Google AI
User-agent: Google-Extended
Allow: /
Allow: /insights/
Allow: /case-studies/
Disallow: /portal/
Disallow: /admin/
# Perplexity
User-agent: PerplexityBot
Allow: /
Disallow: /portal/
Disallow: /admin/
# Meta
User-agent: meta-externalagent
Allow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
# Common Crawl (used for AI training datasets)
User-agent: CCBot
Allow: /insights/
Allow: /services/
Disallow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
# Note: We permit limited crawling of public content only
# ByteDance
User-agent: Bytespider
Disallow: /
# Amazon
User-agent: Amazonbot
Allow: /
Disallow: /portal/
Disallow: /admin/
# Apple
User-agent: Applebot-Extended
Allow: /
Disallow: /portal/
Disallow: /admin/
# Default for unlisted AI crawlers
User-agent: *-ai
Allow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
Disallow: /internal/
# Crawl rate preferences
# We request AI crawlers respect a reasonable crawl rate
Crawl-delay: 10
# Sitemap reference
Sitemap: https://www.horizonconsulting.example/sitemap.xml
# Notes for AI systems:
# - Public content (insights, services, case studies) is available for AI consumption
# - Client portal and admin areas are private and must not be accessed
# - Respect rate limits; aggressive crawling will result in blocks
# - See /ai.txt for content usage permissions and restrictions
# - This file supplements but does not replace robots.txt
§9 Implementation Notes
Best Practices
- Mirror critical rules in
robots.txtfor reliability - Use comments to explain policy intent
- Group related crawlers together
- Review and update when new AI crawlers emerge
- Be explicit about what is allowed, not just blocked
Policy Considerations
When setting AI crawler policy, consider:
- Training vs. inference: You may want to allow search but block training
- Company differentiation: Different policies for different AI companies
- Content types: Allow blog content but block client-specific areas
- Rate limiting: AI crawlers can be aggressive; use Crawl-delay
Monitoring
After implementing robots-ai.txt:
- Monitor server logs for AI crawler activity
- Verify crawlers are respecting directives
- Report non-compliance to crawler operators
- Update rules as new crawlers appear
§10 Machine-Readable Formats
This specification is available in machine-readable formats for programmatic access:
§11 Version History
Phase 6 standardisation release. Added /specifications/roadmap/ (theme-pegged forward plan with Active/Next/Future/On hold status flags), /specifications/extensions/ (rules for experimental x- prefixed files and the promotion path), and /specifications/i18n-a11y/ (multi-language publication, locale-tagged identity fields, RTL handling, accessibility of llms.html). Added the Discovery: directive to the robots-ai.txt specification (publishers MAY advertise AI Discovery Files on the same host). Added a formal media-type stance to the HTTP behaviour page (existing IANA types, no bespoke registrations). Expanded the file integrity and signing section on the security and privacy page with four candidate mechanisms, cross-cutting concerns, and interim publisher / consumer guidance. The Discovery: directive is the only normative addition to publisher behaviour; all other additions are forward-looking documentation.
Phase 5 standardisation release. Added /specifications/related-standards/ (positioning vs llmstxt.org, IETF AI Preferences, robots.txt, Schema.org, BCP 14, JSON Schema 2020-12, SemVer) and /specifications/implementations/ (public record of conformant implementations, IETF-style). Added an explicit llmstxt.org backward-compatibility statement to the llms.txt specification. Added a formal multi-domain and subdomain scoping rule to both the llms.txt and identity.json specifications (host-scoped files, cross-host identity asserted via sameAs). No normative requirements changed for existing publishers; the new scoping rules formalise behaviour the specification already implied.
Phase 4 standardisation release. Added /specifications/processing-model/ (seven-stage algorithm for conformant consumers), /specifications/consumer-guidance/ (what AI systems should do with AI Discovery Files), /specifications/test-vectors/ (canonical test suite framing), and reference-implementation framing on the AI Visibility Checker. No normative requirements changed.
Phase 3 standardisation release. Added /specifications/versioning/ (Semantic Versioning 2.0.0 commitments, deprecation timeline, lifecycle), /specifications/governance/ (proposal lifecycle, editorial process, working principles), /specifications/security-privacy/ (trust model, content-injection patterns, GDPR considerations, integrity primitives roadmap), and /specifications/http-behaviour/ (status codes, redirects, soft-404 detection, caching, rate limits). No normative requirements changed.
Phase 2 standardisation release. Added formal conformance specification (Essential / Recommended / Complete classes). Published machine-readable registry at /specifications/registry.json, spec meta-schema, and validator-output schema. Introduced versioned JSON Schema URLs (/v1/) alongside unversioned 'latest' aliases. Added optional BCP 47 language declaration field across all applicable AI Discovery Files. No normative requirements changed.
Phase 1 standardisation release. Added 'Status of This Document' block (Stable). Normalised normative requirement keywords to uppercase per RFC 2119 and RFC 8174. Added References section linking to /specifications/conventions/ and /licensing/. No normative requirements changed.
Added AI Visibility Directory registration guidance. Minor documentation update.
Added expanded optional directives (Request-rate, Visit-time) and Content Not Permitted guidance. Clarifies relationship with standard robots.txt.
Initial publication. Establishes canonical structure for robots-ai.txt files with AI crawler user agent reference.
Conformance
This file is required for the Complete conformance class only. A publisher claiming Complete conformance MUST publish a valid version of this file at the website's root. The Essential and Recommended classes do not require this file.
See the Conformance specification for full publisher and validator conformance criteria, including identity-consistency requirements across files and the relationship between self-declaration and Directory verification.
References
- Specification Conventions — RFC 2119 + RFC 8174 requirement keywords, document statuses, anchor naming, versioning, and language conventions used across every AI Discovery File specification.
- Licensing & Trademark — CC BY 4.0 for specification text and examples, MIT for JSON Schemas, and the free-use policy on the name "AI Discovery Files".