Specification Version 1.7.0
Published
Last Modified
Status Stable

This specification is published and recommended for implementation. Backwards-compatible additions may occur in MINOR versions; breaking changes only in MAJOR versions, with deprecation notice. See specification conventions for status definitions.

robots-ai.txt Specification

AI Crawler-Specific Access Directives File Format

Abstract

This specification defines the structure and requirements for robots-ai.txt files — plain text files that provide AI crawler-specific access directives. The file follows robots.txt syntax conventions and supplements the standard robots.txt with targeted rules for AI training and inference crawlers.

§1 Overview

What This File Does

The robots-ai.txt file provides supplementary crawler access directives specifically for AI systems. While standard robots.txt applies to all crawlers, robots-ai.txt allows site owners to declare AI-specific policies:

  • Different rules for AI training crawlers vs. search crawlers
  • Granular control over which AI systems can access content
  • Crawl rate preferences for AI-specific bots
  • Commentary explaining the intent behind rules

Why It Matters for AI Visibility

The proliferation of AI crawlers with different user agents has made crawler management complex. Site owners may want to:

  • Allow AI search systems but block training crawlers
  • Permit some AI companies but block others
  • Set AI-specific rate limits
  • Document their AI crawler policy clearly
Important

The robots-ai.txt file supplements but does not replace the standard robots.txt. AI crawlers should respect robots.txt first. This file provides additional, AI-specific guidance.

§2 File Location

Primary Location

The robots-ai.txt file MUST be placed in the website's root directory:

https://example.com/robots-ai.txt

URL Requirements

  • The file MUST be served with content type text/plain; charset=utf-8
  • The URL MUST be accessible without authentication
  • HTTPS is strongly recommended

Relationship to robots.txt

The robots-ai.txt file lives alongside robots.txt but at a different path:

https://example.com/robots.txt      # Standard robots exclusion
https://example.com/robots-ai.txt   # AI-specific supplementary rules

§3 Format Specification

File Format

PropertyRequirement
EncodingUTF-8 (required)
Line endingsLF (Unix-style) recommended; CRLF accepted
Syntaxrobots.txt syntax conventions
CommentsLines starting with # are comments

Basic Structure

# robots-ai.txt for Example Company
# Supplementary AI crawler directives

User-agent: GPTBot
Allow: /
Disallow: /private/

User-agent: CCBot
Disallow: /

Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml

Syntax Rules

This file follows standard robots.txt syntax:

  • User-agent: identifies the crawler
  • Allow: permits access to paths
  • Disallow: blocks access to paths
  • Crawl-delay: requests delay between requests (seconds)
  • Sitemap: points to XML sitemap
  • # begins a comment

§4 AI Crawler User Agents

Known AI Crawler User Agents

The following AI crawler user agents are commonly encountered:

OpenAI Crawlers

User AgentPurpose
GPTBotTraining data collection
ChatGPT-UserReal-time retrieval for ChatGPT
OAI-SearchBotSearchGPT web search

Anthropic Crawlers

User AgentPurpose
ClaudeBotTraining data collection
Claude-UserReal-time retrieval for Claude

Google AI Crawlers

User AgentPurpose
Google-ExtendedGemini AI training and Vertex AI

Other AI Crawlers

User AgentCompanyPurpose
PerplexityBotPerplexityAI search
CCBotCommon CrawlTraining datasets
BytespiderByteDanceTraining data
meta-externalagentMetaAI training
AmazonbotAmazonAlexa and AI services
Applebot-ExtendedAppleApple Intelligence
cohere-aiCohereAI training
DiffbotDiffbotWeb data extraction
FacebookBotMetaContent preview
YouBotYou.comAI search
omgiliOmgiliNews aggregation
Note

This list is not exhaustive. New AI crawlers emerge frequently. Check crawler documentation for current user agent strings.

§5 Directive Reference

Core Directives

DirectiveDescriptionStatus
User-agent: Identifies which crawler the following rules apply to Required
Allow: Permits crawling of specified path Recommended
Disallow: Blocks crawling of specified path Recommended

Optional Directives

DirectiveDescriptionStatus
Crawl-delay: Requested delay between requests (seconds) Optional
Sitemap: URL to XML sitemap Optional
Request-rate: Requested crawl rate (pages per second) Optional
Visit-time: Preferred crawling time window (UTC) Optional
Discovery: Absolute URL pointing at an AI Discovery File on this host Optional
Comments Lines starting with # to document policy intent Optional

The Discovery: directive

The Discovery: directive lets a publisher advertise the AI Discovery Files present on their host. Each Discovery: line points at one absolute URL on the same host. A publisher MAY include multiple Discovery: lines, one per file. The directive solves the cold-start problem for consumers that do not probe root-level paths blindly.

Example:

# robots-ai.txt for example.com
User-agent: GPTBot
Allow: /

Discovery: https://example.com/llms.txt
Discovery: https://example.com/identity.json
Discovery: https://example.com/ai.json
Discovery: https://example.com/brand.txt

Rules:

  1. Each Discovery: URL MUST be absolute and MUST point at the same host serving the robots-ai.txt file. Cross-host advertisement is not permitted because the scoping rule requires AI Discovery Files to be host-scoped.
  2. Each Discovery: URL SHOULD resolve to a 200 status code. A 404 at a Discovery URL is a publisher error; the reference validator MUST report it.
  3. A publisher SHOULD list every AI Discovery File they publish, even ones at canonical root paths. Listing them explicitly removes the consumer's need to probe.
  4. The directive is informational, not authoritative. A consumer MAY still fetch https://example.com/llms.txt even if no Discovery: line lists it; the file's presence at the canonical path is the normative declaration.
  5. A publisher MAY use the Discovery: directive to point at experimental x--prefixed files (see extensions) to advertise them to interested consumers without claiming them as part of conformance.

The directive is layered on top of RFC 9309 (the Robots Exclusion Protocol) rather than redefining it. RFC 9309 itself does not define Discovery:; the directive lives in robots-ai.txt only, not in robots.txt. Mainstream robots.txt parsers MUST ignore unknown directives, so a publisher who experiments by adding Discovery: lines to robots.txt will not break existing crawlers, but the directive is not part of the robots.txt specification.

Content Not Permitted

The following MUST NOT be included in robots-ai.txt files:

  • Rules contradicting robots.txt: This file supplements but cannot override robots.txt
  • Unrecognised user-agents: Only use documented AI crawler user-agent strings
  • Invalid path syntax: Paths MUST start with / and follow URL conventions
  • Non-standard directives: Only use supported robots.txt directives
  • Authentication requirements: This file controls access, not authentication methods
  • Legal threats: Crawler rules are advisory; legal terms belong elsewhere

Wildcard User Agent

Use * or pattern matching for groups of AI crawlers:

# Rules for all AI crawlers not specifically listed
User-agent: *-ai
Allow: /
Disallow: /private/

Complete Block

To block an AI crawler entirely:

User-agent: Bytespider
Disallow: /

§6 Validation Rules

Valid File Requirements

A robots-ai.txt file is considered valid when:

  • It follows robots.txt syntax conventions
  • Each rule group begins with User-agent:
  • Allow/Disallow paths are valid URL paths
  • The file is valid UTF-8 encoded text

Common Errors

ErrorResolution
Missing User-agent lineEvery rule group MUST start with User-agent
Invalid path formatPaths MUST start with /
Contradictory rulesMost specific path wins; review rule order
Unknown user agentsVerify crawler names from official documentation

§7 Relationship to robots.txt

Hierarchy of Authority

When both files exist, the relationship is:

  1. robots.txt — authoritative for all crawlers
  2. robots-ai.txt — supplementary guidance for AI crawlers

AI Crawler Behaviour

AI crawlers should:

  1. First check and respect robots.txt
  2. Then check robots-ai.txt for AI-specific guidance
  3. If rules conflict, robots.txt takes precedence

Important Note

Not all AI crawlers currently check for robots-ai.txt. For reliable blocking, rules should also be in robots.txt. The robots-ai.txt file provides additional documentation and granularity.

Related AI Discovery Files

FileRelationship
ai.txt Usage guidance; robots-ai.txt is access control
developer-ai.txt Technical context; robots-ai.txt is crawler rules
See Also

For complete conflict resolution rules, see the Interoperability Guide.

§8 Canonical Example

The following example demonstrates a complete robots-ai.txt file:

Complete Example
# AI Crawler Directives for Horizon Strategic Consulting
# This file provides supplementary AI-specific guidance
# Standard robots.txt remains the authoritative source for all crawlers

# OpenAI Crawlers
User-agent: GPTBot
Allow: /
Allow: /insights/
Allow: /case-studies/
Allow: /services/
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic Crawlers
User-agent: ClaudeBot
Allow: /
Allow: /insights/
Allow: /case-studies/
Allow: /services/
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/

User-agent: Claude-User
Allow: /

# Google AI
User-agent: Google-Extended
Allow: /
Allow: /insights/
Allow: /case-studies/
Disallow: /portal/
Disallow: /admin/

# Perplexity
User-agent: PerplexityBot
Allow: /
Disallow: /portal/
Disallow: /admin/

# Meta
User-agent: meta-externalagent
Allow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/

# Common Crawl (used for AI training datasets)
User-agent: CCBot
Allow: /insights/
Allow: /services/
Disallow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
# Note: We permit limited crawling of public content only

# ByteDance
User-agent: Bytespider
Disallow: /

# Amazon
User-agent: Amazonbot
Allow: /
Disallow: /portal/
Disallow: /admin/

# Apple
User-agent: Applebot-Extended
Allow: /
Disallow: /portal/
Disallow: /admin/

# Default for unlisted AI crawlers
User-agent: *-ai
Allow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
Disallow: /internal/

# Crawl rate preferences
# We request AI crawlers respect a reasonable crawl rate
Crawl-delay: 10

# Sitemap reference
Sitemap: https://www.horizonconsulting.example/sitemap.xml

# Notes for AI systems:
# - Public content (insights, services, case studies) is available for AI consumption
# - Client portal and admin areas are private and must not be accessed
# - Respect rate limits; aggressive crawling will result in blocks
# - See /ai.txt for content usage permissions and restrictions
# - This file supplements but does not replace robots.txt

§9 Implementation Notes

Best Practices

  • Mirror critical rules in robots.txt for reliability
  • Use comments to explain policy intent
  • Group related crawlers together
  • Review and update when new AI crawlers emerge
  • Be explicit about what is allowed, not just blocked

Policy Considerations

When setting AI crawler policy, consider:

  • Training vs. inference: You may want to allow search but block training
  • Company differentiation: Different policies for different AI companies
  • Content types: Allow blog content but block client-specific areas
  • Rate limiting: AI crawlers can be aggressive; use Crawl-delay

Monitoring

After implementing robots-ai.txt:

  • Monitor server logs for AI crawler activity
  • Verify crawlers are respecting directives
  • Report non-compliance to crawler operators
  • Update rules as new crawlers appear

§10 Machine-Readable Formats

This specification is available in machine-readable formats for programmatic access:

§11 Version History

1.7.0

Phase 6 standardisation release. Added /specifications/roadmap/ (theme-pegged forward plan with Active/Next/Future/On hold status flags), /specifications/extensions/ (rules for experimental x- prefixed files and the promotion path), and /specifications/i18n-a11y/ (multi-language publication, locale-tagged identity fields, RTL handling, accessibility of llms.html). Added the Discovery: directive to the robots-ai.txt specification (publishers MAY advertise AI Discovery Files on the same host). Added a formal media-type stance to the HTTP behaviour page (existing IANA types, no bespoke registrations). Expanded the file integrity and signing section on the security and privacy page with four candidate mechanisms, cross-cutting concerns, and interim publisher / consumer guidance. The Discovery: directive is the only normative addition to publisher behaviour; all other additions are forward-looking documentation.

1.6.0

Phase 5 standardisation release. Added /specifications/related-standards/ (positioning vs llmstxt.org, IETF AI Preferences, robots.txt, Schema.org, BCP 14, JSON Schema 2020-12, SemVer) and /specifications/implementations/ (public record of conformant implementations, IETF-style). Added an explicit llmstxt.org backward-compatibility statement to the llms.txt specification. Added a formal multi-domain and subdomain scoping rule to both the llms.txt and identity.json specifications (host-scoped files, cross-host identity asserted via sameAs). No normative requirements changed for existing publishers; the new scoping rules formalise behaviour the specification already implied.

1.5.0

Phase 4 standardisation release. Added /specifications/processing-model/ (seven-stage algorithm for conformant consumers), /specifications/consumer-guidance/ (what AI systems should do with AI Discovery Files), /specifications/test-vectors/ (canonical test suite framing), and reference-implementation framing on the AI Visibility Checker. No normative requirements changed.

1.4.0

Phase 3 standardisation release. Added /specifications/versioning/ (Semantic Versioning 2.0.0 commitments, deprecation timeline, lifecycle), /specifications/governance/ (proposal lifecycle, editorial process, working principles), /specifications/security-privacy/ (trust model, content-injection patterns, GDPR considerations, integrity primitives roadmap), and /specifications/http-behaviour/ (status codes, redirects, soft-404 detection, caching, rate limits). No normative requirements changed.

1.3.0

Phase 2 standardisation release. Added formal conformance specification (Essential / Recommended / Complete classes). Published machine-readable registry at /specifications/registry.json, spec meta-schema, and validator-output schema. Introduced versioned JSON Schema URLs (/v1/) alongside unversioned 'latest' aliases. Added optional BCP 47 language declaration field across all applicable AI Discovery Files. No normative requirements changed.

1.2.0

Phase 1 standardisation release. Added 'Status of This Document' block (Stable). Normalised normative requirement keywords to uppercase per RFC 2119 and RFC 8174. Added References section linking to /specifications/conventions/ and /licensing/. No normative requirements changed.

1.1.1

Added AI Visibility Directory registration guidance. Minor documentation update.

1.1.0

Added expanded optional directives (Request-rate, Visit-time) and Content Not Permitted guidance. Clarifies relationship with standard robots.txt.

1.0.0

Initial publication. Establishes canonical structure for robots-ai.txt files with AI crawler user agent reference.

Conformance

This file is required for the Complete conformance class only. A publisher claiming Complete conformance MUST publish a valid version of this file at the website's root. The Essential and Recommended classes do not require this file.

See the Conformance specification for full publisher and validator conformance criteria, including identity-consistency requirements across files and the relationship between self-declaration and Directory verification.

References

  • Specification Conventions — RFC 2119 + RFC 8174 requirement keywords, document statuses, anchor naming, versioning, and language conventions used across every AI Discovery File specification.
  • Licensing & Trademark — CC BY 4.0 for specification text and examples, MIT for JSON Schemas, and the free-use policy on the name "AI Discovery Files".
Free WordPress Plugin

Generate AI Discovery Files from your dashboard

Using WordPress? Install the plugin and create all 10 files in minutes — no coding, no configuration files to edit manually.

Get the Plugin

Register in the AI Visibility Directory

Once your AI Discovery Files are published, register your website in the AI Visibility Directory — the verified registry of websites implementing AI Discovery Files. Registration validates your implementation and lists your site for AI systems and industry peers to discover.

Basic Listing

Card entry in the directory with automated file validation. Open to any site with a valid llms.txt file. No cost.

Full Listing Recommended

Dedicated profile page on the directory with dofollow backlinks to your website — a genuine SEO authority signal from a topically relevant, verified source. Includes an attribution badge and enhanced visibility.