Specification Version 1.7.1

Published 12 January 2026

Last Modified 11 June 2026

Status Stable

This specification is published and recommended for implementation. Backwards-compatible additions may occur in MINOR versions; breaking changes only in MAJOR versions, with deprecation notice. See specification conventions for status definitions.

robots-ai.txt Specification

AI Crawler-Specific Access Directives File Format

Abstract

This specification defines the structure and requirements for robots-ai.txt files — plain text files that provide AI crawler-specific access directives. The file follows robots.txt syntax conventions and supplements the standard robots.txt with targeted rules for AI training and inference crawlers.

§1 Overview

What This File Does

The robots-ai.txt file provides supplementary crawler access directives specifically for AI systems. While standard robots.txt applies to all crawlers, robots-ai.txt allows site owners to declare AI-specific policies:

Different rules for AI training crawlers vs. search crawlers
Granular control over which AI systems can access content
Crawl rate preferences for AI-specific bots
Commentary explaining the intent behind rules

Why It Matters for AI Visibility

The proliferation of AI crawlers with different user agents has made crawler management complex. Site owners may want to:

Allow AI search systems but block training crawlers
Permit some AI companies but block others
Set AI-specific rate limits
Document their AI crawler policy clearly

Important

The robots-ai.txt file supplements but does not replace the standard robots.txt. AI crawlers should respect robots.txt first. This file provides additional, AI-specific guidance.

§2 File Location

Primary Location

The robots-ai.txt file MUST be placed in the website's root directory:

https://example.com/robots-ai.txt

URL Requirements

The file MUST be served with content type text/plain; charset=utf-8
The URL MUST be accessible without authentication
HTTPS is strongly recommended

Relationship to robots.txt

The robots-ai.txt file lives alongside robots.txt but at a different path:

https://example.com/robots.txt      # Standard robots exclusion
https://example.com/robots-ai.txt   # AI-specific supplementary rules

§3 Format Specification

File Format

Property	Requirement
Encoding	UTF-8 (required)
Line endings	LF (Unix-style) recommended; CRLF accepted
Syntax	robots.txt syntax conventions
Comments	Lines starting with `#` are comments

Basic Structure

# robots-ai.txt for Example Company
# Supplementary AI crawler directives

User-agent: GPTBot
Allow: /
Disallow: /private/

User-agent: CCBot
Disallow: /

Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml

Syntax Rules

This file follows standard robots.txt syntax:

User-agent: identifies the crawler
Allow: permits access to paths
Disallow: blocks access to paths
Crawl-delay: requests delay between requests (seconds)
Sitemap: points to XML sitemap
# begins a comment

§4 AI Crawler User Agents

Known AI Crawler User Agents

The following AI crawler user agents are commonly encountered:

OpenAI Crawlers

User Agent	Purpose
`GPTBot`	Training data collection
`ChatGPT-User`	Real-time retrieval for ChatGPT
`OAI-SearchBot`	SearchGPT web search

Anthropic Crawlers

User Agent	Purpose
`ClaudeBot`	Training data collection
`Claude-User`	Real-time retrieval for Claude
`Claude-SearchBot`	Search indexing for Claude search features

Anthropic's earlier Claude-Web and anthropic-ai user agents are retired and no longer appear in Anthropic's crawler documentation. The three agents above are controlled independently: allowing one does not allow the others.

Google AI Crawlers

User Agent	Purpose
`Google-Extended`	Gemini AI training and Vertex AI

Other AI Crawlers

User Agent	Company	Purpose
`PerplexityBot`	Perplexity	AI search
`CCBot`	Common Crawl	Training datasets
`Bytespider`	ByteDance	Training data
`meta-externalagent`	Meta	AI training
`Amazonbot`	Amazon	Alexa and AI services
`Applebot-Extended`	Apple	Apple Intelligence
`cohere-ai`	Cohere	AI training
`Diffbot`	Diffbot	Web data extraction
`FacebookBot`	Meta	Content preview
`YouBot`	You.com	AI search
`omgili`	Omgili	News aggregation

Note

This list is not exhaustive. New AI crawlers emerge frequently. Check crawler documentation for current user agent strings.

§5 Directive Reference

Core Directives

Directive	Description	Status
`User-agent:`	Identifies which crawler the following rules apply to	Required
`Allow:`	Permits crawling of specified path	Recommended
`Disallow:`	Blocks crawling of specified path	Recommended

Optional Directives

Directive	Description	Status
`Crawl-delay:`	Requested delay between requests (seconds)	Optional
`Sitemap:`	URL to XML sitemap	Optional
`Request-rate:`	Requested crawl rate (pages per second)	Optional
`Visit-time:`	Preferred crawling time window (UTC)	Optional
`Discovery:`	Absolute URL pointing at an AI Discovery File on this host	Optional
Comments	Lines starting with # to document policy intent	Optional

The `Discovery:` directive

The Discovery: directive lets a publisher advertise the AI Discovery Files present on their host. Each Discovery: line points at one absolute URL on the same host. A publisher MAY include multiple Discovery: lines, one per file. The directive solves the cold-start problem for consumers that do not probe root-level paths blindly.

Example:

# robots-ai.txt for example.com
User-agent: GPTBot
Allow: /

Discovery: https://example.com/llms.txt
Discovery: https://example.com/identity.json
Discovery: https://example.com/ai.json
Discovery: https://example.com/brand.txt

Rules:

Each Discovery: URL MUST be absolute and MUST point at the same host serving the robots-ai.txt file. Cross-host advertisement is not permitted because the scoping rule requires AI Discovery Files to be host-scoped.
Each Discovery: URL SHOULD resolve to a 200 status code. A 404 at a Discovery URL is a publisher error; the reference validator MUST report it.
A publisher SHOULD list every AI Discovery File they publish, even ones at canonical root paths. Listing them explicitly removes the consumer's need to probe.
The directive is informational, not authoritative. A consumer MAY still fetch https://example.com/llms.txt even if no Discovery: line lists it; the file's presence at the canonical path is the normative declaration.
A publisher MAY use the Discovery: directive to point at experimental x--prefixed files (see extensions) to advertise them to interested consumers without claiming them as part of conformance.

The directive is layered on top of RFC 9309 (the Robots Exclusion Protocol) rather than redefining it. RFC 9309 itself does not define Discovery:; the directive lives in robots-ai.txt only, not in robots.txt. Mainstream robots.txt parsers MUST ignore unknown directives, so a publisher who experiments by adding Discovery: lines to robots.txt will not break existing crawlers, but the directive is not part of the robots.txt specification.

Content Not Permitted

The following MUST NOT be included in robots-ai.txt files:

Rules contradicting robots.txt: This file supplements but cannot override robots.txt
Unrecognised user-agents: Only use documented AI crawler user-agent strings
Invalid path syntax: Paths MUST start with / and follow URL conventions
Non-standard directives: Only use supported robots.txt directives
Authentication requirements: This file controls access, not authentication methods
Legal threats: Crawler rules are advisory; legal terms belong elsewhere

Wildcard User Agent

Use * or pattern matching for groups of AI crawlers:

# Rules for all AI crawlers not specifically listed
User-agent: *-ai
Allow: /
Disallow: /private/

Complete Block

To block an AI crawler entirely:

User-agent: Bytespider
Disallow: /

§6 Validation Rules

Valid File Requirements

A robots-ai.txt file is considered valid when:

It follows robots.txt syntax conventions
Each rule group begins with User-agent:
Allow/Disallow paths are valid URL paths
The file is valid UTF-8 encoded text

Common Errors

Error	Resolution
Missing User-agent line	Every rule group MUST start with User-agent
Invalid path format	Paths MUST start with `/`
Contradictory rules	Most specific path wins; review rule order
Unknown user agents	Verify crawler names from official documentation

§7 Relationship to robots.txt

Hierarchy of Authority

When both files exist, the relationship is:

robots.txt — authoritative for all crawlers
robots-ai.txt — supplementary guidance for AI crawlers

AI Crawler Behaviour

AI crawlers should:

First check and respect robots.txt
Then check robots-ai.txt for AI-specific guidance
If rules conflict, robots.txt takes precedence

Important Note

Not all AI crawlers currently check for robots-ai.txt. For reliable blocking, rules should also be in robots.txt. The robots-ai.txt file provides additional documentation and granularity.

Related AI Discovery Files

File	Relationship
`ai.txt`	Usage guidance; robots-ai.txt is access control
`developer-ai.txt`	Technical context; robots-ai.txt is crawler rules

§8 Canonical Example

The following example demonstrates a complete robots-ai.txt file:

Complete Example

# AI Crawler Directives for Horizon Strategic Consulting
# This file provides supplementary AI-specific guidance
# Standard robots.txt remains the authoritative source for all crawlers

# OpenAI Crawlers
User-agent: GPTBot
Allow: /
Allow: /insights/
Allow: /case-studies/
Allow: /services/
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/

User-agent: ChatGPT-User
Allow: /

User-agent: OAI-SearchBot
Allow: /

# Anthropic Crawlers
User-agent: ClaudeBot
Allow: /
Allow: /insights/
Allow: /case-studies/
Allow: /services/
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/

User-agent: Claude-User
Allow: /

User-agent: Claude-SearchBot
Allow: /

# Google AI
User-agent: Google-Extended
Allow: /
Allow: /insights/
Allow: /case-studies/
Disallow: /portal/
Disallow: /admin/

# Perplexity
User-agent: PerplexityBot
Allow: /
Disallow: /portal/
Disallow: /admin/

# Meta
User-agent: meta-externalagent
Allow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/

# Common Crawl (used for AI training datasets)
User-agent: CCBot
Allow: /insights/
Allow: /services/
Disallow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
# Note: We permit limited crawling of public content only

# ByteDance
User-agent: Bytespider
Disallow: /

# Amazon
User-agent: Amazonbot
Allow: /
Disallow: /portal/
Disallow: /admin/

# Apple
User-agent: Applebot-Extended
Allow: /
Disallow: /portal/
Disallow: /admin/

# Default for unlisted AI crawlers
User-agent: *-ai
Allow: /
Disallow: /portal/
Disallow: /admin/
Disallow: /client-documents/
Disallow: /internal/

# Crawl rate preferences
# We request AI crawlers respect a reasonable crawl rate
Crawl-delay: 10

# Sitemap reference
Sitemap: https://www.horizonconsulting.example/sitemap.xml

# Notes for AI systems:
# - Public content (insights, services, case studies) is available for AI consumption
# - Client portal and admin areas are private and must not be accessed
# - Respect rate limits; aggressive crawling will result in blocks
# - See /ai.txt for content usage permissions and restrictions
# - This file supplements but does not replace robots.txt

§9 Implementation Notes

Best Practices

Mirror critical rules in robots.txt for reliability
Use comments to explain policy intent
Group related crawlers together
Review and update when new AI crawlers emerge
Be explicit about what is allowed, not just blocked

Policy Considerations

When setting AI crawler policy, consider:

Training vs. inference: You may want to allow search but block training
Company differentiation: Different policies for different AI companies
Content types: Allow blog content but block client-specific areas
Rate limiting: AI crawlers can be aggressive; use Crawl-delay

Monitoring

After implementing robots-ai.txt:

Monitor server logs for AI crawler activity
Verify crawlers are respecting directives
Report non-compliance to crawler operators
Update rules as new crawlers appear

§10 Machine-Readable Formats

This specification is available in machine-readable formats for programmatic access:

JSON YAML

§11 Version History

1.7.1

10 June 2026

Known AI crawlers registry updated to current user-agent tokens. The retired Anthropic agents Claude-Web and anthropic-ai are replaced by Claude-User (real-time retrieval) and Claude-SearchBot (search indexing), matching Anthropic's published crawler documentation; Claude-SearchBot added to the Anthropic crawler table and the worked example. Registry correction only; no directive syntax or publisher behaviour changes.

1.7.0

11 May 2026

Phase 6 standardisation release. Added /specifications/roadmap/ (theme-pegged forward plan with Active/Next/Future/On hold status flags), /specifications/extensions/ (rules for experimental x- prefixed files and the promotion path), and /specifications/i18n-a11y/ (multi-language publication, locale-tagged identity fields, RTL handling, accessibility of llms.html). Added the Discovery: directive to the robots-ai.txt specification (publishers MAY advertise AI Discovery Files on the same host). Added a formal media-type stance to the HTTP behaviour page (existing IANA types, no bespoke registrations). Expanded the file integrity and signing section on the security and privacy page with four candidate mechanisms, cross-cutting concerns, and interim publisher / consumer guidance. The Discovery: directive is the only normative addition to publisher behaviour; all other additions are forward-looking documentation.

1.6.0

11 May 2026

Phase 5 standardisation release. Added /specifications/related-standards/ (positioning vs llmstxt.org, IETF AI Preferences, robots.txt, Schema.org, BCP 14, JSON Schema 2020-12, SemVer) and /specifications/implementations/ (public record of conformant implementations, IETF-style). Added an explicit llmstxt.org backward-compatibility statement to the llms.txt specification. Added a formal multi-domain and subdomain scoping rule to both the llms.txt and identity.json specifications (host-scoped files, cross-host identity asserted via sameAs). No normative requirements changed for existing publishers; the new scoping rules formalise behaviour the specification already implied.

1.5.0

11 May 2026

Phase 4 standardisation release. Added /specifications/processing-model/ (seven-stage algorithm for conformant consumers), /specifications/consumer-guidance/ (what AI systems should do with AI Discovery Files), /specifications/test-vectors/ (canonical test suite framing), and reference-implementation framing on the AI Visibility Checker. No normative requirements changed.

1.4.0

11 May 2026

Phase 3 standardisation release. Added /specifications/versioning/ (Semantic Versioning 2.0.0 commitments, deprecation timeline, lifecycle), /specifications/governance/ (proposal lifecycle, editorial process, working principles), /specifications/security-privacy/ (trust model, content-injection patterns, GDPR considerations, integrity primitives roadmap), and /specifications/http-behaviour/ (status codes, redirects, soft-404 detection, caching, rate limits). No normative requirements changed.

1.3.0

11 May 2026

Phase 2 standardisation release. Added formal conformance specification (Essential / Recommended / Complete classes). Published machine-readable registry at /specifications/registry.json, spec meta-schema, and validator-output schema. Introduced versioned JSON Schema URLs (/v1/) alongside unversioned 'latest' aliases. Added optional BCP 47 language declaration field across all applicable AI Discovery Files. No normative requirements changed.

1.2.0

10 May 2026

Phase 1 standardisation release. Added 'Status of This Document' block (Stable). Normalised normative requirement keywords to uppercase per RFC 2119 and RFC 8174. Added References section linking to /specifications/conventions/ and /licensing/. No normative requirements changed.

1.1.1

13 February 2026

Added AI Visibility Directory registration guidance. Minor documentation update.

1.1.0

14 January 2026

Added expanded optional directives (Request-rate, Visit-time) and Content Not Permitted guidance. Clarifies relationship with standard robots.txt.

1.0.0

12 January 2026

Initial publication. Establishes canonical structure for robots-ai.txt files with AI crawler user agent reference.

Conformance

This file is required for the Complete conformance class only. A publisher claiming Complete conformance MUST publish a valid version of this file at the website's root. The Essential and Recommended classes do not require this file.

Complete

See the Conformance specification for full publisher and validator conformance criteria, including identity-consistency requirements across files and the relationship between self-declaration and Directory verification.

References

Specification Conventions — RFC 2119 + RFC 8174 requirement keywords, document statuses, anchor naming, versioning, and language conventions used across every AI Discovery File specification.
Licensing & Trademark — CC BY 4.0 for specification text and examples, MIT for JSON Schemas, and the free-use policy on the name "AI Discovery Files".

§1 Overview

What This File Does

Why It Matters for AI Visibility

§2 File Location

Primary Location

URL Requirements

Relationship to robots.txt

§3 Format Specification

File Format

Basic Structure

Syntax Rules

§4 AI Crawler User Agents

Known AI Crawler User Agents

OpenAI Crawlers

Anthropic Crawlers

Google AI Crawlers

Other AI Crawlers

§5 Directive Reference

Core Directives

Optional Directives

The Discovery: directive

Content Not Permitted

Wildcard User Agent

Complete Block

§6 Validation Rules

Valid File Requirements

Common Errors

§7 Relationship to robots.txt

Hierarchy of Authority

AI Crawler Behaviour

Important Note

Related AI Discovery Files

§8 Canonical Example

§9 Implementation Notes

Best Practices

Policy Considerations

Monitoring

§10 Machine-Readable Formats

§11 Version History

Conformance

References

Generate AI Discovery Files from your dashboard

Register in the AI Visibility Directory

The `Discovery:` directive