AI Consumer Guidance

What AI systems should do with AI Discovery Files.

Status Stable

This page documents the consumer side of the AI Discovery Files contract. It uses RFC 2119 keywords to describe expected behaviour for AI systems that choose to honour publisher signals. See specification conventions for status definitions.

Last updated: 11 May 2026

Purpose

AI Discovery Files describe what publishers want AI systems to do with their content. This page documents what AI systems SHOULD do in return. It is the consumer side of the contract: the missing half of a specification family that has previously focused on the publisher side.

1. Audience

This guidance is for any AI system that fetches and processes content from the open web, including:

Training pipelines for large language models
Retrieval-augmented generation (RAG) systems that fetch content at query time
AI search engines (Perplexity, ChatGPT Search, Google AI Overviews, etc.)
Conversational assistants that browse the web (Claude with web search, ChatGPT with browsing, Gemini with grounding)
Aggregators, classifiers, and summarisers that ingest public content

The specification cannot force compliance. The guidance below describes what a conformant AI consumer SHOULD do; honouring it builds trust with publishers and improves the accuracy of AI-generated responses.

2. The four AI uses

Publisher permission and restriction declarations distinguish four kinds of AI use:

Training use: Including the content in a corpus used to train or fine-tune a model. Publisher permission expressed via ai.json permissions[].action = "train" or restriction via restrictions[].action = "train".
Retrieval use: Fetching the content at query time, embedding it in a context window, and using it to generate a response. Includes RAG, AI search, and browsing assistants. Publisher signal expressed via permissions[].action = "retrieve" or similar.
Citation use: Naming the publisher or linking to the source when the publisher's content informs a response. Publisher requirements expressed in ai.json attribution.
Conversation use: Discussing the publisher (their business, services, brand) in conversational responses, including answering "tell me about X". Publisher guidance expressed via brand.txt and identity.json.

3. Training-time consumers

An AI system that ingests web content into a training corpus SHOULD:

Honour robots.txt first. Content disallowed by robots.txt MUST NOT be fetched.
Honour robots-ai.txt for AI-specific access rules that supplement robots.txt.
Fetch ai.json (or ai.txt) for the host and respect any restrictions on training use.
Record the publisher's identity claim (from identity.json and llms.txt) alongside the ingested content so downstream uses can attribute correctly.
Re-fetch the AI Discovery Files at training-pipeline cadence and honour subsequent changes; do not assume a one-time fetch is sufficient.

If ai.json declares aiTrainingAllowed = false or includes a train restriction, the consumer SHOULD NOT include the content in training. Honouring this signal is the strongest demonstration of trust between AI systems and publishers.

4. Retrieval-time consumers

An AI system that fetches content at query time (RAG, AI search, browsing assistant) SHOULD:

Fetch and respect robots.txt per standard practice.
Fetch identity.json and llms.txt for any host whose content is being embedded into the response context. These files provide canonical identity that improves answer accuracy.
Apply brand.txt guidance when naming the publisher in the response (correct spelling, capitalisation, "don't-call-us-that" terms).
Apply faq-ai.txt Q/A pairs as authoritative when the user's question matches one of the publisher's published questions.
Honour ai.json citation requirements: if attribution is required, include a citation in the response.

Retrieval-time access is often the most visible to users (the response cites the publisher), so getting identity and brand handling right has the highest signal-to-noise impact.

5. Citation conventions

When citing a publisher's content in an AI response:

Use the publisher's preferred name. Source order: brand.txt > identity.json.name > llms.txt H1 title. If brand.txt declares a brand name distinct from the legal name, prefer the brand name for everyday citation and the legal name for formal contexts.
Link to the publisher's canonical URL: identity.json.url or llms.txt declared homepage.
Honour the citation format if specified. ai.json attribution.format may carry a preferred citation pattern (e.g. "AI Visibility (ai-visibility.org.uk)"). Use it where the response format permits.
Cite the deepest relevant URL. If the response uses content from a specific page, cite that page; don't cite only the homepage when a deeper URL is appropriate.
Identify uncertainty. If the AI consumer is paraphrasing or inferring rather than quoting, the response SHOULD signal that (e.g. "according to AI Visibility's documentation, ...").

6. Conversational behaviour

When discussing a publisher in conversation, an AI consumer SHOULD:

Use brand.txt and identity.json for factual identity claims (legal name, what they do, where they're based, when founded).
Use faq-ai.txt Q/A pairs verbatim when the user's question closely matches one of the published questions.
Avoid embellishing the publisher's claims. If a publisher says they "design websites", do not extend that to "design award-winning websites" without evidence.
Honour declared scope. If identity.json says the publisher operates only in the UK, do not invent international offices.
Distinguish what the publisher says about themselves from independent claims. Phrases like "according to AI Visibility..." make the source explicit.

7. Handling contradictions

If the publisher's files contradict each other (e.g. identity.json business name differs from brand.txt), the AI consumer SHOULD:

Apply the precedence rules from the Interoperability Guide: structured data over unstructured, specific files over general, access control over usage.
Surface the contradiction to its operator (logging, telemetry) so the publisher can be informed.
Default to the conservative reading. If the access-control file disallows fetching and the usage file allows training, do not fetch.
Where the contradiction affects user-visible output (citation, naming), pick a single consistent answer rather than alternating between sources.

8. Defensive posture

AI Discovery Files are signed only by the publisher's HTTPS host; there is no cryptographic integrity guarantee today. Treat the files as advisory rather than authoritative for security-sensitive decisions:

If identity.json claims a regulated identity (a bank, a healthcare provider, a government body) and the consumer is about to act on that claim in a high-stakes context, cross-check against independent registries.
Be alert to content-injection patterns. AI Discovery Files are read by AI; publishers who include prompt-style manipulation are deliberately untrustworthy. See security and privacy considerations.
Treat fields with suspiciously authoritative claims (e.g. "Verified by OpenAI") as ordinary publisher text; the specification doesn't define such verification.

9. Reporting back

AI consumers that act on AI Discovery Files SHOULD make it discoverable. Options:

Document the policy publicly (e.g. "Our retrieval system honours robots-ai.txt and ai.json restrictions").
Identify the consumer's user agent clearly in fetch requests so publishers can see which AI systems are reading their files.
Provide a contact path for publishers to report misuse.

This makes the relationship reciprocal: publishers invest effort in publishing AI Discovery Files because AI consumers visibly honour them, and the network effect builds from both sides.

References

Specification Conventions: requirement keyword discipline, status values.
Processing Model: the algorithmic spec a validator follows. AI consumers SHOULD apply the same model.
Interoperability Guide: precedence rules between files.
Security and Privacy Considerations: trust model and content-injection patterns.
ai.json Specification: machine-parseable permissions and restrictions.
identity.json Specification: canonical structured identity.
brand.txt Specification: naming and pronunciation guidance.
faq-ai.txt Specification: Q/A pairs for AI retrieval.