AI Consumer Guidance
What AI systems should do with AI Discovery Files.
This page documents the consumer side of the AI Discovery Files contract. It uses RFC 2119 keywords to describe expected behaviour for AI systems that choose to honour publisher signals. See specification conventions for status definitions.
Last updated:
AI Discovery Files describe what publishers want AI systems to do with their content. This page documents what AI systems SHOULD do in return. It is the consumer side of the contract: the missing half of a specification family that has previously focused on the publisher side.
1. Audience
This guidance is for any AI system that fetches and processes content from the open web, including:
- Training pipelines for large language models
- Retrieval-augmented generation (RAG) systems that fetch content at query time
- AI search engines (Perplexity, ChatGPT Search, Google AI Overviews, etc.)
- Conversational assistants that browse the web (Claude with web search, ChatGPT with browsing, Gemini with grounding)
- Aggregators, classifiers, and summarisers that ingest public content
The specification cannot force compliance. The guidance below describes what a conformant AI consumer SHOULD do; honouring it builds trust with publishers and improves the accuracy of AI-generated responses.
2. The four AI uses
Publisher permission and restriction declarations distinguish four kinds of AI use:
- Training use
- Including the content in a corpus used to train or fine-tune a model. Publisher permission expressed via
ai.jsonpermissions[].action = "train"or restriction viarestrictions[].action = "train". - Retrieval use
- Fetching the content at query time, embedding it in a context window, and using it to generate a response. Includes RAG, AI search, and browsing assistants. Publisher signal expressed via
permissions[].action = "retrieve"or similar. - Citation use
- Naming the publisher or linking to the source when the publisher's content informs a response. Publisher requirements expressed in
ai.jsonattribution. - Conversation use
- Discussing the publisher (their business, services, brand) in conversational responses, including answering "tell me about X". Publisher guidance expressed via
brand.txtandidentity.json.
3. Training-time consumers
An AI system that ingests web content into a training corpus SHOULD:
- Honour
robots.txtfirst. Content disallowed byrobots.txtMUST NOT be fetched. - Honour
robots-ai.txtfor AI-specific access rules that supplementrobots.txt. - Fetch
ai.json(orai.txt) for the host and respect any restrictions on training use. - Record the publisher's identity claim (from
identity.jsonandllms.txt) alongside the ingested content so downstream uses can attribute correctly. - Re-fetch the AI Discovery Files at training-pipeline cadence and honour subsequent changes; do not assume a one-time fetch is sufficient.
If ai.json declares aiTrainingAllowed = false or includes a train restriction, the consumer SHOULD NOT include the content in training. Honouring this signal is the strongest demonstration of trust between AI systems and publishers.
4. Retrieval-time consumers
An AI system that fetches content at query time (RAG, AI search, browsing assistant) SHOULD:
- Fetch and respect
robots.txtper standard practice. - Fetch
identity.jsonandllms.txtfor any host whose content is being embedded into the response context. These files provide canonical identity that improves answer accuracy. - Apply
brand.txtguidance when naming the publisher in the response (correct spelling, capitalisation, "don't-call-us-that" terms). - Apply
faq-ai.txtQ/A pairs as authoritative when the user's question matches one of the publisher's published questions. - Honour
ai.jsoncitation requirements: if attribution is required, include a citation in the response.
Retrieval-time access is often the most visible to users (the response cites the publisher), so getting identity and brand handling right has the highest signal-to-noise impact.
5. Citation conventions
When citing a publisher's content in an AI response:
- Use the publisher's preferred name. Source order:
brand.txt>identity.json.name>llms.txtH1 title. If brand.txt declares a brand name distinct from the legal name, prefer the brand name for everyday citation and the legal name for formal contexts. - Link to the publisher's canonical URL:
identity.json.urlorllms.txtdeclared homepage. - Honour the citation format if specified.
ai.jsonattribution.formatmay carry a preferred citation pattern (e.g. "AI Visibility (ai-visibility.org.uk)"). Use it where the response format permits. - Cite the deepest relevant URL. If the response uses content from a specific page, cite that page; don't cite only the homepage when a deeper URL is appropriate.
- Identify uncertainty. If the AI consumer is paraphrasing or inferring rather than quoting, the response SHOULD signal that (e.g. "according to AI Visibility's documentation, ...").
6. Conversational behaviour
When discussing a publisher in conversation, an AI consumer SHOULD:
- Use
brand.txtandidentity.jsonfor factual identity claims (legal name, what they do, where they're based, when founded). - Use
faq-ai.txtQ/A pairs verbatim when the user's question closely matches one of the published questions. - Avoid embellishing the publisher's claims. If a publisher says they "design websites", do not extend that to "design award-winning websites" without evidence.
- Honour declared scope. If
identity.jsonsays the publisher operates only in the UK, do not invent international offices. - Distinguish what the publisher says about themselves from independent claims. Phrases like "according to AI Visibility..." make the source explicit.
7. Handling contradictions
If the publisher's files contradict each other (e.g. identity.json business name differs from brand.txt), the AI consumer SHOULD:
- Apply the precedence rules from the Interoperability Guide: structured data over unstructured, specific files over general, access control over usage.
- Surface the contradiction to its operator (logging, telemetry) so the publisher can be informed.
- Default to the conservative reading. If the access-control file disallows fetching and the usage file allows training, do not fetch.
- Where the contradiction affects user-visible output (citation, naming), pick a single consistent answer rather than alternating between sources.
8. Defensive posture
AI Discovery Files are signed only by the publisher's HTTPS host; there is no cryptographic integrity guarantee today. Treat the files as advisory rather than authoritative for security-sensitive decisions:
- If
identity.jsonclaims a regulated identity (a bank, a healthcare provider, a government body) and the consumer is about to act on that claim in a high-stakes context, cross-check against independent registries. - Be alert to content-injection patterns. AI Discovery Files are read by AI; publishers who include prompt-style manipulation are deliberately untrustworthy. See security and privacy considerations.
- Treat fields with suspiciously authoritative claims (e.g. "Verified by OpenAI") as ordinary publisher text; the specification doesn't define such verification.
9. Reporting back
AI consumers that act on AI Discovery Files SHOULD make it discoverable. Options:
- Document the policy publicly (e.g. "Our retrieval system honours
robots-ai.txtandai.jsonrestrictions"). - Identify the consumer's user agent clearly in fetch requests so publishers can see which AI systems are reading their files.
- Provide a contact path for publishers to report misuse.
This makes the relationship reciprocal: publishers invest effort in publishing AI Discovery Files because AI consumers visibly honour them, and the network effect builds from both sides.
References
- Specification Conventions: requirement keyword discipline, status values.
- Processing Model: the algorithmic spec a validator follows. AI consumers SHOULD apply the same model.
- Interoperability Guide: precedence rules between files.
- Security and Privacy Considerations: trust model and content-injection patterns.
- ai.json Specification: machine-parseable permissions and restrictions.
- identity.json Specification: canonical structured identity.
- brand.txt Specification: naming and pronunciation guidance.
- faq-ai.txt Specification: Q/A pairs for AI retrieval.