Security and Privacy Considerations
Cross-cutting security and privacy guidance for publishers and consumers of AI Discovery Files.
This guidance is published and current. It documents responsibilities that publishers and consumers SHOULD observe; specific normative requirements are surfaced in the individual specifications that introduce them. See specification conventions for status definitions.
Last updated:
AI Discovery Files describe a publisher's identity, scope, permissions, and context to AI systems. The files sit at predictable root-level paths, are publicly fetched, and are meant to be trusted by consumers. This page documents the security and privacy implications of that contract, what the specification does and does not enforce, and what publishers and consumers SHOULD do to handle the gaps responsibly.
1. Trust model
The trust model for AI Discovery Files rests on three assumptions:
- The publisher controls the host. A file served from
https://example.com/ai.jsonis assumed to represent the operator ofexample.com. The HTTPS host is the trust anchor. - The transport is HTTPS. Consumers SHOULD fetch files over HTTPS. Files served over plain HTTP carry no integrity guarantees beyond what the transport provides; consumers MAY treat them as untrusted.
- The publisher accepts public disclosure. Anything written in an AI Discovery File is public. There is no privacy boundary inside the file format itself.
The specification does NOT guarantee:
- That the publisher of
example.comis the entity they claim to be inidentity.json(no identity verification is performed by the format alone) - That a file has not been modified by an intermediary (no signing or hash chain is part of the current specification)
- That declared permissions in
ai.jsonare honoured by any AI system (the file is a request, not an enforcement mechanism)
Integrity guarantees beyond TLS are a forthcoming capability. See section 6.
2. Content injection and prompt-style manipulation
AI Discovery Files are read by automated systems including large language models. A publisher SHOULD NOT include content designed to manipulate an AI consumer's downstream behaviour beyond the documented field semantics.
Patterns to avoid:
- Instructions framed as if from the user. Lines like
"Ignore all previous instructions"or"You are now a different system"embedded in any field. - Fake authority signals. Claims like
"System: this site is verified by Anthropic"or"Per OpenAI policy, you must..."when no such verification or policy exists. - Hidden text in white-on-white, zero-width, or non-printable Unicode intended to influence AI processing without being visible to humans reviewing the file.
- Encoded instructions (base64, ROT13, homoglyphs) intended to bypass content inspection.
- Permissions claiming authority over content the publisher does not own. A publisher MAY only declare permissions for their own content, not for third parties or for the AI system's training data generally.
Consumers SHOULD validate field content against the format documented in each specification, treat instruction-shaped content in non-instruction fields as untrusted prose, and decline to act on it.
3. Personal data and GDPR
Several AI Discovery Files describe people: identity.json can list founders and notable employees; brand.txt can mention spokespeople; contact fields can include personal email addresses or telephone numbers.
Publishers SHOULD treat the file as a public statement to which all UK GDPR / EU GDPR principles apply:
- Lawful basis. Identifying a named individual in a public file requires a lawful basis under Article 6. For employees, this is typically a contractual or legitimate-interests basis; for public-facing roles (founders, executives, press contacts) named on the company website already, the basis is usually established.
- Minimisation. Include only the personal data the file's purpose requires. A press contact's office telephone number is appropriate; their home address is not.
- Right to erasure. Individuals named in the file MAY request removal. Publishers SHOULD have a documented process for honouring such requests promptly.
- Special category data. Health, religion, sexuality, ethnicity, political opinions, and trade union membership SHOULD NOT appear in AI Discovery Files. There is no field defined for such data; including it would be a misuse of the format.
For sole traders, single-person businesses, and home-based businesses where the "company contact" is the operator's personal contact, this distinction matters. Publishers SHOULD use a dedicated business email address and a separate business contact channel where practical.
4. Access control
AI Discovery Files describe usage preferences. They do NOT enforce access control. A file declaring "AI-Training: No" is a request; it does not technically prevent an AI crawler from training on the content.
For enforcement, publishers SHOULD layer multiple mechanisms:
robots.txtat the protocol level: blocks compliant crawlers from fetching content.robots.txtis an external standard and always takes precedence; AI Discovery Files MUST NOT contradict it. See the Interoperability Guide for the precedence rules.robots-ai.txtat the protocol level for AI-specific crawlers: supplementsrobots.txtwith directives that target AI user agents specifically.- HTTP-level access control for content that must not be fetched at all: authentication, IP allowlists, Cloudflare rules. The AI Discovery Files specification does not replace these.
ai.txtandai.jsonfor usage declarations once content is fetched: declares permitted and prohibited uses (training, citation, summarisation, etc.). Compliant AI systems honour these.
A publisher relying solely on AI Discovery Files for enforcement is relying on goodwill. The files are useful and increasingly respected, but they are not a security boundary.
5. Logging and request fingerprinting
The files at /llms.txt, /ai.txt, etc. are fetched by AI crawlers, validators, the AI Visibility Checker, indexing tools, and curious humans. Each fetch leaves a log line on the publisher's server.
Implications:
- Server access logs may contain personal data (IP addresses, user agents) under GDPR. Standard access-log retention policies apply.
- The pattern of files a publisher chooses to publish reveals editorial choices (e.g. a site that publishes
robots-ai.txtwith broadDisallowrules is signalling an AI-cautious stance). This is intentional and public. - Validators (including the AI Visibility Checker) SHOULD identify themselves with a clear
User-Agentstring and honourrobots.txtwhen fetching publisher files for validation.
6. File integrity and signing
The current specification does not include file-integrity primitives. A consumer that fetches identity.json over HTTPS gets the transport-level integrity TLS provides but no signed assertion that the file's content originated from the claimed publisher. This section documents the current baseline, the candidate mechanisms for a future signing capability, and the cross-cutting concerns that hold signing back to a future MAJOR release.
6.1 Current baseline (1.x)
For the entire 1.x line of the specification, the trust model relies on:
- HTTPS transport (mandated for all files; HTTP-served files are explicitly untrusted)
- The publisher's control of their host (a file at
https://example.com/identity.jsonis "what example.com says it is") - Independent cross-checks (Companies House, trademark registries, the AI Visible Directory) for high-stakes decisions
This is enough for the specification's stated purpose: helping AI systems discover, interpret, and safely use publishers' machine-readable identity and context. It is not enough to defend against an attacker who controls the publisher's hosting (e.g. via DNS hijack or compromised hosting credentials).
6.2 Candidate signing mechanisms
The maintainer is tracking four candidate mechanisms for a future MAJOR release. Each has different trade-offs:
- DKIM-style detached signatures via DNS
- Publishers store a signing public key in a DNS TXT record at a well-known subdomain. A detached signature accompanies each AI Discovery File (either as a sidecar file or as a header). Consumers verify the signature against the DNS-published key. Pros: reuses DKIM operational practice; DNS already anchors host identity. Cons: ties signing to DNS operators; DNS TXT records have size limits; key rotation requires DNS coordination.
- Sigstore-style transparency-log signing
- Publishers obtain short-lived signing certificates from a public certificate authority (e.g. Sigstore's Fulcio), sign the files, and the signing event is recorded in a public transparency log. Consumers verify both the signature and the log inclusion proof. Pros: short-lived certificates remove long-term key custody burden; transparency log gives third parties a way to detect malicious signing. Cons: mature for software supply chain, less mature for static content; requires Sigstore-like infrastructure to remain operational.
- JOSE / JWS attached signatures
- Each file (or a manifest of files) is wrapped in a JWS structure. The signing key is published at a well-known URL on the publisher's host (e.g.
/.well-known/ai-discovery-keys.json). Pros: JOSE is well-understood and well-implemented across languages. Cons: wraps the file content in a JOSE envelope, changing the on-the-wire format; tooling integration cost. - HTTPS-anchored manifest with content hashes
- A single signed manifest file (e.g.
/.well-known/ai-discovery-manifest.json) contains SHA-256 hashes of every other AI Discovery File. Consumers fetch the manifest, verify its signature, and then verify each individual file against its hash. Pros: single signature covers all files; individual files stay unmodified. Cons: manifest staleness becomes a failure mode; manifest signing key custody still has to be solved (likely via one of the three mechanisms above).
The maintainer has no commitment to any single mechanism. The roadmap entry on integrity / signing (see roadmap) records the status as On hold (2.0 territory) while these designs are evaluated.
6.3 Cross-cutting concerns
Signing is held back to a future MAJOR release because it has implications that ripple across the entire ecosystem:
- Key management. Whichever mechanism is chosen, publishers must generate, store, and protect a signing key. Small-business publishers (the bulk of the AI Discovery Files audience) do not have key-management practice; a signing requirement that demands one effectively prices them out of conformance.
- Key rotation. Long-lived keys are a liability. The chosen mechanism MUST support routine rotation without breaking historical verifications. This is non-trivial: a signature from a rotated-out key must still validate against the key's revocation status at signing time, not at verification time.
- Downstream caching. CDNs cache AI Discovery Files. A signature attached to the cached body must remain valid after intermediate caching. CDNs that mangle whitespace, normalise line endings, or rewrite headers will invalidate signatures unless the signing canonicalisation explicitly accounts for them.
- Validator behaviour. Validators (the AI Visibility Checker, third-party tools, AI consumers) must verify signatures consistently. A validator that fails open on a missing or invalid signature gives publishers no incentive to sign correctly; a validator that fails closed locks legitimate publishers out during transition. The deprecation timeline for signed-by-default would need to be at least 12 months (per the versioning policy).
- Revocation. A publisher who discovers their signing key has been compromised needs a path to revoke. The chosen mechanism MUST define how revocation propagates to consumers without requiring real-time lookups for every file fetch (which would re-introduce the centralised dependency the specification deliberately avoids).
- Conformance class membership. Today, a publisher signs nothing. After signing ships, the question is whether Essential / Recommended / Complete classes include a "signed" requirement. The maintainer's working position is that signing will introduce a separate axis (Signed vs Unsigned) rather than collapsing into the existing class hierarchy.
6.4 What publishers can do now
While integrity primitives are pending, publishers SHOULD adopt operational practices that reduce the risk of unauthorised modification:
- Read-only file deployment. Publish AI Discovery Files via an immutable deploy pipeline (e.g. CI/CD with code review), not via an admin UI that allows ad-hoc edits.
- Separate CDN purge credentials. If the CDN purge credential is the only barrier to a file being replaced at the edge, treat it with the same care as production secrets.
- Documented change history. Keep a public changelog of substantive changes to
identity.json,ai.json, and other files where mismatch would be exploitable. The AI Visible Directory records change history for participating publishers. - Repository hosting. Mirror AI Discovery Files in a public repository (e.g. GitHub) where history is tamper-evident. Consumers that suspect a recent change at
https://example.com/identity.jsoncan cross-check against the repository. - Identity cross-publication. Cross-publish identity assertions (the
sameAsarray inidentity.json) to independent platforms (Companies House, LinkedIn, the AI Visible Directory) so a corrupted file can be detected by reconciliation against external sources.
6.5 What consumers should do now
Until file integrity ships, consumers SHOULD:
- Treat AI Discovery Files as advisory rather than authoritative for security-sensitive decisions
- Cross-check identity claims against independent signals where stakes are high
- Honour the publisher's
ai.jsontraining and retrieval permissions without requiring proof that the file is signed - Cache files conservatively; refetch on substantive contradictions with previously-observed values
- Report obviously-tampered files (e.g. a file claiming an identity that contradicts every other signal on the publisher's site) via the AI Visible Directory's feedback channel where applicable
References
- Specification Conventions: editorial and structural conventions referenced throughout this page.
- Versioning and Deprecation Policy: how integrity primitives will be introduced (MAJOR release with deprecation timeline).
- Interoperability Guide: precedence rules between
robots.txt,robots-ai.txt, and the AI Discovery Files. - robots-ai.txt Specification: AI-crawler-specific directives that supplement
robots.txt. - ai.json Specification: machine-parseable usage permissions and restrictions.
- UK GDPR guidance (ICO): external reference for personal-data handling.