Processing Model

How conformant consumers process AI Discovery Files.

Status Stable

This processing model is published and recommended for any implementation that consumes AI Discovery Files. The reference validator (AI Visibility Checker) follows this model. See specification conventions for status definitions.

Last updated:

Purpose

This page documents the algorithm a conformant consumer applies when processing a publisher's AI Discovery Files. The model is deterministic: given the same input (target host and a clock), two conformant implementations MUST produce equivalent normalised output. The output is the contract; the steps below explain how to produce it.

1. Overview

The processing model has seven stages, executed in order. Each stage takes the output of the previous stage and contributes to the final normalised summary documented by the validator-output schema.

  1. Discover: enumerate which files to fetch from the target host.
  2. Fetch: retrieve each file over HTTPS, observing redirects and rate limits.
  3. Validate: check each fetched file against its specification.
  4. Resolve identity: derive a single identity from the available files.
  5. Resolve permissions: derive a single permission set.
  6. Detect contradictions: surface disagreements between files.
  7. Emit normalised summary: produce a result conforming to the validator-output schema.

2. Discover

Given a target host (e.g. example.com):

  1. Load /specifications/registry.json from this site (or a cached copy). The registry is the canonical list of AI Discovery Files and their conventional paths.
  2. For each entry in registry.files, construct the file's URL as https://<host><path>. Example: ADF-001 has path = "/llms.txt", so the URL is https://example.com/llms.txt.
  3. Optionally, the consumer MAY honour a publisher-supplied discovery index (forthcoming as part of Phase 6) that points to non-default file locations.

The result of stage 1 is an ordered list of (id, name, path, url) tuples.

3. Fetch

For each URL produced in stage 1:

  1. Issue an HTTPS GET request with a clear User-Agent identifying the validator.
  2. Follow redirects per HTTP Behaviour: up to 5 hops, no HTTPS-to-HTTP downgrade, warn on cross-host redirects.
  3. Record the final URL, HTTP status, Content-Type, Content-Length, and elapsed time.
  4. Apply soft-404 detection: a 200 response with the wrong Content-Type or unparseable content is reclassified as 404.
  5. Honour 429 with Retry-After; back off exponentially on repeated 429 or 5xx; cap total fetch time per file at a reasonable bound (10 seconds is typical).

The result of stage 2 is a fetch record per file: { found, httpStatus, redirectedTo, contentType, contentLength, body }.

4. Validate

For each file with found = true:

  1. Parse the body according to the file's format. JSON files parse to a JSON document; text files parse line-by-line per their specification.
  2. For JSON files with a published schema (ai.json, identity.json), validate against the schema referenced by the file's $schema property. Accept either the versioned (/v1/) or unversioned URL.
  3. For text files, apply the format rules documented on the corresponding spec page (e.g. llms.txt structure, ai.txt key:value pairs, brand.txt naming sections).
  4. Record diagnostics: structural errors as errors (mark valid = false); softer concerns (placeholder content, missing recommended fields, malformed BCP 47 tags) as warnings (file may still be valid).

The result of stage 3 is a per-file validation record: { id, name, url, httpStatus, contentType, found, valid, errors[], warnings[] }.

5. Resolve identity

Multiple files carry identity claims. The resolution algorithm walks them in order of authority and produces a single canonical identity:

  1. identity.json is authoritative for structured identity (legal name, brand name, services, founding date, contacts). Use its values as the baseline.
  2. brand.txt is authoritative for naming, pronunciation, and "don't-call-us-that" guidance. If brand.txt declares a brand name distinct from identity.json's name, prefer brand.txt for naming and identity.json for legal name.
  3. llms.txt provides narrative context. Use its prose to enrich the identity (description, summary, services) but do not let it override structured values from identity.json or brand.txt.
  4. ai.txt and ai.json identity hints (name, url) are cross-checked for consistency with the above. Disagreements are recorded as contradictions in stage 6.

Where a publisher publishes only a subset of these files, resolve identity from what's available. A site at the Essential conformance class has only llms.txt and ai.txt; the resolved identity is necessarily less structured than for a site at the Complete class.

See the Interoperability Guide for the full precedence matrix.

6. Resolve permissions

Two files carry permission declarations:

  1. ai.json is authoritative for structured permissions and restrictions. Use its permissions[] and restrictions[] arrays as the baseline.
  2. ai.txt is the human-readable summary. Cross-check that its claims are consistent with ai.json. Disagreements are recorded as contradictions.
  3. robots.txt and robots-ai.txt sit above both ai.txt and ai.json for access control. A robots.txt or robots-ai.txt rule that disallows fetching content overrides any permission declared in ai.json or ai.txt for that content.

The resolved permission set is the union of declared permissions, minus anything contradicted by a higher-precedence file. A consumer MAY choose to apply stricter rules than the publisher declared (defensive posture) but MUST NOT relax them.

7. Detect contradictions

Contradictions are surfaced as structured diagnostics in the output. Detect:

Each contradiction includes the files involved, the field that disagrees, a one-line description, and a severity (error blocks the corresponding conformance class; warning does not).

8. Emit normalised summary

The final stage produces a result conforming to validator-output.schema.json. The output includes:

Determinism: given identical input (same target, same registry, same clock time), two conformant validators MUST produce equivalent output. The order of files[], the wording of structured fields, and the conformance verdict MUST agree. Optional diagnostics[] wording MAY differ between validators.

9. Caching and re-checking

Conformant consumers SHOULD cache fetch results for a reasonable interval (1 hour is typical) to avoid hammering publisher hosts. A consumer that re-checks SHOULD honour conditional requests (If-Modified-Since, If-None-Match) and treat 304 as "no change".

For Directory verification specifically, re-checking happens on a quarterly schedule per the conformance specification.

10. Implementation notes

The AI Visibility Checker is the reference implementation of this processing model. Its source is published as part of the AI Visibility project; conformant alternative validators SHOULD produce equivalent output for the same target.

Implementation guidance for validator authors:

References