Research Methodology

This document describes how we collect, validate, and score AI Discovery File adoption data across the web's most prominent domains. We publish our methodology in full to enable scrutiny and reproducibility.

Domain Selection

We crawl two overlapping lists of prominent websites:

  • Global Top 1,000 — The top 1,000 domains worldwide by traffic rank (sourced from the Tranco List, a research-grade domain ranking that aggregates multiple lists to resist manipulation).
  • UK Top 1,000 — The top 1,000 domains popular in the United Kingdom.

The lists are combined and deduplicated. When a domain appears in both lists, the global entry takes precedence and the UK duplicate is discarded. After deduplication, the actual crawl covers approximately 1,995 unique domains. This selection captures the websites most likely to be referenced by AI systems when answering user questions.

Crawl Process

The crawler is a Cloudflare Worker that runs quarterly (January, April, July, October). Each crawl:

  1. Fetches domain lists from KV storage, deduplicates (global list takes precedence), and splits domains into processing batches of 25.
  2. For each domain, fetches robots.txt first, then all 10 AI Discovery Files sequentially at their canonical URLs (e.g., https://example.com/llms.txt, https://example.com/ai.json), with a 500ms delay between each request.
  3. Validates each file against its specification — checking required fields, format compliance, and content quality (see File Validation).
  4. Checks robots.txt for AI-specific crawler directives, testing 15 known AI user agents (see AI Crawler Analysis).
  5. Fetches comparison filessecurity.txt, humans.txt, ads.txt — for adoption context.
  6. Fetches the homepage and checks for Schema.org structured data (see Schema.org Detection).
  7. Computes a readiness tier (0–5) for each domain using the combinatorial scoring model (see Readiness Scoring).
  8. Aggregates results into a summary report once all batches complete.

Request Behaviour

The crawler identifies itself as ADF-Adoption-Crawler/1.0. All requests:

  • Use a 10-second timeout per file fetch
  • Follow up to 3 redirects (manually tracked — see Redirect Classification)
  • Include a 500ms delay between consecutive requests to the same domain (10 files + 4 comparison files + homepage = minimum 7.5 seconds per domain)
  • Process batches with 5 concurrent domains, further limiting aggregate request rate
  • Truncate response bodies at 50KB per file and 100KB for homepage fetches to prevent memory exhaustion
  • Respect robots.txt (the crawler checks for self-blocking and records it)

Files Checked

We check for all 10 AI Discovery Files defined in the ADF specification:

Code File Path Purpose
ADF-001 llms.txt /llms.txt Project description for LLMs
ADF-002 llm.txt /llm.txt Redirect alias for llms.txt
ADF-003 llms.html /llms.html Human-readable version of llms.txt
ADF-004 ai.txt /ai.txt AI permissions and preferences
ADF-005 ai.json /ai.json Machine-parseable AI metadata
ADF-006 identity.json /identity.json Structured identity data
ADF-007 brand.txt /brand.txt Brand naming and terminology
ADF-008 faq-ai.txt /faq-ai.txt AI-optimised FAQ content
ADF-009 developer-ai.txt /developer-ai.txt Technical context for developers
ADF-010 robots-ai.txt /robots-ai.txt AI-specific crawler access rules

File Validation

Each file goes through a multi-stage validation pipeline. The crawler first determines whether the file genuinely exists (handling redirects and soft 404s), then validates its structure against the ADF specification.

Quality Tiers

Every ADF file is assigned one of four quality levels based on structural validation:

Quality Logic Meaning
Complete All required checks pass and at least one recommended check passes Meets the full specification including best-practice fields
Minimal All required checks pass, zero recommended checks pass Structurally valid but missing recommended fields
Invalid One or more required checks fail File exists but does not conform to the specification
Not Found HTTP 404, soft 404, invalid redirect, or fetch error File does not exist at the expected URL

For readiness scoring purposes, valid means a quality of "minimal" or "complete" — files rated "invalid" do not count toward a domain's valid ADF total.

Structural Checks per File Type

Each ADF file type has its own set of required and recommended checks. For example:

  • llms.txt (ADF-001) — Required: contains a # Heading, a > blockquote, ## subsections, and 50+ words. Recommended: 3+ subsections, contact info.
  • ai.txt (ADF-004) — Required: [identity], [permissions], [restrictions] sections. Recommended: [attribution], [contact], [content-types].
  • ai.json (ADF-005) — Required: valid JSON, name, url, permissions array, restrictions array. Recommended: $schema reference, attribution object.
  • identity.json (ADF-006) — Required: valid JSON, name, type, url, description (20+ characters). Recommended: alternateNames, contactPoints.
  • brand.txt (ADF-007) — Required: [official-names], [incorrect-names], [naming-rules]. Recommended: [brand-voice], [key-people], [quotation-policy].
  • faq-ai.txt (ADF-008) — Required: Q: and A: pairs, at least 3 Q&A pairs. Recommended: category headings, answers averaging 50+ characters.

Full validation rules for all 10 file types are defined in the ADF specification.

Soft 404 Detection

Many web servers return an HTTP 200 status code for missing pages, serving a custom error page instead of a proper 404. The crawler applies six detection heuristics, checked in order:

  1. Body too small — Response body under 50 bytes is treated as empty/placeholder.
  2. Content-Type mismatch — A .txt or .json file returning text/html indicates a custom error page.
  3. HTML body in non-HTML file — Body starts with <!doctype, <html, or <head when the file should be plain text or JSON.
  4. Title tag contains 404 — The <title> element contains "404" or "not found".
  5. Common error strings — The first 2,000 characters of the body are checked for 11 known error phrases: "page not found", "page could not be found", "nothing was found", "404 error", "error 404", "does not exist", "no longer available", and others.
  6. WordPress catch-all — Non-HTML file under 1,000 bytes containing wp-content, indicating a WordPress theme's custom 404 response.

Files matching any heuristic are classified as "not found" rather than "invalid". This distinction matters: a soft 404 means the site has no ADF file, whereas "invalid" means the site attempted to create one but made structural errors.

Redirect Classification

When a file URL returns a 3xx redirect, the crawler follows up to 3 hops and classifies the redirect chain to determine whether the file genuinely exists at the destination:

Redirect Type Example Treated As
Protocol upgrade http → https Valid — file exists
www canonicalization example.com → www.example.com Valid — file exists
llm.txt → llms.txt /llm.txt → /llms.txt (same domain) Valid — expected ADF-002 behaviour
Homepage catch-all /ai.txt → / or /index.php Not found — server redirects unknown URLs to homepage
External redirect example.com → otherdomain.com Not found — different domain
Path redirect /ai.txt → /error or /custom-page Not found — likely custom error handler

This classification prevents false positives from servers that redirect all unknown URLs to the homepage or a custom error page instead of returning a 404 status code.

Readiness Scoring

Each domain receives an AI Readiness Tier from 0 to 5. The tier is computed from three inputs:

  1. Valid ADF count — Number of ADF files with quality "minimal" or "complete" (files rated "invalid" do not count)
  2. AI crawler classification — The domain's robots.txt policy toward AI user agents (see AI Crawler Analysis)
  3. Schema.org presence — Whether the homepage contains structured data markup

The rules are evaluated in order — the first matching rule determines the tier:

Tier Label Rule (first match wins)
5 AI-Optimised 3+ valid ADFs AND explicitly allows AI crawlers AND Schema.org present
4 AI-Ready 1+ valid ADFs AND allows AI (explicitly or by default) AND Schema.org present
3a Partially Ready 1+ valid ADFs AND blocks some or all AI crawlers
3b Partially Ready 0 ADFs AND allows AI AND Schema.org present
2 Passive 0 ADFs AND does not block all AI crawlers
1 Actively Blocking Blocks all AI crawlers AND 0 valid ADFs
0 Unaware Catch-all — no ADFs, no AI crawler mentions, no Schema.org

Key design decisions in this model:

  • Tier 5 requires explicit AI crawler permission — having no AI policy (no_ai_policy) qualifies for Tier 4 at most, because silence is not the same as consent.
  • Tier 3 has two entry paths — a domain can reach Tier 3 either by having ADF files but blocking AI crawlers (a contradiction), or by having good structural foundations (Schema.org + open crawlers) but no ADF files yet.
  • Schema.org is required for Tiers 4–5 — structured data demonstrates an understanding of machine-readable signals, not just ADF adoption.
  • "Allows AI" includes both explicit allows and no policy — the absence of blocking directives in robots.txt means AI crawlers are permitted by default.

AI Crawler Analysis

We parse each domain's robots.txt and check for directives targeting 15 known AI user agents:

User Agent Company Purpose
GPTBotOpenAITraining
ChatGPT-UserOpenAIRetrieval
OAI-SearchBotOpenAISearch
ClaudeBotAnthropicTraining
Claude-UserAnthropicRetrieval
Google-ExtendedGoogleTraining
CCBotCommon CrawlTraining
BytespiderByteDanceTraining
PerplexityBotPerplexitySearch
Applebot-ExtendedAppleTraining
FacebookBotMetaPreview
meta-externalagentMetaTraining
cohere-aiCohereTraining
AmazonbotAmazonTraining
DiffbotDiffbotExtraction

For each agent, the crawler checks whether robots.txt contains a matching User-agent: block and classifies the directives within it. Agents not explicitly mentioned inherit the wildcard (User-agent: *) behaviour. The crawler tracks whether each agent is effectively blocked — either by a direct Disallow: / or by wildcard inheritance.

Access Classification

Each domain is classified into one of five access policies based on the aggregate behaviour across all 15 agents:

  1. Blocks All AI — All 15 AI agents are effectively blocked (directly or via wildcard inheritance).
  2. Blocks Selectively — Some AI agents are blocked while others are allowed, or some are blocked but none are explicitly allowed.
  3. Rate-Limits AI — AI agents are mentioned only with Crawl-delay directives, no allow/disallow rules.
  4. Explicitly Allows — At least one AI agent has an explicit Allow directive, and none are blocked.
  5. No AI Policy — No AI-specific directives in robots.txt (AI crawlers are permitted by default).

Comparison Standards

For context, we check for the presence of established web standards at their canonical locations:

Standard Path Validation Specification
robots.txt /robots.txt Contains User-agent: and Disallow: directives RFC 9309
security.txt /.well-known/security.txt Contains a Contact: field RFC 9116
humans.txt /humans.txt Non-trivial content (20+ characters) humanstxt.org
ads.txt /ads.txt Contains seller entries with DIRECT or RESELLER IAB Tech Lab

This comparison contextualises ADF adoption relative to other standards that also require placing files at the domain root. It answers the question: how does ADF adoption compare with existing, well-established web conventions?

Schema.org Detection

Schema.org is checked separately by fetching the homepage and scanning the first 100KB of HTML for any of three markup formats:

  1. JSON-LD — Presence of application/ld+json in the page source
  2. Microdata — Presence of itemtype="https://schema.org/..." attributes
  3. RDFa — Presence of vocab="https://schema.org/..." attributes

This is a presence check, not a quality assessment. We do not validate the correctness or depth of the structured data — only that the homepage includes at least one Schema.org reference in any supported format.

Known Limitations

  • Sample scope — We crawl approximately 2,000 top domains. Adoption rates in the broader web (small businesses, long-tail sites, non-English markets) may differ significantly. Our data represents the largest, most visible websites.
  • Domain-level only — We check the root domain only (e.g., example.com), not subdomains or subpaths. Organisations hosting ADF files on subdomains (e.g., docs.example.com/llms.txt) would be missed.
  • Point-in-time snapshot — Each quarterly crawl captures the state at crawl time. Websites may add or remove files between crawls. Trends between quarters are indicative, not continuous.
  • Soft 404 heuristics — Despite six detection methods, some edge cases may be misclassified. A site that serves a valid ai.txt file containing the phrase "page not found" in its content would be a false positive. We err on the side of caution (flagging as not found) rather than accepting ambiguous files.
  • CDN and caching — Some domains serve cached or CDN-edge content that may differ from origin. The crawler sees whatever the CDN serves.
  • Redirects — We follow up to 3 redirects. Domains with longer redirect chains are marked as errored. Our redirect classification uses heuristics (homepage catch-all detection) that may occasionally misclassify intentional redirects.
  • Schema.org detection is shallow — We check for the presence of Schema.org markup in three formats (JSON-LD, Microdata, RDFa) but do not validate correctness, depth, or relevance. A page with a single, minimal Organization block is treated identically to one with comprehensive markup.
  • Body truncation — File bodies are capped at 50KB and homepage bodies at 100KB. Exceptionally large files may be validated on a truncated version, though this is rare in practice for root-level text and JSON files.
  • HTTPS only — The crawler only checks https:// URLs. Domains that are only available over HTTP (without TLS) are not covered.
  • Specification evolution — The ADF specification may evolve between crawls. We validate against the specification version current at crawl time. Historical data reflects the rules in effect when it was collected.

Reproducibility

To support external verification, we publish the following:

  • Domain lists — Sourced from the Tranco List (global) and UK domain rankings, frozen at the start of each crawl.
  • Crawler identityUser-Agent: ADF-Adoption-Crawler/1.0
  • Validation rules — Per-file required and recommended checks are documented in the ADF specification.
  • Scoring model — The readiness tier calculation is fully described above, with no opaque weights or proprietary algorithms.
  • Raw data — CSV downloads are available for every completed crawl on the research page.

Data Availability

All crawl data is available for download as CSV files on the research page, licensed under CC BY 4.0. Downloads include:

  • Summary CSV — Per-domain results with readiness tier, valid ADF count, and AI crawler classification.
  • Scores CSV — Detailed per-file quality tiers for every domain checked.
  • Blocking CSV — Per-agent blocking data for every domain.

When citing this research, please reference the specific quarter (e.g., "Q1 2026") and link to the methodology page.

Crawl Schedule

Crawls run quarterly on the 1st of January, April, July, and October at 04:00 UTC. Batches are processed concurrently (5 domains at a time, 25 domains per batch) and results are aggregated automatically once all batches complete. The full crawl typically completes within 2–3 hours. Results are published within one week of the crawl completing.

The first crawl was conducted in February 2026.

Frequently Asked Questions

How many websites are included in the ADF adoption crawl?

We crawl the Global Top 1,000 and UK Top 1,000 domains, sourced from the Tranco List — a research-grade domain ranking that aggregates multiple popularity lists to resist manipulation. After deduplication (where the global entry takes precedence), the crawl covers approximately 1,995 unique domains. These represent the websites most likely to be referenced by AI systems when answering user questions.

How are AI Readiness Tiers calculated?

Each domain receives a tier from 0 (Unaware) to 5 (AI-Optimised) based on three measurable inputs: the number of valid AI Discovery Files found, the domain's AI crawler policy in robots.txt, and whether Schema.org structured data is present on the homepage. The scoring uses a first-match-wins model — rules are evaluated top-down and the first matching condition determines the tier. There are no opaque weights or proprietary algorithms. The full tier calculation logic is published above.

How does the crawler detect soft 404 pages?

Many web servers return an HTTP 200 status code for pages that don't exist, serving a custom error page instead of a proper 404. Our crawler applies six detection heuristics in sequence: checking for response bodies under 50 bytes, Content-Type mismatches (e.g., text/html returned for a .txt file), HTML markup in non-HTML files, <title> tags containing "404" or "not found", 11 common error page phrases, and WordPress-specific catch-all patterns. Files matching any heuristic are classified as "not found" rather than counted as invalid ADF files.

How often is the crawl performed?

Crawls run quarterly — on the 1st of January, April, July, and October at 04:00 UTC. Each crawl processes approximately 2,000 domains in batches of 25 with 5 concurrent domains per batch, and typically completes within 2–3 hours. Results are reviewed and published within one week. Historical data from every completed crawl is archived and available for download on the research page.

Is the research data available for download?

Yes. All crawl data is freely available as CSV downloads on the research page, licensed under CC BY 4.0. Three files are provided per quarter: a domain summary (readiness tier, valid ADF count, AI crawler classification), detailed per-file quality scores, and per-agent blocking data. We encourage researchers and journalists to cite the specific quarter (e.g., "Q1 2026") and link to this methodology page.

What User-Agent does the crawler identify as?

The crawler identifies itself as ADF-Adoption-Crawler/1.0. All requests include a 500ms delay between consecutive fetches to the same domain, and batches are processed with limited concurrency (5 domains at a time) to prevent overwhelming target servers. The crawler respects robots.txt and records whether it is blocked by a domain's crawler directives.

What counts as a "valid" AI Discovery File?

A file is considered valid if it achieves a quality rating of "minimal" or "complete" after structural validation. "Minimal" means all required checks for that file type pass (e.g., llms.txt must contain a heading, blockquote, subsections, and 50+ words). "Complete" means all required checks pass and at least one recommended best-practice field is also present. Files that exist but fail required checks are rated "invalid" and do not count toward a domain's valid ADF total for readiness scoring.

Why is Schema.org included in the readiness scoring?

Schema.org structured data demonstrates that a website already understands machine-readable signals — a prerequisite for the kind of deliberate, infrastructure-level approach that AI Discovery Files represent. A domain with Schema.org markup and open AI crawler policies but no ADF files yet (Tier 3) is closer to AI readiness than one with no structured data at all (Tier 2). Schema.org is required for Tiers 4 and 5 because the highest readiness levels reflect a comprehensive commitment to machine interpretability, not just ADF file presence.

← Back to Research