How AI Search Actually Works: Inside ChatGPT, Claude, Gemini, and Perplexity

Q: How do AI Discovery Files fit into the pipeline?

AI Discovery Files like llms.txt and identity.json give AI systems a short, unambiguous declaration of who you are and what you do, sitting alongside your HTML content. They don't bypass the pipeline; they make the understanding and re-ranking stages far more confident, because your identity signals agree with each other instead of contradicting.

AI search is not thinking, it's retrieving

When you ask ChatGPT a question that references anything recent, specific, or local, something happens before the model generates a word of its reply. It runs a search. A real search, against a real index, returning real URLs. The answer you see is assembled from the pages that came back.

This is the single most important fact about how AI search works, and it's the one most people get wrong. Large language models don't "know" which plumber to recommend in Kettering or which accountancy firm serves SaaS founders in Manchester. They retrieve that information, just-in-time, from an underlying search index or a direct web fetch. The generative part is the last step. The retrieval is where your website either gets invited in or ignored.

Understanding the pipeline matters because everything people argue about with AI visibility (keywords, backlinks, schema, AI Discovery Files, brand mentions) maps to a specific stage of it. You can't optimise for a system you can't see. So here's the system.

The five-stage retrieval pipeline

Infographic showing the five stages of the AI search pipeline as five coloured cards with icons: question bubble, fan-out tree, document stack, sorting arrow with a star, and chat reply bubble — Every AI search query runs through the same five stages: understand, fan out, retrieve, re-rank, generate. The output you see is the last step. The stage that decides whether you appear is the second-to-last.

Every modern AI search system runs five stages for every query. The details vary, but the shape doesn't.

1. Understanding the question

The AI receives a raw natural-language prompt and rewrites it into the shape its retrieval system expects. Ambiguities get resolved (is "Apple" the company or the fruit?), acronyms get expanded, and entities get tagged against an internal knowledge graph. If your brand name is the same as something bigger or more famous, this is the stage at which your identity dies. The AI simply resolves to the bigger entity and you never make the candidate list.

This is also where the AI decides whether to retrieve at all. Questions about the capital of France get answered from the model's internal weights. Questions about anything recent, specific, commercial, or local trigger retrieval. The threshold is tighter than most people assume: even moderately factual "what is" queries now route to web search by default in ChatGPT, Gemini, and Claude.

2. Query fan-out

One question usually becomes several internal searches. If you ask "What's the best CRM for a small plumbing business in the UK?", the AI will likely fan that out into sub-queries covering: best CRM small business, CRM for trades, UK-specific CRM pricing, plumbing software integrations, and so on. Google has a patent describing exactly this for AI Mode.

Fan-out is why niche sites still have a shot. You don't need to match the user's exact words. You need to match one of the sub-queries the AI generates on their behalf. A plumbing software company that ranks for "field service management for plumbers" can surface in an answer to a CRM question that never mentions field service management, because one of the fan-out queries did.

3. Retrieval

The sub-queries get issued against a live index. Which index depends on the system. ChatGPT's search uses Bing. Perplexity runs its own index plus third-party providers. Google AI Overviews and AI Mode use Google's index natively. Claude fetches directly from the open web. Gemini grounds on Google Search.

Each sub-query returns a handful of candidate pages. The AI doesn't read them in full. It reads the passages that match. A retrieval might surface your page because one paragraph happens to answer the query, even if the rest of the page is about something slightly different.

Two consequences follow. First, passage-level structure matters: a page broken into clear, self-contained paragraphs with specific claims is easier to retrieve than a wall of text. Second, if your site is unreachable at this stage (blocked crawlers, server errors, JavaScript-only content, Cloudflare bot challenges), nothing else in the pipeline helps. Our blocking checklist covers the most common failure points, and if your robots.txt blocks every crawler outright, AI can never recommend you at all.

4. Re-ranking

Illustration of five pastel website cards numbered one to five with a magnifying glass spotlighting the chosen card, representing AI systems re-ranking candidate passages before citation — Retrieval produces candidates. Re-ranking decides which one actually gets cited. The re-ranker rewards extractable passages and clear identity signals more than it rewards raw authority.

The re-ranker is the stage that decides whether your page ends up cited or just sits in a pool of unused candidates. It looks at each candidate passage and scores it on a few things that matter far more than traditional SEO metrics.

Passage-level relevance comes first: does this chunk of text actually answer the sub-query it was retrieved for? Then comes extractability: can the AI lift a concise, factual statement from it without having to interpret or paraphrase heavily? Then comes identity clarity: does the passage make it obvious which entity is being described? Then comes corroboration: does what this passage says agree with what other independent sources say?

This is the stage where sites with strong domain authority but vague, repetitive content lose to smaller sites with specific, extractable claims. Mike King, co-founder of iPullRank, put it bluntly:

"It's not enough for your brand to have, like, 500 million mentions scattered across the Internet. If they're not relevant, they don't even matter."
Mike King, co-founder of iPullRank, interviewed by Advanced Web Ranking (verify quote at source)

When I first read that, it reframed something I'd been seeing in our research data. In our Q2 2026 crawl of 1,905 top websites, plenty of domains with enormous backlink profiles had zero AI Discovery Files and no consistent identity declaration. They'll lose the re-ranker race to smaller sites with machine-readable identity even though they'd win a backlink audit. Mentions without extractable relevance are dead weight inside a retrieval pipeline. Authority transfers only if the passage itself survives the re-ranker's scoring.

5. Generation

Finally, the model takes the top two to six re-ranked passages and generates an answer grounded in them. It attaches citation links to the specific sentences those passages supported. It may paraphrase, compress, or synthesise across sources, but the factual content is anchored to the retrieved passages, not to the model's internal knowledge.

This is why AI answers about the same question asked twice can include different citations: the retrieval and re-ranking steps are probabilistic. Your goal isn't to guarantee citation on any single query. Your goal is to be in the high-probability candidate pool often enough that you get cited regularly.

How ChatGPT, Claude, Gemini, and Perplexity differ

The shape of the pipeline is universal. The implementation details aren't. Here's where the four major systems diverge, and why that matters.

ChatGPT triggers web search on roughly 20 to 35% of prompts, delegates retrieval to Bing, and cites inline. Its re-ranker leans heavily on authority signals that overlap with traditional SEO, but weights identity clarity more than Google historically has. If your business is easy to disambiguate, ChatGPT tends to cite you consistently once it finds you.

Claude fetches directly from the open web when web access is enabled, which means it's more sensitive to whether your site is technically reachable. Server errors, aggressive bot blocking, and JavaScript-heavy content hurt Claude citations more than they hurt ChatGPT. Claude leans hard on clear, machine-readable identity files; you can see this site's own llms.txt as a reference example.

Gemini and Google AI Overviews use Google's search infrastructure, which means anything that helps Google visibility (technical SEO, E-E-A-T, Schema.org, quality backlinks) helps AI Overviews. The re-ranker then applies its own extra scoring on top. BrightEdge tracking puts AI Overviews in roughly 48% of monitored queries as of early 2026, and much higher in some industries: 88% for healthcare, 83% for education, 82% for B2B technology. If you operate in any of those verticals, AI Overviews probably already intercepts most of your prospective traffic.

Perplexity is the outlier. It searches aggressively, cites generously, and is the engine most likely to surface smaller, specialised sites. Third-party tracking consistently shows Perplexity citing at a much higher rate than ChatGPT per user session, partly because it's designed as an answer engine first and a chatbot second. If you're working on AI visibility and want fast feedback, Perplexity is usually where you see results land first.

The practical implication: test your visibility across all four, not just one. AI recommendations are inconsistent across engines and across runs. One prompt in one engine tells you almost nothing. For anyone wondering which of these five to pay for personally, we cover the consumer comparison separately in which AI subscription is best for the average user, and for developers weighing the model choice inside Claude Code, see whether Claude Fable 5 is worth it.

Why most websites never reach the re-ranker

The uncomfortable part of the pipeline is that most websites never make it past stage three. They're filtered out during retrieval, before the re-ranker even has a chance to score them.

Our quarterly crawl data shows the picture clearly. As of Q2 2026, only 7.2% of the top 1,905 websites publish any AI Discovery File. The most popular, llms.txt, sits at 4.9%. Roughly one in five top sites actively blocks GPTBot, ClaudeBot, or PerplexityBot via robots.txt. Cloudflare now blocks AI crawlers by default on new domains, which affects a further slice of the web silently.

Beyond blocking, three quieter filter problems kick in during retrieval:

Unreachable content. JavaScript-rendered pages that need full browser execution often get skipped. Many AI crawlers fetch HTML and extract text; anything that isn't in the initial payload may be invisible.
Thin or diluted passages. Pages where the key information is buried under unrelated boilerplate rarely win passage-level retrieval. The AI matches the passage, not the page.
Ambiguous identity. Pages that don't make clear who the business is (no Schema.org Organization markup, no consistent "About" language, no identity file) are harder for the retrieval system to attach to any entity, so they tend not to get retrieved for brand-specific queries.

Lily Ray wrote recently on her Substack:

"By clearly stating important details about your company and its products and services in unambiguous language, you can increase your chances of being cited in AI search."
Lily Ray, Senior Director of SEO & Head of Organic Research at Amsive, writing on her Substack (verify quote at source)

That line stuck with me because it's both obvious and almost universally ignored. Most "About Us" pages describe a company in aspirational marketing language that no retrieval system can cleanly anchor to a factual claim. Our own checker runs across thousands of sites weekly and the pattern is the same: the sites that get cited aren't the loudest, they're the least ambiguous. Unambiguous isn't a brand instinct, so writers fight it; retrieval pipelines reward it, so AI visibility depends on it.

The three signals that decide citations

Assuming your site is technically reachable and your identity is resolvable, three signals determine whether the re-ranker picks you over a competitor passage:

Extractability. Can the AI lift a standalone factual sentence from your page without editing? Short paragraphs, clear topic sentences, defined terms, and structured lists all help. Walls of marketing prose hurt. This is why long-form explainer content with concrete claims tends to outperform heavily designed landing pages for AI visibility, even if the landing page converts better for human traffic.

Identity clarity. Does the passage leave zero doubt about which entity it's describing? Schema.org Organization with a sameAs array pointing to Wikidata, LinkedIn, and Companies House dramatically reduces ambiguity. So does an identity.json file. The AI isn't checking these manually; they feed into the knowledge graph that the query-understanding stage consults.

Corroboration. Does what your page says agree with what independent third parties say? If your site claims you're the UK's leading X and no other source corroborates, the re-ranker discounts the claim. If three independent sources (Wikipedia, a trade publication, a review platform) agree on your services and positioning, the claim gets weighted much higher.

If these three feel familiar, it's because they map exactly onto what we call AI Visibility Checking: validating whether a website can be correctly discovered, interpreted, trusted, and safely used by AI systems. The definition was written to describe the retrieval pipeline's inputs. This section is why.

The corroboration network

Corroboration is the silent half of the re-ranking score. AI systems prefer claims that multiple independent sources agree on, so your visibility depends partly on what else says what you say.

The corroboration part deserves a section of its own because it's the signal website owners have the least direct control over, and the one that tends to have the largest multiplier effect on citations.

AI re-rankers triangulate. If your site says you're a web design agency based in Kettering that specialises in WordPress and AI visibility, the re-ranker wants to see that corroborated by independent sources. A Wikipedia article (if the business is notable enough), a Companies House filing, a LinkedIn company page, a review on Trustpilot, a trade-association directory listing, a feature in an industry publication. Each of these is a separate node in a network, and the network is what the AI treats as ground truth.

If your owned signals (website, llms.txt, Schema.org) and your third-party signals (Wikipedia, LinkedIn, press, reviews) say the same thing in similar language, your claim is confident. If they drift (different founding dates, different descriptions of services, different positioning), the AI de-risks by either omitting you or hedging its phrasing. That's the same "signal drift" pattern that makes ambiguous identity such a common invisible problem.

There's no shortcut here. Building the corroboration network is the slow work of being described consistently by other people, which is why it can't be faked and why it's the highest-value investment for AI visibility.

What this means for your website

The pipeline isn't hostile. It's just indifferent. It'll cite whatever sits cleanly in its candidate pool and scores well in its re-ranker. The checklist for getting there is short:

Be reachable. No aggressive crawler blocking, no JavaScript-only content, no opaque Cloudflare challenges for GPTBot, ClaudeBot, PerplexityBot, or GoogleOther.
Be extractable. Write the factual parts of your site in short, standalone paragraphs. Avoid marketing prose for anything you want cited.
Be unambiguous. Publish Schema.org Organization, publish llms.txt and identity.json, and use identical phrasing for your core services across your owned assets.
Be corroborated. Make sure Wikipedia, LinkedIn, Companies House, trade directories, and reviews describe you using the same language your site does.

None of these are growth hacks. They're the minimum viable input for a retrieval pipeline that's quietly processing hundreds of millions of queries a day. The sites that do them well will get cited regularly. The sites that don't will keep assuming AI search is unpredictable when actually it's just predictable about something they aren't measuring.

See which stage of the pipeline your site fails at

The AI Visibility Checker runs your site through the same checks an AI retrieval system does: accessibility, identity clarity, AI Discovery Files, and extractability. You get a deterministic score in under a minute, with specific fixes for each failing signal.

Check your AI visibility

Frequently asked questions

How does ChatGPT search the web?

When ChatGPT decides a query needs fresh information, it issues one or more real-time search requests to an underlying web index (Bing, in most cases), retrieves candidate pages, extracts relevant passages, then synthesises an answer with citations. It doesn't rely on its training data for anything time-sensitive. On roughly 20 to 35% of prompts, ChatGPT triggers a live web search, which works out to hundreds of millions of retrievals a day.

Is AI search the same as RAG (retrieval-augmented generation)?

Yes, most practical AI search is RAG. The model doesn't "know" the answer from training alone; it retrieves relevant passages from a live source, then generates a response grounded in those passages. Perplexity, Google AI Overviews, Google AI Mode, ChatGPT Search, Claude with web access, and Gemini Grounding all use variants of this pattern.

Do all AI search engines use the same index?

No. ChatGPT search largely uses Bing. Perplexity runs its own index supplemented by third-party sources. Google AI Overviews and AI Mode use Google's index. Claude fetches directly from the open web. The site that gets cited depends partly on which index the AI is reading from, which is why cross-engine visibility testing matters.

What is query fan-out in AI search?

Query fan-out is the step where an AI system decomposes one user question into several related sub-queries, issues them all in parallel, then merges the results. A single question like "best CRM for a plumbing business" can produce six or eight internal searches. Your site only needs to match one of them to enter the candidate pool.

Does ranking #1 on Google mean I'll be cited in AI search?

Not reliably. There's significant overlap, because AI systems often start from a search index and trust the same authority signals Google does, but the re-ranking step weights different things: passage-level extractability, identity clarity, and corroboration across independent sources. A page that's hard to summarise or whose business identity is ambiguous can be outranked by a less-authoritative page that's easier to parse.

How do AI Discovery Files fit into the pipeline?

AI Discovery Files like llms.txt and identity.json give AI systems a short, unambiguous declaration of who you are and what you do, sitting alongside your HTML content. They don't bypass the pipeline; they make the understanding and re-ranking stages far more confident, because your identity signals agree with each other instead of contradicting.

Can I force ChatGPT or Claude to cite my website?

No, and anyone selling that capability is selling fiction. What you can do is make sure your site is accessible, that your identity signals are consistent, and that your content is easy to extract as standalone passages. That puts you in the candidate pool. The re-ranker decides the rest. You can check whether your site meets these conditions with the AI Visibility Checker.