Guide

Is Your Website Blocking AI? A Technical Visibility Checklist

An estimated 27% of websites unintentionally block AI crawlers through CDN rules, WAF settings, or broad robots.txt directives. This technical checklist covers every barrier we see and how to fix each one.

Is Your Website Blocking AI? A Technical Visibility Checklist

The invisible problem

Your website might be invisible to AI, and you'd never know it.

When someone asks ChatGPT, Claude, or Gemini about your industry, your competitors show up. You don't. Not because your content is weak or your site is new, but because something technical is preventing AI systems from accessing your information in the first place.

This isn't a fringe concern. As Mark McNeece explains in our expert Q&A on AI Visibility, an estimated 27% of websites unintentionally block AI crawlers through CDN rules, WAF configurations, or overly broad robots.txt directives. They didn't choose to be invisible. Their infrastructure made the choice for them.

The difference between a website that AI systems can cite and one they ignore often comes down to a handful of configuration settings. A robots.txt rule that blocks GPTBot. A firewall that treats ClaudeBot as a threat. A missing llms.txt file that would have told AI systems exactly who you are and what you do.

This checklist walks through every technical barrier we see when running AI visibility checks across hundreds of websites. Each item is something you can verify and fix today. For a broader view of what AI search engines need from your site beyond just crawler access, see our guide on how to appear in AI search results.

How AI systems access your website

Before working through the checklist, it helps to understand how AI crawlers behave differently from search engine crawlers.

Traditional search crawlers like Googlebot visit your pages to build a search index. They follow links, read HTML, respect cache headers, and send referral traffic back when users click results. The exchange is simple: you let the crawler in, and you get search visibility in return. AI crawlers sit inside a wider retrieval pipeline, and understanding that pipeline helps explain why a single blocked crawler can knock you out of citations entirely.

AI crawlers work differently. Systems like GPTBot (OpenAI), ClaudeBot (Anthropic), and PerplexityBot fetch content to train language models or to answer user queries directly. They often read raw HTML without executing JavaScript. They look for specific files in your root directory. And the traffic ratio is dramatically skewed: where Google crawls roughly 5 times per referral it sends, Anthropic's crawl-to-referral ratio has been measured at over 38,000 to 1.

"If a human were doing a task... you might go to five websites. Your agent... will often go to a thousand times the number of sites. So it might go to 5,000 sites. And that's real traffic, and that's real load."

MP
Matthew Prince
CEO, Cloudflare

That line, "and that's real load," landed hard the first time I read it. We'd been thinking about AI visibility as a content problem: are your files in the right place, is the information accurate? But Prince is talking about something more visceral. Thousands of bots hammering your server for data they may never send a single visitor back for. When we started running AI visibility checks across client sites, we kept finding servers that were buckling under crawler traffic the site owners didn't even know existed. They were paying for hosting to serve human visitors, and half their bandwidth was going to machines. Cloudflare's 2026 threat report found that 94% of login attempts are now automated bots. That number puts the scale of non-human traffic into perspective.

Cloudflare estimates bot traffic will overtake human traffic by 2027. That's not a distant prediction. It's next year. The question isn't whether to engage with AI crawlers. It's whether you're engaging on your terms, or being excluded by default.

Checklist: Crawler access

This is where most unintentional blocking happens. Your robots.txt file and server-level rules determine which crawlers can reach your content.

Review your robots.txt for AI-specific rules

Open https://yourdomain.com/robots.txt in your browser. Look for rules targeting these user agents:

  • GPTBot (OpenAI, ChatGPT)
  • ClaudeBot and Claude-Web (Anthropic, Claude)
  • PerplexityBot (Perplexity)
  • Google-Extended (Google Gemini training)
  • Applebot-Extended (Apple Intelligence)
  • CCBot (Common Crawl, used by many AI models)
  • Meta-ExternalAgent (Meta AI)

If you see Disallow: / for any of these, your site is invisible to that AI system. This is a deliberate choice for some publishers, but many businesses have these rules without realising it, sometimes added by a CMS plugin or a security template. The 365i Robots.txt Checker can parse your file and test specific URL paths against your crawl rules to spot exactly what's blocked.

Paul Calvano's HTTP Archive analysis found that almost 21% of the top 1,000 websites now include rules for GPTBot, and that number grew from near zero to over 500,000 sites in under two years. Many of those rules are blanket blocks added reactively, not strategically. With AI models advancing rapidly (Anthropic's Claude Mythos can now autonomously discover zero-day vulnerabilities), the systems trying to read your website are only getting more capable. Blocking them cuts you off from an increasingly powerful audience.

Check for wildcard blocks

A common mistake is using a broad wildcard rule that catches more than intended:

# This blocks ALL bots, including AI crawlers
User-agent: *
Disallow: /

If your robots.txt starts with this, every AI crawler is locked out. You need explicit Allow rules for the crawlers you want, or a more targeted approach that blocks only specific paths.

Check your CDN and WAF rules

This is the most common source of unintentional blocking. Content delivery networks and web application firewalls often classify AI crawlers as suspicious traffic. Cloudflare, Sucuri, Wordfence, and similar security tools may challenge or block requests from AI user agents by default.

Check your CDN dashboard for:

  • Bot management rules that block "unrecognised" user agents
  • Rate limiting thresholds that AI crawlers might exceed
  • JavaScript challenge pages that block non-browser clients
  • Country-based blocking that might exclude AI crawler IP ranges

If you're on managed hosting, your hosting provider should be able to whitelist known AI crawler user agents while keeping malicious bot protection active.

Checklist: Server configuration

Even when crawlers aren't blocked, poor server configuration can prevent them from reliably accessing your content.

Response times

AI crawlers are less patient than you might expect. If your server takes more than a few seconds to respond, crawlers move on. Google's own crawler ignores content that takes too long to fetch. AI crawlers behave similarly.

Test your server response time (not page load time, which includes client-side rendering). Your Time to First Byte (TTFB) should be under 500ms consistently. If it's regularly above 1 second, your hosting infrastructure needs attention.

For sites on shared hosting where noisy-neighbour effects cause intermittent slowdowns, moving to a dedicated WordPress hosting environment or managed cloud server with isolated resources can make a measurable difference to crawler reliability.

SSL/TLS configuration

AI crawlers require valid SSL certificates. Expired, self-signed, or misconfigured certificates will cause connection failures that happen silently. Your site looks fine in a browser (which displays a warning you can click past), but crawlers simply fail and move on.

Verify that your certificate is valid, covers all subdomains you use, and includes a proper certificate chain. Free certificates from Let's Encrypt work perfectly well; the important thing is that they're current and correctly installed. The 365i HTTPS Inspector scans for mixed content, insecure resources, and SSL/TLS configuration issues that might not be visible in a browser.

HTTP response codes

Your AI Discovery Files and key pages must return HTTP 200. Common issues:

  • 301/302 redirect chains: A file at /llms.txt that redirects to /en/llms.txt that redirects to https://www.example.com/en/llms.txt. Each hop increases the chance a crawler gives up.
  • 403 Forbidden: Server permissions blocking access to .txt or .json files in your root directory.
  • 404 Not Found: The most basic failure. Files that simply don't exist yet.
  • 503 Service Unavailable: Server under load or in maintenance mode when a crawler visits.

Content-Type headers

Your llms.txt should serve as text/plain. Your ai.json and identity.json should serve as application/json. If your server returns the wrong MIME type, parsers may reject the content even though it's valid.

Most web servers handle this correctly by default based on file extensions, but CMS platforms and custom routing can override defaults. You can test with curl -I https://yourdomain.com/llms.txt in your terminal, or use the 365i HTTP Header Inspector to check response headers, redirect chains, and status codes without leaving your browser.

Checklist: AI Discovery Files

Even with crawler access sorted and server configuration right, AI systems still need structured information about your business. That's what AI Discovery Files provide.

These machine-readable files sit in your website's root directory and tell AI systems who you are, what you do, and how to represent you accurately. Without them, AI systems guess based on whatever they can scrape from your pages. With them, you control the narrative.

Start with llms.txt

The llms.txt specification defines the most widely adopted AI Discovery File. It's a plain text Markdown file that gives AI systems a structured overview of your business: name, description, services, exclusions, and contact information.

If you haven't created one yet, our step-by-step guide walks through the process. If you're unsure whether your business needs one, the short answer is: almost certainly yes.

Add supporting files

The full AI Discovery Files specification defines ten files, each serving a different purpose. After llms.txt, the highest-priority files are:

  • identity.json (ADF-006): Structured business identity data in JSON format
  • ai.txt (ADF-004): AI-specific permissions and preferences
  • brand.txt (ADF-007): Official business name, naming rules, and terminology

You don't need all ten on day one. But the more files you publish, the harder it becomes for AI systems to misrepresent you. The quick start guide has a recommended implementation order.

Check for consistency

If your llms.txt says "Acme Web Solutions" but your identity.json says "Acme Digital Ltd" and your Schema.org markup says "ACME", AI systems can't confidently determine which name is correct. Inconsistency across files undermines the trust signal that AI Discovery Files are meant to provide.

Run your domain through the AI Visibility Checker to catch identity inconsistencies, missing files, and format errors in one scan.

Don't want to create these files yourself?

The AI Discovery Files Service Pack includes all ten files, professionally written and deployed to your website. 365i handles the technical work so you don't have to.

Get started

Checklist: Content structure

AI crawlers read your site differently from human visitors. They don't execute JavaScript, they don't scroll, and they don't watch videos. What they do is parse HTML and look for structured data.

JavaScript rendering

If your website relies on client-side JavaScript to render core content (React, Vue, Angular single-page applications), AI crawlers will see an empty page. Googlebot executes JavaScript, but GPTBot, ClaudeBot, and PerplexityBot typically don't. They read the raw HTML that your server sends.

Server-side rendering (SSR) or static site generation (SSG) ensures that crawlers see your content. If you're on a JavaScript framework, verify that your pages return complete HTML without JavaScript execution.

Structured data and Schema.org

JSON-LD Schema.org markup gives AI systems machine-readable context about your content. At minimum, include Organization schema with your business name, URL, and contact information. Add Article schema to blog posts, LocalBusiness to location pages, and FAQPage to FAQ sections.

Schema.org and AI Discovery Files are complementary, not alternatives. Schema.org is embedded in your HTML and tied to individual pages. AI Discovery Files are standalone documents providing a complete business overview. You need both for full AI visibility.

"The appearance of AI bot user agents in so many websites over a short period reflects site owners' sentiment toward content scraping."

PC
Paul Calvano
Performance Architect, Etsy

This quote resonated with something we've seen first-hand. Calvano's data shows GPTBot going from zero robots.txt appearances to over 500,000 in just two years. That's not a measured response; it's a flinch. And I get it. When we first saw how aggressively some AI crawlers were hitting sites, the instinct to block everything felt rational. But sitting with it longer, and after watching businesses wonder why ChatGPT never mentions them, the cost of that reflex became obvious. They blocked the scraping, but they also blocked the citations, the recommendations, and the visibility. The smarter move, the one we keep coming back to, is to let AI crawlers in but use AI Discovery Files to control what they learn. Give them the information on your terms rather than slamming the door.

Meta robots and X-Robots-Tag

Check that your pages don't include <meta name="robots" content="noindex"> on pages you want AI systems to access. Also check for X-Robots-Tag HTTP headers that your server or CDN might add. These headers can block indexing globally without any visible indication in your HTML.

WordPress considerations

WordPress powers over 40% of the web, and it introduces a few AI visibility issues specific to the platform.

Security plugins and bot blocking

Popular plugins like Wordfence, Sucuri Security, and iThemes Security include bot-blocking features that can reject AI crawlers. Check your security plugin settings for bot filtering rules and whitelist legitimate AI user agents.

Caching and robots.txt

Some WordPress caching plugins generate a robots.txt automatically or modify your existing one. Check that your live robots.txt (the one that crawlers actually see) matches what you expect. Plugins like Yoast SEO and Rank Math also manage robots.txt rules, and conflicting settings can create unexpected blocks.

AI Discovery Files on WordPress

The AI Discovery Files WordPress plugin generates and serves all ten AI Discovery Files directly from your dashboard. It handles the formatting, serves files at the correct URLs, and keeps everything consistent. If you're on WordPress, it's the simplest path to full AI visibility.

For WordPress sites that need more control over their hosting environment, bot management, and server configuration, 365i's WordPress hosting is built for exactly this kind of fine-grained control. And if you're migrating from a host that doesn't give you the access you need, free migrations make the switch painless.

Test your AI visibility

Once you've worked through this checklist, verify everything in one place.

Automated checking

The AI Visibility Checker scans your domain across four dimensions: AI Discovery File presence, identity consistency, crawler access, and structural readiness. Each dimension is scored, and you get specific recommendations for anything that needs attention.

It's free, takes under a minute, and doesn't require an account. If you've made changes based on this checklist, run a scan to confirm everything is working.

Manual verification

For a quick manual check, test these URLs in your browser:

  • https://yourdomain.com/robots.txt (should return 200, check for AI crawler rules)
  • https://yourdomain.com/llms.txt (should return 200, text/plain)
  • https://yourdomain.com/ai.txt (should return 200, text/plain)
  • https://yourdomain.com/identity.json (should return 200, application/json)

If any of these return a 404, 403, or redirect, you've found something to fix.

List your site

Once your AI Discovery Files are in place and validated, submit your site to the AI Discovery Files Directory. The directory is monitored by AI crawlers and helps accelerate the discovery of your files. You can also browse the directory to see how other organisations have implemented their AI visibility infrastructure. For an example of how quickly a properly-configured site can become AI-visible, see our case study of a three-week-old site that topped AI search.

Need expert help?

The AI Discovery Files Service Pack covers everything in this checklist. 365i writes, validates, and deploys all ten AI Discovery Files to your website. One fixed price, no ongoing subscription.

View the Service Pack

Frequently asked questions

How do I check if my website is blocking AI crawlers?

The fastest way is to run your domain through the AI Visibility Checker, which tests crawler access, AI Discovery Files, and structural readiness in one scan. You can also check manually by reviewing your robots.txt for AI-specific user-agent rules, testing server responses with curl, and verifying that your AI Discovery Files return HTTP 200.

Which AI crawlers should I allow in robots.txt?

At minimum, allow GPTBot (OpenAI/ChatGPT), ClaudeBot (Anthropic/Claude), PerplexityBot (Perplexity), Google-Extended (Gemini), and Applebot-Extended (Apple Intelligence). These are the primary AI systems that businesses and consumers use to find information. Block only crawlers you have a specific reason to exclude.

What is the difference between blocking AI crawlers and blocking search engines?

Search engine crawlers like Googlebot index your pages for search results. AI crawlers like GPTBot and ClaudeBot retrieve content to train models or answer user queries directly. Blocking search engines removes you from search results. Blocking AI crawlers removes you from AI-generated answers, citations, and recommendations. They use separate user-agent strings, so you can control access independently.

Do AI Discovery Files replace robots.txt?

No. They serve different purposes. robots.txt controls which crawlers can access which parts of your site. AI Discovery Files tell AI systems who you are, what you do, and how to represent you. Think of robots.txt as the door policy and AI Discovery Files as the business card you hand over once someone is inside.

Does my hosting provider affect AI visibility?

Yes. Server response times, uptime, SSL configuration, and WAF/CDN settings all affect whether AI crawlers can reliably access your content. Aggressive rate limiting, bot-detection rules that misidentify AI crawlers as threats, and slow response times can all prevent AI systems from reading your site. Managed hosting with configurable bot rules gives you more control.

Will allowing AI crawlers increase my server costs?

On most hosting plans, no. Shared and managed hosting absorbs AI crawler traffic within your existing plan. Serverless or pay-per-request hosting is the exception: unmanaged AI crawler traffic can cause billing spikes on those platforms. The solution isn't to block crawlers entirely, but to manage access with proper rate limits and caching. AI Discovery Files also help by giving crawlers a concise summary, reducing the need to crawl your entire site.

How long does it take for AI systems to find my AI Discovery Files?

AI systems don't fetch files in real time for every query. They crawl periodically, similar to search engines. After publishing your files, expect a few days to several weeks before AI systems incorporate the information. You can speed this up by submitting your site to the AI Discovery Files Directory, which AI crawlers monitor.

Can I block AI training but still appear in AI answers?

Partially. Some crawlers separate training from retrieval. Google-Extended controls Gemini training without affecting Google Search. GPTBot handles both training and retrieval for ChatGPT, so blocking it removes you from both. The robots-ai.txt specification provides more granular control, letting you specify permissions for different AI use cases.

Sources