Robots.txt for AI Crawlers
Robots.txt for AI Crawlers: How to Allow GPTBot, ClaudeBot, and PerplexityBot
Your robots.txt file controls which AI crawlers can access your site. Here's how to configure it for maximum AI visibility.
The robots.txt file tells web crawlers what they can and can't access on your website. Traditionally, this meant Googlebot and Bingbot. But now there's a new generation of AI crawlers — and many websites are blocking them without realizing it.
If AI crawlers can't access your site, your business won't appear in AI-generated answers from ChatGPT, Claude, Perplexity, or Google AI Overviews. This guide covers every AI crawler you need to know about and how to configure robots.txt correctly.
Check your robots.txt instantly
Foglift scans your robots.txt and checks every AI crawler listed below. Free, no signup required.
What the Data Shows About AI Crawler Blocking
Before configuring your robots.txt, it helps to understand the landscape. AI agent traffic grew over 6,900% year-over-year in 2025, making this decision increasingly consequential.
A 2025 BuzzStream study of top news publishers found that 79% block AI training bots via robots.txt, while 71% also block AI retrieval bots. The most-blocked crawlers: ClaudeBot (69% of sites), PerplexityBot (67%), and GPTBot (62%).
Cloudflare's Q1 2026 data reveals that 89.4% of AI crawler traffic is training or mixed-purpose, while only 8% is search-related and just 2.2% responds to actual user queries in real time. This distinction matters for your blocking strategy.
The blocking paradox
BuzzStream found that 70.6% of sites blocking ChatGPT-User still appeared in AI citations — blocking via robots.txt does not reliably prevent AI from citing your content. But publishers who blocked AI crawlers experienced a 23.1% decline in total monthly visits and a 13.9% drop in human-only browsing. The takeaway: blocking costs you traffic but doesn't prevent citation.
The Three-Tier Crawler Framework (2026)
As of 2026, major AI companies no longer use a single crawler. They've split into three tiers — and your robots.txt strategy needs to account for each:
| Purpose | OpenAI | Anthropic | Perplexity |
|---|---|---|---|
| Training | GPTBot | ClaudeBot | PerplexityBot |
| Search indexing | OAI-SearchBot | Claude-SearchBot | — |
| User browsing | ChatGPT-User | Claude-User | Perplexity-User |
Key insight: Blocking the training bot does not block the search or user-browsing bots. If you block GPTBot but allow OAI-SearchBot, ChatGPT search can still index your site — but the base model won't train on your content. Anthropic confirmed all three of its bots honor robots.txt independently (Search Engine Journal, 2026).
Complete AI Crawler Reference Table
Here's the complete list of AI crawlers as of April 2026, updated to include Apple, Meta, DuckDuckGo, and Common Crawl agents that launched in 2024–2025. ClaudeBot has approximately doubled its crawl rate between Q3 2025 and Q1 2026, suggesting Anthropic is scaling retrieval infrastructure significantly (TechnologyChecker.io, Q1 2026). Meta-ExternalAgent launched in July 2024 for Llama training (Meta Developers, 2024); Applebot-Extended launched June 2024 as an AI-training opt-out control for Apple Intelligence (Apple Support, 2024).
| Crawler | Company | Tier | Powers | User-Agent |
|---|---|---|---|---|
| GPTBot | OpenAI | Training | Model training data | GPTBot |
| OAI-SearchBot | OpenAI | Search | ChatGPT search indexing | OAI-SearchBot |
| ChatGPT-User | OpenAI | User | Real-time web browsing | ChatGPT-User |
| ClaudeBot | Anthropic | Training | Model training data | ClaudeBot |
| Claude-SearchBot | Anthropic | Search | Claude search indexing | Claude-SearchBot |
| Claude-User | Anthropic | User | User-requested browsing | Claude-User |
| PerplexityBot | Perplexity AI | Training | Perplexity indexing | PerplexityBot |
| Perplexity-User | Perplexity AI | User | Real-time retrieval | Perplexity-User |
| Google-Extended | Training | Gemini, AI Overviews | Google-Extended | |
| Amazonbot | Amazon | Mixed | Alexa answers, Amazon AI | Amazonbot |
| Bytespider | ByteDance | Training | TikTok / Doubao AI | Bytespider |
| cohere-ai | Cohere | Training | Cohere AI products | cohere-ai |
| Applebot-Extended | Apple | Training (opt-out signal) | Apple Intelligence foundation models | Applebot-Extended |
| Meta-ExternalAgent | Meta | Training | Llama / Meta AI training | meta-externalagent |
| Meta-ExternalFetcher | Meta | User | Meta AI user-requested fetches | meta-externalfetcher |
| DuckAssistBot | DuckDuckGo | User | DuckAssist cited answers | DuckAssistBot |
| CCBot | Common Crawl | Training (dataset) | Open dataset used by many LLMs | CCBot |
Sources: Search Engine Journal (Dec 2025), ALM Corp (2026), Anthropic documentation, Apple Support (119829, 120320), Meta Developers (externalagent crawler docs), DuckDuckGo Help Pages (duckassistbot), Common Crawl (commoncrawl.org/ccbot). Note: anthropic-ai and Claude-Web are deprecated — use ClaudeBot, Claude-SearchBot, and Claude-User instead.
Three crawlers worth their own callout:
Applebot-Extendedis an opt-out signal, not a separate crawler. It does not fetch pages of its own. Apple's regularApplebotdoes the crawling; blockingApplebot-Extendedtells Apple not to use the content for generative-AI training, while still leaving you indexable for Siri, Spotlight, and Apple search. Roughly 6–7% of high-traffic sites block it today — mostly news publishers including The New York Times, The Financial Times, The Atlantic, Vox Media, and Condé Nast (Apple Support, 2025).Meta-ExternalFetchercan bypass robots.txt for user-initiated URLs. Meta states thatfacebookexternalhitandmeta-externalfetchermay ignore robots.txt when a user explicitly provides a URL as context to a Meta AI product — the same carve-out that Perplexity-User and ChatGPT-User apply. If you need a hard block on user-triggered fetches, you need firewall rules, not robots.txt alone (Meta Developers, 2026).CCBotblocks propagate slowly. Common Crawl publishes snapshots quarterly, and older snapshots live forever in derivative training datasets like The Pile and RedPajama. Blocking CCBot today removes you from future snapshots; content you already published remains in circulation for years (Common Crawl, 2025). Treat it as a long-horizon decision, not an instant kill switch.
Recommended robots.txt Configuration
For most businesses that want maximum AI visibility, use this robots.txt:
# Standard search engine crawlers User-agent: Googlebot Allow: / User-agent: Bingbot Allow: / # OpenAI (all three tiers) User-agent: GPTBot Allow: / User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / # Anthropic (all three tiers) User-agent: ClaudeBot Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: / # Perplexity (both tiers) User-agent: PerplexityBot Allow: / User-agent: Perplexity-User Allow: / # Google AI features User-agent: Google-Extended Allow: / # Amazon AI User-agent: Amazonbot Allow: / # Default — allow everything else User-agent: * Allow: / # Block private/admin areas Disallow: /admin/ Disallow: /api/ Disallow: /auth/ Sitemap: https://yoursite.com/sitemap.xml
How to Selectively Allow or Block AI Crawlers
The three-tier framework enables a nuanced strategy: allow search and user-browsing bots (so you appear in AI answers) while blocking training bots (so your content isn't used to train models). Here's how:
# ALLOW search + user bots (appear in AI answers) User-agent: OAI-SearchBot Allow: / User-agent: ChatGPT-User Allow: / User-agent: Claude-SearchBot Allow: / User-agent: Claude-User Allow: / User-agent: Perplexity-User Allow: / # BLOCK training bots (prevent model training) User-agent: GPTBot Disallow: / User-agent: ClaudeBot Disallow: / User-agent: PerplexityBot Disallow: / User-agent: Google-Extended Disallow: /
Note: Even with this selective approach, blocking training crawlers means AI models won't have up-to-date knowledge of your site. They may still cite you from cached data, but it won't be current. And as BuzzStream's data shows, blocking doesn't reliably prevent citations — 70.6% of sites blocking ChatGPT-User still appeared in AI citations.
Compliance caveat: Bytespider (ByteDance) claimed robots.txt compliance but was observed accessing disallowed paths on test sites within 30 days of applying a block. Consider firewall-level blocking for crawlers you don't trust to honor robots.txt.
How to Edit robots.txt on Popular Platforms
WordPress
Edit via Yoast SEO plugin: SEO → Tools → File editor → robots.txt. Or create/edit the file at your site root.
Squarespace
Go to Settings → SEO → scroll to "Additional Robots.txt Rules" and add your AI crawler rules there.
Wix
Go to Dashboard → Settings → SEO (Google) → SEO Tools → Robots.txt Editor.
Shopify
Shopify auto-generates robots.txt. Edit it via theme.liquid or use a Shopify robots.txt app.
Next.js / Vercel
Create a robots.ts file in your app/ directory or add a static robots.txt in public/.
Common Mistakes
- Using a wildcard Disallow that blocks AI crawlers —
User-agent: * Disallow: /blocks everything, including AI crawlers. - Not checking platform defaults — Some CMS platforms add AI crawler blocks automatically. Always check after setup.
- Blocking GPTBot but expecting ChatGPT visibility — GPTBot is how ChatGPT learns about your site. Without it, you rely only on Bing indexing.
- Forgetting to add a sitemap reference — Always include
Sitemap: https://yoursite.com/sitemap.xmlat the end of robots.txt.
How to Verify Your Configuration
- Visit
yoursite.com/robots.txtin your browser - Check that GPTBot, ClaudeBot, and PerplexityBot are not blocked
- Run a free Website Audit — it checks all AI crawlers and shows exactly which are blocked
Frequently Asked Questions
What is GPTBot?
GPTBot is OpenAI's web crawler that indexes content for ChatGPT and other OpenAI products. OpenAI now operates a three-tier system: GPTBot (training), OAI-SearchBot (search indexing), and ChatGPT-User (real-time browsing). GPTBot is the most-blocked AI crawler via robots.txt — 62% of top news sites block it (BuzzStream, 2025).
Should I allow AI crawlers on my website?
For most businesses, yes. AI agent traffic grew over 6,900% year-over-year in 2025. Allowing AI crawlers means your site can appear in AI-generated answers from ChatGPT, Perplexity, and Google AI Overviews. BuzzStream found that publishers blocking AI crawlers experienced a 23.1% decline in total monthly visits — and blocking doesn't reliably prevent citations, since 70.6% of sites blocking ChatGPT-User still appeared in AI citations.
Does Squarespace block AI crawlers?
Yes, some Squarespace sites block AI crawlers by default in their robots.txt. Check your site's robots.txt to confirm, and contact Squarespace support if you need to modify it.
What AI crawlers should I allow?
The key AI crawlers to allow in 2026 are: GPTBot, OAI-SearchBot, and ChatGPT-User (OpenAI); ClaudeBot, Claude-SearchBot, and Claude-User (Anthropic); PerplexityBot and Perplexity-User (Perplexity); Google-Extended (Google AI features); and Amazonbot (Amazon/Alexa). Each AI company now uses multiple crawlers — blocking the training bot alone does not block the search or user-browsing bots.
How do I check if AI crawlers are blocked on my site?
Visit yoursite.com/robots.txt and look for Disallow rules targeting GPTBot, ClaudeBot, or PerplexityBot. Or use Foglift's free Website Audit — it automatically checks AI crawler access as part of the AI Visibility Score.
Check your AI crawler status
Instant scan. See which AI crawlers can access your site.
Free AI Crawler CheckGenerate your robots.txt
Use our free AI Robots.txt Generator to create an optimized robots.txt with the right AI crawler settings.
AI Robots.txt GeneratorSources & Further Reading
- BuzzStream, “Which News Sites Block AI Crawlers in 2025? [New Data],” 2025. 79% of top news sites block AI training bots; 70.6% of blocking sites still cited; 23.1% traffic decline for blockers.
- TechnologyChecker.io, “We Analyzed robots.txt Across Cloudflare's Network: Which AI Crawlers Get Blocked Most and Why,” Q1 2026. ClaudeBot doubled crawl rate Q3 2025 → Q1 2026.
- Search Engine Journal, “Cloudflare Report: Googlebot Tops AI Crawler Traffic,” 2025. 89.4% of AI crawler traffic is training/mixed; 2.2% real-time user queries.
- Search Engine Land, “Googlebot dominates web crawling in 2025 as AI bots surge,” 2025. GPTBot share decreased from 35.46% to 28.97% as blocking increased.
- Paul Calvano, “AI Bots and Robots.txt,” Aug 2025. Analysis of AI crawler blocking patterns across the web.
- Search Engine Journal, “Complete Crawler List For AI User-Agents,” Dec 2025. Verified user-agent list from real server logs with IP validation.
- Search Engine Journal, “Anthropic's Claude Bots Make Robots.txt Decisions More Granular,” 2026. Three-bot framework: ClaudeBot, Claude-SearchBot, Claude-User.
- ALM Corp, “ClaudeBot, Claude-User & Claude-SearchBot: Anthropic's Three-Bot Framework,” 2026. Each bot honors robots.txt independently.
- Apple Support, “About Applebot” (article 119829) and “Applebot model training and individual privacy rights” (article 120320), updated 2025. Applebot-Extended is an AI-training opt-out signal, not a distinct crawler; Applebot continues to index for Siri, Spotlight, and Apple search regardless.
- Meta for Developers, “Meta crawler documentation” (developers.facebook.com/docs/sharing/webmasters/crawler), 2024–2026. Meta-ExternalAgent launched July 2024 for training; Meta-ExternalFetcher handles user-initiated fetches and may bypass robots.txt when a user supplies an explicit URL.
- DuckDuckGo Help Pages, “Is DuckAssistBot related to DuckDuckGo?” 2025. DuckAssistBot/1.2 crawls on-demand for DuckAssist answers; robots.txt changes take effect within 72 hours; data is not used for model training.
- Common Crawl, “CCBot” (commoncrawl.org/ccbot), 2025. CCBot now runs on dedicated IP ranges with reverse DNS for verification; Common Crawl snapshots feed derivative training datasets (The Pile, RedPajama, C4), so blocks propagate slowly.
Related Reading
- How to Optimize Your Website for ChatGPT
- What Is Generative Engine Optimization (GEO)?
- GEO vs SEO: What's the Difference?
- How to Appear in AI-Generated Answers
- Free Website Audit: What to Check
Related: Learn about AEO (Answer Engine Optimization) — the framework for making your content extractable by AI answer engines.
Fundamentals: Learn about GEO (Generative Engine Optimization) and AEO (Answer Engine Optimization) — the two frameworks for optimizing your content for AI search engines.