Which AI Crawlers Should You Allow in robots.txt?

Allow AI crawlers selectively. If public search visibility matters, usually allow search-related crawlers such as OAI-SearchBot, PerplexityBot, Claude-SearchBot and Googlebot on pages you want discovered. If content reuse or model-training exposure is the bigger concern, block or restrict training and dataset crawlers such as GPTBot, ClaudeBot, Google-Extended and CCBot. Treat ChatGPT-User, Claude-User and Perplexity-User as a separate user-triggered category, and never use robots.txt as protection for private content.

The Short Answer

The best default policy is not "allow every AI bot" and not "block every AI bot." It is a role-based policy:

Allow search and citation crawlers for public pages that should be discoverable.
Block or restrict training-oriented crawlers when content reuse risk outweighs the benefit of broader AI ingestion.
Monitor user-triggered fetchers separately because they represent a person asking an AI tool to retrieve something.
Protect private, paid, admin, staging and customer content outside robots.txt.

That last point is not a footnote. robots.txt is a preference signal for compliant crawlers. It can reduce access by bots that honor it, but it is not authentication, authorization, paywall enforcement or legal consent management by itself. It also exposes the paths you list, so it is a poor place to advertise private URLs.

Decision rule: allow access where crawlability supports public discovery; disallow access where the main outcome is unwanted training or dataset reuse; use security controls for anything that must actually stay private.

Allowing a crawler only removes an access barrier. It does not guarantee AI citations, rankings, recommendations, traffic, answer inclusion or share of voice. If the business question is whether your brand appears in ChatGPT, Gemini, Perplexity or Google AI features, crawler access is only one technical input; you still need to track AI citations for your website.

Use A Policy Matrix First

Most bad robots.txt policies start as copied bot lists. A better policy starts with intent: what should this crawler be able to do, and what risk does that create for your site?

Use this matrix before editing the file.

Crawler role	Examples	Usually allow when	Usually block or restrict when	What to verify
Search or citation crawlers	`OAI-SearchBot`, `PerplexityBot`, `Claude-SearchBot`	Public pages should be eligible for AI search, cited answers or search-result quality systems.	The page is thin, outdated, legally sensitive, licensed in a way that limits reuse, or not meant to be discoverable.	Live `robots.txt`, status codes, CDN/WAF rules, source verification and whether important pages return usable HTML.
Training or model-development crawlers	`GPTBot`, `ClaudeBot`, `Google-Extended`, `CCBot`	The organization accepts future use of eligible public content for model improvement, broad datasets or related systems.	Content is licensed, premium, high-cost to produce, legally sensitive, or the business has no appetite for model-training exposure.	Whether the bot is truly training-related, whether the token is a real HTTP user agent, and whether old content may already exist elsewhere.
User-triggered fetchers	`ChatGPT-User`, `Claude-User`, `Perplexity-User`	Users should be able to ask an AI product to retrieve or summarize public URLs from your site.	User-directed retrieval creates abuse, load, compliance or paywall issues that need product and security review.	Logs by user agent, IP verification where available, WAF actions, rate limits and whether requests behave like single fetches rather than scheduled crawls.
Traditional search crawlers	`Googlebot`, `Bingbot`	The site wants normal organic search visibility and eligibility for search features built on the search index.	You intentionally want a URL out of search and have a safer indexing control in place.	Search crawler access, indexability, snippets, canonical tags, server response consistency and host-specific robots rules.
Unknown or spoofed bots	Generic AI user agents, browser-like scrapers, unverifiable claims	Rarely by default. Treat them as untrusted until verified.	The source cannot be verified, ignores rules, causes load or requests private-looking paths.	Reverse DNS, provider IP ranges, request patterns, ASN, WAF events and whether the claimed user agent matches a known provider.

A compact starter policy might look like this for a site that wants AI search visibility but does not want future model-training crawls. Do not paste it over your existing file without checking your subdomains, sitemap, CMS rules and CDN behavior.

User-agent: OAI-SearchBot
Allow: /

User-agent: PerplexityBot
Allow: /

User-agent: Claude-SearchBot
Allow: /

User-agent: GPTBot
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: CCBot
Disallow: /

User-agent: *
Disallow:

The example is intentionally selective. It allows search-related crawlers while disallowing several training or dataset-oriented tokens. A publisher with a licensing strategy may choose differently. A SaaS site with mostly product and documentation pages may choose differently again. The important part is that each rule maps to a purpose.

Crawlers You Usually Allow For Visibility

Allow these crawlers only for public, useful pages that you are comfortable making discoverable. Do not allow a crawler because the user-agent name sounds important. Allow it because the page should be eligible to be found, cited or used as a search source. In practice, that means prioritizing public pages that AI search can cite, not every URL that happens to return a 200.

OpenAI: OAI-SearchBot Is Not GPTBot

For OpenAI, separate OAI-SearchBot, GPTBot and ChatGPT-User.

OAI-SearchBot is the crawler to evaluate when your goal is visibility in ChatGPT search features. If you block it, you are creating a direct access barrier for that search-related system. If you allow it, you are only making access possible; your page still needs to be useful, reachable, parseable and relevant.

GPTBot is different. It is associated with crawling content that may be used to train OpenAI generative AI foundation models. A site can allow OAI-SearchBot while disallowing GPTBot. That is often the cleanest distinction for teams that want public answer visibility without saying yes to training crawls.

OpenAI also uses ChatGPT-User for certain user actions. That is not the same as automatic search crawling, and it should not be used as proof that a page is appearing in ChatGPT search answers. OpenAI says search-related robots adjustments may take about 24 hours after a robots.txt change, so do not judge the policy from an immediate log sample.

Anthropic: Claude-SearchBot Is Not ClaudeBot

For Anthropic, the same split matters. Claude-SearchBot is search-related. ClaudeBot is tied to collecting public web content that could contribute to model training. Claude-User supports user-directed retrieval when someone asks Claude to access web content.

If your goal is visibility and accuracy in Claude search-style experiences, evaluate Claude-SearchBot separately from ClaudeBot. Blocking ClaudeBot for training reasons should not automatically mean blocking Claude-SearchBot for search-related discovery, unless your content policy requires a stricter stance.

Anthropic says its bots honor standard robots.txt directives and support the non-standard Crawl-delay extension where appropriate. That can be useful if the issue is load rather than consent. If the issue is sensitive content, use real access controls instead of crawl delay.

Perplexity: PerplexityBot Is The Search Crawler To Watch

PerplexityBot is the crawler to evaluate for Perplexity search and answer sourcing. It is designed to surface and link websites in Perplexity results, not to crawl content for AI foundation-model pretraining. If Perplexity visibility matters, important public pages should not be blocked in robots.txt, by the CDN, or by a WAF rule that returns 403, 429 or a challenge page.

Perplexity-User should be handled separately. It represents user-triggered activity, not the same scheduled crawl pattern as PerplexityBot. If you see it in logs, treat it as evidence of retrieval, not as proof of search index inclusion or citation.

Perplexity is also a place where WAF review matters. A rule can allow the text in robots.txt while still blocking the real request at the edge because of bot-score thresholds, geography, missing JavaScript, rate limits or a managed security rule. If you intentionally allow PerplexityBot, verify that the request source is legitimate and that your security layer returns usable content.

Googlebot Controls Google Search Access

For Google AI features such as AI Overviews and AI Mode, the practical crawler decision is still about Google Search eligibility. Googlebot access matters because those features are part of Google Search and use Search systems. Blocking Googlebot to opt out of AI Overviews is a blunt move because it can affect normal Google Search crawling and eligibility.

Google-Extended is a separate control token in robots.txt, not a normal HTTP crawler user agent you should expect to see in logs. It is used to manage whether content Google has crawled may be used for Gemini-related model training and certain grounding uses outside normal Search. Google says Google-Extended does not affect Google Search. That makes it a good example of why crawler policy must separate Search visibility from model-training and AI product-use controls.

Red flag: if a crawler is allowed in robots.txt but receives 403, 429, CAPTCHA, empty HTML or blocked assets from the CDN or WAF, it is not effectively allowed. The file and the live response must agree.

Crawlers You May Block For Training Or Dataset Use

Blocking training-oriented crawlers can be a reasonable policy, especially for publishers, documentation sites, research-heavy teams, paid-content businesses and any organization with licensing, contractual or legal constraints. The key is to block the right agents without accidentally harming search discovery.

Start with these common examples:

GPTBot: OpenAI crawler associated with content that may be used for training generative AI foundation models.
ClaudeBot: Anthropic crawler associated with collecting public web content that could contribute to model training.
Google-Extended: Google robots.txt product token for Gemini-related training and grounding controls; not a standalone HTTP user agent.
CCBot: Common Crawl crawler used for open web crawl datasets that can be reused by many downstream projects.

This is not a universal blocklist. It is a policy set to evaluate. A public-domain archive, open-source project or research site may intentionally allow dataset crawlers because broad reuse is aligned with its mission. A commercial publisher may block them because the content is expensive to produce and licensed under specific terms. A product site may choose a hybrid policy: allow documentation, block gated research, and review blog or comparison pages separately.

The decision should be made at page or section level when the site structure supports it. For example, a company might allow search crawlers on /blog/, /docs/ and /features/, but disallow training crawlers from /research/ or /reports/. Another site may block training crawlers globally because its content rights are too mixed to separate safely.

Do not oversell what the block does. A robots.txt disallow can signal future crawl preferences to compliant bots. It does not delete old crawl snapshots, erase content from third-party sites, remove syndicated copies, override contractual terms, or guarantee that every AI system has never seen the page. If content has already been widely quoted, indexed, archived or republished, the practical risk review has to include those sources too.

Practical conclusion: blocking GPTBot, ClaudeBot, Google-Extended or CCBot can be sensible even while allowing OAI-SearchBot, Claude-SearchBot, PerplexityBot and Googlebot. That is not inconsistent. It is the point of a selective policy.

User-Triggered Fetchers Are Different

User-triggered fetchers are easy to misread because they look like AI traffic but do not mean the same thing as scheduled crawling. ChatGPT-User, Claude-User and Perplexity-User may appear when a person asks an AI product to retrieve, summarize or use a URL. That user action changes the access question.

Do not count these requests as AI search indexing. A single ChatGPT-User request to a pricing page may mean someone pasted the URL into ChatGPT. A Claude-User request may mean someone asked Claude to inspect a public page. A Perplexity-User request may reflect a user-directed action rather than a normal index-building crawl.

The policy question is practical:

Should users be able to ask AI tools to retrieve this public page?
Does the page include content that is public but contractually sensitive?
Would user-triggered retrieval bypass a commercial flow, login expectation or paywall?
Does the fetcher cause server load or trigger security rules?
Can the provider and request source be verified well enough for your risk model?

For many public marketing, documentation and support pages, monitoring is enough. For paywalled, customer-specific, trial-only, staging or legally sensitive content, do not rely on robots.txt. Require authentication and enforce access at the application, CDN or network layer.

Also expect provider differences. OpenAI says ChatGPT-User is initiated by a user and robots.txt rules may not apply in the same way. Perplexity documents Perplexity-User as a user-requested fetcher that generally ignores robots.txt. Anthropic documents Claude-User as a separate controllable bot for user-initiated retrieval. The safe operational habit is to classify these agents separately in logs, review current provider documentation before making policy changes, and avoid using one provider's behavior as a rule for every AI tool.

Robots.txt Rules That Backfire

The riskiest robots.txt mistakes are not subtle. They usually come from copying a large list, editing the wrong host, or treating a crawl preference file like a security system.

Red Flags To Fix Before You Ship

User-agent: * followed by Disallow: / on a production site that needs organic search or AI search visibility.
Blocking Googlebot to opt out of Google AI Overviews while forgetting that this can affect normal Google Search access.
Treating Google-Extended like a log-visible crawler instead of a robots.txt control token.
Allowing OAI-SearchBot, PerplexityBot or Claude-SearchBot in the file while CDN or WAF rules still block them.
Relying on old or unofficial user-agent names without checking current provider documentation.
Editing only the apex domain while the real content lives on www, blog, docs, help, app or another subdomain.
Forgetting that robots.txt is scoped by protocol, host and port. The rules for one host do not control www, docs, HTTP, HTTPS variants or non-standard ports unless each serves the intended file.
Listing /admin/, /staging/, /customers/, /invoices/ or similar private-looking paths as the main privacy strategy.
Omitting the sitemap after a rushed rewrite, making discovery diagnostics harder.
Using malformed syntax, smart quotes, hidden CMS-generated rules or a CDN-managed robots.txt that differs from the file in version control.

The private-path issue deserves special attention. A robots.txt file is public. If you list sensitive directories there, you may make them easier to discover. If a URL must not be seen, put it behind authentication, remove public links to it, use appropriate indexing controls where applicable, and make sure the server refuses unauthorized access.

Syntax also matters. Robots rules are simple, but simple files still break. Mixed casing, duplicated groups, accidental whitespace, generated comments, plugin overrides and environment-specific templates can produce different behavior than expected. Always test the live file, not the file you think is deployed.

How To Verify The Policy

Do not stop after editing robots.txt. A policy is only useful when the live site, security stack and logs agree with it.

Use this verification sequence:

Open the live robots.txt for every relevant host and protocol. Check the apex domain, www, blog, docs, help center, app subdomain, staging host and any CDN hostname that can serve public content.
Confirm the file is publicly reachable with the expected status code and content. Do not rely only on a CMS setting or a repository file.
Check the exact user-agent groups for OAI-SearchBot, GPTBot, ChatGPT-User, ClaudeBot, Claude-SearchBot, Claude-User, PerplexityBot, Perplexity-User, Googlebot, Google-Extended and CCBot where relevant.
Test representative public URLs from each important section: homepage, product pages, category pages, documentation, blog, pricing, support and any high-value answer pages.
Review server, CDN and WAF logs for user agent, timestamp, host, path, status code, source IP, cache status, WAF action and response size.
Verify bot identity with provider-published IP ranges, reverse DNS or documented verification methods where available. Do not trust the user-agent string alone.
Look for mismatches: 200 for regular browsers but 403, 429, CAPTCHA, timeout, blank HTML or blocked assets for the crawler.
Recheck after caches and crawler systems have had time to refresh. OpenAI says search-related robots.txt changes can take about 24 hours to adjust; other systems may refresh on their own schedules.
Separate access evidence from visibility evidence. Logs can show that a crawler fetched a URL. They do not show whether the page was cited, recommended, ranked or shown in an answer.

For Google, use Search Console and server logs together. For Googlebot verification, use Google's published verification methods rather than trusting the label in the request header. For OpenAI, Anthropic, Perplexity and Common Crawl, use the provider's current published ranges or reverse-DNS guidance where available, then keep those checks updated. For Perplexity in particular, WAF rules should combine user-agent matching with source verification rather than one signal alone. IP ranges and bot names can change.

The final step is a business check. If you allowed search crawlers because you want AI visibility, monitor whether the pages are actually appearing as sources, citations or brand references in the answer surfaces your audience uses. If you blocked training crawlers because of content risk, confirm that the policy is documented internally so future CMS, CDN or migration work does not silently undo it.

The Bottom Line

Use robots.txt as a crawler governance file, not as a magic AI visibility switch. The clean policy is selective: allow search-related crawlers for public pages that should be discoverable, restrict training and dataset crawlers when content reuse risk is unacceptable, and keep user-triggered fetchers in their own monitoring bucket.

For most sites, that means allowing OAI-SearchBot, PerplexityBot, Claude-SearchBot and Googlebot where public discovery matters. It often means blocking or restricting GPTBot, ClaudeBot, Google-Extended and CCBot when the organization does not want eligible public content used for future model training or broad datasets. It also means treating ChatGPT-User, Claude-User and Perplexity-User as retrieval events, not as ordinary crawlers.

Revisit the policy whenever providers rename bots, your content rights change, the site moves behind a new CDN or WAF, or the business changes its AI visibility strategy. If the real question is whether your brand is cited, recommended or visible in AI answers, pair crawler access checks with recurring AI rank, citation, competitor and brand visibility monitoring.