Chapter 05.1Instruments · Technical SEO

Robots.txt Checker & Validator — Free Online Tool

Free robots.txt checker, validator, and tester. Validate syntax, find crawl blockers, and confirm AI crawlers (GPTBot, ClaudeBot, PerplexityBot) can reach your site. Built by Ram Lakhan, Delhi SEO specialist.

Updated May 2026

Key statistics

Pushing a staging robots.txt with Disallow: / to production deindexes 80% of a typical site's organic traffic within two weeks.

Source · Ahrefs Technical SEO Audit Report, 2025

62% of enterprise sites audited in 2025 had at least one robots.txt rule unintentionally blocking a section of indexable pages.

Source · Botify Enterprise SEO Study, 2025

Sites that explicitly allow GPTBot, ClaudeBot, and PerplexityBot are cited 2.4x more often in generative engine outputs than sites that block or omit those user-agents.

Source · Profound AI Citation Study, Q1 2026

Chapter About this tool

What it does and why it matters.

A robots.txt checker fetches your site's robots.txt file from the domain root, parses every User-agent, Allow, Disallow, and Sitemap directive, and flags syntax errors, accidental crawl blocks, and missing AI-crawler permissions. The most common mistake — a misplaced Disallow: / that hides the whole site from Google — removes pages from the index within 48 hours. Use the validator above to check any URL, or follow the step-by-step guide below to write a robots.txt from scratch.

What this free robots.txt validator checks

Enter any public URL above and the tool fetches the robots.txt file at /robots.txt on that domain. It then runs eight validation passes: (1) file accessibility — is the file returning HTTP 200 and served as text/plain; (2) UTF-8 encoding compliance per the RFC 9309 robots specification; (3) directive syntax — every Allow/Disallow line has a valid path; (4) User-agent grouping — no orphan rules; (5) wildcard correctness — * and $ used per Google's parser rules; (6) sitemap discoverability — at least one Sitemap directive present and reachable; (7) AI crawler access — GPTBot, ClaudeBot, PerplexityBot, Google-Extended, CCBot explicitly handled; (8) accidental blocks on CSS, JavaScript, or image files Google needs for rendering.

Why robots.txt validation is critical for SEO in 2026

Google's crawler budget is finite. Every wasted crawl on an admin URL, faceted search parameter, or staging environment is a crawl you do not get on a real revenue page. A clean robots.txt directs that budget to the URLs that matter. Get it wrong and you see one of three failure modes: (a) full deindex — a stray Disallow: / removes the entire site within days; (b) partial blackout — an overly broad wildcard hides a section like /blog/ or /products/; (c) silent quality drop — Google cannot render your pages because CSS or JavaScript is blocked, lowering the page experience score and pushing you down the SERPs.

For Delhi-area B2B and SaaS clients I work with through SEO services in Delhi, the robots.txt is the first file I audit in every engagement. I have rescued sites that lost 60-80% of their organic traffic from a single deployment that pushed a staging-environment robots.txt to production. Validation takes thirty seconds and prevents months of recovery work.

The eight robots.txt rules every site should follow

(1) Place the file at the root of every host. example.com/robots.txt, blog.example.com/robots.txt, and m.example.com/robots.txt are three separate files for three separate hosts — subdomains do not inherit. (2) Keep it under 500 KB. Google ignores everything past that size. (3) Use absolute URLs in Sitemap directives. Sitemap: https://example.com/sitemap.xml, not /sitemap.xml. (4) Never block CSS, JS, or images Google needs to render the page. Modern Googlebot needs the same files a real browser needs. Block them and Google sees a broken page.

(5) Use the most specific User-agent rule that applies. When Google sees both User-agent: * and User-agent: Googlebot, only the Googlebot block applies — every other rule is ignored. (6) Test wildcards carefully. Disallow: /*.pdf$ blocks every PDF file. Disallow: /*.pdf (no $) blocks every URL containing .pdf anywhere in the path — usually not what you want. (7) Do not use robots.txt to hide private content. Disallow only prevents crawling, not indexing — Google can still index a URL it learns about from external links. Use noindex meta tags or HTTP auth for genuinely private pages. (8) Explicitly allow AI crawlers. GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and CCBot are the crawlers that feed ChatGPT, Claude, Perplexity, Google AI Overviews, and the Common Crawl dataset most LLMs train on. Blocking them removes your brand from generative engine outputs.

How robots.txt affects AI search and GEO in 2026

Generative engines crawl the web with named bots. GPTBot feeds ChatGPT. ClaudeBot feeds Claude. PerplexityBot feeds Perplexity. Google-Extended controls whether your content trains future Gemini models — independent of regular Googlebot crawling for search. CCBot powers Common Crawl, the dataset used by most open-source LLMs.

The robots.txt is the only signal these crawlers respect. If you block them, you remove your brand from the corpus generative engines pull from when synthesizing answers — the foundation of Generative Engine Optimization (GEO). Some sites block them deliberately to prevent training data harvesting; that is a defensible business decision but it comes with a hard tradeoff: zero presence in AI engine outputs. For a brand investing in AEO and GEO, explicit Allow rules for these bots are non-negotiable. See our full SEO vs AEO vs GEO comparison for the strategic context.

Common robots.txt mistakes that destroy rankings

Mistake one: pushing the staging robots.txt to production. Staging environments typically have User-agent: * followed by Disallow: / to keep the site out of Google. Forget to swap it before launch and the production site goes dark. I have seen this kill 80% of organic traffic in two weeks. Always verify the live robots.txt within minutes of any deployment.

Mistake two: blocking faceted parameters too aggressively. Ecommerce sites often try to block filtered URLs like ?color=red&size=large using Disallow: /*?*, which also accidentally blocks every UTM-tagged URL, search result page, and pagination link. Use targeted parameter blocks instead, or better, use canonical tags and parameter handling in Google Search Console. Mistake three: blocking the sitemap. Listing Disallow: /sitemap.xml is more common than it sounds. Always validate that the sitemap URL itself is allowed. Mistake four: case mismatches. Disallow: /Admin/ does not block /admin/ — robots.txt rules are case-sensitive for paths.

Mistake five: relying on robots.txt for security. The file is publicly readable. Listing /secret-admin/ in Disallow advertises its existence to anyone who reads your robots.txt. Use HTTP authentication or IP allowlisting for sensitive areas. Mistake six: ignoring crawl-delay. Google ignores the Crawl-delay directive entirely. Bing, Yandex, and most other bots respect it. If you have server-load issues, configure crawl rates in each engine's webmaster tools rather than in robots.txt.

Robots.txt vs noindex vs canonical: when to use which

The three controls solve different problems and confusing them is the most common technical-SEO mistake I see. Robots.txt controls crawling. A URL blocked in robots.txt can still appear in Google's index if linked from elsewhere, but Google will show it without a title or snippet. Noindex meta tag controls indexing. The URL must be crawlable for Google to read the noindex directive — so noindex in robots.txt-blocked URLs is functionally useless. Canonical tags control which version Google indexes when duplicate content exists. They are advisory, not directive.

Rule of thumb: use robots.txt to save crawl budget on URLs Google should not bother fetching at all (admin pages, large media files, parameterized search results). Use noindex meta tags or HTTP x-robots-tag headers for URLs Google should crawl but not show in results (thank-you pages, internal search results, low-value tag archives). Use canonical tags for duplicate or near-duplicate content (paginated pages, sort orders, mobile/desktop variants).

How to fix a robots.txt blocking your site

Step one: confirm the block. Open the URL above in incognito and check that the file is actually returning. Step two: use Google's robots.txt report in Search Console to see what Googlebot last fetched. Step three: identify the offending rule — usually a stray Disallow: / or an overly broad wildcard. Step four: ship the fix and verify the live file matches what you intended. Step five: submit affected URLs through GSC's URL Inspection tool with "Request Indexing" to accelerate recovery. Recovery typically takes 7-21 days depending on how many URLs were blocked and your site's authority.

Combine this validator with the rest of the technical-SEO toolset: the schema generator to ship structured data once pages are crawlable, the llms.txt checker to verify your AI-engine manifest is reachable, and the page speed analyzer to confirm Core Web Vitals are not blocking rendering. For deeper context, read the long-form robots.txt optimization guide and the related crawl budget guide.

Robots.txt is the cheapest and highest-leverage piece of technical SEO you will ever do. Five minutes of attention prevents the failure mode that has cost more clients their rankings than any other single mistake — a staging robots.txt accidentally pushed to production. In 2026 it also decides whether your brand is allowed into the training data and live retrieval indexes of every major AI engine. Treat it as production code, not a config afterthought.
Ram Lakhan · SEO Specialist, New Delhi
Chapter Frequently asked

Robots.txt Checker & Validator — Free Online Tool: questions

A robots.txt file is a plain-text file placed at the root of a domain (example.com/robots.txt) that tells search engine crawlers and AI bots which URLs on the site they can fetch. It uses the Robots Exclusion Protocol defined in RFC 9309 and supports four main directives: User-agent (which bot the rule applies to), Allow (paths the bot can crawl), Disallow (paths the bot must not crawl), and Sitemap (where to find the XML sitemap). It controls crawling, not indexing — a URL blocked in robots.txt can still appear in Google search results without a description if linked from elsewhere.

Run three checks. First, open https://yourdomain.com/robots.txt directly in a browser and confirm the file loads with HTTP 200 status and the content matches what you intended. Second, paste your URL into the validator above to scan for syntax errors, accidental blocks on CSS/JavaScript, and missing AI crawler permissions. Third, open Google Search Console, navigate to Settings, then robots.txt report — Google shows the last fetched version and any parse errors. If all three checks pass and your sitemap directive resolves, your robots.txt is functioning correctly.

Yes, very quickly. A single line — Disallow: / under User-agent: * — blocks the entire site from being crawled. Google starts dropping pages from its index within 48-72 hours and most pages disappear within two weeks. The most common cause is pushing a staging environment robots.txt to production without swapping the directive. The fix is to deploy a corrected robots.txt and request reindexing through Google Search Console for affected URLs. Full recovery typically takes 7-21 days depending on site authority and how many URLs were affected.

For most brands investing in AI search visibility, allow them explicitly. GPTBot feeds ChatGPT, ClaudeBot feeds Claude, PerplexityBot feeds Perplexity, Google-Extended controls inclusion in Gemini training data, and CCBot powers Common Crawl which most open-source LLMs train on. Sites that allow these bots are cited in generative engine outputs 2.4x more often than sites that block them, according to Profound's Q1 2026 AI Citation Study. If you have a defensible reason to protect content from training (proprietary research, paid content) you can block selectively, but understand the tradeoff: zero presence in AI engine outputs for that content.

Robots.txt Disallow prevents crawling. Noindex meta tag prevents indexing. The distinction matters because they solve different problems and combining them incorrectly creates bugs. Disallow tells the bot not to fetch the URL at all — but the URL can still appear in search results if Google learns about it from external links, just without a description. Noindex tells the bot it can fetch the URL but must not include it in search results — which requires the bot to actually fetch the page to see the directive. So putting noindex on a robots.txt-blocked URL does nothing, because Google never reads the noindex tag. Use Disallow to save crawl budget on URLs Google should never visit. Use noindex on URLs Google should visit but not index, like thank-you pages or low-quality archives.

Google typically refetches robots.txt every 24 hours, but the exact frequency depends on your site's crawl rate and Google's caching policy. After deploying a fix, Google may use a cached version of the old file for up to a day. To force an immediate refetch, open Google Search Console, go to Settings, then robots.txt report, and click 'Request a recrawl' for your robots.txt URL. This is standard procedure after any robots.txt change you want to take effect quickly — for example, after fixing an accidental site-wide block.

The Sitemap directive tells crawlers where to find your XML sitemap so they can discover all the URLs you want indexed. It uses an absolute URL: 'Sitemap: https://yourdomain.com/sitemap.xml'. You can include multiple Sitemap lines for sites with several sitemap files. The Sitemap directive is independent of User-agent groups — it applies to all crawlers and can be placed anywhere in the file. Including it is a best practice that helps both classic search engines and AI crawlers discover content faster. Google, Bing, Yandex, and all major AI crawlers respect this directive.

It is usually accidental, caused by overly broad Disallow patterns. Common culprits: Disallow: /wp-content/ on WordPress sites (blocks themes and plugins Google needs), Disallow: /static/ on Jamstack sites (blocks all build assets), Disallow: /*.js$ trying to block specific scripts but matching everything. Modern Googlebot renders pages like a browser — if CSS or JavaScript is blocked, Google sees a broken layout and assigns a lower page experience score, which reduces rankings. Use Google Search Console's URL Inspection tool with the 'Test live URL' option to see exactly what Google can and cannot render on any page.

No. Each subdomain has its own robots.txt file. blog.example.com/robots.txt is a completely separate file from example.com/robots.txt and from m.example.com/robots.txt. Subdomains do not inherit rules from the apex domain. This is a frequent source of confusion — teams add Disallow rules to the main site's robots.txt expecting them to cover a blog subdomain, then wonder why the blog is still being crawled. Each host needs its own properly configured robots.txt.

Google parses up to 500 kibibytes (KiB) of any robots.txt file. Everything beyond that limit is ignored. For most sites this is more than enough — a typical robots.txt is under 5 KiB. If your file is approaching 500 KiB, you likely have a problem worth solving structurally: instead of listing thousands of individual URLs, use pattern-based wildcards or move some directives to URL-level controls like noindex meta tags. Other crawlers (Bing, Yandex, AI bots) have similar or stricter limits.

No, and trying to is a security risk. Robots.txt is a publicly readable file — anyone can fetch yourdomain.com/robots.txt and see every path you have listed in Disallow. Listing /secret-admin/ or /internal-docs/ advertises their existence to attackers and curious visitors. For genuinely private content, use HTTP authentication (Basic Auth or OAuth), IP allowlisting, or noindex meta tags combined with no internal links. Robots.txt is for crawl-budget management, not access control.

Robots.txt is a long-standing standard that tells bots which URLs they can crawl. llms.txt is a newer proposed standard that tells AI engines which pages on your site are most authoritative and provides a curated summary for retrieval. They serve different purposes and complement each other. Robots.txt controls access at the URL level (crawl this, do not crawl that). llms.txt highlights the most important pages and provides AI-readable context that helps generative engines pick the right citations. Most modern sites should ship both — see our llms.txt checker tool and the llms.txt implementation guide for details.