How robots.txt Works, and Common Mistakes That Hurt SEO

robots.txt is a small, plain-text file with an outsized ability to help or seriously hurt a site's search visibility, because a single misplaced line can accidentally block search engines from crawling the entire site. Understanding exactly what it does — and doesn't do — prevents the most common and most damaging mistakes.

What robots.txt actually controls

robots.txt gives instructions to well-behaved web crawlers (Googlebot, Bingbot, and others) about which parts of a site they're allowed to crawl — meaning visit and read. It lives at a fixed location (yoursite.com/robots.txt) and crawlers check it before crawling anything else on the domain. It's a voluntary convention, not a security mechanism — it doesn't prevent access, it requests that well-behaved crawlers not visit certain paths, and malicious bots can simply ignore it.

Crawling vs. indexing — a distinction that trips people up

Blocking a page in robots.txt prevents crawlers from reading its content, but it doesn't reliably prevent that URL from appearing in search results — if other pages link to a blocked URL, Google can still index the URL itself (typically showing it with no description, since it was never allowed to read the page). To actually prevent a page from appearing in search results, use a noindex meta tag on the page itself, which requires crawlers to be able to access the page in order to read that instruction — the opposite of blocking it in robots.txt. Blocking in robots.txt and asking for noindex are solving two different problems, and combining them incorrectly (blocking a page in robots.txt while also trying to noindex it) actually prevents the noindex instruction from ever being seen.

The mistake that deindexes an entire site

A single line — Disallow: / — blocks crawlers from the entire site, not just one section. This is commonly left over accidentally from a staging or development environment (where blocking all crawling is intentional and correct) after the site goes live, silently telling every search engine to stop crawling the production site. Checking robots.txt after any site migration, redesign, or platform change is worth doing specifically because this mistake is common and its effect — a slow, quiet drop in search visibility — isn't always immediately obvious.

A sensible baseline robots.txt

Most sites need very little in robots.txt: block genuinely non-public areas (admin panels, internal search result pages, staging paths), and include a reference to the sitemap so crawlers can find it easily:

User-agent: * (applies to all crawlers)
Disallow: /admin/ (block specific non-public paths, if any)
Sitemap: https://yoursite.com/sitemap.xml

If there's nothing specific to block, an empty Disallow: line (allowing everything) plus a sitemap reference is a perfectly valid and common configuration.

Try it yourself

Our Robots.txt Generator builds a correctly formatted robots.txt file, and our Sitemap Validator checks that your linked sitemap is valid and error-free.

This guide reflects general, publicly known crawling standards, which individual search engines may interpret with minor variations.

Frequently asked questions

Does blocking a page in robots.txt remove it from Google search results?

Not reliably — if other pages link to the blocked URL, it can still appear in search results, just without a description, since Google was never allowed to read its content. Use a noindex tag on the page itself to reliably prevent it from appearing.

What does "Disallow: /" do?

It blocks well-behaved crawlers from the entire site — every page, not just one section. This is appropriate for a staging environment but should never remain on a live production site.

Is robots.txt a security feature?

No — it's a voluntary request to well-behaved crawlers, not an access control mechanism. Anything genuinely sensitive should be protected with authentication, not just excluded in robots.txt.