robots.txt and Sitemap.xml: A Practical Guide

These two files sit at the root of your domain and shape how search engines interact with your site. robots.txt tells crawlers what they should and shouldn't request. sitemap.xml tells them what pages exist. Together, they give you control over crawl behaviour without touching a line of application code.

What robots.txt does

A robots.txt file is a set of directives for web crawlers. When Googlebot (or any well-behaved crawler) arrives at your site, the first thing it requests is /robots.txt. The file tells it which paths are off-limits.

Important: robots.txt is a request, not access control. It relies on crawlers voluntarily obeying the rules. Malicious bots will ignore it. Never use robots.txt to hide sensitive content - use authentication or noindex meta tags instead.

robots.txt syntax

The file uses four main directives:

User-agent: specifies which crawler the rules apply to. Use * for all crawlers, or a specific name like Googlebot.
Disallow: blocks a path or pattern. Disallow: /admin/ prevents crawling anything under /admin/.
Allow: overrides a broader Disallow. Useful for permitting a specific file inside a blocked directory.
Sitemap: points crawlers to your sitemap URL. This line can appear anywhere in the file and applies globally regardless of User-agent blocks.

A basic robots.txt looks like this:

User-agent: *
Disallow: /admin/
Allow: /admin/public/
Sitemap: https://example.com/sitemap.xml

Common robots.txt mistakes

Accidentally blocking the entire site

A single misplaced Disallow: / under User-agent: * blocks every crawler from every page. This happens more often than you'd think - staging sites ship to production with their restrictive robots.txt still in place. One line can stop search engines crawling your entire site, and your search visibility will collapse as snippets and rankings degrade.

Blocking CSS and JavaScript

Google needs to render your pages to understand them. Blocking /css/ or /js/ prevents Googlebot from rendering your site correctly, which can hurt rankings. Unless a directory contains genuinely private assets, don't block it.

Using robots.txt for noindex

Blocking a page in robots.txt prevents crawling, but it doesn't remove it from search results. If Google already knows about the URL (from backlinks, sitemaps, or previous crawls), it may keep it indexed with a "No information is available for this page" snippet. Use a noindex meta tag or HTTP header to actually deindex a page.

What sitemap.xml does

A sitemap lists the URLs you want search engines to know about. It helps with discovery - especially for new sites, large sites, or pages that are poorly linked internally. A sitemap is not a guarantee of indexing. Google will still evaluate whether each URL is worth indexing based on content quality, crawl budget, and other signals.

Sitemap format

A sitemap is an XML file with a <urlset> root element containing one <url> block per page. Each block requires a <loc> (the full URL) and optionally includes <lastmod> (last modification date), <changefreq>, and <priority>. In practice, Google ignores changefreq and priority - only loc and lastmod matter.

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
  <url>
    <loc>https://example.com/</loc>
    <lastmod>2026-03-26</lastmod>
  </url>
  <url>
    <loc>https://example.com/blog/</loc>
    <lastmod>2026-03-20</lastmod>
  </url>
</urlset>

Sitemap index files

Individual sitemaps are capped at 50,000 URLs and 50MB uncompressed. For larger sites, use a sitemap index - an XML file that lists multiple sitemap files. This lets you split URLs by section (blog, products, categories) and keep each file manageable. Most CMS platforms and frameworks generate sitemap indexes automatically once you exceed the limit.

Referencing your sitemap in robots.txt

Add a Sitemap: directive with the full URL to your sitemap at the bottom of your robots.txt file. For example: Sitemap: https://example.com/sitemap.xml. This is the simplest way to ensure every crawler discovers your sitemap without needing to submit it manually in search console tools.

Testing your setup

After making changes, check your robots.txt using the robots.txt report in Google Search Console (which shows the files Google fetched and flags any errors), and confirm a specific URL is crawlable with the URL Inspection tool. Google retired the standalone robots.txt Tester in 2023, so those two are now the built-in ways to verify it. For sitemaps, submit the URL in Search Console and check for errors - common issues include non-canonical URLs, URLs blocked by robots.txt, and URLs returning non-200 status codes. A site audit tool can flag these conflicts automatically.

Audit your full technical SEO setup

AuditZap checks your robots.txt, sitemap, meta tags, and dozens more factors - 40 checks in a single scan.

Run a free audit