robots.txt and Sitemap.xml: A Practical Guide
These two files sit at the root of your domain and shape how search engines interact with your site. robots.txttells crawlers what they should and shouldn't request. sitemap.xml tells them what pages exist. Together, they give you control over crawl behaviour without touching a line of application code.
What robots.txt does
A robots.txt file is a set of directives for web crawlers. When Googlebot (or any well-behaved crawler) arrives at your site, the first thing it requests is /robots.txt. The file tells it which paths are off-limits.
Important: robots.txt is a request, not access control. It relies on crawlers voluntarily obeying the rules. Malicious bots will ignore it. Never use robots.txt to hide sensitive content — use authentication or noindex meta tags instead.
robots.txt syntax
The file uses four main directives:
- User-agent: specifies which crawler the rules apply to. Use
*for all crawlers, or a specific name likeGooglebot. - Disallow: blocks a path or pattern.
Disallow: /admin/prevents crawling anything under/admin/. - Allow: overrides a broader Disallow. Useful for permitting a specific file inside a blocked directory.
- Sitemap: points crawlers to your sitemap URL. This line can appear anywhere in the file and applies globally regardless of User-agent blocks.
Common robots.txt mistakes
Accidentally blocking the entire site
A single misplaced Disallow: / under User-agent: *blocks every crawler from every page. This happens more often than you'd think — staging sites ship to production with their restrictive robots.txt still in place. One line can deindex your entire site within days.
Blocking CSS and JavaScript
Google needs to render your pages to understand them. Blocking /css/ or /js/prevents Googlebot from rendering your site correctly, which can hurt rankings. Unless a directory contains genuinely private assets, don't block it.
Using robots.txt for noindex
Blocking a page in robots.txt prevents crawling, but it doesn't remove it from search results. If Google already knows about the URL (from backlinks, sitemaps, or previous crawls), it may keep it indexed with a "No information is available for this page" snippet. Use a noindex meta tag or HTTP header to actually deindex a page.
What sitemap.xml does
A sitemap lists the URLs you want search engines to know about. It helps with discovery — especially for new sites, large sites, or pages that are poorly linked internally. A sitemap is not a guarantee of indexing. Google will still evaluate whether each URL is worth indexing based on content quality, crawl budget, and other signals.
Sitemap format
A sitemap is an XML file with a <urlset> root element containing one <url> block per page. Each block requires a <loc> (the full URL) and optionally includes <lastmod> (last modification date), <changefreq>, and <priority>. In practice, Google ignores changefreq and priority — only loc and lastmod matter.
Sitemap index files
Individual sitemaps are capped at 50,000 URLs and 50MB uncompressed. For larger sites, use a sitemap index — an XML file that lists multiple sitemap files. This lets you split URLs by section (blog, products, categories) and keep each file manageable. Most CMS platforms and frameworks generate sitemap indexes automatically once you exceed the limit.
Referencing your sitemap in robots.txt
Add a Sitemap: directive with the full URL to your sitemap at the bottom of your robots.txt file. For example: Sitemap: https://example.com/sitemap.xml. This is the simplest way to ensure every crawler discovers your sitemap without needing to submit it manually in search console tools.
Testing your setup
After making changes, validate your robots.txt using Google Search Console's robots.txt Tester. For sitemaps, submit the URL in Search Console and check for errors — common issues include non-canonical URLs, URLs blocked by robots.txt, and URLs returning non-200 status codes. A site audit tool can flag these conflicts automatically.
Audit your full technical SEO setup
AuditZap checks your robots.txt, sitemap, meta tags, and 20 more technical SEO factors in a single scan — no account required to start.
Run a free audit