CyberLens documentation

robots.txt

Practical guide to what the robots.txt file does, how CyberLens reads it today, and when action is actually worth taking.

Italian version
Severity
Informational
Estimated fix time
5-15 min
Technical level
Beginner
Applies to
WordPressStatic HTMLApacheNginx

What it is

The robots.txt file tells crawlers which parts of a site they may scan. It is not an access-control mechanism, it is not a security mechanism, and it does not control indexing directly. Its role is to guide bot traffic and, when needed, point crawlers to the sitemap.

robots.txt affects crawling, not indexing. If the file is missing, that is usually not a critical issue on its own: Google generally behaves as if no special crawl restrictions were declared. If the file returns a server error, the situation is more serious because crawling may slow down or pause.

Info: crawling means fetching and reading a page. Indexing means deciding whether that page belongs in search results. They are related, but they are not the same thing.

robots.txt is a plain text file placed at the site root, for example https://example.com/robots.txt, and it follows the Robots Exclusion Protocol to communicate crawl instructions.

It must live exactly at the root of the site. A file placed in a subfolder such as /assets/robots.txt is ignored. The filename is case-sensitive and should stay lowercase.

  • https://example.com/robots.txt does not automatically apply to http://example.com/robots.txt.
  • It does not cover subdomains such as sub.example.com.
  • It does not extend to other ports such as :8080.

Technical note: every subdomain and every protocol or port combination needs its own robots.txt file.

Why it matters

  • Crawling, not indexing. Blocking a page with Disallow prevents Google from reading it, but does not guarantee that it stays out of search results.
  • Crawl budget. For small sites this is usually not a major concern. For larger sites, ecommerce catalogs, or sites with many URLs, robots.txt helps steer crawling toward the most useful sections.
  • Sitemap discovery. It gives crawlers a clear place to discover the XML sitemap.

Info: robots.txt does not prevent indexing by itself. A blocked page can still appear in search results if it receives external links: Google may index the URL without reading the page content.

Warning: if you block a page with Disallow, Google cannot read a noindex tag on that page because it cannot crawl it. The result may be the opposite of what you want: the page can remain indexed. If you need a page out of the index, let it be crawled and use noindex in the HTML.

Since September 2019, Google no longer supports a noindex directive placed inside robots.txt.

How CyberLens checks it

CyberLens fetches the robots.txt file and checks:

  • whether the file is present at the expected location, /robots.txt;
  • the returned HTTP status, such as 2xx, 4xx, 5xx, or 429;
  • a quick content preview when the file is reachable;
  • whether a Sitemap directive is present;
  • whether the file includes an explicit reference to the XML sitemap.

Future enhancements: deeper rule analysis, detection of full-site blocks or blocked critical assets, and more detailed syntax validation may be added later.

Possible findings

Each finding stands on its own, so readers can jump straight to the one shown in the report.

Missing robots.txt (404/410)

Severity: Low / Informational

The file does not exist. Google behaves as if no special crawl restrictions were declared through robots.txt. That does not guarantee that every page will be crawled, because crawl behavior still depends on internal links, site quality, available crawl time, and other signals. The most practical downside is that there is no native place to advertise the sitemap.

Unreachable robots.txt (5xx / 429)

Severity: Critical

The server returns a stability error such as 5xx or a rate-limiting response such as 429. Google may treat this as a server-side problem and slow down or pause crawling of the whole site until the file becomes reachable again. This is the most urgent case.

Sitemap not declared

Severity: Low / Moderate

The file exists but does not reference the XML sitemap. It is not an error, but adding the sitemap usually makes new content easier to discover.

Tip: a missing robots.txt is rarely urgent by itself. A robots.txt file returning a server error such as 5xx or 429 should be checked first.

Future enhancements: more granular findings, such as full-site blocks, blocked CSS or JS assets, or invalid syntax, may become part of later versions of the check.

  1. Unreachable file (5xx/429): check server health immediately.
  2. Missing file (404/410): decide whether to publish one, especially if you want to declare the sitemap clearly.
  3. Sitemap not declared: add it during the next technical update of the site.

How to fix it

Quick fix with CyberLens

If CyberLens reports that robots.txt is missing, you can use the built-in generator to create a simple starter file.

The generated file uses a straightforward configuration that fits most sites:

User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml

After downloading it, upload it to the public root of the site so it is reachable at:

https://yourdomain.com/robots.txt

If you do not manage the site files directly, send the content to your webmaster or hosting provider.

WordPress

Make sure the file does not block admin-ajax.php, which many themes and plugins rely on for dynamic content:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

If you use an SEO plugin, confirm that it generates the sitemap directive correctly.

Static site

Create a robots.txt file and publish it at the site root, not inside a subfolder. A minimal example is:

User-agent: *
Disallow:
Sitemap: https://example.com/sitemap.xml

Apache / hosting panel

If the file returns 5xx or 429:

  • check the server logs to identify the cause;
  • confirm that a security module or WAF is not blocking legitimate crawler requests;
  • make sure robots.txt is not behind authentication or server rules that fail for bots.

Nginx / server configuration

  • check that the location block serving /robots.txt does not return avoidable errors or redirects;
  • review any rate-limiting rules that may answer 429 to crawlers;
  • confirm that the file is served from the correct site root.

Info for non-technical users: if robots.txt returns 5xx or 429 and you do not manage the server yourself, you do not need to guess. Contact the hosting provider, explain that the file is returning a server-side error, and ask for a check.

Warning: Google follows a robots.txt redirect for up to five hops. If the redirect chain goes beyond that or fails, the file is effectively treated like a 404.

How this appears in CyberLens

In the scan report, the robots.txt finding appears with:

  • whether the file is present or missing;
  • the observed HTTP status;
  • a quick content preview when the file is reachable;
  • whether the sitemap reference is present or absent;
  • the current severity, based on the actual result observed today: missing file, server error, or missing sitemap reference.

When robots.txt is missing, CyberLens may offer a simple generator for creating a basic file to download and upload to the site.

Note: in the current version, CyberLens should not be described as a complete line-by-line parser. The goal of this view is quick, useful interpretation, not advanced REP debugging.