Home > Blog > Implementing Robots – Technical SEO – Robots.txt Advice

Robots Implementation Guide for Developers

What are Robots?

Robots is one of 3 different ways to tell crawlers what they should and should not crawl and index on your site. 

You should only use one of these 3 methods for a page:

    • Robots.txt file
    • Robots meta tag
    • X-Robots HTTP Header

The process crawlers take when they visit your website is as follows:

  • A crawler comes along and before it accesses any pages on your site, it looks for a robots.txt file
  • If it finds a robots.txt file, it does or doesn’t crawl based on those directives
  • If it doesn’t find a robots.txt file, then it looks for a robots metatag or x-robots header to tell it whether or not to index a page and follow the links on it to other pages. 
  • If it doesn’t find a metatag or x-robots header, it indexes and follows a page anyway. This is the default. 

Whereas robots.txt file directives give bots suggestions for how to crawl a website’s pages, robots meta directives provide more firm instructions on how to crawl and index a page’s content.

In most cases, using a meta robots tag with parameters “noindex, follow” should be employed as a way to to restrict crawling or indexation instead of using robots.txt file disallows.

The robots.txt file is used to guide a search engine as to which directories and files it should crawl. It does not stop content from being indexed and listed in search results.

The noindex robots meta tag tells search engines not to include content in search results and, if the content has already been indexed before, then they should drop the content entirely. It does not stop search engines from crawling content.

Robots.txt

robots.txt is a text file which sits in the root folder of your website. You can access it via FTP or your hosting login panel. 

The file must be named exactly robots.txt not Robots.txt or ROBOTS.TXT.

The robots.txt file is used to guide a search engine as to which directories and files it should crawl. 

It does not stop content from being indexed and listed in search results:

Reasons to use Robots.txt

Robots. txt files are best for disallowing a whole section of a site, such as a category.

Point Bots Away From Private Folders –  Prevent bots from checking out your private folders will make them much harder to find and index. Robots.txt does not prevent all bots from crawling pages, so If you have private information that you don’t want to make publicly searchable, choose a more secure approach, such as password protection, to keep visitors from viewing confidential pages.

Keep Resources Under Control / Crawl Budget – Each time a bot crawls through your site, it uses crawl budget. For sites with thousands of pages such as ecommerce sites, you can use up crawl budget really quickly, as it will crawl all pages, files and images. 

You can use robots.txt to make it difficult for bots to access individual scripts, PDFs and images; so it spends more of your crawl budget on the pages which actually matter. 

Specify Location Of Your Sitemap – Let crawlers know where your sitemap is located so that they can find pages and crawl your site easier.

Robots.txt directives

There are various commands you can use in a robots.txt file to tell crawlers what to do. Bear in mind that not all bots and crawlers will respect robots.txt and therefore you should not rely on it to prevent content from being indexed and crawled. 

Disallow

To block all web crawlers from all content use:

User-agent: *
Disallow: /

To block a specific folder from being crawled by all crawlers:

User-agent: *
Disallow: /example-subfolder/

This would block an individual pdf file:

User-agent: *
Disallow: /directory/some-pdf.pdf

This would block an individual page:

User-agent: *
Disallow: /useless_file.html

This would block an individual image:

User-agent: Googlebot-Image
Disallow: /images/dogs.jpg

To block all images from being crawled, you can use this:

User-agent: Googlebot-Image
Disallow: /

Remember – Even if you disallow a page in the robots.txt file, Google will still crawl the page if you have internal links pointing to it.

Be careful with trailing slashes on directories! For example:

User-agent: *
Disallow: /directory

This directive actually doesn’t allow search engines access to all of these:

  • /directory
  • /directory/
  • /directory-name-1
  • /directory-name.html
  • /directory-name.php
  • /directory-name.pdf

This is because when you append a forward slash to the directory name, it actually disallows crawling of a whole directory. 

Also bear in mind that the disallowed string may appear anywhere in the path if you use a trailing slash on the end, e.g. /directory/ would also block /my-folder/cards/directory/.

Wildcards

Using an asterisk * or wildcard means that it applies to all things you specify. It can be used to specify all crawlers (apart from Adbots) or it can be used to specify parts of URL, e.g.

User-agent: *
Disallow: *.html

This would stop crawlers from accessing any .html files on the site.  

Using an asterisk (*) matches all crawlers except the various AdsBot crawlers, which must be named explicitly. For example:

# Example 1: Block only Googlebot

User-agent: Googlebot
Disallow: /

# Example 2: Block Googlebot and Adsbot

User-agent: Googlebot
User-agent: AdsBot-Google
Disallow: /

# Example 3: Block all but AdsBot crawlers

User-agent: *
Disallow: /

The wildcard is supported by Google, Bing, Yahoo and Ask.

General Directives

To allow all web crawlers access to all content use:

User-agent: *
Disallow:

To block a specific crawler from crawling your site:

User-agent: Bingbot
Disallow: /

To block a specific crawler from crawling your site, but allow all others:

User-agent: Unnecessarybot
Disallow: /
User-agent: *
Allow: /

 

Allow

You might occasionally see an allow in a robots.txt file. This is used to tell a crawler it can crawl part of a subfolder previously not allowed – it is basically an override.  

User-agent: *
Allow: /media/terms-and-conditions.pdf
Disallow: /media/

Using the Allow and Disallow directives together you can tell search engines they can access a specific file or page within a directory that’s otherwise disallowed.

In the example above all search engines are not allowed to access the /media/ directory, except for the file /media/terms-and-conditions.pdf.

You could also just allow a single crawler (in this case Googlebot-news) to access the site and all others are disallowed, like this:

User-agent: Googlebot-news
Allow: /
User-agent: *
Disallow: /

If you wanted to hide your pages from search results, but allow access to Mediapartners-Google to analyse them, you would use this: (note the order is different to the example above, as the order matters!):

User-agent: *
Disallow: /
User-agent: Mediapartners-Google
Allow: /

The allow directive is supported by Google and Bing. 

Using $

The $ sign is used to specify the end of a URL, for example:

User-agent: *
Disallow: /*.php$

In the example above search engines aren’t allowed to access all URLs which end with .php.

In this example, Googlebot cannot crawl any GIF files:

User-agent: Googlebot
Disallow: /*.gif$ 

In this example, no XLS files can be crawled by Googlebot:

User-agent: Googlebot
Disallow: /*.xls$

However, URLs with parameters, e.g. https://example.com/page.php?lang=en would not be disallowed, as the URL doesn’t end after .php

 

Path Matches

Here are some example path matches:
/ Matches the root and any lower level URL.
/* Equivalent to /. The trailing wildcard is ignored.
/$ Matches only the root. Any lower level URL is allowed for crawling.
/fish

Matches any path that starts with /fish.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn’t match:

  • /Fish.asp
  • /catfish
  • /?id=fish
  • /desert/fish

Note: Matching is case-sensitive.

/fish*

Equivalent to /fish. The trailing wildcard is ignored.

Matches:

  • /fish
  • /fish.html
  • /fish/salmon.html
  • /fishheads
  • /fishheads/yummy.html
  • /fish.php?id=anything

Doesn’t match:

  • /Fish.asp
  • /catfish
  • /?id=fish
  • /desert/fish
/fish/

Matches anything in the /fish/ folder.

Matches:

  • /fish/
  • /animals/fish/
  • /fish/?id=anything
  • /fish/salmon.htm

Doesn’t match:

  • /fish
  • /fish.html
  • /Fish/Salmon.asp
/*.php

Matches any path that contains .php.

Matches:

  • /index.php
  • /filename.php
  • /folder/filename.php
  • /folder/filename.php?parameters
  • /folder/any.php.file.html
  • /filename.php/

Doesn’t match:

  • / (even if it maps to /index.php)
  • /windows.PHP
/*.php$

Matches any path that ends with .php.

Matches:

  • /filename.php
  • /folder/filename.php

Doesn’t match:

  • /filename.php?parameters
  • /filename.php/
  • /filename.php5
  • /windows.PHP
/fish*.php

Matches any path that contains /fish and .php, in that order.

Matches:

  • /fish.php
  • /fishheads/catfish.php?parameters

Doesn’t match: /Fish.PHP

 

Comments

Comments are preceded by a # and can either be placed at the start of a line or after a directive on the same line. Everything after the # will be ignored. These comments are meant for humans only. For example:

# Don’t allow access to the /wp-admin/ directory for all robots.

User-agent: *
Disallow: /wp-admin/

Blocking Parameter URLs

You can disallow parameters wildcards in robots like this:

Disallow: /*?*

However, this isn’t the best idea. It can lead to pages being blocked that you don’t want blocked, for example blog pages and also it doesn’t allow the crawlers to even look at those pages. 

Take a look at what Google’s John Mueller has to say:


Robots.txt Order of Rules

When matching robots.txt rules to URLs, crawlers use the most specific rule based on the length of the rule path. In case of conflicting rules, including those with wildcards, Google uses the least restrictive rule.

Sample conflict situations

http://example.com/page

allow: /p

disallow: /

Applicable rule: allow: /p, because it’s more specific.

http://example.com/folder/page

allow: /folder

disallow: /folder

Applicable rule: allow: /folder, because in case of conflicting rules, Google uses the least restrictive rule.

http://example.com/page.htm

allow: /page

disallow: /*.htm

Applicable rule: disallow: /*.htm, because the rule path is longer and it matches more characters in the URL, so it’s more specific.

http://example.com/page.php5

allow: /page

disallow: /*.ph

Applicable rule: allow: /page, because in case of conflicting rules, Google uses the least restrictive rule.

http://example.com/

allow: /$

disallow: /

Applicable rule: allow: /$, because it’s more specific.

http://example.com/page.htm

allow: /$

disallow: /

Applicable rule: disallow: /, because the allow rule only applies on the root URL.

Specifying Your Sitemap

You should always try to specify the sitemap in the robots.txt file as it helps Google to find it. 

You do this by showing crawlers where it is, like this:

User-agent: *
Disallow: /wp-admin/

Sitemap: https://www.example.com/sitemap_index.xml

You can also specify multiple sitemaps too – 

User-agent: *
Disallow:

Sitemap: https://www.example.com/people.xml
Sitemap: https://www.example.com/blog-posts.xml

Always use the absolute (complete) URL for the sitemap, not just /sitemap.xml

Specifying A Crawl Delay

You can also use robots.txt to tell crawlers unofficially to crawl your website more slowly if they are crashing or overloading the site. You do this by adding a crawl delay, like this: Crawl-delay: 8

You can see how it works in action here, with the crawl delay only applicable to Bingbot in this robots.txt file:

User-agent: *
Disallow: /search/
Disallow: /compare/

User-agent: BingBot
Disallow: /search/
Disallow: /compare/
Crawl-delay: 10

Sitemap: https://www.example.com/sitemap.xml

Google’s crawler, Googlebot, does not support the Crawl-delay directive, so don’t bother with defining a Google crawl-delay. Do it in Search Console instead.

The Crawl-delay directive is an unofficial directive used to prevent overloading servers with too many requests. 

Adding Crawl-delay to your robots.txt file is only a temporary fix.

Robots.txt Best Practices & Tips

  • Google has indicated that a robots.txt file is generally cached for up to 24 hours. It’s important to take this into consideration when you make changes in your robots.txt file.
  • Crawlers read the file from top to bottom and will obey the first directive it thinks matches
  • By blocking unimportant pages with robots.txt, Googlebot can spend more of your crawl budget on the pages that actually matter. Don’t block CSS!!
  • You can use a robots.txt file to block resource files such as unimportant image, script, or style files, if the page does not need this resource to be rendered.. However, if the absence of these resources make the page harder for Google’s crawler to understand the page, don’t block them, or search engines will not be able to analyse the page properly.
  • Rules are case-sensitive. For instance, disallow: /file.asp applies to https://www.example.com/file.asp, but not https://www.example.com/FILE.asp.
  • The # character marks the beginning of a comment.
  • Each directive needs to go on a separate line
  • Google no longer supports the NoIndex directive in robots.txt
  • You can only define one group of directives per search engine. Having multiple groups of directives for one search engine confuses them. 
  • If your content is already indexed, blocking access to it in robots.txt will not remove it from Google’s index
  • You need a separate robots.txt file for each subdomain, e.g. https://website.example.com/robots.txt

 

WordPress Robots.txt

Ideally, you want to allow / block the following for WordPress installations in robots.txt

User-Agent: *
Allow: /wp-content/uploads/
Disallow: /wp-content/plugins/
Disallow: /wp-admin/

You can find the robots.txt file in the root directory via FTP, or you can access robots.txt via various plugins such as Yoast SEO. 

Testing Robots.txt

You can test to see what’s blocked using the robots.txt testing tool on Google Search Console: https://www.google.com/webmasters/tools/robots-testing-tool

More Information on Robots.txt

For Joomla – https://docs.joomla.org/Robots.txt_file 

Wix – https://support.wix.com/en/article/editing-your-sites-robotstxt-file 

Shopify – https://help.shopify.com/en/manual/promoting-marketing/seo/editing-robots-txt 

Magento – https://docs.magento.com/user-guide/marketing/search-engine-robots.html

Robots Meta Tag

The robots meta tag is different from the robots.txt file. The robots.txt is better for excluding whole directories, whereas the robots meta tag is best for individual pages

It looks like this: <meta name=”robots” content=”noindex, nofollow”>

It goes in the <head> section of a web page and it tells crawlers whether to index and follow a page or not. 

If you don’t add a robots meta tag, it’s assumed that the page is to be index, follow.

It does not stop search engines from crawling content, but it can stop them from indexing content.

Meta Tag Options

You can use these options with the meta tag – 

  • all – this is the default and is the same as index,follow
  • noindex – don’t index this page
  • noarchive – don’t show a cached link in search results. If you don’t specify this directive, Google may generate a cached page and users may access it through the search results.
  • index – this is the default – you don’t need to add this as it’s already implied
  • follow – Even if the page isn’t indexed, follow all the links on a page and pass on the link juice
  • nofollow – Don’t follow any links on a page or pass along any link equity.
  • noimageindex – Don’t index any images on a page
  • notranslate – don’t offer to translate this page in search results
  • nositelinkssearchbox – don’t show a site links search box for this page
  • nosnippet – Do not show a text snippet or video preview in the search results for this page. A static image thumbnail (if available) may still be visible, when it results in a better user experience. This applies to all forms of search results (at Google: web search, Google Images, Discover).
  • none – Equivalent to using both the noindex and nofollow tags simultaneously.
  • unavailable_after – Search engines should no longer index this page after a particular date.
  • indexifembedded – you can tell Google you’d still like your content indexed when it’s embedded through iframes and similar HTML tags in other pages, even when the content page has the noindex tag.

You can also implement max-snippet, max-image-preview and max-video-preview which sets the number of characters in a snippet, maximum size of an image preview and maximum number of seconds to show of a video.

 

Noindex, Nofollow

The noindex, no follow tag looks like this:

<meta name=”robots” content=”noindex, nofollow”>

This meta robots tag tells crawlers to not index the page in the search engines and to not follow any links.

If the page has already been indexed before, then search engines should remove the page from the search results.

It does not stop search engines from crawling content. 

Shopify Robots Meta Tag

You can add a noindex tag to individual pages using the <head> section of your theme.liquid layout file like this:

{% if handle contains ‘page-name’ %}

<meta name=”robots” content=”noindex, follow”>

{% endif %}

X-Robots HTTP Header Tag

In order to use the x-robots-tag, you’ll need to be able to access your site’s website’s header .php, .htaccess, or server configuration file. If you do not have access to this, you will need to use meta robots tags to instruct crawlers.

The x-robots header goes in the HTTP header of a page and it looks like this:

 

You do not need to use both meta robots and the x-robots-tag on the same page – doing so would be redundant.

You can use the x-robots header on pages or files which you can’t use the robots meta tag, such as PDF files, videos etc. You can also use it to noindex only certain parts of the page as well. You can also specify crawling directives that are applied globally across a site. The support of regular expressions allows a higher level of flexibility than the robots meta tag.

You add the X-Robots tag to a website HTTP response header via either .htaccess or httpd.conf file on Apache or using NGINX.

You can use the same options for X-Robots as the meta tag options above. 

Apache

Add X-Robots For Certain File Types

For example, to add a noindex, nofollow X-Robots-Tag to the HTTP response for all .PDF files across an entire site, add the following snippet to the site’s root .htaccess file or httpd.conf file on Apache:.

<Files ~ “\.pdf$”>
   Header set X-Robots-Tag “noindex, nofollow”
</Files>

Or if you wanted to noindex, noarchive and nosnippet all .doc and .pdf files you would add this in Apache:

<FilesMatch “.(doc|pdf)$”>
Header set X-Robots-Tag “noindex, noarchive, nosnippet”

You can also apply this to images across the whole site for example in Apache:

<Files ~ “\.(png|jpe?g|gif)$”>
  Header set X-Robots-Tag “noindex”
</Files>

Add X-Robots to Block Individual Files

You could also block individual files like this in Apache:

# the htaccess file must be placed in the directory of the matched file.
<Files “unicorn.pdf”>
  Header set X-Robots-Tag “noindex, nofollow”
</Files>

The syntax is as follows for Apache to block any individual file:

<FilesMatch “filename”>
Header set X-Robots-Tag “noindex, nofollow”
</FilesMatch>

Blocking Whole Directories with X-Robots

The X-Robots tag isn’t suitable for protecting entire folders in one go on a global level. However, you can create a separate .htaccess file to go in a folder you want to protect. The HTTP header will then apply to all pages and files within that folder. 

Say for example you want to prevent /wp-admin/ folder in WordPress from being indexed. You need to create a new .htaccess file with the following rule:

Header set X-Robots-Tag “noindex, nofollow”

Then you can upload this to the /wp-admin/ folder via FTP. Every page in the /wp-admin/ folder will now serve the X-Robots HTTP header tag with the noindex, nofollow directives. 

Blocking Certain Crawlers with X-Robots

To set the header for an individual crawler, you can do this:

Header set X-Robots-Tag “googlebot: noindex, nofollow”

NGINX

Add X-Robots For Certain File Types

For example, to add a noindex, nofollow X-Robots-Tag to the HTTP response for all .PDF files across an entire site, add the following snippet to the site’s .conf file on NGINX:

location ~* \.pdf$ {
  add_header X-Robots-Tag “noindex, nofollow”;
}

Or if you wanted to noindex, noarchive and nosnippet all .doc and .pdf files you would add this in NGINX::

location ~* \.(doc|pdf)$ {
    add_header X-Robots-Tag “noindex, nofollow, nosnippet”;
}

You can also apply this to images across the whole site for example in the site’s .conf file on NGINX:

location ~* \.(png|jpe?g|gif)$ {
  add_header X-Robots-Tag “noindex”;

Add X-Robots to Block Individual Files

You could also block individual files like this in the site’s .conf file on NGINX:

location = /secrets/unicorn.pdf {
  add_header X-Robots-Tag “noindex, nofollow”;
}

The syntax is as follows to block any individual file:

location = filename {
  add_header X-Robots-Tag “noindex, nofollow”;
}

PHP

You can also use PHP to set X-Robots tags like this:

header(‘X-Robots-Tag: noindex,nofollow’);

Or this

<?php header(‘X-Robots-Tag: index,archive’); ?>

They need to go at the very top of the file before any other code is run in order to work correctly. 

Checking X-Robots HTTP Headers

To check the tag using Google Search Console, go to URL Inspection, and click on Test live URL and View crawled page. You’ll see the information about the HTTP response in the More info section.

How To Stop Content From Being Indexed

If you want to stop content being indexed, you:

MUST use the NOINDEX tag
and
you MUST allow search engines to crawl the content.

If search engines CANNOT crawl the content then they CANNOT see the NOINDEX meta tag and therefore CANNOT exclude the content from search results.

Need a hand? Check out our SEO Services, get in touch or reach out for a free audit!

Blog

Giant Wednesday

Sign Up For More

Stay up to date with the latest happenings, learnings, events & more with our GIANT Newsletters.

Contact Us

Top Floor, The Civic Centre, Castle Hill Avenue, Folkestone CT20 2QY.

 Show me directions

 01303 240715

 Send us a message

Copyright © 2022 Sleeping Giant Media. All Rights Reserved.