SEO & robots.txt: A Deep Dive into robots.txt and SEO

Posted On : 04 March 2020

What is a robots.txt File?

A robots.txt file aka robots exclusion protocol or standard, is a tiny text file, which exists in every website. Designed to work with search engines, it’s been moulded into a SEO boost waiting to be availed. robots.txt file acts as a guideline for the search engine crawlers, as to what pages/files or folders can be crawled and which ones they cannot.

To view a robtots.txt file simply type in the root domain and then add /robots.txt to the end of the URL.

Why a robots.txt is Important for Your Website?

It helps prevent crawling of duplicate pages by search engine bots.
It helps in keeping parts of the website private (i.e. not to show in Search Results).
Using robots.txt prevents server overloading.
It helps prevent wastage of Google’s “crawl budget."

How to Find Your robots.txt File?

If a robots.txt file has already been created then it can be accessed through www.example.com/robots.txt

How to Create a robots.txt File?

In order to create a new robots.txt file one needs to open a blank “.txt” document and commence writing directives.
For example, if you want to disallow all search engines from crawling your /admin/ directory, it should look similar to this: User agent : * disallow: /admin/

Where to Save Your robots.txt File?

The robots.txt file needs to be uploaded in the root directory of the main domain to which it is applied to.
In order to control crawling behaviour on www.bthrust.com, the robots.txt file should be accessible from.

Basic Format of robots.txt

Lets Understand the robots.txt Format Line by Line

1. User-agent

A robots.txt file comprises of one or more blocks of commands or directives, each starting with its own user-agent line. This “user-agent” is the name of the specific spider it addresses. A search engine spider will always pick the block that best matches its name.
There are various user-agents but the most prominent ones for SEO are-
1. Google: Googlebot

2. Bing: Bingbot

3. Baidu: Baiduspider

4. Google Image: Googlebot-Image

5. Yahoo: Slurp

Note: It’s highly important for us to know that user-agents are case sensitive in robots.txt. Following example is incorrect because Google’s user-agent is “Googlebot” not “googlebot”

User-agent: googlebot 

											Disallow:

The correct example would be:


										User-agent: Googlebot

                                        Disallow:

2. Sitemap Directive

This directive is used to specify the location of your sitemap(s) for the search engines.
An XML Sitemap declaration in robots.txt provides a supplementary signal regarding the presence of XML Sitemaps for search engines.
Sitemap includes the pages you want the search engines to crawl and index. The code should look like this: Sitemap : https://www.example.com/sitemap.xml
The sitemap tells the search engine crawlers how many pages are there to be crawled, when the page was last modified, what of pages, and how often is the page likely to be updated.
The sitemap directive does not need to be repeated or duplicated multiple times for each and every user-agent. It is applicable to all.
It is optimum to include sitemap directives either at the start or towards the end of the robots.txt file.

A code with sitemap directive in the beginning should look like: Sitemap : https://www.example.com/sitemap.xml User-agent: Googlebot Disallow: /blog/ Allow: /blog/post-title/
A code with sitemap directive in the end should look like: User-agent: Googlebot Disallow: /blog/ Allow: /blog/post-title/ Sitemap : https://www.example.com/sitemap.xml

3. Wildcard/Regular Expressions

Star (*) wildcard is used assign directives to all user-agents.
Every time a new user-agent is declared, that acts like a clean slate. Essentially, the directives declared initially for the first user-agent do not apply to the second, or third and so on.
Only rules that are the most accurate for specific crawlers are followed by the crawler’s name. User-agent: Googlebot Allow: /

The above rule blocks all bots except Googlebot from crawling the site.

4. Some Starter Tips:

Each and every directive should start from a new line. Incorrect User-agent : * Disallow: /directory/ Disallow: /another-directory/ Correct User-agent : * Disallow: /directory/ Disallow: /another directory/
Wildcards (*) can be used to apply directives to user-agents as well as to match URL patterns when declaring the said directives. User-agent: * Disallow: /products/it-solutions Disallow: /products/seo-solutions Disallow: /products/graphic-solutions This however is not that effective and it’s best to keep the wildcard as simple as possible. as shown below to block all files and pages in /products/ directory User-agent: * disallow: /products/
Always use $ sign to specify the end of the URL path, In order to allow or disallow content like PDF etc to the search engine. User agent: * Allow: /*.pdf$ Disallow: /*.jpg$
Each user agent command should be used one time only. As all Search Engines simply compile all the prior mentioned rules into one and follow all of them. As shown below. User agent: Bingbot Disallow: /a/ User agent: Bingbot Disallow: /b/ Above code should be written as follows User agent: Bingbot Disallow: /a/ Disallow: /b/ Google will not crawl any of these folders but it is still far more beneficial to be direct and concise. Chances of mistakes and errors are also reduced when there are lesser commands to code and follow.
In case of a missing robots.txt file, search engine crawlers crawl through all the publicly available pages of the website and add it to their index.
If an URL is neither disallowed in robots.txt nor it is in XML sitemap, it can be indexed by search engines unless a robot meta tag of noindex is implemented in that page.
If search engines cannot understand the directives of a file due to any reason, bots can still access the website and disregard the directives that are in the robots.txt file.
Use separate robots.txt file for every domain and sub-domain, like for www.example.com and blog.example.com
Use single robots.txt file for all subdirectories under single domain.

**5. Non-Standard robots.txt Directives**

Allow and Disallow commands are not case sensitive, the values however are case sensitive. As shown below /photo/ is not same as /Photo/, but Disallow is same as disallow
There can be more than one Disallow directive, specifying which segments of the website the spider cannot access.
An empty Disallow directive allows the spider to have access to all segments of the website as it essentially means nothing is being disallowed and the command would look like: User –agent: * Disallow:
Block all search engines that listen to robots.txt from crawling your site and the command would look like: User –agent: * Disallow: /
“Allow” not originally available, but now most search engines can follow these simple and easy directives to allow one page inside a disallowed directory. Disallow: /wp-admin/ Allow: wp-admin/admin-ajax.php
If not for “Allow” directive, one would have to categorically disallow files and that is a tedious task.
One has to give concise “allow” & “disallow” commands otherwise there might be a conflict between the two. User-agent: * Disallow: /blog/ Allow: /blog
In Google and Bing, the directive with the most characters is followed.

Bthrust.com Example
User-agent: * Disallow: /blog/ Allow: /blog
By above code, bthrust.com/blog/ and pages in the blog folder will be disallowed in spite of an allow directive(5 characters) for such pages because disallow directive have longer path value((6 characters)).

**Most Commonly Used robots.txt Commands**

No access for all crawlers User-agent : * Disallow: /
All access for all crawlers User-agent : * Disallow:
Block one sub directory for all crawlers User-agent : * Disallow: /folder/
Block one sub directory for all crawlers with only one file allowed User-agent : * Disallow:/folder/ Allow: /folder/page.html
Block one file for all crawlers User-agent : * Disallow: /this-file-is-not-for-you.pdf
Block one file type for all crawlers User-agent : * Disallow: /*.pdf$

**Uses of a robots.txt File**

Page Type	Description
Web page	For web pages, robots.txt can be used to regulate crawling traffic to avoid crawling of unimportant or similar pages on the website.
Web page	robots.txt should not be used to hide web pages from Google, as other pages can point to the hidden web page with descriptive text, and the page would be indexed without visiting the page.
Media files	robots.txt can be used to manage crawl traffic, and to prevent visual and audio files from appearing in the Google search results. This however doesn’t stop other users or pages from linking to the page in question.
Resource file	robots.txt can be used to block resource files like certain images, scripts, or style files. Google's crawler might find it harder to understand the web page in the absence of such resources and would result in lowered ratings.

**Why Your WordPress Needs a robots.txt File**

Every search engine bot has a maximum crawl limit for each website i.e. X number of pages to be crawled in a crawl session. If let’s say the bot in unable to go through all the pages on a website, it will return back and continue crawling on in the next session and that hampers your website’s rankings.

This can be fixed by disallowing search bots to crawl unnecessary pages like the admin pages, private data etc.

Disallowing unnecessary pages obviously saves the crawl quota for the site and that in turn helps the search engines to crawl more pages on a site and index faster than before.

A default WordPress robots.txt should look like this:


								User-agent: * 

								Disallow: /wp-admin/ 

								Allow: /wp-admin/admin-ajax.php

The WordPress website creates a virtual robots.txt file when the website is created in the server’s main folder.


								Thisismywebsite.com -> website
 
								Thisismywebsite.com/robots.txt -> to access robots.txt file

A code similar to this should be observed, it’s a very basic robots.txt file

User-agent: * 

										Disallow: /wp-admin/ 

										Disallow: /wp-includes/ 

										Allow: /wp-admin/admin-ajax.php

In order to add more rules, one needs to create a new text file with the name as “robots.txt” and upload it as the previous virtual files replacement. This can simply be done in any writing software as long as the format remains in .txt.