A robots.txt file aka robots exclusion protocol or standard, is a tiny text file, which exists in every website. Designed to work with search engines, it’s been moulded into a SEO boost waiting to be availed. robots.txt file acts as a guideline for the search engine crawlers, as to what pages/files or folders can be crawled and which ones they cannot.
To view a robtots.txt file simply type in the root domain and then add /robots.txt to the end of the URL.
Why a robots.txt is Important for Your Website?
It helps prevent crawling of duplicate pages by search engine bots.
It helps in keeping parts of the website private (i.e. not to show in Search Results).
Using robots.txt prevents server overloading.
It helps prevent wastage of Google’s “crawl budget."
How to Find Your robots.txt File?
If a robots.txt file has already been created then it can be accessed through www.example.com/robots.txt
How to Create a robots.txt File?
In order to create a new robots.txt file one needs to open a blank “.txt” document and commence writing directives.
For example, if you want to disallow all search engines from crawling your /admin/ directory, it should look similar to this:
User agent : *
disallow: /admin/
Where to Save Your robots.txt File?
The robots.txt file needs to be uploaded in the root directory of the main domain to which it is applied to.
In order to control crawling behaviour on www.bthrust.com, the robots.txt file should be accessible from.
Basic Format of robots.txt
Lets Understand the robots.txt Format Line by Line
1. User-agent
A robots.txt file comprises of one or more blocks of commands or directives, each starting with its own user-agent line. This “user-agent” is the name of the specific spider it addresses. A search engine spider will always pick the block that best matches its name.
There are various user-agents but the most prominent ones for SEO are-
1. Google: Googlebot
2. Bing: Bingbot
3. Baidu: Baiduspider
4. Google Image: Googlebot-Image
5. Yahoo: Slurp
Note: It’s highly important for us to know that user-agents are case sensitive in robots.txt. Following example is incorrect because Google’s user-agent is “Googlebot” not “googlebot”
User-agent: googlebot
Disallow:
The correct example would be:
User-agent: Googlebot
Disallow:
2. Sitemap Directive
This directive is used to specify the location of your sitemap(s) for the search engines.
An XML Sitemap declaration in robots.txt provides a supplementary signal regarding the presence of XML Sitemaps for search engines.
Sitemap includes the pages you want the search engines to crawl and index. The code should look like this:
Sitemap : https://www.example.com/sitemap.xml
The sitemap tells the search engine crawlers how many pages are there to be crawled, when the page was last modified, what of pages, and how often is the page likely to be updated.
The sitemap directive does not need to be repeated or duplicated multiple times for each and every user-agent. It is applicable to all.
It is optimum to include sitemap directives either at the start or towards the end of the robots.txt file.
A code with sitemap directive in the beginning should look like:
Sitemap : https://www.example.com/sitemap.xml
A code with sitemap directive in the end should look like:
User-agent: Googlebot
Disallow: /blog/
Allow: /blog/post-title/
Sitemap : https://www.example.com/sitemap.xml
3. Wildcard/Regular Expressions
Star (*) wildcard is used assign directives to all user-agents.
Every time a new user-agent is declared, that acts like a clean slate. Essentially, the directives declared initially for the first user-agent do not apply to the second, or third and so on.
Only rules that are the most accurate for specific crawlers are followed by the crawler’s name.
User-agent: Googlebot
Allow: /
The above rule blocks all bots except Googlebot from crawling the site.
4. Some Starter Tips:
Each and every directive should start from a new line.
Incorrect
User-agent : * Disallow: /directory/ Disallow: /another-directory/
Correct
User-agent : *
Disallow: /directory/
Disallow: /another directory/
Wildcards (*) can be used to apply directives to user-agents as well as to match URL patterns when declaring the said directives.
User-agent: *
Disallow: /products/it-solutions
Disallow: /products/seo-solutions
Disallow: /products/graphic-solutions
This however is not that effective and it’s best to keep the wildcard as simple as possible. as shown below to block all files and pages in /products/ directory
User-agent: *
disallow: /products/
Always use $ sign to specify the end of the URL path, In order to allow or disallow content like PDF etc to the search engine.
User agent: *
Allow: /*.pdf$
Disallow: /*.jpg$
Each user agent command should be used one time only. As all Search Engines simply compile all the prior mentioned rules into one and follow all of them. As shown below.
User agent: Bingbot
Disallow: /a/
User agent: Bingbot
Disallow: /b/
Above code should be written as follows
User agent: Bingbot
Disallow: /a/
Disallow: /b/
Google will not crawl any of these folders but it is still far more beneficial to be direct and concise. Chances of mistakes and errors are also reduced when there are lesser commands to code and follow.
In case of a missing robots.txt file, search engine crawlers crawl through all the publicly available pages of the website and add it to their index.
If an URL is neither disallowed in robots.txt nor it is in XML sitemap, it can be indexed by search engines unless a robot meta tag of noindex is implemented in that page.
If search engines cannot understand the directives of a file due to any reason, bots can still access the website and disregard the directives that are in the robots.txt file.
Use single robots.txt file for all subdirectories under single domain.
5. Non-Standard robots.txt Directives
Allow and Disallow commands are not case sensitive, the values however are case sensitive. As shown below /photo/ is not same as /Photo/, but Disallow is same as disallow
There can be more than one Disallow directive, specifying which segments of the website the spider cannot access.
An empty Disallow directive allows the spider to have access to all segments of the website as it essentially means nothing is being disallowed and the command would look like:
User –agent: *
Disallow:
Block all search engines that listen to robots.txt from crawling your site and the command would look like:
User –agent: *
Disallow: /
“Allow” not originally available, but now most search engines can follow these simple and easy directives to allow one page inside a disallowed directory.
Disallow: /wp-admin/
Allow: wp-admin/admin-ajax.php
If not for “Allow” directive, one would have to categorically disallow files and that is a tedious task.
One has to give concise “allow” & “disallow” commands otherwise there might be a conflict between the two.
User-agent: *
Disallow: /blog/
Allow: /blog
In Google and Bing, the directive with the most characters is followed.
Bthrust.com Example
User-agent: *
Disallow: /blog/
Allow: /blog
By above code, bthrust.com/blog/ and pages in the blog folder will be disallowed in spite of an allow directive(5 characters) for such pages because disallow directive have longer path value((6 characters)).
Most Commonly Used robots.txt Commands
No access for all crawlers
User-agent : *
Disallow: /
All access for all crawlers
User-agent : *
Disallow:
Block one sub directory for all crawlers
User-agent : *
Disallow: /folder/
Block one sub directory for all crawlers with only one file allowed
User-agent : *
Disallow:/folder/
Allow: /folder/page.html
Block one file for all crawlers
User-agent : *
Disallow: /this-file-is-not-for-you.pdf
Block one file type for all crawlers
User-agent : *
Disallow: /*.pdf$
Uses of a robots.txt File
Page Type
Description
Web page
For web pages, robots.txt can be used to regulate crawling traffic to avoid crawling of unimportant or similar pages on the website.
robots.txt should not be used to hide web pages from Google, as other pages can point to the hidden web page with descriptive text, and the page would be indexed without visiting the page.
Media files
robots.txt can be used to manage crawl traffic, and to prevent visual and audio files from appearing in the Google search results. This however doesn’t stop other users or pages from linking to the page in question.
Resource file
robots.txt can be used to block resource files like certain images, scripts, or style files.
Google's crawler might find it harder to understand the web page in the absence of such resources and would result in lowered ratings.
Why Your WordPress Needs a robots.txt File
Every search engine bot has a maximum crawl limit for each website i.e. X number of pages to be crawled in a crawl session. If let’s say the bot in unable to go through all the pages on a website, it will return back and continue crawling on in the next session and that hampers your website’s rankings.
This can be fixed by disallowing search bots to crawl unnecessary pages like the admin pages, private data etc.
Disallowing unnecessary pages obviously saves the crawl quota for the site and that in turn helps the search engines to crawl more pages on a site and index faster than before.
A default WordPress robots.txt should look like this:
In order to add more rules, one needs to create a new text file with the name as “robots.txt” and upload it as the previous virtual files replacement. This can simply be done in any writing software as long as the format remains in .txt.
Creating a New WordPress robots.txt File:
Below we explained 3 methods of implementing robots.txt
Method 1: Yoast SEO
The most popular SEO plug-in for WordPress is Yoast SEO, due to its ease of use and performance.
Yoast SEO allows the optimization of our posts and pages to ensure the best usage of our keywords.
It’s Doable in 3 Simple Steps
Step 1. Enable advanced settings toggle button from features tab in Yoast dashboard.
Note: Yoast SEO has its own default rules, which override any existing virtual robots.txt file.
The Right Solution for Every Business
Do you want your business to touch new heights? If you do, we can certainly help your business with the perfect blend of SEO and custom software solutions. In fact, we helped many businesses in achieving massive success over the years with our solutions.