Posted On : 04 March 2020
A robots.txt file aka robots exclusion protocol or standard, is a tiny text file, which exists in every website. Designed to work with search engines, it’s been moulded into a SEO boost waiting to be availed. robots.txt file acts as a guideline for the search engine crawlers, as to what pages/files or folders can be crawled and which ones they cannot.
To view a robtots.txt file simply type in the root domain and then add /robots.txt to the end of the URL.
www.example.com/robots.txt
User agent : *
disallow: /admin/
1. Google: Googlebot
2. Bing: Bingbot
3. Baidu: Baiduspider
4. Google Image: Googlebot-Image
5. Yahoo: Slurp
Note: It’s highly important for us to know that user-agents are case sensitive in robots.txt. Following example is incorrect because Google’s user-agent is “Googlebot” not “googlebot”
User-agent: googlebot
Disallow:
The correct example would be:
User-agent: Googlebot
Disallow:
Sitemap : https://www.example.com/sitemap.xml
Sitemap : https://www.example.com/sitemap.xml
User-agent: Googlebot
Disallow: /blog/
Allow: /blog/post-title/
User-agent: Googlebot
Disallow: /blog/
Allow: /blog/post-title/
Sitemap : https://www.example.com/sitemap.xml
User-agent: Googlebot
Allow: /
The above rule blocks all bots except Googlebot from crawling the site.
Incorrect
User-agent : * Disallow: /directory/ Disallow: /another-directory/
Correct
User-agent : *
Disallow: /directory/
Disallow: /another directory/
User-agent: *
Disallow: /products/it-solutions
Disallow: /products/seo-solutions
Disallow: /products/graphic-solutions
This however is not that effective and it’s best to keep the wildcard as simple as possible. as shown below to block all files and pages in /products/ directory
User-agent: *
disallow: /products/
User agent: *
Allow: /*.pdf$
Disallow: /*.jpg$
User agent: Bingbot
Disallow: /a/
User agent: Bingbot
Disallow: /b/
Above code should be written as follows
User agent: Bingbot
Disallow: /a/
Disallow: /b/
Google will not crawl any of these folders but it is still far more beneficial to be direct and concise. Chances of mistakes and errors are also reduced when there are lesser commands to code and follow.
User –agent: *
Disallow:
User –agent: *
Disallow: /
Disallow: /wp-admin/
Allow: wp-admin/admin-ajax.php
User-agent: *
Disallow: /blog/
Allow: /blog
In Google and Bing, the directive with the most characters is followed.
Bthrust.com Example
User-agent: *
Disallow: /blog/
Allow: /blog
By above code, bthrust.com/blog/ and pages in the blog folder will be disallowed in spite of an allow directive(5 characters) for such pages because disallow directive have longer path value((6 characters)).
User-agent : *
Disallow: /
User-agent : *
Disallow:
User-agent : *
Disallow: /folder/
User-agent : *
Disallow:/folder/
Allow: /folder/page.html
User-agent : *
Disallow: /this-file-is-not-for-you.pdf
User-agent : *
Disallow: /*.pdf$
Page Type | Description |
---|---|
Web page |
For web pages, robots.txt can be used to regulate crawling traffic to avoid crawling of unimportant or similar pages on the website. |
robots.txt should not be used to hide web pages from Google, as other pages can point to the hidden web page with descriptive text, and the page would be indexed without visiting the page. | |
Media files |
robots.txt can be used to manage crawl traffic, and to prevent visual and audio files from appearing in the Google search results. This however doesn’t stop other users or pages from linking to the page in question. |
Resource file |
robots.txt can be used to block resource files like certain images, scripts, or style files. Google's crawler might find it harder to understand the web page in the absence of such resources and would result in lowered ratings. |
Every search engine bot has a maximum crawl limit for each website i.e. X number of pages to be crawled in a crawl session. If let’s say the bot in unable to go through all the pages on a website, it will return back and continue crawling on in the next session and that hampers your website’s rankings.
This can be fixed by disallowing search bots to crawl unnecessary pages like the admin pages, private data etc.
Disallowing unnecessary pages obviously saves the crawl quota for the site and that in turn helps the search engines to crawl more pages on a site and index faster than before.
A default WordPress robots.txt should look like this:
User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
The WordPress website creates a virtual robots.txt file when the website is created in the server’s main folder.
Thisismywebsite.com -> website
Thisismywebsite.com/robots.txt -> to access robots.txt file
A code similar to this should be observed, it’s a very basic robots.txt file
User-agent: *
Disallow: /wp-admin/
Disallow: /wp-includes/
Allow: /wp-admin/admin-ajax.php
In order to add more rules, one needs to create a new text file with the name as “robots.txt” and upload it as the previous virtual files replacement. This can simply be done in any writing software as long as the format remains in .txt.
The most popular SEO plug-in for WordPress is Yoast SEO, due to its ease of use and performance.
Yoast SEO allows the optimization of our posts and pages to ensure the best usage of our keywords.
It’s Doable in 3 Simple StepsStep 1. Enable advanced settings toggle button from features tab in Yoast dashboard.
Step 2. Go to tools and then file editor. You will see .htaccess file and robots.txt creation button. Upon clicking “create robots.txt” an input text area will open where robots.txt file can be modified.
Step 3. Make sure to save any changes made to the robots.txt document to ensure retention of all the changes made.
Very similar to the above mentioned SEO plug-in, other than being a lighter and faster plug-in, creating a robots.txt file is also as easy in All in One SEO plug-in as it was in the Yoast SEO.
Step 1: Simply navigate to the All in One SEO and into the feature manager page on the dashboard.
Step 2: Inside, there is a tool which states robots.txt, with a bright activate button right under it.
Step 3: A new robots.txt screen should pop up; clicking on it will allow you to add new rules, make changes or delete certain rules all together.
Step 4: All in one SEO also allows blocking of “bad bots” straight away via a plug-in.
Step 1: Creating a .txt file is one of the easiest things, simply open notepad and type in your desired commands.
Step 2: Save the file as .txt type
Step 3: Once a file has been created and saved, the website should be connected via FTP.
Step 4: Upon establishing FTP connection to the site
Step 5: Navigate to the public_html folder.
Step 6: All that is left to do is uploading the robots.txt file from your system onto the server.
Step 7: That can be done via simply dragging or dropping it or it can be done by right clicking on the file using the FTP client’s local
1) Upon creation of robots.txt or on updating the robots.txt file, Google automatically updates robots.txt, alternatively it can also be submitted to the Google Search Console to test before you make changes to it.
2) The Google Search Console is a collection of various tools provided by Google to monitor how the content will appear in the search.
3) In the search console we can observe an editor field where we can test our robots.txt.
4) The platform checks the file for any technical errors and in case of any; they will be pointed out for you.
For the website to excel on a global level, one needs to make sure that the search engine bots are crawling only the important and relevant information.
A properly configured robots.txt will enable searchers and bots to access the domain’s best part and ensure a rise in the search engine rankings.
Regularly check for issues in coverage report in the Google Search Console regarding any robots.txt updates
Some Common Issues Are:Submitted URL blocked by robots.txt - This error is typically caused if an URL blocked by robots.txt is also present in your XML sitemap. Search Console shows it like this:
Solution #1 – Remove the blocked URL from the XML sitemap.
Solution #2 – - Check for any disallow rules within the robots.txt file and allow that particular URL or remove the disallow rule.
You can choose either solution depending on your priority and needs as to whether you want to block it or not.
This is a warning related to robots.txt which basically means you have accidentlly tried to exclude a page or resource from Google’s search results for which disallowing in robots.txt isn’t the correct solution. Google found it from other sources and indexed it.
Solution - Remove the crawl block and instead use a noindex meta robots tag or x robots-tag HTTP header to prevent indexing.
Crawl budget is an important SEO concept that is often neglected. It is the rate at which search engine’s crawlers go over the pages of your domain.
The crawl rate is “a tentative balance” between Googlebot’s desire to crawl a domain while ensuring the server is not being overcrowded.
Bonus info:
Other Techniques to Optimise Crawl Budget :
<link rel="alternate" hreflang="lang_code" href="url_of_page" />
should be included in the page’s header. As even though Google can find alternate language versions of any page, it is better to clearly indicate the language or region of specific pages to avoid wastage of crawling budget.Meta robot tag provides extra functions which are very page specific in nature and can’t be implemented into a robots.txt file; robots.txt lets us control the crawling of web pages and resources by search engines. On the other hand, Meta robots lets us control the indexing of pages and crawling of link on the page. Meta tags are the most efficient when being used to disallow singular files or pages whereas robots.txt files work to its optimum capacity when being used to disallow sections of sites.
The difference between the two lies in how they function; robots.txt is the standard norm for communicating with crawlers and other bots and it helps set specific commands that guides crawlers to areas of the website that shouldn’t be crawled.
Meta robots tags are exactly what the name suggests, a tag. It guides the search engine like a crutch as to what to follow and what not to. Both can be used together as neither one has any sort of authority over the other.
The meta robots tag should be placed in the <head> section of the website and would look like: <meta name= “robots” content = “noindex”>
<meta name= “robots” content = “follow”>
<meta name= “robots” content = “nofollow”>
<meta name= “robots” content = “index”>
<meta name= “robots” content = “noindex ”>
Do you want your business to touch new heights? If you do, we can certainly help your business with the perfect blend of SEO and custom software solutions. In fact, we helped many businesses in achieving massive success over the years with our solutions.
Drop Us A Line To Know How BThrust Can Turn Your Goals Into Reality. Contact Us For SEO, Custom Software Or Other IT Services We Offer!