Every account has its own limit of crawling pages. How to use the available number smart especially if you need some certain pages to be crawled? This task has very simple solution. You are able to specify a rule that you need to customize our crawler to point what pages you want to be scanned. All you need to do is specify the robots.txt file at the Site Auditor settings section.
What is robots.txt?
The robots.txt file is one of the primary ways of telling crawlers where they can and can’t go on a website. To customize our bots you should set certain rules.
How to customize robots.txt for our crawler?
There are common rules for all bots. Let's take a closer look!
Disallow: or Allow: / will allow any crawler to scan whole website without exceptions.
Disallow: / - using this rule you will restrict scanning of your website.
Disallow: /directory/file.html - you will restrict scanning of a certain file
Disallow: /dir/ - our crawler won’t scan a certain directory
Disallow: /dir - the crawling of an exact resource will be restricted
Also, you can specify the startpoint page. What does it mean? Our bot will start its work on your website from the page which URL you have specified. You will find this possibility very useful if you have, for instance, multilingual websites or those which contain different categories that are unnecessary to be scanned.
Let’s consider ‘example.com’ website in English with French as the second version. You want our crawler to scan each of the versions separately. In such case, you should specify a rule in the robots.txt.
English version. First of all, you should enter the startpoint (where the crawler should start scanning):
After that specify rules like this:
In this case, only pages that are specified by the rule.
If you want to retrieve results only for English version of website (‘example.com’)
you should set the next directions:
In such case, French version will be ignored and you won’t see the data for /fr/ pages that may confuse you.
Another case is when you have content on your website that is dated, like an infinite calendar. There may be thousands of pages if not millions that look like:
If you want to disallow the ‘history’ category you should create such rule:
If you want to scan separate years like 2015 and 2016 you have to add such directions:
With this permissions the crawler will scan only /history/2016/ and /history/2015/ paths from /history/ directory.
You can also allow or disallow scanning by our crawler at your robots.txt file where you set the rules for all bots. All you should do is specify directions for RSiteAuditor (our crawler).
Furthermore, you can set the priority for our crawler among other crawling bots using such simple command:
User-Agent: RSiteAuditor # priority!
If for some reasons you have restricted an access to your website for RSiteAuditor bot and then want it to scan your website again, don’t forget to check if you allow to do that in your robots.txt.