Parse your robots.txt file the same way Google's crawlers do
Choose a Googlebot, enter your robots.txt file in the text area and enter the path you'd like to check.
You must ensure that the path you wish to check follows the format specified by RFC3986, since this library will not perform full normalization of those URI parameters. Only if the URI is in this format, will the matching be done according to the REP specification. This is exactly as per Google's open source project.
Why does this exist?
- It used to be possible to test specific robots.txt configurations in Search Console, but the tester did not match real googlebot behaviour, and eventually Google chose to deprecate that tool.
- Google published an open source project containing the code their crawlers use to parse robots.txt but:
- It needs to be compiled, which requires at least a modicum of C++ skills
- It doesn't contain the Google-specific logic that
googlebot-image
and other Google crawlers use
This site's own robots.txt file exposes the problem of using the open source checker without a wrapper. You can copy-and-paste it here to see that both googlebot
and googlebot-image
should be DISALLOWED
from crawling /bar/
. Without handling this kind of case specifically, the open source project will not get this right.
How does this tool differ from Google's open source project?
Apart from some minor tweaks to make it available on the web, the only substantive change is to allow it to take an ordered tuple of user agents as a comma-separated pair (e.g. googlebot-image,googlebot
) in order to enable functionality that mimics how Google crawlers behave in the wild. You can read the documentation for how they should work here. The key practical differences this makes are that some Google crawlers fall back on googlebot
directives if their own user agent is missing, and some only obey specific directives and ignore User-agent: *
rulesets.
We have verified this behaviour against real googlebot-image
behaviour in the wild, and assume it holds for the other Google crawlers which are described as operating this way.
Who are you anyway?
I'm Will Critchlow. I am CEO of SearchPilot and I am unhealthily interested in how robots.txt files work.
You can follow me on Twitter here or email me: [email protected].
What else do we need to know to use this tool?
If you select Googlebot Image, News or Video, it will run against the underlying parser with the input googlebot-<sub>,googlebot
which first seeks a robots.txt ruleset targeting the specific crawler. Only if that is not present, will it parse the robots.txt file as googlebot
. The same will happen if you input googlebot-image
or similar in the "other" box. You may also provide, in the "other" box a comma-separated tuple of user agents (with no spaces) - which will behave as described above, seeking a ruleset
targeting the first user agent and parse as the second user agent if none is found.
If you select an AdsBot
or the AdSense option (user-agent mediapartners-google
) then it will only respect rulesets that specifically target that user agent and will ignore User-agent: *
blocks.
More Information
- Google's open source robots.txt parser
- My speculation of how Google crawlers like
googlebot-image
parse robots.txt files (this tool uses a version of the open source parser built from a branch that includes these changes) - In order to be able to call it from Python, I modified the open source project to output information in a structured way. You can view this branch of my fork here