-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non-encoding of URL queries causing Regex robots.txt rule matches to be missed #28
Comments
Sounds like a good fix. I do wonder what the right behavior should be if the rule is |
I think as long as there is consistency in how rules and URLs are encoded, then decoding won't be necessary. I couldn't find any official word on the right behaviour, but I'd expect (for example) Your point made me look into the code a little further, and I notice that the "query" part is also not touched for a rule string. E.g.:
Asterisks (as well as a few other special characters) are ignored, but
In this way, any URL, regardless of formatting, should match a corresponding rule, regardless of formatting. Because they'll all be encoded in the same way.
So perhaps it's also worth considering encoding the "params" and "fragment" parts in the same way. They probably wouldn't actually be present in any rules, but |
It looks like Google published a standard last September, which seems to detail how the encoding/decoding should be handled. Now it is a matter of actually following it. |
Hi, we have recently switched from Reppy to Protego for our robots.txt parser. All seems fine, except we noticed a few differences between Reppy and Protego in the URLs we are crawling - essentially, Protego appeared to be allowing access to URLs that should be blocked. It seems that Protego follows the Google specification and Reppy does not, so differences should be expected. However, we noticed that the official Google Robots Tester blocked access to these URLs also - so there seems to be an error here.
The string in the robots.txt file that appeared to be being ignored is
/en-uk/*q=*relevance*
and an example of a URL that was not being filtered by this string is/en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Here is the output from Google Robots Tester showing that this URL should be blocked by the aforementioned string:
Having looked at the Protego code, we believe that we have found where this apparent error comes from. We also think we have a fix for it, and will happily submit the fix for your scrutiny as we'd like to know if there are unforeseen consequences from this change.
The problem involves the ASCII hex-encoding of the URL string. Protego splits the URL into parts, e.g.:
It then encodes symbols in the "path" part of this, removes the "scheme" and "netloc" parts, and reassembles the URL to compare with all the rules in robots.txt. The issue we're seeing is that it only encodes the symbols in the "path" part. The "query" part is left alone.
We end up with this as the URL to be checked:
/en-uk/new-wildlife-range/c/4018?q=%3Arelevance%3Atype%3AHedgehog%2BFood
Which when a regex search is applied to it using this string:
/en-uk/.*?q%3D.*?relevance.*?
a match isn't found as the=
in the URL hasn't been encoded to%3D
.The fix we have is simple, it just encodes the "query" part in the same way as the "path" part. So instead we end up with URL:
/en-uk/new-wildlife-range/c/4018?q%3D%3Arelevance%3Atype%3AHedgehog%2BFood
the URL is matched with the regex string correctly, and crawler access is blocked.
Is this likely to cause any unforeseen issues?
Thanks
The text was updated successfully, but these errors were encountered: