Support custom robots.txt #5086

humitos · 2019-01-10T11:41:08Z

My idea behind supporting this is,

check for a user's custom robots.txt file
if it does exist, serve it ~~by appending our own at the end~~
- ~~we need to disallow /sustainability/click/~~ (check Support custom robots.txt #5086 (comment))
if it does not exist, just ~~returns 404~~ allow all the agents and pages

If we agree on this, we will need to remove our NGINX rules from here and here and here

Another thing to consider is that we are adding /builds/ to the robots.txt file, so if the user has a /builds/ directory on their documentation it will be ignored by robots. We should probably want to split our robots.txt into one for readthedocs.org and another one for readthedocs.io. (see Eric's comment below)

Closes #3161

humitos · 2019-01-10T11:42:36Z

readthedocs/core/views/serve.py

+    default_robots_fullpath = os.path.join(settings.MEDIA_ROOT, 'robots.txt')
+
+    if not version_slug:
+        version_slug = project.get_default_version()


We should consider make the same decision as for custom 404 pages here (#2551 (comment)) about which version to use.

humitos · 2019-01-10T11:46:34Z

This logic could be extended for the favicon.ico also: https://github.com/rtfd/readthedocs-ops/blob/cbdc68eef9e8bbf9ba3db5f048955b2192fdc710/salt/base/nginx/sites/readthedocs-cname.conf#L25-L28

ericholscher · 2019-01-10T15:04:55Z

I believe robots.txt only gets evaluated at the root level of a domain, so serving it under the /en/latest/ wouldn't do anything.

A robots.txt file lives at the root of your site.

https://support.google.com/webmasters/answer/6062596?hl=en

ericholscher · 2019-01-10T15:07:00Z

I think the implementation here needs to be at the nginx level, and we can only serve one for a project, so it either needs to be configured in the YAML/DB, or from the "default version". Not sure the best implementation.

ericholscher · 2019-01-10T15:37:18Z

if it does exist, serve it by appending our own at the end, we need to disallow /sustainability/click/

This is only for the .org, I think our existing robots.txt file makes sense on the .org, but I don't believe we need anything custom for ourselves for subdomains. This is indeed a bug.

stsewd · 2019-01-10T15:37:47Z

Serving from the default version makes sense. If not we would end having another setting in the DB allowing the user to choose from what version the want to serve robots.txt, so, I guess managing everything with the default branch is enough.

ericholscher · 2019-01-10T15:38:49Z

I was actually thinking a text field in the DB with the contents of the robots.txt file, but serving it off disk from the default version is certainly easier, and doesn't add an additional DB thing.

ericholscher · 2019-01-10T15:56:41Z

readthedocs/core/views/serve.py

+        symlink = PublicSymlink(project)
+    if (settings.DEBUG or constants.PRIVATE in serve_docs) and privacy_level == constants.PRIVATE:  # yapf: disable  # noqa
+        symlink = PrivateSymlink(project)
+    basepath = symlink.project_root


Will this file ever exist? I feel like we should be finding it from the default version's HTML root, not from the project_root.

The default version is appended by the resolve_path.

At this point, filename is /en/latest/robots.txt in my case. Then, I remove the initial / and join with project_root which ends being /home/humitos/rtfd/code/readthedocs.org/public_web_root/test-builds/en/latest/robots.txt (in my local instance) and that file does exist.

Ah, I see. 👍

ericholscher · 2019-01-10T16:17:11Z

readthedocs/core/urls/subdomain.py

@@ -22,6 +22,10 @@
 handler404 = server_error_404

 subdomain_urls = [
+    url((r'robots.txt$'.format(**pattern_opts)),


Don't believe we need the format here.

ericholscher · 2019-01-10T16:17:59Z

readthedocs/core/views/serve.py

+    if os.path.exists(fullpath):
+        return HttpResponse(open(fullpath).read(), content_type='text/plain')
+
+    raise Http404()


I wonder if we want to 404 here, or return a default Allow: *?

If the robots.txt is not found, it's assumed that the crawler can access to all the content. Although, I think it's better to make it explicit.

ericholscher

This looks good. Needs a guide for users, and then I think we can ship it. 👍

If we add similar logic for the 404, we should also write up a blog post about it.

agjohnson · 2019-01-10T16:51:39Z

readthedocs/core/views/serve.py

+    """
+    if project.privacy_level == constants.PRIVATE:
+        # If project is private, there is nothing to communicate to the bots.
+        raise Http404()


Related, what do we do if the project's default version is private? Seems we'll be exposing a something potentially private wihtout this check?

Good point.

I think we need to make a decision here:

expose the robots.txt (I don't really think that this will "expose anything sensitive") --even if your default version is private you will want to communicate what to do with the other ones.

disallow the whole site (doesn't make too much sense to me)

other?

I'd go for 1).

Exposing the robots.txt exposes the fact that the project exists, which is definitely a security issue in some cases.

Mmm... good point.

So, for those cases (default version private or project private), we should probably want to return 404. What do you think?

Seems safest, especially since it's a new feature. If we get more requests from users we can add more logic here, but doing the safest thing to start feels right.

I added more cases, like version is not active or version is not built. Which I think it also makes sense to return 404.

humitos · 2019-01-14T10:54:48Z

This looks good. Needs a guide for users, and then I think we can ship it. +1

I wrote the documentation as a FAQ. Please take a look and let me know if you consider that it has to be written in another way.

ericholscher

I think this could pretty easily be a Guide instead of a FAQ. Though I don't feel strongly. FAQ's just feel like something that will not be as easily found as a guide on the topic. Not going to block shipping on it.

This looks 💯 for the .org, but @agjohnson might have other concerns around privacy, so probably good to get his thoughts before merge.

docs/faq.rst

davidfischer · 2019-01-14T17:42:28Z

his logic could be extended for the favicon.ico

It could also be extended for sitemap.xml!

davidfischer

This looks great and I'm excited by this as I think it will let docs authors have a little bit more power over their SEO and appearance to search engines. For small projects, that probably isn't a huge deal but for bigger ones I think it's important.

humitos · 2019-01-14T18:09:34Z

It could also be extended for sitemap.xml!

Yes! We have an issue for this at #557. I want to work on this sooner than later.

humitos · 2019-01-14T18:19:17Z

OK! Now that we have consensus on this, I will add some test cases to be safe with the logic and merge it after that.

Check for a custom `robots.txt` on the default version and if it does exist serve it. Otherwise, return 404.

Co-Authored-By: humitos <[email protected]>

humitos · 2019-01-16T15:54:12Z

Tests added. I'm merging after tests pass.

humitos commented Jan 10, 2019

View reviewed changes

humitos requested a review from a team January 10, 2019 11:46

ericholscher reviewed Jan 10, 2019

View reviewed changes

humitos force-pushed the humitos/custom-robots-txt branch 2 times, most recently from e233b13 to 0590d04 Compare January 10, 2019 16:16

ericholscher reviewed Jan 10, 2019

View reviewed changes

agjohnson added Needed: design decision A core team decision is required PR: work in progress Pull request is not ready for full review labels Jan 10, 2019

ericholscher added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Jan 10, 2019

agjohnson reviewed Jan 10, 2019

View reviewed changes

humitos removed the PR: work in progress Pull request is not ready for full review label Jan 14, 2019

humitos requested a review from ericholscher January 14, 2019 10:55

ericholscher approved these changes Jan 14, 2019

View reviewed changes

docs/faq.rst Outdated Show resolved Hide resolved

davidfischer approved these changes Jan 14, 2019

View reviewed changes

humitos force-pushed the humitos/custom-robots-txt branch from ae1e1b8 to 7921ade Compare January 14, 2019 18:17

humitos added the Needed: tests Tests are required label Jan 14, 2019

humitos self-assigned this Jan 15, 2019

humitos and others added 9 commits January 16, 2019 16:52

Support custom robots.txt

3fc254f

Check for a custom `robots.txt` on the default version and if it does exist serve it. Otherwise, return 404.

Proper URL formatting

5d2c24c

Explicit Allow/Disallow instead of 404

65c8904

FAQ documentation for robots.txt file

52588a3

Use return for HttpResponse

988c46b

Do not serve the robots.txt file if private or version not active/built

5ccc878

Rephrase question

38d8120

Co-Authored-By: humitos <[email protected]>

Add content_type to default response

4724811

any() call fixed

19b5209

humitos force-pushed the humitos/custom-robots-txt branch from ba139e0 to 335b99c Compare January 16, 2019 15:53

humitos removed the Needed: tests Tests are required label Jan 16, 2019

Tests for serving default and custom robots.txt file (public/private)

3e4b1a4

humitos force-pushed the humitos/custom-robots-txt branch from 335b99c to 3e4b1a4 Compare January 16, 2019 16:10

humitos merged commit f06271b into master Jan 16, 2019

delete-merged-branch bot deleted the humitos/custom-robots-txt branch January 16, 2019 16:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support custom robots.txt #5086

Support custom robots.txt #5086

humitos commented Jan 10, 2019 •

edited

Loading

humitos Jan 10, 2019

humitos commented Jan 10, 2019

ericholscher commented Jan 10, 2019 •

edited

Loading

ericholscher commented Jan 10, 2019

ericholscher commented Jan 10, 2019

stsewd commented Jan 10, 2019

ericholscher commented Jan 10, 2019

ericholscher Jan 10, 2019

humitos Jan 10, 2019

ericholscher Jan 10, 2019

ericholscher Jan 10, 2019

ericholscher Jan 10, 2019

humitos Jan 14, 2019

ericholscher left a comment

agjohnson Jan 10, 2019

humitos Jan 14, 2019

ericholscher Jan 14, 2019

humitos Jan 14, 2019

ericholscher Jan 14, 2019

humitos Jan 14, 2019

humitos commented Jan 14, 2019

ericholscher left a comment

davidfischer commented Jan 14, 2019

davidfischer left a comment

humitos commented Jan 14, 2019

humitos commented Jan 14, 2019

humitos commented Jan 16, 2019

Support custom robots.txt #5086

Support custom robots.txt #5086

Conversation

humitos commented Jan 10, 2019 • edited Loading

Choose a reason for hiding this comment

humitos commented Jan 10, 2019

ericholscher commented Jan 10, 2019 • edited Loading

ericholscher commented Jan 10, 2019

ericholscher commented Jan 10, 2019

stsewd commented Jan 10, 2019

ericholscher commented Jan 10, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ericholscher left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

humitos commented Jan 14, 2019

ericholscher left a comment

Choose a reason for hiding this comment

davidfischer commented Jan 14, 2019

davidfischer left a comment

Choose a reason for hiding this comment

humitos commented Jan 14, 2019

humitos commented Jan 14, 2019

humitos commented Jan 16, 2019

humitos commented Jan 10, 2019 •

edited

Loading

ericholscher commented Jan 10, 2019 •

edited

Loading