Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support custom robots.txt #5086

Merged
merged 10 commits into from
Jan 16, 2019
Merged

Support custom robots.txt #5086

merged 10 commits into from
Jan 16, 2019

Conversation

humitos
Copy link
Member

@humitos humitos commented Jan 10, 2019

My idea behind supporting this is,

  1. check for a user's custom robots.txt file
  2. if it does exist, serve it by appending our own at the end
  3. if it does not exist, just returns 404 allow all the agents and pages

If we agree on this, we will need to remove our NGINX rules from here and here and here

Another thing to consider is that we are adding /builds/ to the robots.txt file, so if the user has a /builds/ directory on their documentation it will be ignored by robots. We should probably want to split our robots.txt into one for readthedocs.org and another one for readthedocs.io. (see Eric's comment below)

Closes #3161

default_robots_fullpath = os.path.join(settings.MEDIA_ROOT, 'robots.txt')

if not version_slug:
version_slug = project.get_default_version()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should consider make the same decision as for custom 404 pages here (#2551 (comment)) about which version to use.

@humitos
Copy link
Member Author

humitos commented Jan 10, 2019

@humitos humitos requested a review from a team January 10, 2019 11:46
@ericholscher
Copy link
Member

ericholscher commented Jan 10, 2019

I believe robots.txt only gets evaluated at the root level of a domain, so serving it under the /en/latest/ wouldn't do anything.

A robots.txt file lives at the root of your site.

https://support.google.com/webmasters/answer/6062596?hl=en

@ericholscher
Copy link
Member

I think the implementation here needs to be at the nginx level, and we can only serve one for a project, so it either needs to be configured in the YAML/DB, or from the "default version". Not sure the best implementation.

@ericholscher
Copy link
Member

if it does exist, serve it by appending our own at the end, we need to disallow /sustainability/click/

This is only for the .org, I think our existing robots.txt file makes sense on the .org, but I don't believe we need anything custom for ourselves for subdomains. This is indeed a bug.

@stsewd
Copy link
Member

stsewd commented Jan 10, 2019

Serving from the default version makes sense. If not we would end having another setting in the DB allowing the user to choose from what version the want to serve robots.txt, so, I guess managing everything with the default branch is enough.

@ericholscher
Copy link
Member

I was actually thinking a text field in the DB with the contents of the robots.txt file, but serving it off disk from the default version is certainly easier, and doesn't add an additional DB thing.

symlink = PublicSymlink(project)
if (settings.DEBUG or constants.PRIVATE in serve_docs) and privacy_level == constants.PRIVATE: # yapf: disable # noqa
symlink = PrivateSymlink(project)
basepath = symlink.project_root
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will this file ever exist? I feel like we should be finding it from the default version's HTML root, not from the project_root.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default version is appended by the resolve_path.

At this point, filename is /en/latest/robots.txt in my case. Then, I remove the initial / and join with project_root which ends being /home/humitos/rtfd/code/readthedocs.org/public_web_root/test-builds/en/latest/robots.txt (in my local instance) and that file does exist.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see. 👍

@humitos humitos force-pushed the humitos/custom-robots-txt branch 2 times, most recently from e233b13 to 0590d04 Compare January 10, 2019 16:16
@@ -22,6 +22,10 @@
handler404 = server_error_404

subdomain_urls = [
url((r'robots.txt$'.format(**pattern_opts)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't believe we need the format here.

if os.path.exists(fullpath):
return HttpResponse(open(fullpath).read(), content_type='text/plain')

raise Http404()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we want to 404 here, or return a default Allow: *?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the robots.txt is not found, it's assumed that the crawler can access to all the content. Although, I think it's better to make it explicit.

Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good. Needs a guide for users, and then I think we can ship it. 👍

If we add similar logic for the 404, we should also write up a blog post about it.

@agjohnson agjohnson added Needed: design decision A core team decision is required PR: work in progress Pull request is not ready for full review labels Jan 10, 2019
@ericholscher ericholscher added Accepted Accepted issue on our roadmap and removed Needed: design decision A core team decision is required labels Jan 10, 2019
"""
if project.privacy_level == constants.PRIVATE:
# If project is private, there is nothing to communicate to the bots.
raise Http404()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Related, what do we do if the project's default version is private? Seems we'll be exposing a something potentially private wihtout this check?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I think we need to make a decision here:

  1. expose the robots.txt (I don't really think that this will "expose anything sensitive") --even if your default version is private you will want to communicate what to do with the other ones.
  2. disallow the whole site (doesn't make too much sense to me)
  3. other?

I'd go for 1).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exposing the robots.txt exposes the fact that the project exists, which is definitely a security issue in some cases.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mmm... good point.

So, for those cases (default version private or project private), we should probably want to return 404. What do you think?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems safest, especially since it's a new feature. If we get more requests from users we can add more logic here, but doing the safest thing to start feels right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added more cases, like version is not active or version is not built. Which I think it also makes sense to return 404.

@humitos
Copy link
Member Author

humitos commented Jan 14, 2019

This looks good. Needs a guide for users, and then I think we can ship it. +1

I wrote the documentation as a FAQ. Please take a look and let me know if you consider that it has to be written in another way.

@humitos humitos removed the PR: work in progress Pull request is not ready for full review label Jan 14, 2019
Copy link
Member

@ericholscher ericholscher left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could pretty easily be a Guide instead of a FAQ. Though I don't feel strongly. FAQ's just feel like something that will not be as easily found as a guide on the topic. Not going to block shipping on it.

This looks 💯 for the .org, but @agjohnson might have other concerns around privacy, so probably good to get his thoughts before merge.

docs/faq.rst Outdated Show resolved Hide resolved
@davidfischer
Copy link
Contributor

his logic could be extended for the favicon.ico

It could also be extended for sitemap.xml!

Copy link
Contributor

@davidfischer davidfischer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great and I'm excited by this as I think it will let docs authors have a little bit more power over their SEO and appearance to search engines. For small projects, that probably isn't a huge deal but for bigger ones I think it's important.

@humitos
Copy link
Member Author

humitos commented Jan 14, 2019

It could also be extended for sitemap.xml!

Yes! We have an issue for this at #557. I want to work on this sooner than later.

@humitos
Copy link
Member Author

humitos commented Jan 14, 2019

OK! Now that we have consensus on this, I will add some test cases to be safe with the logic and merge it after that.

@humitos humitos added the Needed: tests Tests are required label Jan 14, 2019
@humitos humitos self-assigned this Jan 15, 2019
@humitos humitos removed the Needed: tests Tests are required label Jan 16, 2019
@humitos
Copy link
Member Author

humitos commented Jan 16, 2019

Tests added. I'm merging after tests pass.

@humitos humitos merged commit f06271b into master Jan 16, 2019
@delete-merged-branch delete-merged-branch bot deleted the humitos/custom-robots-txt branch January 16, 2019 16:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Accepted Accepted issue on our roadmap
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants