Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Generate a sitemap index / Allow custom sitemap.xml #5391

Closed
humitos opened this issue Mar 4, 2019 · 11 comments
Closed

Generate a sitemap index / Allow custom sitemap.xml #5391

humitos opened this issue Mar 4, 2019 · 11 comments
Labels
Feature New feature Needed: design decision A core team decision is required

Comments

@humitos
Copy link
Member

humitos commented Mar 4, 2019

We already are generating sitemap.xml for all projects by default. Although, we don't consider any sitemap.xml generated by Sphinx at all.

This issue is the continuation of #557 and this specific comment about creating a global sitemap index at root pointing to the ones that are in subpaths.

Related: #6903

@humitos humitos added Feature New feature Needed: design decision A core team decision is required labels Mar 4, 2019
@aditya-prayaga
Copy link

Hi @humitos does this feature still need amendments or ready to implement? If yes can you provide with some insights to accomplish this. Thank You.

@humitos
Copy link
Member Author

humitos commented Mar 11, 2019

@Aditya-369 the issue is under "Design decision" (https://docs.readthedocs.io/en/latest/contribute.html#initial-triage) and need some discussion still.

Although, if you follow the links from the description you will find some extra context and proposals about how to implement it. If you want, you can read them all and make a more specific proposal on how this could be implemented and we can discuss over a specific proposal which would be better and easier. Thanks for the interest!

This is definitely something that we want to have as a good feature.

@strophy
Copy link

strophy commented Mar 12, 2019

I noticed sitemap.xml started being generated recently, thanks for this extremely useful development! I have just read through the previous discussion and PR. A couple of points (can open separate issues if you prefer):

  • The generated sitemap currently gets the URL by calling get_docs_url, which by definition returns an http address. But this is usually a 301 redirect to an https URL rather than an actual page, which may result in a penalty with some search engines. Sitemaps should only point to actual pages (references: Google, Bing) or the crawler may begin losing trust in the sitemap.
  • The hreflang for regional variations in the sitemap follows the format of the URL language slug generated by Sphinx, e.g. zh_CN for Chinese (China). This is invalid syntax for hreflang in a sitemap (reference), where a hyphen must be used instead, e.g. zh-CN. Alternatively, it is also valid to define the script of the language instead, e.g. zh-Hans for Simplified Chinese.
  • The sort order should prioritise the user-selected default version (Admin > Versions > Default Version) in the backend instead of setting latest to highest priority. In many cases (and also by definition), latest points at development documentation while stable is the version most people should be using, and should appear first in search results.

I am looking forward to further development of this feature and want to contribute if possible. I'm not much of a coder but willing to learn or help testing on my fairly large and complex documentation. My preferred implementation would be to add an option in conf.py to generate a user-controlled sitemap at https://$url/$lang/$version/sitemap.xml and group these together in an automatically generated sitemap index at https://$url/sitemap_index.xml, and then specifying this file in robots.txt.

@humitos
Copy link
Member Author

humitos commented Mar 12, 2019

@strophy I appreciate your feedback here.

The generated sitemap currently gets the URL by calling get_docs_url, which by definition returns an http address

I think the docstring of that method is wrong. For now, it only returns HTTP when its a custom domain because we can't guarantee that it has SSL setup (see #4641)

# This is from current production's server
In [1]: docs = Project.objects.get(slug='docs')

In [2]: docs.get_docs_url()
Out[2]: 'https://docs.readthedocs.io/en/stable/'

In [3]: pip = Project.objects.get(slug='pip')

In [4]: pip.get_docs_url()
Out[4]: 'http://pip.pypa.io/en/stable/'

A couple of points (can open separate issues if you prefer):

Yes, please. This issue is about generating a sitemap index and your suggestions/reports are about bugs in the current implementation. I'd appreciate if you create one issue per problem. Thanks!

@skirpichev
Copy link
Contributor

@humitos, your current sitemap.xml can't be configured from the project side. This may be handy if you want to disallow index for some versions (Google Search Console consider as an error that you submit URLs, which are blocked by robots.txt).

@humitos
Copy link
Member Author

humitos commented Sep 9, 2019

@skirpichev I'm not really sure to follow your issue. Can you expand and give an example of what you are trying to do?

@skirpichev
Copy link
Contributor

@humitos, I'm not sure it's a real issue, maybe a minor one. But lets suppose you want to disable certain versions in the readthedocs docs. Your docs suggests this variant with robots.txt. But project's sitemap.xml will still provide these "disallowed" versions. Google Search Console consider this as a misconfiguration.

@humitos
Copy link
Member Author

humitos commented Sep 10, 2019

If you disable Versions from your Project, they are not going to be shown in the sitemap.xml.

For other more complex cases is this issue about. Examples,

  • being able to define your own sitemap.xml instead of RTD generating one automatically for you
  • creating a sitemap index at the root that points to other sitemap.xml from other directories (see https://www.sitemaps.org/protocol.html#index)

@alexdlaird
Copy link

alexdlaird commented Aug 25, 2020

If you disable Versions from your Project, they are not going to be shown in the sitemap.xml.

For other more complex cases is this issue about. Examples,

  • being able to define your own sitemap.xml instead of RTD generating one automatically for you
  • creating a sitemap index at the root that points to other sitemap.xml from other directories (see https://www.sitemaps.org/protocol.html#index)

This is true if you disable a version, make it inactive, it is not true if you hide a version. The result is crawlers get confused, as the hidden version gets added to Disallow in robots.txt but still remains in sitemap.xml. It's unclear to me if there is a purpose for this, seems to me hidden versions should be removed from the generated sitemap.xml if they're also going to be disallowed, same as disabled. Regardless, I documented my workaround here.

This specific example can be seen in pyngrok's documentation. 4.1.9, for example is active but hidden—, so id does not show up in the menu anymore (but we want permalinks to continue working), yet it does still show up in the auto-generated sitemap.

@humitos humitos changed the title Generate a sitemap index Generate a sitemap index / Allow custom sitemap.xml Aug 24, 2022
@humitos
Copy link
Member Author

humitos commented Aug 24, 2022

I just got a support request from a user saying that the sitemap.xml generated by Read the Docs does not work as they expected.

being able to define your own sitemap.xml instead of RTD generating one automatically for you

I think this should be the way to go. In a similar way as we do with robots.txt. That way, users could generate their sitemap.xml in the exact way as they want if the one created by Read the Docs is not enough for them.

@humitos
Copy link
Member Author

humitos commented Apr 9, 2024

It seems there is no need to build a feature to allow users to define a custom sitemap.xml since they can just define one by using a custom robots.txt. Example:

User-agent: *
Allow: /

Sitemap: https://docs.example.com/en/stable/sitemap.xml

Read more about this at https://docs.readthedocs.io/en/stable/reference/sitemaps.html#custom-sitemap-xml

I'm closing this issue since we already have documented how to achieve this goal. If you consider there are still missing pieces here, please open new issues.

@humitos humitos closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Feature New feature Needed: design decision A core team decision is required
Projects
Archived in project
Development

No branches or pull requests

5 participants