-
Notifications
You must be signed in to change notification settings - Fork 723
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WebHost: Add robots.txt
to WebHost
#3157
Conversation
It might then be reasonable to just have return app.send_static_file('robots_archipelago.gg.txt') |
How do you propose testing that outside of the production environment? |
Not sure what your current tests look like or which file in test/webhost/ would be responsible for this (feel free to point me there and I can write something up), but assuming TDD, I'd think it'd roughly be:
|
….txt` is served or not
I went with the config file route. We have a preexisting |
Gotcha, makes sense! |
I don't like controlling the behaviour via import, someone importing the module later and breaking this seems like a likely future bug. |
WebHostLib/robots.py
Outdated
@cache.cached() | ||
@app.route('/robots.txt') | ||
def robots(): | ||
configpath = os.path.abspath("config.yaml") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
within the function, nothing of this is needed, just app.config["ASSET_RIGHTS"] suffices
WebHostLib/robots.py
Outdated
configpath = os.path.abspath("config.yaml") | ||
if not os.path.exists(configpath): | ||
configpath = os.path.abspath(Utils.user_path('config.yaml')) | ||
|
||
if os.path.exists(configpath) and not app.config["TESTING"]: | ||
import yaml | ||
app.config.from_file(configpath, yaml.safe_load) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
configpath = os.path.abspath("config.yaml") | |
if not os.path.exists(configpath): | |
configpath = os.path.abspath(Utils.user_path('config.yaml')) | |
if os.path.exists(configpath) and not app.config["TESTING"]: | |
import yaml | |
app.config.from_file(configpath, yaml.safe_load) |
While this is a great idea, Google's own developer documentation says that it will still index pages blocked by robots.txt if other pages link to it:
They only block it completely if the page has a meta noindex tag, like |
Good catch @powerlord; I didn't know Google did that. Assuming y'all still want to differentiate, it might do to just have that config option show a banner, on by default, so satellite sites will say something at the top like:
|
* Add a `robots.txt` file to prevent crawlers from scraping the site * Added `ASSET_RIGHTS` entry to config.yaml to control whether `/robots.txt` is served or not * Always import robots.py, determine config in route function * Finish writing a comment * Remove unnecessary redundant import and config
What is this fixing or adding?
This adds a
robots.txt
file to the WebHost, which will prevent any bot which respects the file from including the site in search results. The main site, https://archipelago.gg should set a config value inconfig.yaml
to disable allow itself to be indexed.I also added an
ASSET_RIGHTS = false
entry toconfig.yaml
. If false,/robots.txt
will be served. Theapp.config
value defaults to false if the file is not present.How was this tested?
I ran the WebHost locally and accessed the file at
localhost://robots.txt