Change robots to deny by default with allowlist#10292
Conversation
changelog: Bug Fixes, Robots, Improve consistency of robots.txt crawling directives
public/robots.txt
Outdated
| Disallow: / | ||
| Allow: /$ | ||
| Allow: /es$ | ||
| Allow: /fr$ |
There was a problem hiding this comment.
Do we want to try to do this programmatically with https://github.com/18F/identity-idp/blob/4087aba3d76b0607fdf7f441781ff581d203ba0b/lib/idp/constants.rb#L3?
There was a problem hiding this comment.
Yeah, you're probably right, we could do this with a custom route + controller and it'd probably be better, and it would also let us use URL route helpers as well.
There was a problem hiding this comment.
are you suggesting making that a live file served by a Rails controller, or an ERB that we write out to public/ as a build step?
There was a problem hiding this comment.
are you suggesting making that a live file served by a Rails controller, or an ERB that we write out to
public/as a build step?
I was thinking a controller, and implemented a first pass in 3ee1ae8, though I do like the idea of a static file since it wouldn't be something we expect to change. Then again, we probably don't get much traffic to this so hopefully it's not a big deal either way?
zachmargolis
left a comment
There was a problem hiding this comment.
LGTM but where's the robots_controller_spec.rb?
I found it hiding behind the "Ready for review" button 😂 Added in 2b15d49 |
🛠 Summary of changes
Updates
robots.txtto disallow all by default, with few exceptions. Also removes redundantrobots<meta>tag.Related Slack discussion: https://gsa-tts.slack.com/archives/C0NGESUN5/p1711054927329839
Why?
robots.txtwas incomplete and allowed crawling beyond what we would expect to be allowedmetadirective preventing indexing (related resource)Open questions:
/en/and/enare valid. We tend to link to the one without the final trailing slash, and it seems reasonable to only want to allowing crawling of one and not both versions of the page.📜 Testing Plan