Skip to content

Change robots to deny by default with allowlist#10292

Merged
aduth merged 6 commits intomainfrom
aduth-crawling
Mar 25, 2024
Merged

Change robots to deny by default with allowlist#10292
aduth merged 6 commits intomainfrom
aduth-crawling

Conversation

@aduth
Copy link
Contributor

@aduth aduth commented Mar 22, 2024

🛠 Summary of changes

Updates robots.txt to disallow all by default, with few exceptions. Also removes redundant robots <meta> tag.

Related Slack discussion: https://gsa-tts.slack.com/archives/C0NGESUN5/p1711054927329839

Why?

  • The previous robots.txt was incomplete and allowed crawling beyond what we would expect to be allowed
  • Since there are very few pages we expect to be crawlable, it's easier to list them explicitly
  • By preventing most all pages from being crawled, it's no longer necessary to also include a meta directive preventing indexing (related resource)

Open questions:

  • Trailing slash? Technically both /en/ and /en are valid. We tend to link to the one without the final trailing slash, and it seems reasonable to only want to allowing crawling of one and not both versions of the page.

📜 Testing Plan

  1. Visit http://localhost:3000/robots.txt
  2. Verify entries match expected crawlable routes

aduth added 3 commits March 22, 2024 14:07
changelog: Bug Fixes, Robots, Improve consistency of robots.txt crawling directives
Disallow: /
Allow: /$
Allow: /es$
Allow: /fr$
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, you're probably right, we could do this with a custom route + controller and it'd probably be better, and it would also let us use URL route helpers as well.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting making that a live file served by a Rails controller, or an ERB that we write out to public/ as a build step?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are you suggesting making that a live file served by a Rails controller, or an ERB that we write out to public/ as a build step?

I was thinking a controller, and implemented a first pass in 3ee1ae8, though I do like the idea of a static file since it wouldn't be something we expect to change. Then again, we probably don't get much traffic to this so hopefully it's not a big deal either way?

@aduth aduth marked this pull request as draft March 22, 2024 18:18
Copy link
Contributor

@zachmargolis zachmargolis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM but where's the robots_controller_spec.rb?

@aduth
Copy link
Contributor Author

aduth commented Mar 25, 2024

but where's the robots_controller_spec.rb?

I found it hiding behind the "Ready for review" button 😂

Added in 2b15d49

@aduth aduth marked this pull request as ready for review March 25, 2024 13:23
@aduth aduth merged commit dfb19f6 into main Mar 25, 2024
@aduth aduth deleted the aduth-crawling branch March 25, 2024 14:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants