-
-
Notifications
You must be signed in to change notification settings - Fork 407
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: return empty noindex webpage when crawlers hit specific pages #8744
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perfect but… could you add a test ?
when a crawler hit nested facets (ex: /category/popcorn-with-caramel/data-quality-error/nutrition-value-total-over-105) we return a blank HTML page with a noindex directive to prevent the crawler from overloading our servers.
The User-Agent has changed: see https://bot.seekport.com/
18d6b57
to
95ce78d
Compare
a88949f
to
c1f3cdc
Compare
Codecov Report
@@ Coverage Diff @@
## main #8744 +/- ##
==========================================
- Coverage 48.78% 48.73% -0.06%
==========================================
Files 117 117
Lines 21882 21908 +26
Branches 4869 4872 +3
==========================================
+ Hits 10676 10677 +1
- Misses 9903 9927 +24
- Partials 1303 1304 +1
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
c1f3cdc
to
b76fcfa
Compare
Crawling bots can't visit all page and crawl OFF continuously. We want to limit crawlers on interesting pages, so we return a noindex html page on most facet pages (except most interesting ones such as brand, category,...)
b76fcfa
to
0e82a6f
Compare
Kudos, SonarCloud Quality Gate passed! |
@@ -564,6 +569,24 @@ sub analyze_request ($request_ref) { | |||
$request_ref->{text} = 'index-pro'; | |||
} | |||
|
|||
# Return noindex empty HTML page for web crawlers that crawl specific facet pages | |||
if (($request_ref->{is_crawl_bot} eq 1) and (defined $request_ref->{tagtype})) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the "eq" operator converts both values to a string before comparing, it's best to use == for numbers
An analysis of nginx logs made us realize that 6% of our traffic was due to Bing bot, half of the queries were "facet" queries that involve aggregate MongoDB queries and consume a lot of resources.
See https://openfoodfacts.slack.com/archives/C1FPYCWM7/p1690454042958259 discussion for more context.
As a result, we decided to prevent known crawlers from crawling nested facet pages (2 facets). This should limit drastically the number of crawlable pages and reduce server load (especially the DB server).
I checked locally with custom User Agent, nested facet pages are blocked as expected.