Skip to content

Conversation

justinjung04
Copy link
Contributor

@justinjung04 justinjung04 commented May 29, 2025

What this PR does:

This PR includes two changes:

1. Change error code of resource exhausted query rejection from 429 to 503

This ends up changing error code returned to the user from 422 to 500 (Prometheus currently has a static mapping, which I'm trying to make customizble).

The reason we want to return 5xx rather than 4xx is that nodes with resources exhausted is more of a system failure rather than user error.

2. Change configuration

I'm planning to add more features to the query protection, and make it available for all query components (query rejection is just one of many features we can implement). Thus, I wanted to change the configuration to be more flexible:

ingester:
  query_proection:
    rejection:
      enabled: true
      threshold:
        cpu_utilization: 0.9
        heap_utilization: 0.9

The new config will be easier to extend as we add more query protection features:

ingester:
  query_proection:
    rejection:
      ...
    other_protection_feature:
      enabled: true
      threshold:
        cpu_utilization: 0.9
        heap_utilization: 0.9
        some_other_threshold: 1

Which issue(s) this PR fixes:
Fixes #

Checklist

  • Tests updated
  • Documentation added
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

Signed-off-by: Justin Jung <[email protected]>
@justinjung04 justinjung04 changed the title Query rejection Query protection - change configs for query rejection May 30, 2025
Signed-off-by: Justin Jung <[email protected]>
@justinjung04 justinjung04 changed the title Query protection - change configs for query rejection Change error code and configuration of resource-based limiter May 30, 2025
Signed-off-by: Justin Jung <[email protected]>
@justinjung04 justinjung04 marked this pull request as ready for review May 30, 2025 19:49
Copy link
Contributor

@yeya24 yeya24 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. This change makes sense to me

CHANGELOG.md Outdated
* [CHANGE] Ingester: Remove EnableNativeHistograms config flag and instead gate keep through new per-tenant limit at ingestion. #6718
* [CHANGE] StoreGateway/Alertmanager: Add default 5s connection timeout on client. #6603
* [CHANGE] Validate a tenantID when to use a single tenant resolver. #6727
* [CHANGE] Ingester/StoreGateway: Change error code and configuration of resource-based limiter. #6771
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can ignore the changelog as resource based limiter has not been released yet so it is not user facing change. Or you can add 6771 to the original changelog entry.

Signed-off-by: Justin Jung <[email protected]>
@yeya24 yeya24 merged commit a4ebd34 into cortexproject:master Jun 4, 2025
17 of 18 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants