Skip to content

Distributor ingestion rate limit increased for retries due to ingestion failure #3804

@pracucci

Description

@pracucci

The distributor ingestion rate limit increases the number of "consumed tokens" in the rate limiter once the request is received and before writing to ingesters:

if !d.ingestionRateLimiter.AllowN(now, userID, totalN) {

In the event of an ingesters outage (eg. 2+ ingesters are unavailable), this means that each tenant remote write request will consume tokens from its rate limiter even if samples have not been successfully ingested. The client (eg. Prometheus) will retry writes and this will further consume tokens from the rate limiter, until it will eventually hit the rate limit, regardless any samples has been actually ingested.

The burst should protect from this, but in the event of a relatively long outage we would end up consuming the burst too (eg. we set burst to 10x the rate limit).

I'm wondering if a better approach would be checking if enough tokens are still available in the rate limiter once the request is received but actually consuming them from the rate limiter only after samples have been successfully written to ingesters. Due to concurrency, the actual accepted rate could be higher than the limit, but we would err in favour of the customer instead of rate limiting for writes we haven't actually ingested.

Related discussions:

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions