Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ring: Fix pathological case when an entire zone leaves #672

Merged
merged 1 commit into from
Mar 27, 2025

Conversation

56quarters
Copy link
Contributor

What this PR does:

This change improves performance in a the case where an entire zone is not ACTIVE and the replication set is meant to be extended. Previously, when an entire zone was unavailable, the ring kept searching for instances by looking at every single token trying to find an instance in the required zone that was ACTIVE. This meant thousands of iterations to find a host that would never work.

This change keeps track of the number of hosts that we have examined in each zone. It returns early once we have either found the hosts in each zone we need OR we have examined all hosts in the zone and so know that we won't find one.

Which issue(s) this PR fixes:

N/A

Checklist

  • Tests updated
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]

@56quarters
Copy link
Contributor Author

Comparison against a commit from early January which predates my changes in #632:

$ benchstat prev.txt current.txt 
goos: linux
goarch: amd64
pkg: github.com/grafana/dskit/ring
cpu: Intel(R) Core(TM) i9-10885H CPU @ 2.40GHz
                           │   prev.txt    │             current.txt             │
                           │    sec/op     │   sec/op     vs base                │
Ring_Get_OneZoneLeaving-16   8329.3µ ± 13%   114.6µ ± 5%  -98.62% (p=0.000 n=10)

                           │   prev.txt   │           current.txt            │
                           │     B/op     │     B/op      vs base            │
Ring_Get_OneZoneLeaving-16   10.27Ki ± 0%   10.27Ki ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

                           │  prev.txt  │          current.txt           │
                           │ allocs/op  │ allocs/op   vs base            │
Ring_Get_OneZoneLeaving-16   12.00 ± 0%   12.00 ± 0%  ~ (p=1.000 n=10) ¹
¹ all samples are equal

@56quarters 56quarters requested review from pstibrany and a team March 25, 2025 16:57
Copy link
Member

@pstibrany pstibrany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen ring code in a while, and originally I thought I've found a bug in the PR, but staring at it for some more time, I'm wrong and the change seems legit and makes sense to me.

This change improves performance in a the case where an entire zone is
not ACTIVE and the replication set is meant to be extended. Previously,
when an entire zone was unavailable, the ring kept searching for
instances by looking at every single token trying to find an instance
in the required zone that was ACTIVE. This meant thousands of iterations
to find a host that would never work.

This change keeps track of the number of hosts that we have examined
in each zone. It returns early once we have either found the hosts in
each zone we need _OR_ we have examined all hosts in the zone and so
know that we won't find one.

Signed-off-by: Nick Pillitteri <[email protected]>
@56quarters 56quarters force-pushed the 56quarters/ring-missing-zone branch from 67a03bf to 16780f3 Compare March 26, 2025 13:53
@56quarters 56quarters marked this pull request as ready for review March 26, 2025 15:41
Copy link
Member

@julienduchesne julienduchesne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense to me, not super familiar with ring code though

@julienduchesne julienduchesne requested a review from a team March 26, 2025 15:48
@56quarters 56quarters merged commit 60d867e into main Mar 27, 2025
5 checks passed
@56quarters 56quarters deleted the 56quarters/ring-missing-zone branch March 27, 2025 16:40
56quarters added a commit to grafana/mimir that referenced this pull request Mar 27, 2025
Specifically, this pulls in the following dskit PRs:

* grafana/dskit#672
* grafana/dskit#669
* grafana/dskit#668

Signed-off-by: Nick Pillitteri <[email protected]>
56quarters added a commit to grafana/mimir that referenced this pull request Mar 27, 2025
Update to the latest dskit commmit to pull in grafana/dskit#672
which improves performance of ring operations in clients (queriers,
distributors) when zone-awareness is enabled and an entire zone is
not "ACTIVE".

Signed-off-by: Nick Pillitteri <[email protected]>
56quarters added a commit to grafana/mimir that referenced this pull request Mar 27, 2025
Update to the latest dskit commit to pull in grafana/dskit#672
which improves performance of ring operations in clients (queriers,
distributors) when zone-awareness is enabled and an entire zone is
not "ACTIVE".

Signed-off-by: Nick Pillitteri <[email protected]>
56quarters added a commit to grafana/mimir that referenced this pull request Mar 27, 2025
Update to the latest dskit commit to pull in grafana/dskit#672
which improves performance of ring operations in clients (queriers,
distributors) when zone-awareness is enabled and an entire zone is
not "ACTIVE".

Signed-off-by: Nick Pillitteri <[email protected]>
56quarters added a commit to grafana/mimir that referenced this pull request Mar 27, 2025
* chore: update to latest dskit for ring performance fix

Update to the latest dskit commit to pull in grafana/dskit#672
which improves performance of ring operations in clients (queriers,
distributors) when zone-awareness is enabled and an entire zone is
not "ACTIVE".

Signed-off-by: Nick Pillitteri <[email protected]>

* Lint

Signed-off-by: Nick Pillitteri <[email protected]>

---------

Signed-off-by: Nick Pillitteri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants