Skip to content

Conversation

@tbg
Copy link
Member

@tbg tbg commented Nov 13, 2025

Leases:

This PR refactors rebalanceStore by extracting the lease rebalancing logic into a standalone rebalanceLeases method on rebalanceState. The method encapsulates the logic for handling lease transfers when a local store is CPU overloaded, improving code organization and reducing the complexity of rebalanceStore.

As part of this refactor, we also fix a bug where the global leaseTransferCount was incorrectly used to determine if the current store transferred leases. The fix introduces a local counter that tracks lease transfers made during processing of a single store. It feels a bit sad to fix a bug without adding a test, but this felt too good to pass up, and the refactors are precursors to plenty of testing in the context of adding a notion of replica disposition.

The refactoring was done incrementally in small, reviewable commits:

  1. Indent the block with explicit braces (mechanical whitespace change)
  2. Add local counter and fix the bug
  3. Wrap the logic in an inline function
  4. Extract to method (done mechanically by the author using IDE tooling)

There should be zero behavior changes in this work (aside from the bug fix).

Replicas:

This PR refactors rebalanceStore by extracting the replica rebalancing logic into a standalone rebalanceReplicas method on rebalanceState. This separation improves code organization by isolating the replica movement logic from the lease rebalancing logic, making the codebase easier to understand and maintain.

The refactoring was done incrementally in small, reviewable commits:

  1. Added a code block wrapper around the replica rebalancing logic
  2. Moved the doneShedding variable declaration into the block
  3. Added method scaffolding with an IIF wrapper to set up the call site
  4. Extracted the code into the rebalanceReplicas method (done manually using IDE tooling)

There should be zero behavior changes in this PR.

Informs #157757.

Epic: CRDB-55052

@blathers-crl
Copy link

blathers-crl bot commented Nov 13, 2025

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@tbg tbg added the do-not-merge bors won't merge a PR with this label. label Nov 13, 2025
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@tbg tbg changed the title mmaprototype: extract rebalanceLeases from rebalanceStore mmaprototype: extract rebalance{Leases,Replicas} from rebalanceStore Nov 13, 2025
@tbg tbg force-pushed the mma-rebalace-lease-refactor branch from 7a0f2b4 to 4d15fdc Compare November 13, 2025 16:56
tbg added 12 commits November 13, 2025 21:47
Wrap the lease rebalancing logic in an explicit block scope to prepare for
extraction into a separate method. This is a pure whitespace change with no
logic modifications.

This change makes the code that will become `considerRebalanceLeases` clearly
delineated, setting up for the subsequent refactoring steps.
Add a local counter to track lease transfers made during processing of a
single store, and use it to fix the bug where the global counter was
incorrectly used to determine if this store transferred leases.

The local counter is incremented alongside the global counter, which is
still needed for loop termination across all stores. The check at the
end of lease rebalancing now correctly uses the local counter to decide
whether to skip replica transfers for this specific store.
Wrap the lease rebalancing logic in an inline function that takes all
necessary parameters and returns whether to skip replica moves. This
prepares the code for extraction to a separate method.

The function converts early returns to return statements with the skip
flag, making the control flow explicit and ready for method extraction.
Extract the lease rebalancing logic into a separate method
`(*rebalanceState).rebalanceLeases`. This improves code organization by
separating the lease rebalancing concern from the main rebalance logic.

The method takes all necessary parameters and returns whether to skip
replica moves. The extraction was done mechanically by the author using
IDE tooling.
Add a new code block (braces) around the replica rebalancing logic
in rebalanceStore to prepare for extraction into a separate method.
This is a mechanical whitespace change that will be followed by
further refactoring steps.
…block

Move the doneShedding variable declaration from the beginning of
rebalanceStore into the code block that contains the replica rebalancing
logic. This narrows the variable's scope to where it's actually used and
prepares for extracting this code into a separate method.
Wrap the replica rebalancing logic in an immediately invoked function
(IIF) and add a call to a new rebalanceReplicas method (currently a
no-op). This sets up both the call site and method signature, making it
straightforward to move the code into the method in the next step.
…s method

Move the replica rebalancing logic from rebalanceStore into a new
rebalanceReplicas method. The code was moved manually using IDE tooling
to ensure accurate extraction. This completes the refactoring to separate
replica rebalancing from the main rebalanceStore flow.
Remove the `now time.Time` parameter from `rebalanceLeases` and use
`rs.now` instead. This completes the refactoring to store `now` in
`rebalanceState` rather than passing it as a parameter.

The method now uses `rs.now` for checking the delay after failed
changes. The call site in `rebalanceStore` no longer passes the `now`
parameter.

This is the final step in removing the `now` parameter from all
rebalance methods.
Move `ctx context.Context` to the first parameter position in
`rebalanceStore` and `rebalanceLeases` to follow Go conventions. All
rebalance methods in this file now consistently have `ctx` as the first
parameter.

Updated all call sites accordingly.
Wenyi correctly pointed out that (rs rebalanceState) is too reminiscent of (rs
rangeState)

(re rebalanceEnv) has no such problem.

Epic: CRDB-55052
It's referenced a lot, so this declutters the code.
@tbg tbg force-pushed the mma-rebalace-lease-refactor branch from 4d15fdc to d236fa5 Compare November 13, 2025 20:47
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Completed a self-review from this - would get my LGTM, but I'm (kinda sorta, really LLM wrote most of the code, I just told it what to do) the author, so appreciate an independent look.
Auto-borsing ok, but probably this will fail some lint or some such.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


-- commits line 77 at r9:
I had this carried out in multiple steps but it looks like it made just one commit but still somehow assumed the user could know about the intermediate steps. Sigh. Good enough I guess. Should've caught that during commit msg review.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 378 at r2 (raw file):

				rs.cs.addPendingRangeChange(leaseChange)
				rs.changes = append(rs.changes, leaseChange)
				rs.leaseTransferCount++

We can probably remove these "global" counters (this one and the one for replicas). I'll leave this like this for now, just pointing it out.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 240 at r3 (raw file):

			topKRanges := ss.adjusted.topKRanges[localStoreID]
			n := topKRanges.len()
			doneShedding := false

This is a bit tricky. There's a doneShedding var that this shadows (in line 200 above), that var was previously used for both lease and the replica rebalancing code (which is still below, near the end of the method). I think this is correct though but I'll check on the final state just to make sure.

edit: yeah this is fine. This local doneShedding moves into rebalanceLeases, so we have a doneShedding left here that's only used for replica transfers, and moves with it to its own method subsequently. r6 carries out that move.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 393 at r3 (raw file):

				if rs.leaseTransferCount >= rs.maxLeaseTransferCount {
					log.KvDistribution.VInfof(ctx, 2, "reached max lease transfer count %d, returning", rs.maxLeaseTransferCount)
					break

this breaks for i := 0; ..., i.e. the topKranges iteration, so that we fall through to the end of the function and can still hit the logging.
I don't love it, but will do for now.


pkg/kv/kvserver/allocator/mmaprototype/cluster_state_rebalance_stores.go line 412 at r3 (raw file):

				log.KvDistribution.VInfof(ctx, 2, "skipping replica transfers for s%d: done shedding=%v, lease_transfers=%d",
					store.StoreID, doneShedding, localLeaseTransferCount)
				return true

// shouldSkipReplicaMoves

(haven't gotten the LLM to realize when such comments are helpful yet).

@tbg tbg removed the do-not-merge bors won't merge a PR with this label. label Nov 13, 2025
@tbg tbg requested review from sumeerbhola and wenyihu6 November 13, 2025 21:14
Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm: I found the self-review comments helpful for confirming my understanding of the commit by commit changes.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)

Copy link
Contributor

@wenyihu6 wenyihu6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola)


-- commits line 12 at r2:
Wanted to check my understanding - is it correct that this bug existed prior to these cleanups - we were incorrectly using this as a global counter

? I ran the tests with --rewrite and no asim changes. Were we just not hitting this

@tbg tbg marked this pull request as ready for review November 14, 2025 09:03
@tbg tbg requested review from a team as code owners November 14, 2025 09:03
Copy link
Member Author

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know, will keep doing this then! Somehow I also get something out of reviewing it on reviewable, even though I already review it in the IDE. Somehow it's hard to review in the same detail in Cursor, I'm not exactly sure why.

bors r+

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @sumeerbhola and @wenyihu6)


-- commits line 12 at r2:

Previously, wenyihu6 (Wenyi Hu) wrote…

Wanted to check my understanding - is it correct that this bug existed prior to these cleanups - we were incorrectly using this as a global counter

? I ran the tests with --rewrite and no asim changes. Were we just not hitting this

Yes, these were existing bugs (both for lease transfer count and replica transfer count). I'll start writing tests for rebalanceStores soon and am planning on looking at coverage.

@craig
Copy link
Contributor

craig bot commented Nov 14, 2025

@craig craig bot merged commit 4c8e2ec into cockroachdb:master Nov 14, 2025
26 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants