Skip to content

Conversation

@kvoli
Copy link
Contributor

@kvoli kvoli commented Sep 6, 2024

Backport 2/2 commits from #130117 on behalf of @kvoli.

/cc @cockroachdb/release


Introduce the ranges.decommissioning gauge metric, which represents
the number of ranges with at least one replica on a decommissioning
node.

The metric is reported by the leaseholder, or if there is no valid
leaseholder, the first live replica in the descriptor, similar to
(under|over)-replication metrics.

The metric can be used to approximately identify the distribution of
decommissioning work remaining across nodes, as the leaseholder replica
is responsible for triggering the replacement of decommissioning
replicas for its own range.

Informs: #130085
Release note (ops change): The ranges.decommissioning metric is added,
representing the number of ranges which have a replica on a
decommissioning node.


When kv.enqueue_in_replicate_queue_on_problem.interval is set to a
positive non-zero value, leaseholder replicas of ranges which are
underreplicated will be enqueued into the replicate queue every
kv.enqueue_in_replicate_queue_on_problem.interval interval.

When kv.enqueue_in_replicate_queue_on_problem.interval is set to 0,
no enqueueing on underreplication will take place, outside of the
regular replica scanner.

A recommended value for users enabling the enqueue (non-zero), is 15
minutes e.g.,

SET CLUSTER SETTING
kv.enqueue_in_replicate_queue_on_problem.interval='900s'

Resolves: #130085
Release note (ops change): The ranges.decommissioning metric is added,
representing the number of ranges which have a replica on a
decommissioning node.


Release justification: Low risk observability change and otherwise disabled by default behavior change which when enabled alleviates a class of decommission stalls.

@kvoli kvoli added the backport Label PR's that are backports to older release branches label Sep 6, 2024
@kvoli kvoli self-assigned this Sep 6, 2024
@blathers-crl
Copy link

blathers-crl bot commented Sep 6, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli changed the title release-23.2: kvserver: enqueue decom ranges at an interval behind a setting release-23.1: kvserver: enqueue decom ranges at an interval behind a setting Sep 6, 2024
Introduce the `ranges.decommissioning` gauge metric, which represents
the number of ranges with at least one replica on a decommissioning
node.

The metric is reported by the leaseholder, or if there is no valid
leaseholder, the first live replica in the descriptor, similar to
(under|over)-replication metrics.

The metric can be used to approximately identify the distribution of
decommissioning work remaining across nodes, as the leaseholder replica
is responsible for triggering the replacement of decommissioning
replicas for its own range.

Informs: cockroachdb#130085
Release note (ops change): The `ranges.decommissioning` metric is added,
representing the number of ranges which have a replica on a
decommissioning node.
When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to a
positive non-zero value, leaseholder replicas of ranges which have
decommissioning replicas will be enqueued into the replicate queue every
`kv.enqueue_in_replicate_queue_on_problem.interval` interval.

When `kv.enqueue_in_replicate_queue_on_problem.interval` is set to 0,
no enqueueing on decommissioning will take place, outside of the regular
replica scanner.

A recommended value for users enabling the enqueue (non-zero), is at
least 15 minutes e.g.,

```
SET CLUSTER SETTING
kv.enqueue_in_replicate_queue_on_problem.interval='900s'
```

Resolves: cockroachdb#130085
Informs: cockroachdb#130199
Release note: None
@kvoli kvoli force-pushed the backport-release-23.1-130117 branch from 05c8596 to 5945d37 Compare September 9, 2024 14:06
@kvoli kvoli marked this pull request as ready for review September 9, 2024 14:06
@kvoli kvoli requested review from a team as code owners September 9, 2024 14:06
@kvoli kvoli requested review from arulajmani, kyle-a-wong and nicktrav and removed request for a team September 9, 2024 14:06
@kvoli
Copy link
Contributor Author

kvoli commented Sep 10, 2024

TYFTR!

@kvoli kvoli merged commit 804cb5b into cockroachdb:release-23.1 Sep 10, 2024
@kvoli kvoli deleted the backport-release-23.1-130117 branch September 10, 2024 18:45
@crl-codesys-jira crl-codesys-jira added the T-kv KV Team label Aug 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport Label PR's that are backports to older release branches T-kv KV Team v23.1.28

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants