Skip to content

Conversation

@kvoli
Copy link
Contributor

@kvoli kvoli commented Nov 13, 2023

It was possible for a replica to be stuck processing in a queue's
replica set. This could occur when a replica had recently been removed
from purgatory for processing but was destroyed, or replica ID changed
before being processed.

When this occurred, the replica could never be processed by the queue
again, potentially leading to decommission stalls, constraint violations
or under(over)replication.

Remove the replica from the queue set upon encountering a replica which
was destroyed, or replica ID changed when processing purgatory. This
prevents the replica from becoming stuck in a processing state in the
queue set.

Fixes: #112761
Fixes: #110761

Release note (bug fix): The store queues will no longer leave purgatory
replicas which have changed replica IDs, or have been destroyed stuck
unable to process via the respective queue again if re-added.

@blathers-crl
Copy link

blathers-crl bot commented Nov 13, 2023

It looks like your PR touches production code but doesn't add or edit any test code. Did you consider adding tests to your PR?

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@kvoli kvoli self-assigned this Nov 13, 2023
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@kvoli kvoli force-pushed the 231113.fix-queue-processing-err branch from 9f562f4 to a1fa0fb Compare November 13, 2023 19:42
@kvoli kvoli force-pushed the 231113.fix-queue-processing-err branch 7 times, most recently from b0c9a35 to 2fbf672 Compare November 17, 2023 00:04
@kvoli kvoli changed the title kvserver: [dnm] ensure repl removal from queue set kvserver: ensure repl removal from queue set Nov 17, 2023
@kvoli kvoli force-pushed the 231113.fix-queue-processing-err branch from 2fbf672 to c8060ec Compare November 17, 2023 00:16
@kvoli kvoli added backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only labels Nov 17, 2023
@kvoli kvoli force-pushed the 231113.fix-queue-processing-err branch 4 times, most recently from 3e6646a to bb2bacd Compare November 21, 2023 15:25
@kvoli kvoli changed the title kvserver: ensure repl removal from queue set kvserver: remove changed replicas in purgatory from replica set Nov 21, 2023
@kvoli kvoli marked this pull request as ready for review November 21, 2023 18:37
@kvoli kvoli requested a review from a team as a code owner November 21, 2023 18:37
@kvoli kvoli requested review from andrewbaptist and nvb November 21, 2023 18:37
Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice job tracking this down. Sometimes the hardest bugs to diagnose require the fewest lines of code to resolve.

Reviewed 2 of 2 files at r1, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)


pkg/kv/kvserver/queue.go line 1276 at r1 (raw file):

			if err != nil || item.replicaID != repl.ReplicaID() {
				bq.mu.Lock()
				bq.removeFromReplicaSetLocked(item.rangeID)

This is now mostly symmetrical with the handling of replica ID changes in baseQueue.pop, which provides confidence to this approach.


pkg/kv/kvserver/queue_test.go line 912 at r1 (raw file):

	const rmReplCount = 2
	repls[0].replicaID = 2
	if err := tc.store.RemoveReplica(context.Background(), repls[1], repls[1].Desc().NextReplicaID, RemoveOptions{

nit: ctx is in scope.


pkg/kv/kvserver/queue_test.go line 939 at r1 (raw file):

			return errors.Errorf("expected 0 purgatory replicas; got %d", v)
		}
		// Verify there are no replicas left in the replica set after finishing

Do we want to test that repls[0] (with the new replica ID) can now be added back into the queue? It may be worth pulling out a separate, smaller test case for that.

@kvoli kvoli force-pushed the 231113.fix-queue-processing-err branch from bb2bacd to 7e3d35e Compare November 22, 2023 21:12
Copy link
Contributor Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review! It certainly was satisfying writing the (short) fix.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/queue_test.go line 912 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: ctx is in scope.

Updated to use the ctx in scope.


pkg/kv/kvserver/queue_test.go line 939 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Do we want to test that repls[0] (with the new replica ID) can now be added back into the queue? It may be worth pulling out a separate, smaller test case for that.

Good idea, I added this below.

Copy link
Contributor

@nvb nvb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 1 of 1 files at r2, all commit messages.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @andrewbaptist and @kvoli)


pkg/kv/kvserver/queue_test.go line 964 at r2 (raw file):

	beforeSuccessCount := bq.successes.Count()
	beforeFailureCount := bq.failures.Count()
	bq.maybeAdd(context.Background(), repls[0], hlc.ClockTimestamp{})

nit: ctx is in scope here as well.

It was possible for a replica to be stuck processing in a queue's
replica set. This could occur when a replica had recently been removed
from purgatory for processing but was destroyed, or replica ID changed
before being processed.

When this occurred, the replica could never be processed by the queue
again, potentially leading to decommission stalls, constraint violations
or under(over)replication.

Remove the replica from the queue set upon encountering a replica which
was destroyed, or replica ID changed when processing purgatory. This
prevents the replica from becoming stuck in a processing state in the
queue set.

Fixes: cockroachdb#112761
Fixes: cockroachdb#110761

Release note (bug fix): The store queues will no longer leave purgatory
replicas which have changed replica IDs, or have been destroyed stuck
unable to process via the respective queue again if re-added.
@kvoli kvoli force-pushed the 231113.fix-queue-processing-err branch from 7e3d35e to a24ba7f Compare November 23, 2023 00:04
Copy link
Contributor Author

@kvoli kvoli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @andrewbaptist and @nvanbenschoten)


pkg/kv/kvserver/queue_test.go line 964 at r2 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

nit: ctx is in scope here as well.

Updated.

@kvoli
Copy link
Contributor Author

kvoli commented Nov 23, 2023

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Nov 23, 2023

Build failed:

@kvoli
Copy link
Contributor Author

kvoli commented Nov 23, 2023

Retrying

bors r=nvanbenschoten

@craig
Copy link
Contributor

craig bot commented Nov 23, 2023

Build succeeded:

@craig craig bot merged commit f71ed12 into cockroachdb:master Nov 23, 2023
nvb added a commit to nvb/cockroach that referenced this pull request Oct 14, 2024
Closes cockroachdb#132546.

No patch release of v22.2 and earlier has cockroachdb#114365, so they have the
potential to be flaky when evacuating nodes using zone configs. Don't
use zone config movement for these versions.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Oct 15, 2024
Closes cockroachdb#132546.

No patch release of v22.2 and earlier has cockroachdb#114365, so they have the
potential to be flaky when evacuating nodes using zone configs. Don't
use zone config movement for these versions.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Oct 15, 2024
No patch release of v22.2 and earlier has cockroachdb#114365, so they have the
potential to be flaky when evacuating nodes using zone configs. Don't
use zone config movement for these versions.

Release note: None
nvb added a commit to nvb/cockroach that referenced this pull request Oct 16, 2024
Closes cockroachdb#132546.

No patch release of v22.2 and earlier has cockroachdb#114365, so they have the
potential to be flaky when evacuating nodes using zone configs. Don't
use zone config movement for these versions.

Release note: None
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-23.1.x PAST MAINTENANCE SUPPORT: 23.1 patch releases via ER request only backport-23.2.x PAST MAINTENANCE SUPPORT: 23.2 patch releases via ER request only

Projects

None yet

3 participants