Skip to content

kvserver: replicas on decommissioning node never being replaced #130199

@kvoli

Description

@kvoli

Describe the problem

We have observed replicas on a decommissioning node (<3) never getting replaced by the corresponding range leaseholder node, over 80 minutes.

The stall was resolved by manually enqueueing the blocking ranges into the leaseholder's replicate queue.

This is surprising, as the replica scanner should check each replica once every 10 minutes, against all store replica queues. By manually enqueueing the ranges via the advanced debug page, without skipping the should queue check, it demonstrates that if the scanner is calling shouldQueue on the replica, it would have been enqueued. There is no indication that this occurred.

To Reproduce

Attempts at reproducing the issue haven't been successful so far. The methods tested are shown below.

Details

Setup the cluster and run the workload:

export cluster=austen-decom-repro
roachprod create $cluster -n 41
roachprod put $cluster ./artifacts/cockroach cockroach
roachprod start $cluster:1-40
rp sql $cluster:1 -- -e 'CREATE DATABASE kv2'
rp sql $cluster:1 -- -e 'ALTER RANGE default CONFIGURE ZONE USING num_replicas = 5'
rp sql $cluster:1 -- -e 'ALTER DATABASE kv2 CONFIGURE ZONE USING num_replicas = 5'
rp run $cluster:1 -- './cockroach workload run kv --init --splits=6000 --min-block-bytes=16384 --max-block-bytes=16384 --insert-count=100000000 --max-rate=100 {pgurl:1}'

Chain decomissions:

for iteration in $(seq 1 10000); do
    echo "Starting iteration $iteration"

    # Loop through nodes
    for node in $(seq 20 40); do
        echo "Processing node $node (Iteration $iteration)"

        # Run drain command
        rp run $cluster:$node -- './cockroach node drain --insecure --self'

        # Run decommission command
        rp run $cluster:$node -- './cockroach node decommission --insecure --self'

        # Wipe the node
        rp wipe $cluster:$node

        # Put artifacts
        rp put $cluster:$node ./artifacts/cockroach

        # Start the node
        rp start $cluster:$node

        echo "Finished processing node $node (Iteration $iteration)"
        echo "------------------------"
    done

    echo "Finished iteration $iteration"
    echo "========================"
done

Restart a node every 3 minutes:

for iteration in $(seq 1 10000); do
    # Loop through nodes
    for node in $(seq 2 19); do
        echo "Restarting node $node (Iteration $iteration)"

        # Restart the node
        rp stop $cluster:$node
        # Start the node
        rp start $cluster:$node

        sleep 180

        echo "Finished restarting node $node (Iteration $iteration)"
        echo "------------------------"
    done

    echo "Finished iteration $iteration"
    echo "========================"
done

Expected behavior

The replica scanner enqueues decommissioning ranges into the replicate queue at minimum every replica_count * 100ms (min scanner interval) duration, or 10 minutes, whichever is greater.

Decommissioning doesn't stall due to this.

Environment:

  • CockroachDB version v23.1.22, however it may affect other versions, it has only occurred on a v23.1.22 cluster

Additional context

Manual intervention is required to complete a decommission.

Jira issue: CRDB-41920

Metadata

Metadata

Assignees

No one assigned

    Labels

    A-kv-distributionRelating to rebalancing and leasing.C-bugCode not up to spec/doc, specs & docs deemed correct. Solution expected to change code/behavior.O-supportWould prevent or help troubleshoot a customer escalation - bugs, missing observability/tooling, docsP-3Issues/test failures with no fix SLAT-kvKV Teambranch-release-23.1Used to mark GA and release blockers, technical advisories, and bugs for 23.1

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions