Skip to content

raftstore: allow leader transfer if conf change applied on transferee#17643

Merged
ti-chi-bot[bot] merged 18 commits intotikv:masterfrom
hhwyt:master
Nov 6, 2024
Merged

raftstore: allow leader transfer if conf change applied on transferee#17643
ti-chi-bot[bot] merged 18 commits intotikv:masterfrom
hhwyt:master

Conversation

@hhwyt
Copy link
Contributor

@hhwyt hhwyt commented Oct 14, 2024

What is changed and how it works?

Issue Number: Close #17363

What's Changed:

Allow leader transfer if conf change applied on transferee.

The purpose of this PR is to address the issue where transfer-leader can be blocked by pending conf-change or other admin commands on the current leader, even though these commands may have already been applied on the transferee. This issue is particularly critical in scenarios where the leader’s I/O operations are hung. Our goal is to ensure swift transfer-leader operations in such situations to maintain region availability.

The transfer-leader operation can be categorized into two scenarios:

  1. Transfer-Leader without Pessimistic Locks: This follows the classic model described in the Raft thesis. In the current implementation, the transfer operation can only be blocked by pending conf-change commands on the leader, as seen in the ready_to_transfer_leader function.
  2. Transfer-Leader with Pessimistic Locks: This is an optimization in TiKV, where the transfer operation serializes the leader’s pessimistic locks and replicates them via Raft logs to the transferee. This approach enhances the availability of the pessimistic lock. After proposing these locks, the leader proposes a TransferLeader command. However, this command may conflict with other pending admin commands on the leader, as seen in the propose_check_epoch->admin_cmd_epoch_lookup logic. The transfer-leader operation will be blocked if there is a pending conflicting admin command.

Scenario 2 is a subset of scenario 1. Therefore, to discuss the impact of transfer-leader, we only need to focus on scenario 1.

From a logical standpoint, transfer-leader can be viewed as an application-layer trigger of an election timeout. The raft-level correctness established in the Raft thesis implies that transfer-leader should not be blocked by any pending admin commands on the leader. So our focus is on ensuring application-level correctness. We must guarantee that the side-effects(e.g., register region-merge tick for region merge) of any admin commands are consistently applied to the new leader (transferee) compared to the old leader, and do not cause any availability issues.

Analysis of transfer-leader meets pending admin commands on the leader, yet applied on the transferee:

  1. ConfChange
    • AddNode:
      • For a new peer transferee, leader cannot send MsgTransferLeader to that peer since it is not included in the leader's voter list.
      • For other peer transferees, the transfer-leader is equivalent to an election timeout.
    • RemoveNode:
      • For a removed peer transferee, leader can send MsgTransferLeader to that peer since it is still included in the leader's voter list. However, if the transferee has already applied the RemoveNode, it becomes a tombstone peer and will reject any raft messages, including the MsgTransferLeader.
      • For others peer transferees, the transfer-leader is equivalent to an election timeout.
    • AddLearnerNode:
      • For a new-demoted learner peer transferee, the leader can send MsgTransferLeader to that peer since it is still regarded as a follower from the leader’s perspective. However, the learner transferee will reject the MsgTransferLeader message based on the check in maybe_reject_transfer_leader_msg.
      • For a new-added learner peer transferee, the leader will not send MsgTransferLeader because it does not recognize this learner.
      • For other non-learner peers, the transfer-leader is equivalent to an election timeout.
  2. Region Split
    • For the old region, the transfer-leader is equivalent to an election timeout.
    • For the new region, to prevent this linear consistency issue, only new peers on the same node as the old region leader can initiate elections immediately after applying a region split. However, in this scenario, if a transfer-leader interleaves with the region split, the new region may not detect the old region leader on the same node when the region split is applied (see on_ready_split_region->maybe_campaign). This prevents the new region from initiating an election until an election timeout occurs, causing availability issues. A potential solution is to use an ExtraMessage to inform the transferee to initiate a new region election after transfer-leader (might be implemented in this PR) with election timeouts serving as a fallback in case lost this message.
  3. Region Merge
    • PrepareMerge
      • For the transferee, since the merge tick is registered on every peer, it can proceed to the next step of the region merge once it becomes the leader. During a region merge, the source region is neither readable nor writable, so it doesn’t matter which peer is the leader.
    • CommitMerge:
      • For the source region, it becomes tombstone, will reject the MsgTransferLeader.
      • For the target region, the transfer-leader is equivalent to an election timeout.
    • RollbackMerge
      • For any regions, after applying, the state returns to prior, and the transfer-leader is equivalent to an election timeout.
  4. FlashBack
    • PrepareFlashback: If a transfer-leader occurs during data writing in the prepare phase, it results in an error, prompting a client retry. Flashback supports arbitrary retries without impacting correctness.
    • FinishFlashback: The transfer-leader is equivalent to an election timeout.
  5. ComputeHash/VerifyHash/CompactLog: No impact.

Finally, we conclude that the transfer-leader can safely proceed even when there are pending admin commands on the leader.

Before fix

image
image

After fix

image

Feel free to modify any part as per your preference!

Related changes

  • PR to update pingcap/docs/pingcap/docs-cn:
  • Need to cherry-pick to the release branch

Check List

Tests

  • Unit test
  • Integration test
  • Manual test
  • No code

Side effects

  • Performance regression: Consumes more CPU
  • Performance regression: Consumes more Memory
  • Breaking backward compatibility

Release note

None

@ti-chi-bot ti-chi-bot bot added release-note-none Denotes a PR that doesn't merit a release note. do-not-merge/needs-triage-completed dco-signoff: yes Indicates the PR's author has signed the dco. contribution This PR is from a community contributor. needs-ok-to-test Indicates a PR created by contributors and need ORG member send '/ok-to-test' to start testing. labels Oct 14, 2024
@ti-chi-bot
Copy link
Contributor

ti-chi-bot bot commented Oct 14, 2024

Hi @hhwyt. Thanks for your PR.

I'm waiting for a tikv member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@ti-chi-bot ti-chi-bot bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Oct 14, 2024
@hhwyt
Copy link
Contributor Author

hhwyt commented Oct 14, 2024

TODO:This issue should be verify with a manual test.

@hhwyt
Copy link
Contributor Author

hhwyt commented Oct 14, 2024

/cc @glorv @Connor1996 @overvenus @LykxSassinator PTAL, thx.

"not ready to transfer leader; target peer has an unapplied conf change";
"region_id" => self.region_id,
"store_id" => self.peer.get_store_id(),
"peer_id" => peer_id,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"peer_id" => peer_id,
"target_peer_id" => peer_id,

"peer_id" mostly refer to the current peer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

pd_client.region_leader_must_not_be(region_id, new_peer(3, 3));

pd_client.transfer_leader(region_id, new_peer(4, 4), vec![]);
pd_client.region_leader_must_be(region_id, new_peer(4, 4));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please also test the scenario that a peer is demoting from voter to learner, and the leader tries to transfer leadership to this peer because leader's apply is block.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very helpful comment!

hhwyt added 2 commits October 17, 2024 19:54
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
let pending_snapshot = self.is_handling_snapshot() || self.has_pending_snapshot();
// shouldn't transfer leader to witness peer or non-witness waiting data
if self.is_witness() || self.wait_data
if self.is_witness() || is_learner(&self.peer) || self.wait_data
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's checked by ready_to_transfer_leader

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the leader hasn't applied, the check in ready_to_transfer_leader is based on stale configuration, leading in incorrect result.

hhwyt added 3 commits October 18, 2024 16:12
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Signed-off-by: hhwyt <hhwyt1@gmail.com>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"target_applied_index" => index
"transferee_applied_index" => index

We use the term "transferee" instead of "target".

@hhwyt hhwyt force-pushed the master branch 3 times, most recently from 68e4abd to a3c3dd9 Compare October 22, 2024 17:35
@LykxSassinator
Copy link
Contributor

rest LGTM

@ti-chi-bot ti-chi-bot bot added needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. labels Nov 13, 2024
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.5: #17811.

@ti-chi-bot ti-chi-bot bot removed the needs-cherry-pick-release-7.5 Should cherry pick this PR to release-7.5 branch. label Nov 13, 2024
ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Nov 13, 2024
close tikv#17363

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.1: #17812.

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Nov 13, 2024
close tikv#17363

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-7.5: #17813.

@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-8.1: #17814.

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Nov 13, 2024
close tikv#17363

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
@ti-chi-bot
Copy link
Member

In response to a cherrypick label: new pull request created to branch release-6.5: #17815.

ti-chi-bot pushed a commit to ti-chi-bot/tikv that referenced this pull request Nov 13, 2024
close tikv#17363

Signed-off-by: ti-chi-bot <ti-community-prow-bot@tidb.io>
ti-chi-bot bot added a commit that referenced this pull request Nov 13, 2024
…#17643) (#17811)

close #17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: hhwyt <hhwyt1@gmail.com>
Co-authored-by: ti-chi-bot[bot] <108142056+ti-chi-bot[bot]@users.noreply.github.com>
Co-authored-by: Bisheng Huang <hbisheng@gmail.com>
@ti-chi-bot ti-chi-bot bot removed the needs-cherry-pick-release-6.5 Should cherry pick this PR to release-6.5 branch. label Feb 5, 2025
hhwyt added a commit to ti-chi-bot/tikv that referenced this pull request Feb 5, 2025
…tikv#17643)

close tikv#17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: Bisheng Huang <hbisheng@gmail.com>

(cherry picked from commit 5182652)
Signed-off-by: hhwyt <hhwyt1@gmail.com>
hhwyt added a commit to ti-chi-bot/tikv that referenced this pull request Feb 5, 2025
…tikv#17643)

close tikv#17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: Bisheng Huang <hbisheng@gmail.com>

(cherry picked from commit 5182652)
Signed-off-by: hhwyt <hhwyt1@gmail.com>
hhwyt added a commit to ti-chi-bot/tikv that referenced this pull request Feb 5, 2025
…tikv#17643)

close tikv#17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: Bisheng Huang <hbisheng@gmail.com>

(cherry picked from commit 5182652)
Signed-off-by: hhwyt <hhwyt1@gmail.com>
ti-chi-bot bot pushed a commit that referenced this pull request Feb 7, 2025
…#17643) (#17815)

close #17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: hhwyt <hhwyt1@gmail.com>
Co-authored-by: Bisheng Huang <hbisheng@gmail.com>
hhwyt added a commit to ti-chi-bot/tikv that referenced this pull request Feb 11, 2025
…tikv#17643)

close tikv#17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: Bisheng Huang <hbisheng@gmail.com>

(cherry picked from commit 5182652)
Signed-off-by: hhwyt <hhwyt1@gmail.com>
ti-chi-bot bot pushed a commit that referenced this pull request Feb 11, 2025
…#17643) (#17813)

close #17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: hhwyt <hhwyt1@gmail.com>
Co-authored-by: Bisheng Huang <hbisheng@gmail.com>
hhwyt added a commit to hhwyt/tikv that referenced this pull request Feb 14, 2025
…tikv#17643)

close tikv#17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: Bisheng Huang <hbisheng@gmail.com>

(cherry picked from commit 5182652)
Signed-off-by: hhwyt <hhwyt1@gmail.com>
ti-chi-bot bot pushed a commit that referenced this pull request Feb 17, 2025
…#17643) (#17814)

close #17363

Allow leader transfer if conf change applied on transferee.

Signed-off-by: hhwyt <hhwyt1@gmail.com>

Co-authored-by: hhwyt <hhwyt1@gmail.com>
Co-authored-by: Bisheng Huang <hbisheng@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved contribution This PR is from a community contributor. dco-signoff: yes Indicates the PR's author has signed the dco. lgtm needs-cherry-pick-release-7.1 Should cherry pick this PR to release-7.1 branch. needs-cherry-pick-release-8.1 Should cherry pick this PR to release-8.1 branch. needs-cherry-pick-release-8.5 Should cherry pick this PR to release-8.5 branch. ok-to-test Indicates a PR is ready to be tested. release-note-none Denotes a PR that doesn't merit a release note. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

qps continued to drop to zero during one of tikv io hang

7 participants