-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit RW separation to remote store enabled clusters and update recovery flow #16760
Conversation
❌ Gradle check result for a932d59: Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for a932d59: Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
This PR includes multiple changes to search replica recovery. 1. Change search only replica copies to recover as empty store instead of PEER. This will run a store recovery that syncs segments from remote store directly and eliminate any primary communication. 2. Remove search replicas from the in-sync allocation ID set and update routing table to exclude them from allAllocationIds. This ensures primaries aren't tracking or validating the routing table for any search replica's presence. 3. Change search replica validation to require remote store. There are versions of the above changes that are still possible with primary based node-node replication, but I don't think they are worth making at this time. Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
… the AllAllocationIds set in the routing table Signed-off-by: Marc Handalian <[email protected]>
…e store cluster. This check had previously only checked for segrep Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
…ases Signed-off-by: Marc Handalian <[email protected]>
… a remote store cluster." reverting this, we already check for remote store earlier. This reverts commit 48ca1a3. Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
… only writers when red Signed-off-by: Marc Handalian <[email protected]>
Signed-off-by: Marc Handalian <[email protected]>
This commit adds PR feedback and recovery tests post node restart. Signed-off-by: Marc Handalian <[email protected]>
apologies for the rebase, had to fix DCO check on an old commit |
❌ Gradle check result for cf68380: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Marc Handalian <[email protected]>
❌ Gradle check result for eaa38d9: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Marc Handalian <[email protected]>
❌ Gradle check result for 002323e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Marc Handalian <[email protected]>
The backport to
To backport manually, run these commands in your terminal: # Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/OpenSearch/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/OpenSearch/backport-2.x
# Create a new branch
git switch --create backport/backport-16760-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 8191de85856d291507d09a7fd425908843ed8675
# Push it to GitHub
git push --set-upstream origin backport/backport-16760-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/OpenSearch/backport-2.x Then, create a pull request where the |
…ery flow (opensearch-project#16760) * Update search only replica recovery flow This PR includes multiple changes to search replica recovery. 1. Change search only replica copies to recover as empty store instead of PEER. This will run a store recovery that syncs segments from remote store directly and eliminate any primary communication. 2. Remove search replicas from the in-sync allocation ID set and update routing table to exclude them from allAllocationIds. This ensures primaries aren't tracking or validating the routing table for any search replica's presence. 3. Change search replica validation to require remote store. There are versions of the above changes that are still possible with primary based node-node replication, but I don't think they are worth making at this time. Signed-off-by: Marc Handalian <[email protected]> * more coverage Signed-off-by: Marc Handalian <[email protected]> * add changelog entry Signed-off-by: Marc Handalian <[email protected]> * add assertions that Search Replicas are not in the in-sync id set nor the AllAllocationIds set in the routing table Signed-off-by: Marc Handalian <[email protected]> * update async task to only run if the FF is enabled and we are a remote store cluster. This check had previously only checked for segrep Signed-off-by: Marc Handalian <[email protected]> * clean up max shards logic Signed-off-by: Marc Handalian <[email protected]> * remove search replicas from check during renewPeerRecoveryRetentionLeases Signed-off-by: Marc Handalian <[email protected]> * Revert "update async task to only run if the FF is enabled and we are a remote store cluster." reverting this, we already check for remote store earlier. This reverts commit 48ca1a3. Signed-off-by: Marc Handalian <[email protected]> * Add more tests for failover case Signed-off-by: Marc Handalian <[email protected]> * Update remotestore restore logic and add test ensuring we can restore only writers when red Signed-off-by: Marc Handalian <[email protected]> * Fix Search replicas to honor node level recovery limits Signed-off-by: Marc Handalian <[email protected]> * Fix translog UUID mismatch on existing store recovery. This commit adds PR feedback and recovery tests post node restart. Signed-off-by: Marc Handalian <[email protected]> * Fix spotless Signed-off-by: Marc Handalian <[email protected]> * Fix bug with remote restore and add more tests Signed-off-by: Marc Handalian <[email protected]> --------- Signed-off-by: Marc Handalian <[email protected]> (cherry picked from commit 8191de8)
Description
This PR includes multiple changes to search replica recovery to further decouple these shards from primaries.
Related Issues
Resolves #15952
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.