-
Notifications
You must be signed in to change notification settings - Fork 391
Add two-phase/staged approach to cluster administration #181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit intentionally does not compile. It will be squashed into the proceeding commit after development and code review.
Add staged clustering as discussed in issue #177. This new approach to cluster administration operates in two phases. Users first stage intended cluster changes (join/leave/remove), then commit the changes when ready and satisfied with how those changes will impact the cluster. This is supported by turning `riak_core_claimant` into a gen_server and adding a new API for scheduling and committing staged changes. Also, `riak_core_console` is updated to provide an interface to this new API. Change the claimant logic to no longer automatically transition `joining` nodes to `valid` nodes. This change is necessary to support staged joining of nodes. Change the node removal logic to allow use of a provided seed for random number generation, to enable the staged clustering system to calculate the same removal plan during both the planning and commit phases.
Add support for `replace node new-node` operation that stages a request for node to handoff all of its partitions to new-node and then leave the cluster. This is similar to a leave, but directs handoff to a specific replacement node. Add support for `force-replace node new-node` operation that stages a request that reassigns all partitions owned by node to new-node and removes node from the cluster. This is similar to a force-remove, but reassigns partitions to a specific replacement node.
Fix 'riak-admin join' so that it once again works despite the changes made to node joining as part of the new staged clustering changes. This is accomplished through the addition of auto-joining logic in the claimant that transitions 'joining' nodes to 'valid' nodes that were joined to the cluster using the traditional join rather than the newer staged join.
|
The code in here all looks fine. +1 for this part. |
Update claim_v2 to compute claim deltas for non-active owning nodes, to handle corner case hit by new staged clustering code. Change riak_core_ring_manager:ring_trans to allow code to provide a reason when ignoring the transaction. Update riak_core_claimant plan/commit to return reasons for failures. Fix bug with riak_core_claimant:compute_all_next_rings where two rings were considered equal when they were not. Fix bug with riak_core_ring:future_ring where leaving nodes that still owned partitions would incorrectly be changed to the exiting state. Add code that enables querying and setting handoff concurrency settings.
|
If I set the transfer limit to 0 across the cluster, I can still join/leave nodes and partitions will be transferred. That seems wrong, no? |
|
Also, negative numbers are allowed as the value for transfer_limit. Also, passing non-numeric input yields an ugly backtrace to the console. |
|
I'm +1 on the staged cluster stuff, but the transfer_limit stuff seems flaky. Someone else needs to review it, as I'll be gone. |
|
If you set transfer limit to 0, you shouldn't see transfers occur. Keep in mind that newly started nodes will need to have their transfer limits set to 0 as well. What you may have been observing is ownership behavior in an empty (or near empty) cluster. The transfer limit prevents transfers from occurring, not ownership from changing. An empty partition that is changing owners will change owners regardless of the transfer limit setting -- there is no transfer of data necessary, so ownership will just be reassigned after the appropriate transitions / handshakes. |
|
Added code that should validate provided transfer limit, preventing negative numbers as well non-numeric values. |
|
Ah yes, I didn't have much, or any, data in the cluster, so that explains the behaviour I saw. |
|
transfer limits validation works for me, +1 |
|
I've taken a run at merging the claimv3/sim code with the staged claim on jdm-merge-staged-clustering, I made the branch off current master and did a git merge-no-ff of the branch on this pull request. I ended up exposing remove_indices in riak_core_claimant so that sim code could cause that directly rather than doing the refactor to use more of the claimant code. There were three issues during the merge
The merged branch passes riak_test verify_build_cluster and verify_staged_clustering, but it fails on verify_{claimant,leave,down} tests (as does this gh177-staged-clustering branch for me). |
Merge of 'gh177-staged-clustering' and 'master' had a few issues. Re-add "clear member metadata on leave" logic that was removed in merge. Re-add riak_core_ring:cancel_transfers/1 that was removed in merge. Perform minor code clean-up. Fix some typespecs to make dialyzer happier (more work still needed). This finalizes integration into 'master' for #181
|
All integration on branch |
This pull-request adds staged clustering as discussed in issue #177. This new approach to cluster administration operates in two phases. Users first stage intended cluster changes (join/leave/remove), then commit the changes when ready and satisfied with how those changes will impact the cluster.
This pull-request also adds the new
replaceandforce-replaceoperations.The
replace node new-nodeoperation stages a request fornodeto handoff all of its partitions tonew-nodeand then leave the cluster. This is similar to aleave, but directs handoff to a specific replacement node.The
force-replace node new-nodeoperation stages a request that reassigns all partitions owned bynodetonew-nodeand removesnodefrom the cluster. This is similar to aforce-remove, but reassigns partitions to a specific replacement node.This feature addition is implemented through a series of refactoring and code additions, and is best reviewed looking at the history of the commits rather than just at the resulting code.
The first commit moves the claimant logic from
riak_core_ring.erlinto a newriak_core_claimant.erl. This commit is more or less a straightforward cut/paste and intentionally doesn't compile cleanly.The second commit than refactors
riak_core_ring.erlandriak_core_claimant.erlso that things compile and work as before the code split. The diff between these two commits should make it clear what was changed. After review, these two commits will be squashed together to ensure there is no broken commit in the history.Note: only changes to the claimant code itself should be reviewed, the pre-existing claimant code has long since been reviewed. Although, future documentation and code refactoring is planned -- but outside the scope of this current pull-request.
The next four commits add the staged clustering code. The first refactors the claimant code so that it can be run without side-effects. The second adds the staged clustering code itself, the third adds the new
replaceandforce-replaceoperations, and the fourth fixes an issue with older non-staged join operation.There is a related
basho/riakpull-request that updatesriak-adminto support the new staged approach.There is also a related
basho/riak_testpull-request that implements an appropriate integration test.