Add two-phase/staged approach to cluster administration #181

jtuple · 2012-05-25T07:06:04Z

This pull-request adds staged clustering as discussed in issue #177. This new approach to cluster administration operates in two phases. Users first stage intended cluster changes (join/leave/remove), then commit the changes when ready and satisfied with how those changes will impact the cluster.

This pull-request also adds the new replace and force-replace operations.

The replace node new-node operation stages a request for node to handoff all of its partitions to new-node and then leave the cluster. This is similar to a leave, but directs handoff to a specific replacement node.

The force-replace node new-node operation stages a request that reassigns all partitions owned by node to new-node and removes node from the cluster. This is similar to a force-remove, but reassigns partitions to a specific replacement node.

This feature addition is implemented through a series of refactoring and code additions, and is best reviewed looking at the history of the commits rather than just at the resulting code.

The first commit moves the claimant logic from riak_core_ring.erl into a new riak_core_claimant.erl. This commit is more or less a straightforward cut/paste and intentionally doesn't compile cleanly.

The second commit than refactors riak_core_ring.erl and riak_core_claimant.erl so that things compile and work as before the code split. The diff between these two commits should make it clear what was changed. After review, these two commits will be squashed together to ensure there is no broken commit in the history.

Note: only changes to the claimant code itself should be reviewed, the pre-existing claimant code has long since been reviewed. Although, future documentation and code refactoring is planned -- but outside the scope of this current pull-request.

The next four commits add the staged clustering code. The first refactors the claimant code so that it can be run without side-effects. The second adds the staged clustering code itself, the third adds the new replace and force-replace operations, and the fourth fixes an issue with older non-staged join operation.

There is a related basho/riak pull-request that updates riak-admin to support the new staged approach.

There is also a related basho/riak_test pull-request that implements an appropriate integration test.

This commit intentionally does not compile. It will be squashed into the proceeding commit after development and code review.

Add staged clustering as discussed in issue #177. This new approach to cluster administration operates in two phases. Users first stage intended cluster changes (join/leave/remove), then commit the changes when ready and satisfied with how those changes will impact the cluster. This is supported by turning `riak_core_claimant` into a gen_server and adding a new API for scheduling and committing staged changes. Also, `riak_core_console` is updated to provide an interface to this new API. Change the claimant logic to no longer automatically transition `joining` nodes to `valid` nodes. This change is necessary to support staged joining of nodes. Change the node removal logic to allow use of a provided seed for random number generation, to enable the staged clustering system to calculate the same removal plan during both the planning and commit phases.

Add support for `replace node new-node` operation that stages a request for node to handoff all of its partitions to new-node and then leave the cluster. This is similar to a leave, but directs handoff to a specific replacement node. Add support for `force-replace node new-node` operation that stages a request that reassigns all partitions owned by node to new-node and removes node from the cluster. This is similar to a force-remove, but reassigns partitions to a specific replacement node.

Fix 'riak-admin join' so that it once again works despite the changes made to node joining as part of the new staged clustering changes. This is accomplished through the addition of auto-joining logic in the claimant that transitions 'joining' nodes to 'valid' nodes that were joined to the cluster using the traditional join rather than the newer staged join.

Vagabond · 2012-06-04T21:08:46Z

The code in here all looks fine. +1 for this part.

Update claim_v2 to compute claim deltas for non-active owning nodes, to handle corner case hit by new staged clustering code. Change riak_core_ring_manager:ring_trans to allow code to provide a reason when ignoring the transaction. Update riak_core_claimant plan/commit to return reasons for failures. Fix bug with riak_core_claimant:compute_all_next_rings where two rings were considered equal when they were not. Fix bug with riak_core_ring:future_ring where leaving nodes that still owned partitions would incorrectly be changed to the exiting state. Add code that enables querying and setting handoff concurrency settings.

Vagabond · 2012-06-09T04:52:01Z

If I set the transfer limit to 0 across the cluster, I can still join/leave nodes and partitions will be transferred. That seems wrong, no?

Vagabond · 2012-06-09T04:54:13Z

Also, negative numbers are allowed as the value for transfer_limit. Also, passing non-numeric input yields an ugly backtrace to the console.

Vagabond · 2012-06-09T05:01:20Z

I'm +1 on the staged cluster stuff, but the transfer_limit stuff seems flaky. Someone else needs to review it, as I'll be gone.

jtuple · 2012-06-11T09:14:17Z

If you set transfer limit to 0, you shouldn't see transfers occur. Keep in mind that newly started nodes will need to have their transfer limits set to 0 as well. What you may have been observing is ownership behavior in an empty (or near empty) cluster. The transfer limit prevents transfers from occurring, not ownership from changing. An empty partition that is changing owners will change owners regardless of the transfer limit setting -- there is no transfer of data necessary, so ownership will just be reassigned after the appropriate transitions / handshakes.

jtuple · 2012-06-11T09:19:32Z

Added code that should validate provided transfer limit, preventing negative numbers as well non-numeric values.

Vagabond · 2012-06-11T15:27:32Z

Ah yes, I didn't have much, or any, data in the cluster, so that explains the behaviour I saw.

jonmeredith · 2012-06-11T15:31:29Z

transfer limits validation works for me, +1

jonmeredith · 2012-06-11T15:41:35Z

I've taken a run at merging the claimv3/sim code with the staged claim on jdm-merge-staged-clustering, I made the branch off current master and did a git merge-no-ff of the branch on this pull request.

I ended up exposing remove_indices in riak_core_claimant so that sim code could cause that directly rather than doing the refactor to use more of the claimant code.

There were three issues during the merge

Conflict between the capabilities stuff and autojoin stuff at https://github.com/basho/riak_core/blob/jdm-merge-staged-clustering/src/riak_core.erl#L146 - I guessed do capability first, then join, but maybe that's wrong.
There was a commit 07e68d6 that cleared metadata from the ring on exit and I couldn't see that on the new branch. Is that a fix from 1.1.x that should have been merged in?
not really a merge issue, but I noticed the arguments for https://github.com/basho/riak_core/blob/jdm-merge-staged-clustering/src/riak_core_claimant.erl#L874 were swapped from the next call, it's harmless at the moment, but it could bite somebody who changes it.

The merged branch passes riak_test verify_build_cluster and verify_staged_clustering, but it fails on verify_{claimant,leave,down} tests (as does this gh177-staged-clustering branch for me).

Merge of 'gh177-staged-clustering' and 'master' had a few issues. Re-add "clear member metadata on leave" logic that was removed in merge. Re-add riak_core_ring:cancel_transfers/1 that was removed in merge. Perform minor code clean-up. Fix some typespecs to make dialyzer happier (more work still needed). This finalizes integration into 'master' for #181

jtuple · 2012-06-11T22:22:24Z

All integration on branch jdm-merge-staged-clustering is done. Merging that branch into master and pushing.

See basho/riak_core#177 and basho/riak_core#181

jtuple added 6 commits May 23, 2012 14:05

Moved claimant code from riak_core_ring to riak_core_claimant

79678e1

This commit intentionally does not compile. It will be squashed into the proceeding commit after development and code review.

Refactor riak_core_ring and riak_core_claimant

b15e345

Refactor claimant code so that it can be run without side-effects

e2d1137

This was referenced May 25, 2012

Add staged clustering support to riak-admin basho/riak#148

Merged

Add test for staged cluster administration basho/riak_test#4

Merged

ghost assigned jtuple May 25, 2012

jtuple mentioned this pull request May 25, 2012

Extend riak_kv_console to support staged joins basho/riak_kv#332

Merged

ghost assigned Vagabond May 31, 2012

Update console command to validate provided transfer limit

952dd27

ghost assigned jtuple Jun 11, 2012

jtuple merged commit 952dd27 into master Jun 11, 2012

Licenser pushed a commit to Licenser/riak_test_core that referenced this pull request Apr 7, 2013

Add test for staged cluster administration

f4776c0

See basho/riak_core#177 and basho/riak_core#181

seancribbs deleted the gh177-staged-clustering branch April 1, 2015 22:59

ooshlablu unassigned jtuple Apr 25, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add two-phase/staged approach to cluster administration #181

Add two-phase/staged approach to cluster administration #181

Uh oh!

jtuple commented May 25, 2012

Uh oh!

Vagabond commented Jun 4, 2012

Uh oh!

Vagabond commented Jun 9, 2012

Uh oh!

Vagabond commented Jun 9, 2012

Uh oh!

Vagabond commented Jun 9, 2012

Uh oh!

jtuple commented Jun 11, 2012

Uh oh!

jtuple commented Jun 11, 2012

Uh oh!

Vagabond commented Jun 11, 2012

Uh oh!

jonmeredith commented Jun 11, 2012

Uh oh!

jonmeredith commented Jun 11, 2012

Uh oh!

jtuple commented Jun 11, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Add two-phase/staged approach to cluster administration #181

Add two-phase/staged approach to cluster administration #181

Uh oh!

Conversation

jtuple commented May 25, 2012

Uh oh!

Vagabond commented Jun 4, 2012

Uh oh!

Vagabond commented Jun 9, 2012

Uh oh!

Vagabond commented Jun 9, 2012

Uh oh!

Vagabond commented Jun 9, 2012

Uh oh!

jtuple commented Jun 11, 2012

Uh oh!

jtuple commented Jun 11, 2012

Uh oh!

Vagabond commented Jun 11, 2012

Uh oh!

jonmeredith commented Jun 11, 2012

Uh oh!

jonmeredith commented Jun 11, 2012

Uh oh!

jtuple commented Jun 11, 2012

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants