Skip to content

Conversation

@jtuple
Copy link
Contributor

@jtuple jtuple commented May 25, 2012

This pull-request adds staged clustering as discussed in issue #177. This new approach to cluster administration operates in two phases. Users first stage intended cluster changes (join/leave/remove), then commit the changes when ready and satisfied with how those changes will impact the cluster.

This pull-request also adds the new replace and force-replace operations.

The replace node new-node operation stages a request for node to handoff all of its partitions to new-node and then leave the cluster. This is similar to a leave, but directs handoff to a specific replacement node.

The force-replace node new-node operation stages a request that reassigns all partitions owned by node to new-node and removes node from the cluster. This is similar to a force-remove, but reassigns partitions to a specific replacement node.

This feature addition is implemented through a series of refactoring and code additions, and is best reviewed looking at the history of the commits rather than just at the resulting code.

The first commit moves the claimant logic from riak_core_ring.erl into a new riak_core_claimant.erl. This commit is more or less a straightforward cut/paste and intentionally doesn't compile cleanly.

The second commit than refactors riak_core_ring.erl and riak_core_claimant.erl so that things compile and work as before the code split. The diff between these two commits should make it clear what was changed. After review, these two commits will be squashed together to ensure there is no broken commit in the history.

Note: only changes to the claimant code itself should be reviewed, the pre-existing claimant code has long since been reviewed. Although, future documentation and code refactoring is planned -- but outside the scope of this current pull-request.

The next four commits add the staged clustering code. The first refactors the claimant code so that it can be run without side-effects. The second adds the staged clustering code itself, the third adds the new replace and force-replace operations, and the fourth fixes an issue with older non-staged join operation.

There is a related basho/riak pull-request that updates riak-admin to support the new staged approach.

There is also a related basho/riak_test pull-request that implements an appropriate integration test.

jtuple added 6 commits May 23, 2012 14:05
This commit intentionally does not compile. It will be squashed into
the proceeding commit after development and code review.
Add staged clustering as discussed in issue #177. This new approach
to cluster administration operates in two phases. Users first stage
intended cluster changes (join/leave/remove), then commit the changes
when ready and satisfied with how those changes will impact the
cluster. This is supported by turning `riak_core_claimant` into a
gen_server and adding a new API for scheduling and committing staged
changes. Also, `riak_core_console` is updated to provide an interface
to this new API.

Change the claimant logic to no longer automatically transition
`joining` nodes to `valid` nodes. This change is necessary to support
staged joining of nodes.

Change the node removal logic to allow use of a provided seed for
random number generation, to enable the staged clustering system to
calculate the same removal plan during both the planning and commit
phases.
Add support for `replace node new-node` operation that stages a
request for node to handoff all of its partitions to new-node and
then leave the cluster. This is similar to a leave, but directs
handoff to a specific replacement node.

Add support for `force-replace node new-node` operation that stages
a request that reassigns all partitions owned by node to new-node
and removes node from the cluster. This is similar to a force-remove,
but reassigns partitions to a specific replacement node.
Fix 'riak-admin join' so that it once again works despite the changes
made to node joining as part of the new staged clustering changes. This
is accomplished through the addition of auto-joining logic in the
claimant that transitions 'joining' nodes to 'valid' nodes that were
joined to the cluster using the traditional join rather than the newer
staged join.
@Vagabond
Copy link
Contributor

Vagabond commented Jun 4, 2012

The code in here all looks fine. +1 for this part.

Update claim_v2 to compute claim deltas for non-active owning nodes, to
handle corner case hit by new staged clustering code.

Change riak_core_ring_manager:ring_trans to allow code to provide a reason
when ignoring the transaction.

Update riak_core_claimant plan/commit to return reasons for failures.

Fix bug with riak_core_claimant:compute_all_next_rings where two rings were
considered equal when they were not.

Fix bug with riak_core_ring:future_ring where leaving nodes that still owned
partitions would incorrectly be changed to the exiting state.

Add code that enables querying and setting handoff concurrency settings.
@Vagabond
Copy link
Contributor

Vagabond commented Jun 9, 2012

If I set the transfer limit to 0 across the cluster, I can still join/leave nodes and partitions will be transferred. That seems wrong, no?

@Vagabond
Copy link
Contributor

Vagabond commented Jun 9, 2012

Also, negative numbers are allowed as the value for transfer_limit. Also, passing non-numeric input yields an ugly backtrace to the console.

@Vagabond
Copy link
Contributor

Vagabond commented Jun 9, 2012

I'm +1 on the staged cluster stuff, but the transfer_limit stuff seems flaky. Someone else needs to review it, as I'll be gone.

@jtuple
Copy link
Contributor Author

jtuple commented Jun 11, 2012

If you set transfer limit to 0, you shouldn't see transfers occur. Keep in mind that newly started nodes will need to have their transfer limits set to 0 as well. What you may have been observing is ownership behavior in an empty (or near empty) cluster. The transfer limit prevents transfers from occurring, not ownership from changing. An empty partition that is changing owners will change owners regardless of the transfer limit setting -- there is no transfer of data necessary, so ownership will just be reassigned after the appropriate transitions / handshakes.

@jtuple
Copy link
Contributor Author

jtuple commented Jun 11, 2012

Added code that should validate provided transfer limit, preventing negative numbers as well non-numeric values.

@Vagabond
Copy link
Contributor

Ah yes, I didn't have much, or any, data in the cluster, so that explains the behaviour I saw.

@jonmeredith
Copy link
Contributor

transfer limits validation works for me, +1

@jonmeredith
Copy link
Contributor

I've taken a run at merging the claimv3/sim code with the staged claim on jdm-merge-staged-clustering, I made the branch off current master and did a git merge-no-ff of the branch on this pull request.

I ended up exposing remove_indices in riak_core_claimant so that sim code could cause that directly rather than doing the refactor to use more of the claimant code.

There were three issues during the merge

  1. Conflict between the capabilities stuff and autojoin stuff at https://github.com/basho/riak_core/blob/jdm-merge-staged-clustering/src/riak_core.erl#L146 - I guessed do capability first, then join, but maybe that's wrong.
  2. There was a commit 07e68d6 that cleared metadata from the ring on exit and I couldn't see that on the new branch. Is that a fix from 1.1.x that should have been merged in?
  3. not really a merge issue, but I noticed the arguments for https://github.com/basho/riak_core/blob/jdm-merge-staged-clustering/src/riak_core_claimant.erl#L874 were swapped from the next call, it's harmless at the moment, but it could bite somebody who changes it.

The merged branch passes riak_test verify_build_cluster and verify_staged_clustering, but it fails on verify_{claimant,leave,down} tests (as does this gh177-staged-clustering branch for me).

jtuple added a commit that referenced this pull request Jun 11, 2012
Merge of 'gh177-staged-clustering' and 'master' had a few issues.

Re-add "clear member metadata on leave" logic that was removed in merge.
Re-add riak_core_ring:cancel_transfers/1 that was removed in merge.
Perform minor code clean-up.
Fix some typespecs to make dialyzer happier (more work still needed).

This finalizes integration into 'master' for #181
@jtuple
Copy link
Contributor Author

jtuple commented Jun 11, 2012

All integration on branch jdm-merge-staged-clustering is done. Merging that branch into master and pushing.

@ghost ghost assigned jtuple Jun 11, 2012
@jtuple jtuple merged commit 952dd27 into master Jun 11, 2012
Licenser pushed a commit to Licenser/riak_test_core that referenced this pull request Apr 7, 2013
@seancribbs seancribbs deleted the gh177-staged-clustering branch April 1, 2015 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants