docs: auto-promoting nodes #10913

maxenglander · 2019-07-20T20:47:54Z

Propose ~~v3.4~~ v3.5 support for having a cluster automatically promote
a learner node to a voting member when added to a cluster with
member add --learner --auto.

The changes proposed for ~~v3.4~~ v3.5 in this PR are implemented in #10887.

codecov-io · 2019-07-20T21:39:09Z

Codecov Report

Merging #10913 into master will increase coverage by 0.34%.
The diff coverage is 16.36%.

@@            Coverage Diff             @@
##           master   #10913      +/-   ##
==========================================
+ Coverage   65.91%   66.26%   +0.34%     
==========================================
  Files         403      403              
  Lines       37021    37067      +46     
==========================================
+ Hits        24403    24563     +160     
+ Misses      11112    11007     -105     
+ Partials     1506     1497       -9

Impacted Files	Coverage Δ
etcdctl/ctlv3/command/printer.go	`0.00% <0.00%> (ø)`
etcdctl/ctlv3/command/printer_json.go	`0.00% <0.00%> (ø)`
etcdserver/server.go	`77.03% <0.00%> (+0.30%)`	⬆️
pkg/tlsutil/cipher_suites.go	`100.00% <ø> (ø)`
mvcc/watchable_store.go	`84.75% <75.00%> (-0.20%)`	⬇️
clientv3/watch.go	`90.40% <100.00%> (+0.40%)`	⬆️
raft/raft.go	`90.37% <100.00%> (-0.62%)`	⬇️
proxy/grpcproxy/watcher.go	`85.71% <0.00%> (-8.17%)`	⬇️
clientv3/client.go	`71.42% <0.00%> (-2.63%)`	⬇️
integration/bridge.go	`90.07% <0.00%> (-2.30%)`	⬇️
... and 23 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7cc2f8a...2b87fc8. Read the comment docs.

jingyih · 2019-07-20T21:40:10Z

cc @gyuho @jpbetz @xiang90 @WIZARD-CXY

jingyih · 2019-07-20T21:49:42Z

Thanks for making the PR. I think this is a good starting point. Let's use this PR to gather comments and suggestions on API changes as well as implementation details regarding learner promotion in upcoming and future releases.

jingyih · 2019-07-20T21:50:24Z

docs/server-learner.rst

@@ -9,7 +9,7 @@ Learner
    Gyuho Lee (github.com/gyuho, *Amazon.com*),
    Joe Betz (github.com/jpbetz, *Google Inc.*)



Please add your name here:)

jingyih · 2019-07-20T22:19:56Z

docs/server-learner.rst

@@ -140,6 +142,25 @@ etcd server must not transfer leadership to learner, since it may still lag behi

 Internally in Raft, second ``MemberAdd`` call to learner node promotes it to a voting member. Leader maintains the progress of each follower and learner. If learner has not completed its snapshot message, reject promote request. Only accept promote request if and only if: The learner node is in a healthy state. The learner is in sync with leader or the delta is within the threshold (e.g. the number of entries to replicate to learner is less than 1/10 of snapshot count, which means it is less likely that even after promotion leader would not need send snapshot to the learner). All these logic are hard-coded in ``etcdserver`` package and not configurable.

+*Expose "Auto-promoting" node type to "MemberAdd" API.*
+
+etcd client adds an “AutoPromote” flag to ``MemberAdd`` API for auto-promoting learner nodes. This flag is only effective when used in conjunction with the “IsLearner” flag, and is ignored otherwise. An etcd server handler applies membership change entry with ``pb.ConfChangeAddAutoPromotingNode`` type. Once the command has been applied, a server joins the cluster with ``etcd --initial-cluster-state=existing`` flag.


Is "AutoPromote" flag only needed in short term and will eventually be deprecated?

Hey @jingyih, I think you could definitely make a case for that.

On the other hand, I think you could also make a case for removing the --learner (and IsLearner) flag, because in v3.5 the default and only behavior of of newly added members is to be a learner (until manually or automatically promoted). It might be more expressive to keep --auto-promote and allow operators to specify --auto-promote=false to add a standby learner.

maxenglander · 2019-07-25T01:29:19Z

Hi @jingyih @gyuho @jpbetz @xiang90 @WIZARD-CXY

Pardon the ping, wanted see if anyone has additional feedback on this proposal and associated PR?

jingyih · 2019-07-25T01:50:02Z

I added an item for this in tomorrow's etcd community meeting to ask for reviews and comments.

https://docs.google.com/document/d/16XEGyPBisZvmmoIHSZzv__LoyOeluC5a4x353CX0SIM

xiang90 · 2019-08-12T20:57:38Z

There are three cases for auto-promote:

promote a learner to voter regardless of the cluster size.
promote a learner to voter if the cluster size is smaller than the expected size (for example, a voter is removed from the cluster)
promote a learner to voter if some other voter is unhealthy for some user defined duration (remove the unhealthy node and replace it with a learner)

maxenglander · 2019-08-12T23:50:15Z

@xiang90 thanks for the comment. Does that mean you do not think there is a 4th case (the case I have proposed in this PR) to have new voters be added first as learners, which are then auto-promoted upon catching up with leader? Or is 4th case == 1st case?

xiang90 · 2019-08-12T23:56:09Z

@maxenglander

Sorry, I mean there are additional 3 cases we should be aware of. I am fine with addressing the one you proposed here.

From UX perspective, I hope them can be consistent. So I mentioned other cases for you to consider.

maxenglander · 2019-08-13T01:00:28Z

Thanks for the clarification @xiang90, what you're saying about consistency makes sense to me. I can imagine, for example, having additional possible values for --auto-promote such as --auto-promote=when-caught-up-with-leader, --auto-promote=when-a-voter-is-unhealty-for=30s, etc.

stale · 2020-04-06T18:26:01Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

maxenglander · 2020-04-26T05:06:23Z

👋
Am revisiting this. Has been a while!

Based on the comments above and the current state of learner doc, I am thinking of something like this for the CLI and PB APIs.

CLI

Current learner behavior

Do not automatically promote to reader or voter. Operator must manually trigger promotion.

member add --learner

The above command is equivalent to:

member add --learner --auto=false --min-progress=90 --delay=0

Promote to voter

Automatically promotes learner to voter when is has achieved at least 90% progress towards being in-sync with leader.

member add --learner --auto --min-progress=90

Automatically promote learner to voter when the cluster has less than three voters for 5 seconds.

member add --learner --auto --min-voters=3 --delay=5000

Automatically promote learner to voter when the cluster has less than current number of members for 10 seconds.

member add --learner --auto --min-members=existing --delay=10000

Promote to reader

Promote to read-only node when progress is 50% of leader's. Allow reading possibly stale data.

member add --learner --auto --promote-to=reader \
           --min-progress=50 --read-consistency=weak

Promote to read-only node when any member (voter or otherwise) becomes unhealthy. Disallow reading possibly stale data.

member add --learner --auto --promote-to=reader \
           --max-unhealthy-members=0 --read-consistency=strong

Combine promotion rules

Combines promotion rules so that the learner is promoted either...

To a voter when all of the conditions below hold:
- Progress is at least 95%
- Any voter becomes unhealthy
- Above conditions hold for 15 seconds or more
To a reader when its progress is 100% of leader's for at least 3 seconds

member add --learner \
           --promote-rule=auto:true,min-progress:95,max-unhealthy-voters=0,delay=15000 \
           --promote-rule=auto:true,to:reader,min-progress:100,delay=3000

In theory, the above rules could be interpreted so that the learner is first promoted to a reader when it has caught up with the leader. Later, if any voter becomes unhealthy, the learner (now a reader) could be promoted to a voter. We wouldn't allow the learner to be promoted to a voter, and later a reader (presumably that would be a demotion).

PB

For the protobuf API, was thinking of something like this.

message MemberMonitor {
  enum Op {
    GREATER_EQUAL = 0;
    LESS = 1;
  }

  enum Type {
    PROGRESS = 0;
    UNHEALTHY_VOTERS = 1;
    MEMBER_COUNT = 2;
    VOTER_COUNT = 3;
  }

  Type type = 1;
  Op op = 2;
  uint64 threshold = 3;
  uint64 delay = 4;
}

message ReaderOptions {
  enum ReadConsistency {
    STRONG = 0;
    WEAK = 1;
  }
}

message MemberPromoteRule {
  enum TargetRole {
    READER = 0;
    VOTER = 1;
  }

  TargetRole promoteTo = 1;
  repeated MemberMonitor monitors = 2;
  optional ReaderOptions readerOptions = 3;
}

message MemberAddRequest {
  // peerURLs is the list of URLs the added member will use to communicate with the cluster.
  repeated string peerURLs = 1;
  // isLearner indicates if the added member is raft learner.
  bool isLearner = 2;

  // Zero or more rules which govern the automatic promotion of learners.
  // If no rules are specified, the learner is not automatically promoted, but
  // the operator may manually trigger a promotion..
  repeated PromoteRule promoteRules = 3;
}

If all that is too much, can go with something a little simpler, e.g. with something that does not support combinations of rules, and which flattens all possible monitors into the MemberAddRequest.

Looking forward to feedback. In the meantime, am going to start making adjustments to this PR to support --auto, --min-progress, and --delay, as well the subset of the PB changes above needed to support those options.

maxenglander · 2020-05-14T04:33:51Z

Pardon the ping @jingyih and @xiang90. I've tried to make adjustments to address some of the comments you've provided me here.

There are three cases for auto-promote:

promote a learner to voter regardless of the cluster size.

promote a learner to voter if the cluster size is smaller than the expected size (for example, a voter is removed from the cluster)

promote a learner to voter if some other voter is unhealthy for some user defined duration (remove the unhealthy node and replace it with a learner)

My last comment above describes how I am currently imagining the MemberAdd API would evolve to support these different cases. I've updated this PR to describe one step of the evolution, and updated this PR to demonstrate a possible implementation.

Looking forward to your feedback.

maxenglander · 2020-06-13T22:16:20Z

Hey @jingyih @xiang90 hope you won't mind if I bump this again and ask for feedback on the latest design proposal and PR. Thanks!

jingyih · 2020-06-20T15:51:01Z

Thanks for the detailed propose! I do not quite follow your PB section, maybe it makes more sense to me when I see it in the PR. Currently there is no "reader" member in etcd. Is this something you plan on adding?

maxenglander · 2020-06-20T20:21:38Z

Hi @jingyih thanks for following up.

In my comment above I included the concept of a "reader" member because it is mentioned in the learner docs as planned for 3.5. With this comment I wanted to show how, even though the changes in this PR do not include any reference to a "reader", they could easily be extended to support that type of member.

I wasn't planning on adding support for a "reader" member in this PR (or in this one either), but I would happy to help with that effort in another PR.

jingyih · 2020-06-22T10:04:10Z

Thanks! Could you mention this topic in this week's etcd community meeting and ask for people's comment and review? The behavior you defined in CLI section looks good to me.

maxenglander · 2020-06-25T19:36:31Z

Hi @jingyih sorry I couldn't make it today due to a conflict. Will try to make the next one. In the meantime, let me know if there's anything else I can do to move this along.

jingyih · 2020-06-26T15:54:33Z

My suggestion of the next step is to start implement member add --learner --auto. I am not sure if people are going to agree with other flags you are proposing. But adding an --auto flag and the mechanism to auto promote leaner should be fine - I think they are required for auto promoting feature.

maxenglander · 2020-06-30T05:37:26Z

@jingyih Sounds good, it's implemented here.

It currently has a --min-progress flag, but it defaults to 90%, so users can just specify member add --learner --auto and it will automatically promote when learner is at least 90% towards leader's progress. User's can change this to have a higher or lower threshold e.g. --min-progress 95.

I'm happy to remove the flag if people don't like it, but I think it's useful for showing how other rules (e.g. --min-unhealthy-voters) can be accommodated by the PB API and rule evaluation logic.

maxenglander · 2020-07-30T18:46:04Z

@gyuho thanks for the chat during community meeting, here is the PR I referenced with design proposal for auto-promoting nodes. I've tried to follow @jingyih and @xiang90 recommendations above as much as possible.

The proposal here lays out long-term API design, which will eventually support automatically promoting nodes based on multiple conditions (# voters in cluster, # healthy nodes). The implementation PR in #10887 implements a subset of this proposal: auto promote based on log progress of learner.

maxenglander · 2020-09-09T15:51:35Z

Hi @jingyih @gyuho just checking in to see if you have any thoughts here about moving forward with this. Thanks!

stale · 2020-12-08T16:58:57Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed after 21 days if no further activity occurs. Thank you for your contributions.

maxenglander · 2020-12-10T03:24:36Z

/stale remove

maxenglander · 2021-01-14T02:39:38Z

Hi @jingyih @gyuho just checking in again. Anything I can do to help move this forward?

spzala

@maxenglander thanks for the PR, and fyi that the project Doc is moved to https://github.com/etcd-io/website/ so this PR should be closed and be created in the doc repo now. Sorry it took long in the review process, but we will try to review and merge quickly in the doc repo. Thanks!

maxenglander · 2021-02-17T01:12:49Z

Thanks you @spzala I'll get this moved over.

spzala · 2021-02-17T15:19:48Z

Thanks you @spzala I'll get this moved over.

Yrw, and thank you @maxenglander

jingyih reviewed Jul 20, 2019

View reviewed changes

xiang90 added the type/feature label Aug 12, 2019

xiang90 added this to the etcd-v3.5 milestone Aug 12, 2019

Marlinc mentioned this pull request Nov 12, 2019

Add replacement nodes as learner nodes cbws/etcd-operator#31

Open

stale bot added the stale label Apr 6, 2020

stale bot removed the stale label Apr 26, 2020

docs: learner auto-promotion

2b87fc8

maxenglander force-pushed the auto-promote-learners-doc branch from 13de299 to 2b87fc8 Compare May 14, 2020 04:12

maxenglander requested a review from jingyih May 14, 2020 04:36

jingyih self-assigned this Jun 17, 2020

stale bot added the stale label Dec 8, 2020

stale bot removed the stale label Dec 10, 2020

nate-double-u mentioned this pull request Feb 2, 2021

Migrate documentation: Remove docs from etcd-io/etcd #12660

Merged

spzala reviewed Feb 10, 2021

View reviewed changes

maxenglander closed this Feb 17, 2021

maxenglander deleted the auto-promote-learners-doc branch February 17, 2021 01:13

This was referenced Feb 17, 2021

content/docs/next/learning: propose v3.5 auto-promotion in design-learner.md etcd-io/website#107

Closed

What's the plan for version 3.5? When will it be released? #12330

Closed

maxenglander mentioned this pull request Feb 20, 2021

etcdserver: add ability to auto-promote learners to voters #10887

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

docs: auto-promoting nodes #10913

docs: auto-promoting nodes #10913

maxenglander commented Jul 20, 2019 •

edited

Loading

codecov-io commented Jul 20, 2019 •

edited

Loading

jingyih commented Jul 20, 2019

jingyih commented Jul 20, 2019

jingyih Jul 20, 2019

jingyih Jul 20, 2019

maxenglander Jul 21, 2019 •

edited

Loading

maxenglander commented Jul 25, 2019

jingyih commented Jul 25, 2019

xiang90 commented Aug 12, 2019

maxenglander commented Aug 12, 2019

xiang90 commented Aug 12, 2019

maxenglander commented Aug 13, 2019

stale bot commented Apr 6, 2020

maxenglander commented Apr 26, 2020 •

edited

Loading

maxenglander commented May 14, 2020

maxenglander commented Jun 13, 2020 •

edited

Loading

jingyih commented Jun 20, 2020

maxenglander commented Jun 20, 2020 •

edited

Loading

jingyih commented Jun 22, 2020

maxenglander commented Jun 25, 2020

jingyih commented Jun 26, 2020

maxenglander commented Jun 30, 2020

maxenglander commented Jul 30, 2020 •

edited

Loading

maxenglander commented Sep 9, 2020

stale bot commented Dec 8, 2020

maxenglander commented Dec 10, 2020

maxenglander commented Jan 14, 2021

spzala left a comment

maxenglander commented Feb 17, 2021

spzala commented Feb 17, 2021

		@@ -9,7 +9,7 @@ Learner
		Gyuho Lee (github.com/gyuho, Amazon.com),
		Joe Betz (github.com/jpbetz, Google Inc.)

docs: auto-promoting nodes #10913

docs: auto-promoting nodes #10913

Conversation

maxenglander commented Jul 20, 2019 • edited Loading

codecov-io commented Jul 20, 2019 • edited Loading

Codecov Report

jingyih commented Jul 20, 2019

jingyih commented Jul 20, 2019

jingyih Jul 20, 2019

Choose a reason for hiding this comment

jingyih Jul 20, 2019

Choose a reason for hiding this comment

maxenglander Jul 21, 2019 • edited Loading

Choose a reason for hiding this comment

maxenglander commented Jul 25, 2019

jingyih commented Jul 25, 2019

xiang90 commented Aug 12, 2019

maxenglander commented Aug 12, 2019

xiang90 commented Aug 12, 2019

maxenglander commented Aug 13, 2019

stale bot commented Apr 6, 2020

maxenglander commented Apr 26, 2020 • edited Loading

CLI

Current learner behavior

Promote to voter

Promote to reader

Combine promotion rules

PB

maxenglander commented May 14, 2020

maxenglander commented Jun 13, 2020 • edited Loading

jingyih commented Jun 20, 2020

maxenglander commented Jun 20, 2020 • edited Loading

jingyih commented Jun 22, 2020

maxenglander commented Jun 25, 2020

jingyih commented Jun 26, 2020

maxenglander commented Jun 30, 2020

maxenglander commented Jul 30, 2020 • edited Loading

maxenglander commented Sep 9, 2020

stale bot commented Dec 8, 2020

maxenglander commented Dec 10, 2020

maxenglander commented Jan 14, 2021

spzala left a comment

Choose a reason for hiding this comment

maxenglander commented Feb 17, 2021

spzala commented Feb 17, 2021

maxenglander commented Jul 20, 2019 •

edited

Loading

codecov-io commented Jul 20, 2019 •

edited

Loading

maxenglander Jul 21, 2019 •

edited

Loading

maxenglander commented Apr 26, 2020 •

edited

Loading

maxenglander commented Jun 13, 2020 •

edited

Loading

maxenglander commented Jun 20, 2020 •

edited

Loading

maxenglander commented Jul 30, 2020 •

edited

Loading