Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ability to add raft node in witness mode. #8934

Closed
janardhan1993 opened this issue Nov 29, 2017 · 5 comments
Closed

Ability to add raft node in witness mode. #8934

janardhan1993 opened this issue Nov 29, 2017 · 5 comments

Comments

@janardhan1993
Copy link

We would like to add a raft node as witness. Witness replicas vote for leaders and for committing writes, but do not store a full copy of the data, can't become the leader, and can't serve reads. They make it easier to achieve quorums for writes without the storage and compute resources that are required by normal replicas.

Is it possible to do it in application layer without modifying current etcd raft's code.

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2017

They make it easier to achieve quorums for writes

I do not understand this. Why not have a smaller quorum then?

@janardhan1993
Copy link
Author

It would be useful if number of nodes in the cluster is even say 2, then if i node goes down we loose the quorum. With witness replica we would use resources for 2 nodes only but would tolerate failure of one node.

@xiang90
Copy link
Contributor

xiang90 commented Nov 29, 2017

With witness replica we would use resources for 2 nodes only but would tolerate failure of one node.

is it? if the actual raft node goes down, all data is lost and nothing can make progress. If the witness node goes down, quorum cannot be reached either. it is worse than having 1 node or 2 nodes raft cluster.

@janardhan1993
Copy link
Author

Sorry i meant we can have 2 raft nodes and 1 witness replica then. Then if i node goes down data won't be lost.

@uluyol
Copy link

uluyol commented Jun 18, 2018

I don't mean to comment on very old threads, but this is an important problem that I would like to see solved.

I think what @janardhan1993 was getting at was witness replicas as they exist in non-Raft consensus protocols. Google for instance, appears to use them in Spanner and previously with Megastore. Unlike Raft, Paxos replicas do not need to store the full operation log, allowing the use of F+1 application replicas, F+1 potential leaders, F+1 active Paxos log replicas, and F witnesses.

By having F+1 nodes act as app replicas (e.g. etcd logic and state), leaders, and active log replicas, typical usage will use only F+1 nodes worth of resources. When a node fails, one of the witnesses is used for the Paxos log. When that node comes back or a node replaces it, the log entries stored on the witness are used to recover data, whereupon the data store on the witness can be freed through checkpointing.

Since witnesses will rarely be used, this allows average utilization of 3 replicas to tolerate 2 failures, not 1, saving CPU, bandwidth, I/O, and storage resources.

To the best of my knowledge, witnesses cannot be used with Raft because Raft requires that replicas maintain a complete prefix of the log. This prevents a replica that does not have the latest log entries from becoming leader, breaking fault tolerance if witnesses are used. A Paxos-based system could elect a node that is behind as leader, and have the new leader obtain the missing log entries by running Prepare-Accept logic, bringing it up-to-date.

I believe that fixing this problem requires making changes to Raft, so it might be out of scope of the etcd developers. At the same time, saving 40% of resources with F=2 would be no small feat.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

4 participants