In this paper, Kotla et al. propose Zyzzyva, a new protocol that uses speculation to reduce the cost and simplify the design of BFT state machine replication. Zyzzyva is also built on one primary replica, which proposes an order on client requests to the other replicas. Unlike traditional protocols, replicas in Zyzzyva speculatively execute requests without an agreement protocol to determine the exact order. However, replicas may diverge and send different responses to clients. In this case, Zyzzyva let clients resolve inconsistency of responses received from replicas. In Zyzzyva, it relies on clients to trigger slow-path processing when responses from replicas don’t match. To help a client determine the consistency, each replica appends history information to the response so that the client can judge whether there exist any inconsistencies by inspecting the sequence numbers and history prefixes. By leveraging clients in this way, Zyzzyva largely improves the overall performance of the whole system. Its replication costs, processing overheads, and communication latencies approach the theoretical lower bounds.
- The authors introduce a new design of BFT protocol, Zyzzyva, which reduces the cost and simplifies the design of traditional BFT state machine replication by responding immediately to the client.
- Besides describing the design details of Zyzzyva, including agreement, view-change, and checkpoint sub-protocols, the authors also discuss several optimizations to improve performance and reduce system cost. The optimizations cover replacing signatures with MACs, separating agreement from execution, request batching, caching out of order requests, read-only optimization, single execution response, and preferred quorums.
- The authors conduct extensive evaluations of throughput, latency, fault scalability, and performance during failures. Moreover, they compare Zyzzyva with other existing approaches, including PBFT, HQ, and Q/U.
- Zyzzyva simplifies the design of BFT replicated services by approaching the lower bounds in almost every key metric, including replication cost, throughput, latency, and fault scalability.
- In Zyzzyva, replicas keep executing request speculatively without waiting to know that requests with lower sequence numbers have completed. This is why Zyzzyva can achieve not just lower latency but also higher throughput.
- In Zyzzyva, a faulty primary can cause a correct replica to execute requests in the wrong sequence. The Zyzzyva protocol itself ensures that clients will not act on the inconsistent wrong responses, but it requires the divergent replicas to correct themselves by fetching a checkpoint and log of subsequent requests from the other replicas, apply the checkpoint, and replay the log. However, this recovery from misspeculation can be expensive, because the checkpoint includes the state of the application and the last response sent to each client.