Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Casper node failover #13

Open
przemyslaw opened this issue Apr 10, 2024 · 0 comments
Open

Casper node failover #13

przemyslaw opened this issue Apr 10, 2024 · 0 comments
Labels

Comments

@przemyslaw
Copy link
Collaborator

przemyslaw commented Apr 10, 2024

Context
Current Casper node operations do not include an automatic failover system. This leads to potential service disruptions if the primary node fails. The concept involves a primary (main) and secondary (slave) node system, where each node is monitored, and failover is triggered based on specific conditions.

Goal
The goal of this task is to implement or use a failover module for the Casper node. This module should ensure high availability of the network services by automatically switching to a backup node if the primary node fails. The failover mechanism should be efficient, with a minimal performance drop during the switch and should avoid double-signing to prevent penalties.

Requirements

  • Dual-node architecture: one main and at least one slave.
  • Each node should have two public keys: one for main operations and the second for failover scenarios.
  • Nodes must regularly ping each other, with intervals and monitoring duration configurable through settings.
  • The failover process should activate the slave node as the primary if the main node becomes unresponsive for a specified period.
  • Include internal replication between main and slave nodes to prevent performance degradation during failover.
  • The system to avoid double-signing, leveraging experience from existing solutions like Horcrux in the CosmosSDK ecosystem.
  • The solution should seamlessly revert to the original configuration once the main node is operational again.

References:

  • Review existing failover mechanisms in blockchain systems, such as the custom Tendermint fail-tolerance applications by Farbole, Figment, and CertusOne.
  • Analyze the Horcrux threshold Tendermint signer as a model for the Casper node failover system.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

1 participant