Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prove Network Reliability and Fault Tolerance #1505

Closed
5 tasks
ch1bo opened this issue Jul 16, 2024 · 1 comment
Closed
5 tasks

Prove Network Reliability and Fault Tolerance #1505

ch1bo opened this issue Jul 16, 2024 · 1 comment
Assignees
Labels
superseded An item that may get superseded by related feature. 💭 idea An idea or feature request

Comments

@ch1bo
Copy link
Collaborator

ch1bo commented Jul 16, 2024

Why

Currently, in our test strategy hierarchy, we are not covering scenarios where real networking failures can occur.

We strive for long-living heads, so providing proofs of resiliency and fault tolerance is essential.

This will open future opportunities to explore different network protocols (such as UDP) and implementations.

What

Motivated by #1436, we aim to:

  • Create a test that stress tests the network layer in the case of three or more intermittently failing peers. A failing peer is defined as a peer that fails to send, receive, or persist network messages.

To achieve this, we want to prove through chaos engineering that the off-chain protocol can survive and recover from network failures.

For that, we need to:

Both require us to build:

  • cluster bootstrap script

    A script that sets up the environment with the required components and infrastructure specifications.

    it should prepare:

    • 1 shared cardano node.
      • slot length should be configured based on script argument
    • 3 hydra nodes, each with its own volume and hydraw instance.
      • each with a custom event-sink configured.
    • the hydra explorer
  • test driver

    • an HTTP server that waits for operator commands to execute a plan.
    • holds a copy of the signing keys.
    • runs execution plans (orchestration scripts) on demand.
      • plans send HTTP requests to each hydraw instance.
      • plans are configurable to wait for the last submitted transaction to be confirmed in a snapshot before processing the next (executes one step at a time); to make it reusable for stress-testing.
      • plans can be empty.
      • plans can contain failure instructions so that during execution, it introduces changes to the infrastructure to cause network failures (e.g., delay, loss, duplicate, and re-order packets) or shuts down the nodes to cause service unavailability and then restarts them after some period of time.
      • plans stop upon being unable to progress with its execution after several attempts/retries.
    • client inputs and failure instructions can be introduced manually by the operator during exectution.
      • that means plans maintain WS connections to each hydra node.
    • failure instructions are executed using one of these tools:
  • test observer

    • provides real-time dashboards and reports about the status of the cluster.
    • keeps track of and reports:
      • the latest version of each hydra node state: L̂, S⁻, Uω, txα, txω, T̂.
      • all messages sent, received and discarded between each pair of peers.
      • the latest version of the L1 head utxo state: same as in explorer details
    • this is achieved by:
      • querying and forwarding the L1 state from the hydra explorer api.
      • pulling and aggregating L2 events from each of the configured hydra event sinks.
      • reading the network messages (sent and received).

        Here, we need a better approach from where to read this information:

image
@ch1bo ch1bo added the 💭 idea An idea or feature request label Jul 16, 2024
@ffakenz
Copy link
Contributor

ffakenz commented Jul 30, 2024

Closing in favor of #1532

@ffakenz ffakenz closed this as completed Jul 30, 2024
@ffakenz ffakenz closed this as not planned Won't fix, can't repro, duplicate, stale Jul 30, 2024
@ch1bo ch1bo added the superseded An item that may get superseded by related feature. label Jul 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
superseded An item that may get superseded by related feature. 💭 idea An idea or feature request
Projects
None yet
Development

No branches or pull requests

2 participants