Prove Network Reliability and Fault Tolerance #1505

ch1bo · 2024-07-16T08:03:34Z

Why

Currently, in our test strategy hierarchy, we are not covering scenarios where real networking failures can occur.

We strive for long-living heads, so providing proofs of resiliency and fault tolerance is essential.

This will open future opportunities to explore different network protocols (such as UDP) and implementations.

What

Motivated by #1436, we aim to:

Create a test that stress tests the network layer in the case of three or more intermittently failing peers. A failing peer is defined as a peer that fails to send, receive, or persist network messages.

To achieve this, we want to prove through chaos engineering that the off-chain protocol can survive and recover from network failures.

For that, we need to:

Prepare a local demo.
Run it on a real cluster in the cloud.

Both require us to build:

cluster bootstrap script

A script that sets up the environment with the required components and infrastructure specifications.

it should prepare:
- 1 shared cardano node.
  - slot length should be configured based on script argument
- 3 hydra nodes, each with its own volume and hydraw instance.
  - each with a custom event-sink configured.
- the hydra explorer
test driver
- an HTTP server that waits for operator commands to execute a plan.
- holds a copy of the signing keys.
- runs execution plans (orchestration scripts) on demand.
  - plans send HTTP requests to each hydraw instance.
  - plans are configurable to wait for the last submitted transaction to be confirmed in a snapshot before processing the next (executes one step at a time); to make it reusable for stress-testing.
  - plans can be empty.
  - plans can contain failure instructions so that during execution, it introduces changes to the infrastructure to cause network failures (e.g., delay, loss, duplicate, and re-order packets) or shuts down the nodes to cause service unavailability and then restarts them after some period of time.
  - plans stop upon being unable to progress with its execution after several attempts/retries.
- client inputs and failure instructions can be introduced manually by the operator during exectution.
  - that means plans maintain WS connections to each hydra node.
- failure instructions are executed using one of these tools:
  - Jepsen
  - Pumba
  - Blockade
test observer
- provides real-time dashboards and reports about the status of the cluster.
- keeps track of and reports:
  - the latest version of each hydra node state: L̂, S⁻, Uω, txα, txω, T̂.
  - all messages sent, received and discarded between each pair of peers.
  - the latest version of the L1 head utxo state: same as in explorer details
- this is achieved by:
  - querying and forwarding the L1 state from the hydra explorer api.
  - pulling and aggregating L2 events from each of the configured hydra event sinks.
  - reading the network messages (sent and received).
    Here, we need a better approach from where to read this information:
    - perhaps another kind of event sink?
    - maybe the trace-dispatcher library?
    - what about cardano-tracer?

ffakenz · 2024-07-30T09:28:13Z

Closing in favor of #1532

ch1bo assigned ffakenz Jul 16, 2024

ch1bo added the 💭 idea An idea or feature request label Jul 16, 2024

ch1bo mentioned this issue Jul 23, 2024

NOT MERGE: Less reliable persistence prototype #1495

Closed

4 tasks

ffakenz mentioned this issue Jul 25, 2024

Packet loss fault injection test #1532

Closed

3 tasks

ffakenz closed this as completed Jul 30, 2024

ffakenz closed this as not planned Won't fix, can't repro, duplicate, stale Jul 30, 2024

ch1bo added the superseded An item that may get superseded by related feature. label Jul 30, 2024

ch1bo mentioned this issue Sep 2, 2024

Spike: Use raft consensus for networking #1591

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prove Network Reliability and Fault Tolerance #1505

Prove Network Reliability and Fault Tolerance #1505

ch1bo commented Jul 16, 2024 •

edited by ffakenz

Loading

ffakenz commented Jul 30, 2024

Prove Network Reliability and Fault Tolerance #1505

Prove Network Reliability and Fault Tolerance #1505

Comments

ch1bo commented Jul 16, 2024 • edited by ffakenz Loading

Why

What

For that, we need to:

Both require us to build:

ffakenz commented Jul 30, 2024

ch1bo commented Jul 16, 2024 •

edited by ffakenz

Loading