Spike: Use raft consensus for networking #1591

ch1bo · 2024-09-02T07:54:12Z

Why

We created a new test suite about resilience of our network stack in #1532 (see also #1106, #1436, #1505). With this in place, we can now explore various means to reach our goal of a crash-tolerant network layer.

This fairly old research paper explored various consensus protocols used in blockchain space and reminds us of the correspondence between consensus and broadcasts:

the form of consensus relevant for blockchain is technically known as atomic
broadcast

Furthermore, it listed at least one of these early, permissioned blockchains that achieved crash-tolerance of $t < n/2$ by simply using etcd with its Raft consensus algorithm.

What

Run fault injection tests with a hydra-node network connected through etcd
Create a PR with results of the spike, but do not merge it

How

Create a Hydra.Network.Etcd network component that implements broadcast using the replicated log of etcd
- Move authentication into application
- Re-use ~~--peer~~ from command line
- Fork ~~etcd~~ when starting
- Message type + sender as key
- Use watch and revisions to be notified of messages while offline
Maybe: Compact messages (this will upper bound reliance to faults?) or use leases

The text was updated successfully, but these errors were encountered:

ch1bo · 2024-09-13T15:31:50Z

After directly identifying a required change in semantics of our NetworkComponentss #1624, I started implementation yesterday and achieved so far:

Initialization of a etcd cluster by re-using existing --peer (and other) command line arguments
Sending and receiving of messages: very basic, hex-encoded, using etcdctl as "client" and by polling a single key
First manual testing and some end-to-end tests / benchmarks working -> fragile (likely because of process calls and polling)

Notes so far:

First step: Starting a etcd in the background of network component.
When making etcd available to hydra-node (repl), I stumbled over cached paths (somewhere in dist-newstyle)
Separated --host and --port is annoying, should just parse a full Host.
etcd has a few command line arguments to get right, this guide was helpful: https://etcd.io/docs/v3.5/op-guide/clustering/
Etcd client libraries are not in the best state in haskell land, resorted to just invoking etcdctl for now (this will be have horrible performance)
- Need to marshal binary encoded messages to etcdtcl put through stdin using hex
- Can use etcdctl get -w json to get a JSON encoded result with base64 encoded values (of hex encoded bytes)
- Should really use a grpc or at least http/json client
Who should be signing and verifying messages? Re-use authentication layer or build it into etcd network component (and re-use keys for transport-level security)?
- Decided to re-use withAuthentication and do some plumbing
- Expand list of valid senders to include "us" (because we do not do short-cuts anymore)
Stopping the last etcd instance does not work properly. It seems like its not gracefully handling the SIGTERM while reconnecting to other nodes.
Surprisingly, the etcd network component even works if we just poll a single key in a busy loop and re-deliver this message over and over. At least in a manual, interactive test using the hydra-tui.

ch1bo · 2024-09-17T15:49:14Z

Continued and got the networked offline mode nodes and 3-node benchmarks to run today:

λ cabal bench hydra-cluster --benchmark-options="datasets datasets/3-nodes.json --timeout 1000s"
[...]
Writing results to: /tmp/bench-3-nodes.json-701cf5c6814573b8/results.csv
Confirmed txs/Total expected txs: 9000/9000 (100.00 %)
Average confirmation time (ms): 53.090586988
P99: 65.27539939ms
P95: 62.45322745ms
P50: 53.0196145ms
Invalid txs: 0
Benchmark bench-e2e: FINISH

Its likely slow because the current state of the branch is still invoking etcdctl put for sending and parsing stdout of etcdctl watch.

Next steps

Figure out how to handle unreliable network connections
- Submission while not connected to the majority seems to be just swallowed
- How can a minority catch up after reconnecting? From which revision on to watch?
Run fault injection tests against this
Summarize findings

Notes from today are in the logbook: https://github.com/cardano-scaling/hydra/wiki/Logbook-2024-H1#sn-on-using-etcd-for-networking

ch1bo · 2024-09-23T11:41:03Z

Persisted some of the original thoughts about Network component properties in #1649.

Also, this updated architecture diagram also mentions the same properties and displays how etcd is embedded in the hydra-node:

ch1bo · 2024-09-23T16:43:33Z

Got the etcd-based network now running in our fault injection test suite, both locally and using github actions: https://github.com/cardano-scaling/hydra/actions/runs/10998675568

Notes from making it work:

Benchmarks using docker containers failed because the etcd is sensitive to inconsistencies between --host/--port and --peer command line arguments.
Tested with various package loss percentages. 10, 15, 20, 25 all still seem fine!
With 40 percent package loss I see the bench driver fail: Something went wrong while waiting for all confirmations: at bench/Bench/EndToEnd.hs:482:33 in main:Bench.EndToEnd: waitNext: ConnectionClosed and later time out.

As the inlined call to pumba makes it clear, this is only testing outbound traffic from alice to bob and carol!

nix run github:noonio/pumba/noon/add-flake -- -l debug netem \
  --duration 20m \
  --target 172.16.238.20 \
  --target 172.16.238.30 \
  loss --percent ${percent} \
  "re2:hydra-node-1" &

This pumba invocation should result in 10 percent package loss to alice: nix run github:noonio/pumba/noon/add-flake -- -l debug netem --duration 20m --target 172.16.238.10 loss --percent 10 "re2:hydra-node-"

Next steps:

Use pumba (or so) to test crashes of a node

ch1bo · 2024-09-25T08:11:43Z

To run the package drop tests locally, we just need to use the instructions from fault tolerance github action (also updated on this branch). Also persisted here for easy reproducibility.

Fault injection

Build the pumba-compatible docker image:

nix build .#docker-hydra-node-for-netem
./result | docker load

Set up a local devnet like in https://hydra.family/head-protocol/docs/getting-started/, but using the updated local hydra-node from the .#demo dev shell:

cd demo
./prepare-devnet.sh
docker compose up -d cardano-node
sudo chmod a+w devnet/node.socket
export CARDANO_NODE_SOCKET_PATH=devnet/node.socket
./seed-devnet.sh $(which cardano-cli) $(which hydra-node)

Start the pumba compatible hydra nodes:

docker compose -f docker-compose.yaml -f docker-compose-netem.yaml up -d hydra-node-{1,2,3}

Now we can run the workload with:

mkdir -p benchmarks
nix run .#hydra-cluster.components.benchmarks.bench-e2e -- demo \
  --output-directory=benchmarks \
  --scaling-factor=100 \
  --timeout=1000s \
  --testnet-magic 42 \
  --node-socket=devnet/node.socket \
  --hydra-client=localhost:4001 \
  --hydra-client=localhost:4002 \
  --hydra-client=localhost:4003

While this runs, we can start dropping packets from alice with, e.g. 10 percent likelihood:

nix run github:noonio/pumba/noon/add-flake -- -l debug netem \
  --duration 20m \
  --target 172.16.238.20 \
  --target 172.16.238.30 \
  loss --percent 10 \
  "re2:hydra-node-1"

And/or drop packets to alice with, e.g. 10 percent likelihood:

nix run github:noonio/pumba/noon/add-flake -- -l debug netem \
  --duration 20m |
  --target 172.16.238.10 \
  loss --percent 10 \
  "re2:hydra-node-"

ch1bo · 2024-09-26T19:01:15Z

Solved the two scenarios:

Broadcast while on a minority cluster -> persist outbound messages and retry
Deliver messages when (re-)connecting to majority cluster -> keep a last known revision and watch starting with it

Kept notes in the logbook and will summarize findings tomorrow: https://github.com/cardano-scaling/hydra/wiki/Logbook-2024-H1#2024-09-26

ch1bo · 2024-09-27T08:37:26Z

Summary of Raft-based network using etcd

Seems very robust, but "overkill"?
Requires etcd as runtime dependency
Similar user experience
- Full, static topology with listing everone as --peer
- Simpler configuration via peer discovery possible
Introspectable network / cluster state
- could improve debugging experience
etcd has a few features we have not thought of yet
- use TLS to secure peer connections
- separate advertised and binding addresses
Unclear yet whether this (mis-)use of etcd would work in the long run
- Originally designed for key-value store
- Only request and storage size limits: https://etcd.io/docs/v3.5/dev-guide/limit/
- Website claims "Benchmarked at 1000s of writes/s"

Approach

Broadcast implemented as put to the etcd cluster works well
- Using indidivual keys per peer is not needed, but handy to debug
Need an outbound buffer to handle failure to put
- unsuccessful when disconnected / connected to only a minority cluster
- this highlights possibility that broadcast can fail -> discuss with researchers about this abstraction!
Relying on the overall revision to deliver on recovery works well
- This also de-duplicates any past messages
- Maybe not needed as our protocol seems resilient against duplication
Cluster configuration would be less fragile if we know everyones --node-id
Not tested compaction feature, but could use to prune revision history and bound memory/disk usage
PeerConnected and PeerDisconnected would need to change
- we might get some information from etcd
- more likely though, our connectivity is interesting
- distinguish between connected (to majority) or disconnected (only on minority)
Protocol versioning would need to change
- could be trivial by using versioned keys or values
Re-used Authenticate network component to provide message signing
- Alternative: move message signing / verification into application logic

Performance

Slower when comparing average confirmation time benchmarks with Github action runs:
- single node: 4ms -> 40ms
- three nodes: 20ms -> 100ms
Faster when comparing average confirmation time benchmarks on my machine:
- single node: 26ms -> 26ms
- three nodes: 100ms -> 50ms
Unclear if faster or slower?
- Possible explanation: Workload is spread across more threads and Github-hosted runners may have poor multi-thread performance
Definitely some potential as implementation spawns etcdctl processes on every broadcast
Also persisting of last known revision is a synchronous write (could be asynchronous, best-effort)
Is etcdctl ensuring durability of their write-ahead-log?

ch1bo added the spike label Sep 2, 2024

ch1bo changed the title ~~Spike: Using raft consensus for networking~~ Spike: Use raft consensus for networking Sep 2, 2024

ch1bo self-assigned this Sep 2, 2024

ch1bo assigned ffakenz Sep 9, 2024

ch1bo linked a pull request Sep 13, 2024 that will close this issue

Spike / not-merge: Raft-based network using etcd #1632

Closed

4 tasks

ch1bo unassigned ffakenz Sep 25, 2024

noonio assigned ch1bo and unassigned ch1bo Sep 30, 2024

noonio closed this as completed Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spike: Use raft consensus for networking #1591

Spike: Use raft consensus for networking #1591

ch1bo commented Sep 2, 2024

ch1bo commented Sep 13, 2024

ch1bo commented Sep 17, 2024 •

edited

Loading

ch1bo commented Sep 23, 2024

ch1bo commented Sep 23, 2024 •

edited

Loading

ch1bo commented Sep 25, 2024

ch1bo commented Sep 26, 2024

ch1bo commented Sep 27, 2024 •

edited

Loading

Spike: Use raft consensus for networking #1591

Spike: Use raft consensus for networking #1591

Comments

ch1bo commented Sep 2, 2024

Why

What

How

ch1bo commented Sep 13, 2024

ch1bo commented Sep 17, 2024 • edited Loading

Next steps

ch1bo commented Sep 23, 2024

ch1bo commented Sep 23, 2024 • edited Loading

ch1bo commented Sep 25, 2024

Fault injection

ch1bo commented Sep 26, 2024

ch1bo commented Sep 27, 2024 • edited Loading

Summary of Raft-based network using etcd

Approach

Performance

ch1bo commented Sep 17, 2024 •

edited

Loading

ch1bo commented Sep 23, 2024 •

edited

Loading

ch1bo commented Sep 27, 2024 •

edited

Loading