Skip to content

net/discovery: File persistence for AddrCache#8839

Merged
Sajjon merged 53 commits intomasterfrom
cyon/persist_peers_cache
Jul 2, 2025
Merged

net/discovery: File persistence for AddrCache#8839
Sajjon merged 53 commits intomasterfrom
cyon/persist_peers_cache

Conversation

@Sajjon
Copy link
Contributor

@Sajjon Sajjon commented Jun 12, 2025

Implementation of #8758

Description

Authority Discovery crate has been changed so that the AddrCache is persisted to persisted_cache_file_path a json file in net_config_path folder controlled by NetworkConfiguration.

AddrCache is JSON serialized (serde_json::to_pretty_string) and persisted to file:

  • periodically (every 10 minutes)
  • on shutdown

Furthermore, this persisted AddrCache on file will be read from upon start of the worker - if it does not exist, or we failed to deserialize it a new empty cache is used.

AddrCache is made Serialize/Deserialize thanks to PeerId and Multiaddr being made Serialize/Deserialize.

Implementation

The worker use a spawner which is used in the [run loop of the worker, where at an interval we try to persist the AddrCache. We won't persist the AddrCache if persisted_cache_file_path: Option<PathBuf> is None - which it would be if NetworkConfiguration net_config_path is None. We spawn a new task each time the interval "ticks" - once every 10 minutes - and it uses fs::write (there is also a tokio::fs::write which requires the fs feature flag of tokio which is not activated and I chose to not use it). If the worker shutsdown we will try to persist without using the spawner.

Changes

  • New crate dependency: serde_with for SerializeDisplay and DeserialzeFromStr macros
  • WorkerConfig in authority-discovery crate has a new field persisted_cache_directory : Option<PathBuf>
  • Worker in authority-discovery crate constructor now takes a new parameter, spawner: Arc<dyn SpawnNamed>

Tests

  • authority-discovery tests tests are changed to use tokio runtime, #[tokio::test] and we pass a test worker config with a tempdir for persisted_cache_directory

net_config_path

Here are the net_config_path (from NetworkConfiguration) which is the folder used by this PR to save a serialized AddrCache in:

dev

cargo build --release && ./target/release/polkadot --dev

shows =>
/var/folders/63/fs7x_3h16svftdz4g9bjk13h0000gn/T/substratey5QShJ/chains/rococo_dev/network/authority_discovery_addr_cache.json'

kusama

cargo build --release && ./target/release/polkadot --chain kusama --validator

shows => ~/Library/Application Support/polkadot/chains/ksmcc3/network/authority_discovery_addr_cache.json

Caution

The node shutdown automatically with scary error.

Essential task `overseer` failed. Shutting down service.
TCP listener terminated with error error=Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }
Installed transports terminated, ignore if the node is stopping
Litep2p backend terminated`
Error:
  0: Other: Essential task failed.

This is maybe expected/correct, but just wanted to flag it, expand output below to see log

Or did I break anything?

Full Log with scary error (expand me 👈) The log
$ ./target/release/polkadot --chain kusama --validator
2025-06-19 14:34:35 ----------------------------
2025-06-19 14:34:35 This chain is not in any way
2025-06-19 14:34:35       endorsed by the
2025-06-19 14:34:35      KUSAMA FOUNDATION
2025-06-19 14:34:35 ----------------------------
2025-06-19 14:34:35 Parity Polkadot
2025-06-19 14:34:35 ✌️  version 1.18.5-e6b86b54d31
2025-06-19 14:34:35 ❤️  by Parity Technologies <admin@parity.io>, 2017-2025
2025-06-19 14:34:35 📋 Chain specification: Kusama
2025-06-19 14:34:35 🏷  Node name: glamorous-game-6626
2025-06-19 14:34:35 👤 Role: AUTHORITY
2025-06-19 14:34:35 💾 Database: RocksDb at /Users/alexandercyon/Library/Application Support/polkadot/chains/ksmcc3/db/full
2025-06-19 14:34:39 Creating transaction pool txpool_type=SingleState ready=Limit { count: 8192, total_bytes: 20971520 } future=Limit { count: 819, total_bytes: 2097152 }
2025-06-19 14:34:39 🚀 Using prepare-worker binary at: "/Users/alexandercyon/Developer/Rust/polkadot-sdk/target/release/polkadot-prepare-worker"
2025-06-19 14:34:39 🚀 Using execute-worker binary at: "/Users/alexandercyon/Developer/Rust/polkadot-sdk/target/release/polkadot-execute-worker"
2025-06-19 14:34:39 Local node identity is: 12D3KooWPVh77R44wZwySBys262Jh4BSbpMFxtvQNmi1EpdcwDDW
2025-06-19 14:34:39 Running litep2p network backend
2025-06-19 14:34:40 💻 Operating system: macos
2025-06-19 14:34:40 💻 CPU architecture: aarch64
2025-06-19 14:34:40 📦 Highest known block at #1294645
2025-06-19 14:34:40 〽️ Prometheus exporter started at 127.0.0.1:9615
2025-06-19 14:34:40 Running JSON-RPC server: addr=127.0.0.1:9944,[::1]:9944
2025-06-19 14:34:40 🏁 CPU single core score: 1.35 GiBs, parallelism score: 1.44 GiBs with expected cores: 8
2025-06-19 14:34:40 🏁 Memory score: 63.75 GiBs
2025-06-19 14:34:40 🏁 Disk score (seq. writes): 2.92 GiBs
2025-06-19 14:34:40 🏁 Disk score (rand. writes): 727.56 MiBs
2025-06-19 14:34:40 CYON: 🔮 Good, path set to: /Users/alexandercyon/Library/Application Support/polkadot/chains/ksmcc3/network/authority_discovery_addr_cache.json
2025-06-19 14:34:40 🚨 Your system cannot securely run a validator.
Running validation of malicious PVF code has a higher risk of compromising this machine.
Secure mode is enabled only for Linux
and a full secure mode is enabled only for Linux x86-64.
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory.
More information: https://docs.polkadot.com/infrastructure/running-a-validator/operational-tasks/general-management/#secure-your-validator
2025-06-19 14:34:40 Successfully persisted AddrCache on disk
2025-06-19 14:34:40 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode for non-Linux; check logs") }
2025-06-19 14:34:40 Starting workers
2025-06-19 14:34:40 Starting approval distribution workers
2025-06-19 14:34:40 👶 Starting BABE Authorship worker
2025-06-19 14:34:40 Starting approval voting workers
2025-06-19 14:34:40 Starting main subsystem loop
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="candidate-validation"
2025-06-19 14:34:40 Starting with an empty approval vote DB.
2025-06-19 14:34:40 subsystem finished unexpectedly subsystem=Ok(())
2025-06-19 14:34:40 🥩 BEEFY gadget waiting for BEEFY pallet to become available...
2025-06-19 14:34:40 Received `Conclude` signal, exiting
2025-06-19 14:34:40 Conclude
2025-06-19 14:34:40 received `Conclude` signal, exiting
2025-06-19 14:34:40 received `Conclude` signal, exiting
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-recovery"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="bitfield-distribution"
2025-06-19 14:34:40 Approval distribution worker 3, exiting because of shutdown
2025-06-19 14:34:40 Approval distribution worker 2, exiting because of shutdown
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="dispute-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="chain-selection"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="pvf-checker"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-store"
2025-06-19 14:34:40 Approval distribution worker 1, exiting because of shutdown
2025-06-19 14:34:40 Approval distribution worker 0, exiting because of shutdown
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-voting"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="chain-api"
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="provisioner"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="runtime-api"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="candidate-backing"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="collation-generation"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="gossip-support"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-voting-parallel"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="bitfield-signing"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="collator-protocol"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="statement-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="network-bridge-tx"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="network-bridge-rx"
2025-06-19 14:34:41 subsystem exited with error subsystem="prospective-parachains" err=FromOrigin { origin: "prospective-parachains", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2025-06-19 14:34:41 subsystem exited with error subsystem="dispute-coordinator" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2025-06-19 14:34:41 Essential task `overseer` failed. Shutting down service.
2025-06-19 14:34:41 TCP listener terminated with error error=Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }
2025-06-19 14:34:41 Installed transports terminated, ignore if the node is stopping
2025-06-19 14:34:41 Litep2p backend terminated
Error:
   0: Other: Essential task failed.

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.

🤔

kusama -d /my/custom/path

cargo build --release && ./target/release/polkadot --chain kusama --validator --unsafe-force-node-key-generation -d /my/custom/path

shows => ./my/custom/path/chains/ksmcc3/network/ for net_config_path

test

I've configured a WorkerConfig with a tempfile for all tests. To my surprise I had to call fs::create_dir_all in order for the tempdir to actually be created.

@Sajjon Sajjon changed the title Add CodableAddrCache (Encode/Decode) with TryFrom/From for AddrCache Add SerializableAddrCache (Serialize/Deserialize) with TryFrom/From for AddrCache Jun 12, 2025
@Sajjon Sajjon force-pushed the cyon/persist_peers_cache branch 4 times, most recently from 15fcb7e to 004ca5f Compare June 16, 2025 11:38
@Sajjon Sajjon changed the title Add SerializableAddrCache (Serialize/Deserialize) with TryFrom/From for AddrCache net/discovery: File persistence for AddrCache Jun 16, 2025
@Sajjon Sajjon added T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network. labels Jun 16, 2025
@Sajjon Sajjon force-pushed the cyon/persist_peers_cache branch from 35a31a0 to a5925bf Compare June 16, 2025 13:54
@lexnv lexnv requested review from dmitry-markin and lexnv June 16, 2025 14:14
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
@Sajjon Sajjon added this pull request to the merge queue Jul 2, 2025
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Jul 2, 2025
@paritytech-workflow-stopper
Copy link

All GitHub workflows were cancelled due to failure one of the required jobs.
Failed workflow url: https://github.com/paritytech/polkadot-sdk/actions/runs/16021153898
Failed job name: test-linux-stable-runtime-benchmarks

@Sajjon Sajjon enabled auto-merge July 2, 2025 11:58
@Sajjon Sajjon added this pull request to the merge queue Jul 2, 2025
Merged via the queue into master with commit ee6d22b Jul 2, 2025
241 checks passed
@Sajjon Sajjon deleted the cyon/persist_peers_cache branch July 2, 2025 13:13
@paritytech-release-backport-bot

Created backport PR for stable2412:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-8839-to-stable2412
git worktree add --checkout .worktree/backport-8839-to-stable2412 backport-8839-to-stable2412
cd .worktree/backport-8839-to-stable2412
git reset --hard HEAD^
git cherry-pick -x ee6d22b94d9a93ac5989d4cce2f20a604b86214b
git push --force-with-lease

@paritytech-release-backport-bot

Created backport PR for stable2503:

Please cherry-pick the changes locally and resolve any conflicts.

git fetch origin backport-8839-to-stable2503
git worktree add --checkout .worktree/backport-8839-to-stable2503 backport-8839-to-stable2503
cd .worktree/backport-8839-to-stable2503
git reset --hard HEAD^
git cherry-pick -x ee6d22b94d9a93ac5989d4cce2f20a604b86214b
git push --force-with-lease

paritytech-release-backport-bot bot pushed a commit that referenced this pull request Jul 2, 2025
Implementation of #8758

# Description
Authority Discovery crate has been changed so that the `AddrCache` is
persisted to `persisted_cache_file_path` a `json` file in
`net_config_path` folder controlled by `NetworkConfiguration`.

`AddrCache` is JSON serialized (`serde_json::to_pretty_string`) and
persisted to file:
- periodically (every 10 minutes)
- on shutdown

Furthermore, this persisted `AddrCache` on file will be read from upon
start of the worker - if it does not exist, or we failed to deserialize
it a new empty cache is used.

`AddrCache` is made Serialize/Deserialize thanks to `PeerId` and
`Multiaddr` being made Serialize/Deserialize.

# Implementation
The worker use a spawner which is used in the [run loop of the worker,
where at an interval we try to persist the
[AddrCache](https://github.com/paritytech/polkadot-sdk/blob/cyon/persist_peers_cache/substrate/client/authority-discovery/src/worker.rs#L361-L372).
We won't persist the `AddrCache` if `persisted_cache_file_path:
Option<PathBuf>` is `None` - which it would be if
[`NetworkConfiguration`
`net_config_path`](https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/src/config.rs#L591)
is `None`. We spawn a new task each time the `interval` "ticks" - once
every 10 minutes - and it uses `fs::write` (there is also a
`tokio::fs::write` which requires the `fs` feature flag of `tokio` which
is not activated and I chose to not use it). If the worker shutsdown we
will try to persist without using the `spawner`.

# Changes
- New crate dependency: `serde_with` for `SerializeDisplay` and
`DeserialzeFromStr` macros
- `WorkerConfig` in authority-discovery crate has a new field
`persisted_cache_directory : Option<PathBuf>`
- `Worker` in authority-discovery crate constructor now takes a new
parameter, `spawner: Arc<dyn SpawnNamed>`

## Tests
- [authority-discovery
tests](substrate/client/authority-discovery/src/tests.rs) tests are
changed to use tokio runtime, `#[tokio::test]` and we pass a test worker
config with a `tempdir` for `persisted_cache_directory `

# `net_config_path`
Here are the `net_config_path` (from `NetworkConfiguration`) which is
the folder used by this PR to save a serialized `AddrCache` in:

## `dev`
```sh
cargo build --release && ./target/release/polkadot --dev
```

shows =>

`/var/folders/63/fs7x_3h16svftdz4g9bjk13h0000gn/T/substratey5QShJ/chains/rococo_dev/network/authority_discovery_addr_cache.json'`

## `kusama`
```sh
cargo build --release && ./target/release/polkadot --chain kusama --validator
```

shows => `~/Library/Application
Support/polkadot/chains/ksmcc3/network/authority_discovery_addr_cache.json`

> [!CAUTION]
> The node shutdown automatically with scary error.
> ```
> Essential task `overseer` failed. Shutting down service.
> TCP listener terminated with error error=Custom { kind: Other, error:
"A Tokio 1.x context was found, but it is being shutdown." }
> Installed transports terminated, ignore if the node is stopping
> Litep2p backend terminated`
>Error:
>   0: Other: Essential task failed.
> ```
> This is maybe expected/correct, but just wanted to flag it, expand
`output` below to see log
>
> Or did I break anything?

<details><summary>Full Log with scary error (expand me 👈)</summary>
The log

```sh
$ ./target/release/polkadot --chain kusama --validator
2025-06-19 14:34:35 ----------------------------
2025-06-19 14:34:35 This chain is not in any way
2025-06-19 14:34:35       endorsed by the
2025-06-19 14:34:35      KUSAMA FOUNDATION
2025-06-19 14:34:35 ----------------------------
2025-06-19 14:34:35 Parity Polkadot
2025-06-19 14:34:35 ✌️  version 1.18.5-e6b86b54d31
2025-06-19 14:34:35 ❤️  by Parity Technologies <admin@parity.io>, 2017-2025
2025-06-19 14:34:35 📋 Chain specification: Kusama
2025-06-19 14:34:35 🏷  Node name: glamorous-game-6626
2025-06-19 14:34:35 👤 Role: AUTHORITY
2025-06-19 14:34:35 💾 Database: RocksDb at /Users/alexandercyon/Library/Application Support/polkadot/chains/ksmcc3/db/full
2025-06-19 14:34:39 Creating transaction pool txpool_type=SingleState ready=Limit { count: 8192, total_bytes: 20971520 } future=Limit { count: 819, total_bytes: 2097152 }
2025-06-19 14:34:39 🚀 Using prepare-worker binary at: "/Users/alexandercyon/Developer/Rust/polkadot-sdk/target/release/polkadot-prepare-worker"
2025-06-19 14:34:39 🚀 Using execute-worker binary at: "/Users/alexandercyon/Developer/Rust/polkadot-sdk/target/release/polkadot-execute-worker"
2025-06-19 14:34:39 Local node identity is: 12D3KooWPVh77R44wZwySBys262Jh4BSbpMFxtvQNmi1EpdcwDDW
2025-06-19 14:34:39 Running litep2p network backend
2025-06-19 14:34:40 💻 Operating system: macos
2025-06-19 14:34:40 💻 CPU architecture: aarch64
2025-06-19 14:34:40 📦 Highest known block at #1294645
2025-06-19 14:34:40 〽️ Prometheus exporter started at 127.0.0.1:9615
2025-06-19 14:34:40 Running JSON-RPC server: addr=127.0.0.1:9944,[::1]:9944
2025-06-19 14:34:40 🏁 CPU single core score: 1.35 GiBs, parallelism score: 1.44 GiBs with expected cores: 8
2025-06-19 14:34:40 🏁 Memory score: 63.75 GiBs
2025-06-19 14:34:40 🏁 Disk score (seq. writes): 2.92 GiBs
2025-06-19 14:34:40 🏁 Disk score (rand. writes): 727.56 MiBs
2025-06-19 14:34:40 CYON: 🔮 Good, path set to: /Users/alexandercyon/Library/Application Support/polkadot/chains/ksmcc3/network/authority_discovery_addr_cache.json
2025-06-19 14:34:40 🚨 Your system cannot securely run a validator.
Running validation of malicious PVF code has a higher risk of compromising this machine.
Secure mode is enabled only for Linux
and a full secure mode is enabled only for Linux x86-64.
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory.
More information: https://docs.polkadot.com/infrastructure/running-a-validator/operational-tasks/general-management/#secure-your-validator
2025-06-19 14:34:40 Successfully persisted AddrCache on disk
2025-06-19 14:34:40 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode for non-Linux; check logs") }
2025-06-19 14:34:40 Starting workers
2025-06-19 14:34:40 Starting approval distribution workers
2025-06-19 14:34:40 👶 Starting BABE Authorship worker
2025-06-19 14:34:40 Starting approval voting workers
2025-06-19 14:34:40 Starting main subsystem loop
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="candidate-validation"
2025-06-19 14:34:40 Starting with an empty approval vote DB.
2025-06-19 14:34:40 subsystem finished unexpectedly subsystem=Ok(())
2025-06-19 14:34:40 🥩 BEEFY gadget waiting for BEEFY pallet to become available...
2025-06-19 14:34:40 Received `Conclude` signal, exiting
2025-06-19 14:34:40 Conclude
2025-06-19 14:34:40 received `Conclude` signal, exiting
2025-06-19 14:34:40 received `Conclude` signal, exiting
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-recovery"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="bitfield-distribution"
2025-06-19 14:34:40 Approval distribution worker 3, exiting because of shutdown
2025-06-19 14:34:40 Approval distribution worker 2, exiting because of shutdown
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="dispute-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="chain-selection"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="pvf-checker"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-store"
2025-06-19 14:34:40 Approval distribution worker 1, exiting because of shutdown
2025-06-19 14:34:40 Approval distribution worker 0, exiting because of shutdown
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-voting"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="chain-api"
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="provisioner"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="runtime-api"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="candidate-backing"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="collation-generation"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="gossip-support"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-voting-parallel"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="bitfield-signing"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="collator-protocol"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="statement-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="network-bridge-tx"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="network-bridge-rx"
2025-06-19 14:34:41 subsystem exited with error subsystem="prospective-parachains" err=FromOrigin { origin: "prospective-parachains", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2025-06-19 14:34:41 subsystem exited with error subsystem="dispute-coordinator" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2025-06-19 14:34:41 Essential task `overseer` failed. Shutting down service.
2025-06-19 14:34:41 TCP listener terminated with error error=Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }
2025-06-19 14:34:41 Installed transports terminated, ignore if the node is stopping
2025-06-19 14:34:41 Litep2p backend terminated
Error:
   0: Other: Essential task failed.

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
```

🤔

</details>

## `kusama -d /my/custom/path`
```sh
cargo build --release && ./target/release/polkadot --chain kusama --validator --unsafe-force-node-key-generation -d /my/custom/path
```
shows => `./my/custom/path/chains/ksmcc3/network/` for `net_config_path`

## `test`

I've configured a `WorkerConfig` with a `tempfile` for all tests. To my
surprise I had to call `fs::create_dir_all` in order for the tempdir to
actually be created.

---------

Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: alvicsam <alvicsam@gmail.com>
(cherry picked from commit ee6d22b)
@paritytech-release-backport-bot

Successfully created backport PR for stable2506:

EgorPopelyaev added a commit that referenced this pull request Jul 4, 2025
Backport #8839 into `stable2506` from Sajjon.

See the
[documentation](https://github.com/paritytech/polkadot-sdk/blob/master/docs/BACKPORT.md)
on how to use this bot.

<!--
  # To be used by other automation, do not modify:
  original-pr-number: #${pull_number}
-->

---------

Co-authored-by: Alexander Cyon <Sajjon@users.noreply.github.com>
Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: alvicsam <alvicsam@gmail.com>
Co-authored-by: Egor_P <egor@parity.io>
Co-authored-by: Alexander Cyon <alex.cyon@parity.io>
ordian added a commit that referenced this pull request Jul 24, 2025
* master: (91 commits)
  Add extra information to the harmless error logs during validate_transaction (#9047)
  `sp-tracing`: Remove `test-utils` feature (#9063)
  add try-state check for staking roles -- staker cannot be nominator a… (#9034)
  net/discovery: File persistence for `AddrCache` (#8839)
  dispute-coordinator: handle race with offchain disabling (#9050)
  Align parameters for `EventEmitter::emit_sent_event` (#9057)
  Fetch parent block `api_version` (#9059)
  [XCM Precompile] Rename functions and improve docs in the Solidity interface (#9023)
  Cleanup and improvements for `ControlledValidatorIndices` (#8896)
  reenable 0001-parachains-pvf (#9046)
  Add optional auto-rebag within on-idle (#8684)
  Fix flaxy 0003-block-building-warp-sync test - one more approach (#8974)
  [Staking] [AHM] Fixes insufficient slashing of nominators (and some other small issues). (#8937)
  chore: Bump bounded-collections dep (#9004)
  XCMP and DMP improvements (#8860)
  EPMB/unsigned: fixed multi-page winner computation (#8987)
  Always send full parent header, not only hash, part of collation response (#8939)
  revive: Precompiles should return dummy code when queried (#9001)
  Fix confusing log messages in network protocol behaviour (#8819)
  Fix pallet_migrations benchmark when FailedMigrationHandler emits events (#8694)
  ...
alvicsam added a commit that referenced this pull request Oct 17, 2025
Implementation of #8758

# Description
Authority Discovery crate has been changed so that the `AddrCache` is
persisted to `persisted_cache_file_path` a `json` file in
`net_config_path` folder controlled by `NetworkConfiguration`.

`AddrCache` is JSON serialized (`serde_json::to_pretty_string`) and
persisted to file:
- periodically (every 10 minutes)
- on shutdown

Furthermore, this persisted `AddrCache` on file will be read from upon
start of the worker - if it does not exist, or we failed to deserialize
it a new empty cache is used.

`AddrCache` is made Serialize/Deserialize thanks to `PeerId` and
`Multiaddr` being made Serialize/Deserialize.

# Implementation
The worker use a spawner which is used in the [run loop of the worker,
where at an interval we try to persist the
[AddrCache](https://github.com/paritytech/polkadot-sdk/blob/cyon/persist_peers_cache/substrate/client/authority-discovery/src/worker.rs#L361-L372).
We won't persist the `AddrCache` if `persisted_cache_file_path:
Option<PathBuf>` is `None` - which it would be if
[`NetworkConfiguration`
`net_config_path`](https://github.com/paritytech/polkadot-sdk/blob/master/substrate/client/network/src/config.rs#L591)
is `None`. We spawn a new task each time the `interval` "ticks" - once
every 10 minutes - and it uses `fs::write` (there is also a
`tokio::fs::write` which requires the `fs` feature flag of `tokio` which
is not activated and I chose to not use it). If the worker shutsdown we
will try to persist without using the `spawner`.

# Changes
- New crate dependency: `serde_with` for `SerializeDisplay` and
`DeserialzeFromStr` macros
- `WorkerConfig` in authority-discovery crate has a new field
`persisted_cache_directory : Option<PathBuf>`
- `Worker` in authority-discovery crate constructor now takes a new
parameter, `spawner: Arc<dyn SpawnNamed>`

## Tests
- [authority-discovery
tests](substrate/client/authority-discovery/src/tests.rs) tests are
changed to use tokio runtime, `#[tokio::test]` and we pass a test worker
config with a `tempdir` for `persisted_cache_directory `

# `net_config_path`
Here are the `net_config_path` (from `NetworkConfiguration`) which is
the folder used by this PR to save a serialized `AddrCache` in:

## `dev`
```sh
cargo build --release && ./target/release/polkadot --dev
```

shows =>

`/var/folders/63/fs7x_3h16svftdz4g9bjk13h0000gn/T/substratey5QShJ/chains/rococo_dev/network/authority_discovery_addr_cache.json'`

## `kusama`
```sh
cargo build --release && ./target/release/polkadot --chain kusama --validator
```

shows => `~/Library/Application
Support/polkadot/chains/ksmcc3/network/authority_discovery_addr_cache.json`

> [!CAUTION]
> The node shutdown automatically with scary error. 
> ```
> Essential task `overseer` failed. Shutting down service.
> TCP listener terminated with error error=Custom { kind: Other, error:
"A Tokio 1.x context was found, but it is being shutdown." }
> Installed transports terminated, ignore if the node is stopping
> Litep2p backend terminated`
>Error:
>   0: Other: Essential task failed.
> ```
> This is maybe expected/correct, but just wanted to flag it, expand
`output` below to see log
> 
> Or did I break anything?

<details><summary>Full Log with scary error (expand me 👈)</summary>
The log

```sh
$ ./target/release/polkadot --chain kusama --validator
2025-06-19 14:34:35 ----------------------------
2025-06-19 14:34:35 This chain is not in any way
2025-06-19 14:34:35       endorsed by the
2025-06-19 14:34:35      KUSAMA FOUNDATION
2025-06-19 14:34:35 ----------------------------
2025-06-19 14:34:35 Parity Polkadot
2025-06-19 14:34:35 ✌️  version 1.18.5-e6b86b54d31
2025-06-19 14:34:35 ❤️  by Parity Technologies <admin@parity.io>, 2017-2025
2025-06-19 14:34:35 📋 Chain specification: Kusama
2025-06-19 14:34:35 🏷  Node name: glamorous-game-6626
2025-06-19 14:34:35 👤 Role: AUTHORITY
2025-06-19 14:34:35 💾 Database: RocksDb at /Users/alexandercyon/Library/Application Support/polkadot/chains/ksmcc3/db/full
2025-06-19 14:34:39 Creating transaction pool txpool_type=SingleState ready=Limit { count: 8192, total_bytes: 20971520 } future=Limit { count: 819, total_bytes: 2097152 }
2025-06-19 14:34:39 🚀 Using prepare-worker binary at: "/Users/alexandercyon/Developer/Rust/polkadot-sdk/target/release/polkadot-prepare-worker"
2025-06-19 14:34:39 🚀 Using execute-worker binary at: "/Users/alexandercyon/Developer/Rust/polkadot-sdk/target/release/polkadot-execute-worker"
2025-06-19 14:34:39 Local node identity is: 12D3KooWPVh77R44wZwySBys262Jh4BSbpMFxtvQNmi1EpdcwDDW
2025-06-19 14:34:39 Running litep2p network backend
2025-06-19 14:34:40 💻 Operating system: macos
2025-06-19 14:34:40 💻 CPU architecture: aarch64
2025-06-19 14:34:40 📦 Highest known block at #1294645
2025-06-19 14:34:40 〽️ Prometheus exporter started at 127.0.0.1:9615
2025-06-19 14:34:40 Running JSON-RPC server: addr=127.0.0.1:9944,[::1]:9944
2025-06-19 14:34:40 🏁 CPU single core score: 1.35 GiBs, parallelism score: 1.44 GiBs with expected cores: 8
2025-06-19 14:34:40 🏁 Memory score: 63.75 GiBs
2025-06-19 14:34:40 🏁 Disk score (seq. writes): 2.92 GiBs
2025-06-19 14:34:40 🏁 Disk score (rand. writes): 727.56 MiBs
2025-06-19 14:34:40 CYON: 🔮 Good, path set to: /Users/alexandercyon/Library/Application Support/polkadot/chains/ksmcc3/network/authority_discovery_addr_cache.json
2025-06-19 14:34:40 🚨 Your system cannot securely run a validator.
Running validation of malicious PVF code has a higher risk of compromising this machine.
Secure mode is enabled only for Linux
and a full secure mode is enabled only for Linux x86-64.
You can ignore this error with the `--insecure-validator-i-know-what-i-do` command line argument if you understand and accept the risks of running insecurely. With this flag, security features are enabled on a best-effort basis, but not mandatory.
More information: https://docs.polkadot.com/infrastructure/running-a-validator/operational-tasks/general-management/#secure-your-validator
2025-06-19 14:34:40 Successfully persisted AddrCache on disk
2025-06-19 14:34:40 subsystem exited with error subsystem="candidate-validation" err=FromOrigin { origin: "candidate-validation", source: Context("could not enable Secure Validator Mode for non-Linux; check logs") }
2025-06-19 14:34:40 Starting workers
2025-06-19 14:34:40 Starting approval distribution workers
2025-06-19 14:34:40 👶 Starting BABE Authorship worker
2025-06-19 14:34:40 Starting approval voting workers
2025-06-19 14:34:40 Starting main subsystem loop
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="candidate-validation"
2025-06-19 14:34:40 Starting with an empty approval vote DB.
2025-06-19 14:34:40 subsystem finished unexpectedly subsystem=Ok(())
2025-06-19 14:34:40 🥩 BEEFY gadget waiting for BEEFY pallet to become available...
2025-06-19 14:34:40 Received `Conclude` signal, exiting
2025-06-19 14:34:40 Conclude
2025-06-19 14:34:40 received `Conclude` signal, exiting
2025-06-19 14:34:40 received `Conclude` signal, exiting
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-recovery"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="bitfield-distribution"
2025-06-19 14:34:40 Approval distribution worker 3, exiting because of shutdown
2025-06-19 14:34:40 Approval distribution worker 2, exiting because of shutdown
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="dispute-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="chain-selection"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="pvf-checker"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-store"
2025-06-19 14:34:40 Approval distribution worker 1, exiting because of shutdown
2025-06-19 14:34:40 Approval distribution worker 0, exiting because of shutdown
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-voting"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="chain-api"
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Approval distribution stream finished, most likely shutting down
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="provisioner"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="availability-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="runtime-api"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="candidate-backing"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="collation-generation"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="gossip-support"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="approval-voting-parallel"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="bitfield-signing"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="collator-protocol"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="statement-distribution"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="network-bridge-tx"
2025-06-19 14:34:40 Terminating due to subsystem exit subsystem="network-bridge-rx"
2025-06-19 14:34:41 subsystem exited with error subsystem="prospective-parachains" err=FromOrigin { origin: "prospective-parachains", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2025-06-19 14:34:41 subsystem exited with error subsystem="dispute-coordinator" err=FromOrigin { origin: "dispute-coordinator", source: SubsystemReceive(Generated(Context("Signal channel is terminated and empty."))) }
2025-06-19 14:34:41 Essential task `overseer` failed. Shutting down service.
2025-06-19 14:34:41 TCP listener terminated with error error=Custom { kind: Other, error: "A Tokio 1.x context was found, but it is being shutdown." }
2025-06-19 14:34:41 Installed transports terminated, ignore if the node is stopping
2025-06-19 14:34:41 Litep2p backend terminated
Error:
   0: Other: Essential task failed.

Backtrace omitted. Run with RUST_BACKTRACE=1 environment variable to display it.
Run with RUST_BACKTRACE=full to include source snippets.
```

🤔

</details>

## `kusama -d /my/custom/path`
```sh
cargo build --release && ./target/release/polkadot --chain kusama --validator --unsafe-force-node-key-generation -d /my/custom/path
```
shows => `./my/custom/path/chains/ksmcc3/network/` for `net_config_path`

## `test`

I've configured a `WorkerConfig` with a `tempfile` for all tests. To my
surprise I had to call `fs::create_dir_all` in order for the tempdir to
actually be created.

---------

Co-authored-by: Alexandru Vasile <60601340+lexnv@users.noreply.github.com>
Co-authored-by: cmd[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: alvicsam <alvicsam@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A4-backport-stable2503 Pull request must be backported to the stable2503 release branch A4-backport-stable2506 Pull request must be backported to the stable2506 release branch T0-node This PR/Issue is related to the topic “node”. T8-polkadot This PR/Issue is related to/affects the Polkadot network.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants