Skip to content

Conversation

@lexnv
Copy link
Collaborator

@lexnv lexnv commented May 16, 2025

This PR greatly improves the WebSocket connection stability by relying on the interval buffers of tungstenite instead of buffering at a higher level. The fix passes through the messages to the tungstenite socket directly.

This is a long-lasting issue (reproducible on all older versions silently with IO errors) that manifested as a decryption error after the state fixes:

Issue context:

  • node is under stress due to handling multiple substreams
  • the issue affected only long running WebSocket substreams and manifested as an IO error from crypto/noise decoding
  • tungstenite WebSocketStream already has a 128KiB buffer for writing
  • litep2p has a redundant 8 KiB buffer for writing
  • litep2p buffered internally multiple packets, tunstenite accepted the batch. I expect this creates a wrongly framed packet that fails to decode at the crypto/noise level

Investigation

We have noted several errors that manifested as crypto/nosie decoding failures on our Kusama validators:

litep2p::crypto::noise: failed to decrypt message error=Decrypt

Upon further investigation, the errors affected only WebSocket connections. The issue could be reproduced by running a local node in Kusama with more than 500 peers in and out. As well as running subp2p-explorer with adjusted protocols:

2025-05-15T14:58:08.095961Z ERROR {peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::crypto::noise: failed to decrypt message for bigger buffers error=Decrypt peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu")

2025-05-15T14:58:08.096419Z DEBUG 
{peer_id=peer_id=12D3KooWGsDvWrbApFTCpF8h7YCKHuvJbok6HAq5ZnPgE9LGWnsv}:
litep2p::websocket::connection: connection closed with error peer=PeerId("12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu") error=Decode(Io(Custom { kind: Other, error: "failed to decrypt message bigger buffers: decrypt error 12D3KooWSa5SbCHGKpNeSs3Qak2TrM5gTkEBrPfvo6TyxhUpEHeu" }))

The issue also reproduced on the zombinet PR, which uses litep2p:

2025-05-14 09:37:30.805  INFO tokio-runtime-worker sync: Warp sync is complete, continuing with state sync.    

2025-05-14 09:37:33.189 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:33.283 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
2025-05-14 09:37:34.764 ERROR tokio-runtime-worker litep2p::crypto::noise: failed to decrypt message error=Decrypt
	
2025-05-14 09:37:35.656  INFO tokio-runtime-worker substrate: ⚙️  State sync, Downloading state, 22%, 2.21 Mib (0 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 707.8kiB/s ⬆ 0.5kiB/s    
	
2025-05-14 09:37:40.657  INFO tokio-runtime-worker substrate: ⚙️  State sync, Downloading state, 22%, 2.21 Mib (3 peers), best: #0 (0xc5e7…d059), finalized #0 (0xc5e7…d059), ⬇ 1.0kiB/s ⬆ 1.0kiB/s    

Testing Done

Performance

Tested the performance with litep2p-perf using the following branch:

Status Data Size Time (s) Bandwidth (Mbit/s)
Before
Uploaded 256.00 MiB 15.1152 135.49
Downloaded 256.00 MiB 13.2296 154.80
After
Uploaded 256.00 MiB 15.7178 130.30
Downloaded 256.00 MiB 13.2435 154.64

From the performance table, we are within 3% of the original buggy implementation. I would lean towards a normal variation in our results. Therefore, the performance remains unimpacted.

Repro Case

Have added a custom user protocol as part of our testing to filter out these errors.

  • The protocol opens 16 outbound substreams on the connection established event. Therefore, it will handle 16 outbound substreams and 16 inbound substreams
  • The outbound substreams will push a configurable number of packets, each of size 128 bytes, to the remote peer. While the inbound substreams will read the same number of packets from the remote peer.

Before this PR, the TCP was unaffected and the websocket reproduces the decrypt failure. After this PR, the test passes.

Closes: paritytech/polkadot-sdk#8525

@lexnv lexnv self-assigned this May 16, 2025
@lexnv lexnv added the bug Something isn't working label May 16, 2025
lexnv added 3 commits May 16, 2025 14:32
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
Signed-off-by: Alexandru Vasile <[email protected]>
@lexnv lexnv merged commit 276b190 into master May 22, 2025
8 checks passed
@lexnv lexnv deleted the lexnv/fix-ws-stability branch May 22, 2025 15:48
lexnv added a commit that referenced this pull request May 26, 2025
## [0.9.5] - 2025-05-26

This release primarily focuses on strengthening the stability of the
websocket transport. We've resolved an issue where higher-level
buffering was causing the Noise protocol to fail when decoding messages.

We've also significantly improved connectivity between litep2p and
Smoldot (the Substrate-based light client). Empty frames are now handled
correctly, preventing handshake timeouts and ensuring smoother
communication.

Finally, we've carried out several dependency updates to keep the
library current with the latest versions of its underlying components.

### Fixed

- substream/fix: Allow empty payloads with 0-length frame
([#395](#395))
- websocket: Fix connection stability on decrypt messages
([#393](#393))

### Changed

- crypto/noise: Show peerIDs that fail to decode
([#392](#392))
- cargo: Bump yamux to 0.13.5 and tokio to 1.45.0
([#396](#396))
- ci: Enforce and apply clippy rules
([#388](#388))
- build(deps): bump ring from 0.16.20 to 0.17.14
([#389](#389))
- Update hickory-resolver 0.24.2 -> 0.25.2
([#386](#386))

cc @paritytech/networking

---------

Signed-off-by: Alexandru Vasile <[email protected]>
github-merge-queue bot pushed a commit to paritytech/polkadot-sdk that referenced this pull request Jun 3, 2025
# Litep2p Becomes the Default Network Backend

This PR finalizes the [litep2p](https://github.com/paritytech/litep2p)
integration and makes it the default network backend for substrate-based
chains.

## Litep2p Improvements

After the stabilization, a forum post will follow with up to date
information and more accurate measurements of the live impact of
litep2p.

### CPU Usage Reduction

**Litep2p consumes roughly 2x less CPU than the libp2p alternative**.
This frees up resources for other usecases (subsystems) and enables
running nodes on more cost-efficient hardware.

This metric has been collected by the `networking::libp2p-node` metric
of a live Kusama validator. This represents the CPU time spent on
polling the networking task. Litep2p CPU consumption is on the left,
using roughtly 1.3x CPUs, while libp2p on the right uses roughly 2.9-3x
CPUs:

![Screenshot 2025-05-26 at 15 23
22](https://github.com/user-attachments/assets/17bf1ed8-b887-423e-b131-f0bbf146919e)


This metric has been collected by the NodeExporter of a live Kusama
validator. Litep2p CPU consumption is on the left, using roughtly 230
CPU units, while libp2p on the right uses roughly 350 CPU units. This
makes litep2p ~1.52 times more effiecient:

![Screenshot 2025-05-26 at 15 24
33](https://github.com/user-attachments/assets/8923cb56-241d-4e1d-9593-33c5def2ff4d)



### DHT Improvements and Authority Discovery

Litep2p is able to discover peers faster via the Kademlia protocol than
libp2p. This behavior manifests in faster discovery times for
validators. For context, libp2p discovers 1K DHT records (authority
records) in approximately 10 minutes, while litep2p discovers them in
just 2.5 minutes (for more info see
#7077 (comment)).

This will improve issues we've seen with libp2p that causes validators
to not receive rewards:
- #8548

### Stable Sync Peers

Litep2p presents a more stable peer count in comparison with the libp2p
backend. This ensures we can sync up faster than libp2p to the tip of
the chain. In an older experiment, litep2p syncs to the tip of the chain
in 526s, compared to 803s for libp2p. The stability of connections shows
improvements for other protocols as well:

![Screenshot 2025-05-26 at 15 01
59](https://github.com/user-attachments/assets/ac3607ba-a551-49e5-9a50-f5150a6b619f)

The previous image shows on the left the litep2p version and on the
right the libp2p version.


### Revert Kusama Enablement
This PR reverts #7866.
Litep2p is now enabled by default, we don't need to selectively enable
it on different chains.

### Litep2p 0.9.5

This release primarily focuses on strengthening the stability of the
websocket transport. We've resolved an issue where higher-level
buffering was causing the Noise protocol to fail when decoding messages.

We've also significantly improved connectivity between litep2p and
Smoldot (the Substrate-based light client). Empty frames are now handled
correctly, preventing handshake timeouts and ensuring smoother
communication.

Finally, we've carried out several dependency updates to keep the
library current with the latest versions of its underlying components.

Fixed:
- substream/fix: Allow empty payloads with 0-length frame
([#395](paritytech/litep2p#395))
- websocket: Fix connection stability on decrypt messages
([#393](paritytech/litep2p#393))

Changed:
- crypto/noise: Show peerIDs that fail to decode
([#392](paritytech/litep2p#392))
- cargo: Bump yamux to 0.13.5 and tokio to 1.45.0
([#396](paritytech/litep2p#396))
- ci: Enforce and apply clippy rules
([#388](paritytech/litep2p#388))
- build(deps): bump ring from 0.16.20 to 0.17.14
([#389](paritytech/litep2p#389))
- Update hickory-resolver 0.24.2 -> 0.25.2
([#386](paritytech/litep2p#386))


### Fix peerset reserve only mode

This has been move in PR:
#8650 for ease of
reviewing.
The PR rejects non-reserved peers in the reserved-only mode of the
litep2p notification peerset.

---------

Signed-off-by: Alexandru Vasile <[email protected]>
pgherveou pushed a commit to paritytech/polkadot-sdk that referenced this pull request Jun 11, 2025
# Litep2p Becomes the Default Network Backend

This PR finalizes the [litep2p](https://github.com/paritytech/litep2p)
integration and makes it the default network backend for substrate-based
chains.

## Litep2p Improvements

After the stabilization, a forum post will follow with up to date
information and more accurate measurements of the live impact of
litep2p.

### CPU Usage Reduction

**Litep2p consumes roughly 2x less CPU than the libp2p alternative**.
This frees up resources for other usecases (subsystems) and enables
running nodes on more cost-efficient hardware.

This metric has been collected by the `networking::libp2p-node` metric
of a live Kusama validator. This represents the CPU time spent on
polling the networking task. Litep2p CPU consumption is on the left,
using roughtly 1.3x CPUs, while libp2p on the right uses roughly 2.9-3x
CPUs:

![Screenshot 2025-05-26 at 15 23
22](https://github.com/user-attachments/assets/17bf1ed8-b887-423e-b131-f0bbf146919e)


This metric has been collected by the NodeExporter of a live Kusama
validator. Litep2p CPU consumption is on the left, using roughtly 230
CPU units, while libp2p on the right uses roughly 350 CPU units. This
makes litep2p ~1.52 times more effiecient:

![Screenshot 2025-05-26 at 15 24
33](https://github.com/user-attachments/assets/8923cb56-241d-4e1d-9593-33c5def2ff4d)



### DHT Improvements and Authority Discovery

Litep2p is able to discover peers faster via the Kademlia protocol than
libp2p. This behavior manifests in faster discovery times for
validators. For context, libp2p discovers 1K DHT records (authority
records) in approximately 10 minutes, while litep2p discovers them in
just 2.5 minutes (for more info see
#7077 (comment)).

This will improve issues we've seen with libp2p that causes validators
to not receive rewards:
- #8548

### Stable Sync Peers

Litep2p presents a more stable peer count in comparison with the libp2p
backend. This ensures we can sync up faster than libp2p to the tip of
the chain. In an older experiment, litep2p syncs to the tip of the chain
in 526s, compared to 803s for libp2p. The stability of connections shows
improvements for other protocols as well:

![Screenshot 2025-05-26 at 15 01
59](https://github.com/user-attachments/assets/ac3607ba-a551-49e5-9a50-f5150a6b619f)

The previous image shows on the left the litep2p version and on the
right the libp2p version.


### Revert Kusama Enablement
This PR reverts #7866.
Litep2p is now enabled by default, we don't need to selectively enable
it on different chains.

### Litep2p 0.9.5

This release primarily focuses on strengthening the stability of the
websocket transport. We've resolved an issue where higher-level
buffering was causing the Noise protocol to fail when decoding messages.

We've also significantly improved connectivity between litep2p and
Smoldot (the Substrate-based light client). Empty frames are now handled
correctly, preventing handshake timeouts and ensuring smoother
communication.

Finally, we've carried out several dependency updates to keep the
library current with the latest versions of its underlying components.

Fixed:
- substream/fix: Allow empty payloads with 0-length frame
([#395](paritytech/litep2p#395))
- websocket: Fix connection stability on decrypt messages
([#393](paritytech/litep2p#393))

Changed:
- crypto/noise: Show peerIDs that fail to decode
([#392](paritytech/litep2p#392))
- cargo: Bump yamux to 0.13.5 and tokio to 1.45.0
([#396](paritytech/litep2p#396))
- ci: Enforce and apply clippy rules
([#388](paritytech/litep2p#388))
- build(deps): bump ring from 0.16.20 to 0.17.14
([#389](paritytech/litep2p#389))
- Update hickory-resolver 0.24.2 -> 0.25.2
([#386](paritytech/litep2p#386))


### Fix peerset reserve only mode

This has been move in PR:
#8650 for ease of
reviewing.
The PR rejects non-reserved peers in the reserved-only mode of the
litep2p notification peerset.

---------

Signed-off-by: Alexandru Vasile <[email protected]>
alvicsam pushed a commit to paritytech/polkadot-sdk that referenced this pull request Oct 17, 2025
# Litep2p Becomes the Default Network Backend

This PR finalizes the [litep2p](https://github.com/paritytech/litep2p)
integration and makes it the default network backend for substrate-based
chains.

## Litep2p Improvements

After the stabilization, a forum post will follow with up to date
information and more accurate measurements of the live impact of
litep2p.

### CPU Usage Reduction

**Litep2p consumes roughly 2x less CPU than the libp2p alternative**.
This frees up resources for other usecases (subsystems) and enables
running nodes on more cost-efficient hardware.

This metric has been collected by the `networking::libp2p-node` metric
of a live Kusama validator. This represents the CPU time spent on
polling the networking task. Litep2p CPU consumption is on the left,
using roughtly 1.3x CPUs, while libp2p on the right uses roughly 2.9-3x
CPUs:

![Screenshot 2025-05-26 at 15 23
22](https://github.com/user-attachments/assets/17bf1ed8-b887-423e-b131-f0bbf146919e)


This metric has been collected by the NodeExporter of a live Kusama
validator. Litep2p CPU consumption is on the left, using roughtly 230
CPU units, while libp2p on the right uses roughly 350 CPU units. This
makes litep2p ~1.52 times more effiecient:

![Screenshot 2025-05-26 at 15 24
33](https://github.com/user-attachments/assets/8923cb56-241d-4e1d-9593-33c5def2ff4d)



### DHT Improvements and Authority Discovery

Litep2p is able to discover peers faster via the Kademlia protocol than
libp2p. This behavior manifests in faster discovery times for
validators. For context, libp2p discovers 1K DHT records (authority
records) in approximately 10 minutes, while litep2p discovers them in
just 2.5 minutes (for more info see
#7077 (comment)).

This will improve issues we've seen with libp2p that causes validators
to not receive rewards:
- #8548

### Stable Sync Peers

Litep2p presents a more stable peer count in comparison with the libp2p
backend. This ensures we can sync up faster than libp2p to the tip of
the chain. In an older experiment, litep2p syncs to the tip of the chain
in 526s, compared to 803s for libp2p. The stability of connections shows
improvements for other protocols as well:

![Screenshot 2025-05-26 at 15 01
59](https://github.com/user-attachments/assets/ac3607ba-a551-49e5-9a50-f5150a6b619f)

The previous image shows on the left the litep2p version and on the
right the libp2p version.


### Revert Kusama Enablement
This PR reverts #7866.
Litep2p is now enabled by default, we don't need to selectively enable
it on different chains.

### Litep2p 0.9.5

This release primarily focuses on strengthening the stability of the
websocket transport. We've resolved an issue where higher-level
buffering was causing the Noise protocol to fail when decoding messages.

We've also significantly improved connectivity between litep2p and
Smoldot (the Substrate-based light client). Empty frames are now handled
correctly, preventing handshake timeouts and ensuring smoother
communication.

Finally, we've carried out several dependency updates to keep the
library current with the latest versions of its underlying components.

Fixed:
- substream/fix: Allow empty payloads with 0-length frame
([#395](paritytech/litep2p#395))
- websocket: Fix connection stability on decrypt messages
([#393](paritytech/litep2p#393))

Changed:
- crypto/noise: Show peerIDs that fail to decode
([#392](paritytech/litep2p#392))
- cargo: Bump yamux to 0.13.5 and tokio to 1.45.0
([#396](paritytech/litep2p#396))
- ci: Enforce and apply clippy rules
([#388](paritytech/litep2p#388))
- build(deps): bump ring from 0.16.20 to 0.17.14
([#389](paritytech/litep2p#389))
- Update hickory-resolver 0.24.2 -> 0.25.2
([#386](paritytech/litep2p#386))


### Fix peerset reserve only mode

This has been move in PR:
#8650 for ease of
reviewing.
The PR rejects non-reserved peers in the reserved-only mode of the
litep2p notification peerset.

---------

Signed-off-by: Alexandru Vasile <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

litep2p::crypto::noise: failed to decrypt message error=Decrypt

3 participants