Skip to content

[Fix] Harden the router's resolver#3540

Merged
vicsn merged 13 commits intoProvableHQ:stagingfrom
ljedrz:fix/hardened_resolver
Jul 7, 2025
Merged

[Fix] Harden the router's resolver#3540
vicsn merged 13 commits intoProvableHQ:stagingfrom
ljedrz:fix/hardened_resolver

Conversation

@ljedrz
Copy link
Collaborator

@ljedrz ljedrz commented Mar 14, 2025

While investigating a potential issue with some trusted peers being periodically dropped, I've noticed a lot of instances of Unable to resolve the (...) address in the log extracts from different networks. I believe most of them are triggered unnecessarily, but we need to be sure, and this PR aims to address this.

The proposed changes are as follows:

  • 1e8dc49 - changes the dual-lock setup of the resolver to a single-lock one in order to avoid any possibility of mismatch between the address maps; it should also slightly improve its performance
  • 260e84b - the inbound method is "fed" from a lower-level queue which doesn't have an awareness of the address resolver, so the entries that fail to resolve there are basically guaranteed to be post-disconnect "stragglers" and may be ignored (instead of triggering potentially many redundant disconnect attempts, which result in further resolver-related warnings)
  • 6bb8a74 - this swaps the order of disconnect-related operations, altering the resolver only after a peer is no longer marked as connected; this will avoid situations where an outbound message is greenlit to be sent to a peer (who is marked as connected) only to fail at address resolution right afterwards, triggering a bogus warning
  • bccf29a - this is a loosely-related drive-by; we should clear any peer-related cache before marking them as a candidate for connections, in order to avoid a (highly unlikely) scenario where the peer is reconnected to while having outdated cache entries, or even having new and applicable cache entries cleared
  • 7ee66b4 - when a peer sends us a Message::Disconnect, we shouldn't report it as a protocol violation; this is mostly a cleanup of one or two misleading logs
  • b673d7b - since I've seen some instances of the heartbeat process reporting lingering inactive peers, we should have a fallback cleanup of high-level connection artifacts in case the resolver can't find the physically connected address

Update: the PR was rebased due to a conflict, and while these commit hashes have changed, their contents or order haven't.

Copy link
Collaborator

@niklaslong niklaslong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A nice tightening up of peer tracking! Did a first pass and the current changeset looks good 👍

@howardwu
Copy link
Member

@ljedrz Do we need to apply the same changes from Router to Gateway?

@ljedrz
Copy link
Collaborator Author

ljedrz commented Mar 26, 2025

@howardwu not necessarily; my recommendation would be to first introduce these changes, and then perform a new analysis of the logs, looking for protocol violation false positives and potential connection stability issues. These changes will make the picture a lot more clear.

@joske
Copy link
Contributor

joske commented Apr 10, 2025

@ljedrz @niklaslong Which logs are you talking about? Were you able to reproduce the issue yourself?

@ljedrz
Copy link
Collaborator Author

ljedrz commented Apr 10, 2025

@joske I was analyzing the logs of one of the Canarynet clients before and after these changes.

@joske
Copy link
Contributor

joske commented Apr 10, 2025

Could you share those logs?

@vicsn
Copy link
Collaborator

vicsn commented Apr 11, 2025

Could you share those logs?

I recall very often seeing the errors Lukasz mentioned. I suggest you can just run your own local canary client as he suggests, if you don't pass any peers you should connect to bootstrap nodes who will connect you to others.

Copy link
Contributor

@kaimast kaimast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let a few comments. Sorry if they are too nitpicky...

Could you resolve the conflicts with staging, as well? Hopefully we can get this merged in the coming days.

ljedrz added 12 commits July 1, 2025 09:23
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
…o a peer

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
… peer

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
…connect

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
…ed peers

Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
Signed-off-by: ljedrz <ljedrz@users.noreply.github.com>
@ljedrz ljedrz force-pushed the fix/hardened_resolver branch from b673d7b to 67b9d81 Compare July 1, 2025 08:27
@ljedrz ljedrz requested review from bendyarm and kaimast July 1, 2025 08:29
@ljedrz
Copy link
Collaborator Author

ljedrz commented Jul 1, 2025

Rebased (there were no changes to the original commits), and applied the review comments.

@ljedrz
Copy link
Collaborator Author

ljedrz commented Jul 1, 2025

The devnet CI job has failed, but I can't reproduce it locally; can't restart it either.

Update: it was a deadlock which I avoided locally due to starting the nodes less quickly. Fixed.

@ljedrz ljedrz marked this pull request as ready for review July 1, 2025 09:12
Copy link
Contributor

@kaimast kaimast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vicsn vicsn merged commit 3c4a705 into ProvableHQ:staging Jul 7, 2025
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants