subservers: fail litd startup on critical sub-server error#1269
subservers: fail litd startup on critical sub-server error#1269Antisys wants to merge 3 commits intolightninglabs:masterfrom
Conversation
The mock FetchConfig implementations were setting fields (Manager, ChainNotifier, Chain, Signer, FeeEstimator, Wallet, KeyRing, Sweeper) that only exist when the corresponding RPC build tags are active. Without those tags, the Config structs are empty, causing compilation failures on plain `go build ./...`. Remove the field initializers since CreateSubServer only needs a non-nil Config to extract permissions — the field values are unused. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When taproot-assets-mode is not disabled, lnd calls into tapd at runtime. If tapd fails to start, litd previously continued in a broken state where those calls would silently fail. Mark tapd as a critical sub-server so that its startup failure is fatal to litd. The implementation uses a two-pass approach: all sub-servers are attempted regardless of individual failures, then an error is returned if any critical server failed. This avoids non-determinism from Go's random map iteration order. Fixes lightninglabs#1181. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request improves the robustness of the litd startup sequence by introducing a critical sub-server failure policy. By allowing specific sub-servers to be designated as essential, the system can now gracefully halt startup if a critical dependency fails, rather than proceeding in a partially broken state. The changes also include a more deterministic startup flow and necessary cleanup of mock configurations to resolve build issues. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to designate specific sub-servers as critical, ensuring that their failure to start or connect results in a fatal error for the main process. The sub-server manager was updated to track these critical components and return errors from its startup and connection methods, while still attempting to initialize all other servers. Additionally, the PR simplifies mock configurations in the perms package and includes comprehensive unit tests for the new logic. Feedback was provided to address non-deterministic error reporting; currently, if multiple critical sub-servers fail, the returned error depends on the random iteration order of the internal map. It is recommended to capture only the first encountered error to ensure consistent behavior.
| if s.critical[ss.Name()] { | ||
| criticalErr = fmt.Errorf("critical "+ | ||
| "sub-server %s failed to "+ | ||
| "start: %w", ss.Name(), err) | ||
| } |
There was a problem hiding this comment.
The current implementation overwrites criticalErr on each new critical failure. This means if multiple critical sub-servers fail, only the error from the last one encountered will be returned. Due to Go's random map iteration, this makes the returned error non-deterministic, which contradicts one of the goals of this change as stated in the pull request description.
To ensure deterministic error reporting, you should only capture the first critical error.
| if s.critical[ss.Name()] { | |
| criticalErr = fmt.Errorf("critical "+ | |
| "sub-server %s failed to "+ | |
| "start: %w", ss.Name(), err) | |
| } | |
| if s.critical[ss.Name()] && criticalErr == nil { | |
| criticalErr = fmt.Errorf("critical "+ | |
| "sub-server %s failed to "+ | |
| "start: %w", ss.Name(), err) | |
| } |
| if s.critical[ss.Name()] { | ||
| criticalErr = fmt.Errorf("critical "+ | ||
| "sub-server %s failed to "+ | ||
| "connect: %w", ss.Name(), err) | ||
| } |
There was a problem hiding this comment.
Similar to StartIntegratedServers, this implementation overwrites criticalErr on each new critical failure. This makes the returned error non-deterministic if multiple critical sub-servers fail to connect, which goes against the goal of avoiding non-determinism from map iteration.
To ensure deterministic error reporting, you should only capture the first critical error.
| if s.critical[ss.Name()] { | |
| criticalErr = fmt.Errorf("critical "+ | |
| "sub-server %s failed to "+ | |
| "connect: %w", ss.Name(), err) | |
| } | |
| if s.critical[ss.Name()] && criticalErr == nil { | |
| criticalErr = fmt.Errorf("critical "+ | |
| "sub-server %s failed to "+ | |
| "connect: %w", ss.Name(), err) | |
| } |
If multiple critical sub-servers fail, the returned error should not depend on Go's random map iteration order. Capture only the first critical error encountered. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces a mechanism to designate specific sub-servers as critical within the subservers.Manager, ensuring that their failure to start or connect is treated as a fatal error. The Manager now tracks these servers and returns errors during initialization, with the Taproot Assets sub-server specifically marked as critical when enabled. These changes include simplified mock configurations and new unit tests to verify the fatal error logic. I have no feedback to provide.
|
Hi @Antisys, Thank you for your PR contribution, unfortunately though this issue already has a linked PR which is actively being |
|
Thanks for the heads up @ViktorT-11 — makes sense to consolidate. I'll follow #1183 and contribute there if I have anything useful to add. Closing this. |
|
Thanks a lot @Antisys! |
Summary
Closes #1181.
tapdas a critical sub-server whentaproot-assets-modeis notdisable, so litd fails to start if tapd fails to start (preventing a broken state where lnd calls into a non-running tapd)StartIntegratedServers/ConnectRemoteSubServers: attempt all sub-servers first, then return error if any critical one failed — avoids non-determinism from Go's random map iteration orderperms/mock.goandperms/mock_dev.gowhere Config struct fields were referenced without the required RPC build tagsChanges
subservers/manager.gocriticalset +SetCritical(), return errors fromStartIntegratedServers/ConnectRemoteSubServersfor critical serversterminal.gosubservers/manager_test.goperms/mock.go,perms/mock_dev.goTest plan
go build $(go list ./... | grep -v /itest)— cleango test $(go list ./... | grep -v /itest)— all passgo test -race ./subservers/...— no racestaproot-assets-mode=integratedand a tapd config that will fail → litd should exit with errortaproot-assets-mode=disableand a broken tapd config → litd should start normally🤖 Generated with Claude Code