Skip to content

subservers: fail litd startup on critical sub-server error#1269

Closed
Antisys wants to merge 3 commits intolightninglabs:masterfrom
Antisys:feat/fail-litd-on-critical-subserver-error
Closed

subservers: fail litd startup on critical sub-server error#1269
Antisys wants to merge 3 commits intolightninglabs:masterfrom
Antisys:feat/fail-litd-on-critical-subserver-error

Conversation

@Antisys
Copy link
Copy Markdown

@Antisys Antisys commented Mar 28, 2026

Summary

Closes #1181.

  • Mark tapd as a critical sub-server when taproot-assets-mode is not disable, so litd fails to start if tapd fails to start (preventing a broken state where lnd calls into a non-running tapd)
  • Use a two-pass approach in StartIntegratedServers/ConnectRemoteSubServers: attempt all sub-servers first, then return error if any critical one failed — avoids non-determinism from Go's random map iteration order
  • Fix pre-existing build failure in perms/mock.go and perms/mock_dev.go where Config struct fields were referenced without the required RPC build tags

Changes

File What
subservers/manager.go Add critical set + SetCritical(), return errors from StartIntegratedServers/ConnectRemoteSubServers for critical servers
terminal.go Mark TAP as critical when enabled, handle new error returns
subservers/manager_test.go 5 unit tests (critical failure, non-critical failure, no critical, disabled critical, status verification)
perms/mock.go, perms/mock_dev.go Remove field initializers for empty (untagged) Config structs

Test plan

  • go build $(go list ./... | grep -v /itest) — clean
  • go test $(go list ./... | grep -v /itest) — all pass
  • go test -race ./subservers/... — no races
  • Integration test: start litd with taproot-assets-mode=integrated and a tapd config that will fail → litd should exit with error
  • Integration test: start litd with taproot-assets-mode=disable and a broken tapd config → litd should start normally

🤖 Generated with Claude Code

Antisys and others added 2 commits March 28, 2026 18:20
The mock FetchConfig implementations were setting fields (Manager,
ChainNotifier, Chain, Signer, FeeEstimator, Wallet, KeyRing, Sweeper)
that only exist when the corresponding RPC build tags are active.
Without those tags, the Config structs are empty, causing compilation
failures on plain `go build ./...`.

Remove the field initializers since CreateSubServer only needs a
non-nil Config to extract permissions — the field values are unused.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
When taproot-assets-mode is not disabled, lnd calls into tapd at
runtime. If tapd fails to start, litd previously continued in a
broken state where those calls would silently fail.

Mark tapd as a critical sub-server so that its startup failure is
fatal to litd. The implementation uses a two-pass approach: all
sub-servers are attempted regardless of individual failures, then
an error is returned if any critical server failed. This avoids
non-determinism from Go's random map iteration order.

Fixes lightninglabs#1181.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@gemini-code-assist
Copy link
Copy Markdown

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request improves the robustness of the litd startup sequence by introducing a critical sub-server failure policy. By allowing specific sub-servers to be designated as essential, the system can now gracefully halt startup if a critical dependency fails, rather than proceeding in a partially broken state. The changes also include a more deterministic startup flow and necessary cleanup of mock configurations to resolve build issues.

Highlights

  • Critical Sub-server Enforcement: Introduced a mechanism to mark specific sub-servers as 'critical', ensuring that if they fail to start or connect, the entire litd startup process is aborted.
  • TAP Integration: Configured the tapd sub-server as critical whenever taproot-assets-mode is enabled, preventing litd from running in a broken state where lnd expects a non-running tapd.
  • Deterministic Startup: Updated sub-server startup logic to use a two-pass approach, attempting all sub-servers first before returning errors, which avoids non-deterministic behavior caused by Go's map iteration order.
  • Build Fixes: Resolved build failures in perms/mock.go and perms/mock_dev.go by removing references to untagged Config struct fields.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to designate specific sub-servers as critical, ensuring that their failure to start or connect results in a fatal error for the main process. The sub-server manager was updated to track these critical components and return errors from its startup and connection methods, while still attempting to initialize all other servers. Additionally, the PR simplifies mock configurations in the perms package and includes comprehensive unit tests for the new logic. Feedback was provided to address non-deterministic error reporting; currently, if multiple critical sub-servers fail, the returned error depends on the random iteration order of the internal map. It is recommended to capture only the first encountered error to ensure consistent behavior.

Comment thread subservers/manager.go Outdated
Comment on lines +153 to +157
if s.critical[ss.Name()] {
criticalErr = fmt.Errorf("critical "+
"sub-server %s failed to "+
"start: %w", ss.Name(), err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The current implementation overwrites criticalErr on each new critical failure. This means if multiple critical sub-servers fail, only the error from the last one encountered will be returned. Due to Go's random map iteration, this makes the returned error non-deterministic, which contradicts one of the goals of this change as stated in the pull request description.

To ensure deterministic error reporting, you should only capture the first critical error.

Suggested change
if s.critical[ss.Name()] {
criticalErr = fmt.Errorf("critical "+
"sub-server %s failed to "+
"start: %w", ss.Name(), err)
}
if s.critical[ss.Name()] && criticalErr == nil {
criticalErr = fmt.Errorf("critical "+
"sub-server %s failed to "+
"start: %w", ss.Name(), err)
}

Comment thread subservers/manager.go Outdated
Comment on lines +187 to +191
if s.critical[ss.Name()] {
criticalErr = fmt.Errorf("critical "+
"sub-server %s failed to "+
"connect: %w", ss.Name(), err)
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to StartIntegratedServers, this implementation overwrites criticalErr on each new critical failure. This makes the returned error non-deterministic if multiple critical sub-servers fail to connect, which goes against the goal of avoiding non-determinism from map iteration.

To ensure deterministic error reporting, you should only capture the first critical error.

Suggested change
if s.critical[ss.Name()] {
criticalErr = fmt.Errorf("critical "+
"sub-server %s failed to "+
"connect: %w", ss.Name(), err)
}
if s.critical[ss.Name()] && criticalErr == nil {
criticalErr = fmt.Errorf("critical "+
"sub-server %s failed to "+
"connect: %w", ss.Name(), err)
}

If multiple critical sub-servers fail, the returned error should not
depend on Go's random map iteration order. Capture only the first
critical error encountered.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@Antisys
Copy link
Copy Markdown
Author

Antisys commented Mar 28, 2026

/gemini review

Copy link
Copy Markdown

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a mechanism to designate specific sub-servers as critical within the subservers.Manager, ensuring that their failure to start or connect is treated as a fatal error. The Manager now tracks these servers and returns errors during initialization, with the Taproot Assets sub-server specifically marked as critical when enabled. These changes include simplified mock configurations and new unit tests to verify the fatal error logic. I have no feedback to provide.

@ViktorT-11
Copy link
Copy Markdown
Contributor

ViktorT-11 commented Mar 30, 2026

Hi @Antisys,

Thank you for your PR contribution, unfortunately though this issue already has a linked PR which is actively being
worked on: #1183. This PR is therefore unfortunately redundant and I must sadly therefore recommend you to close this PR, and contribute in the already opened PR instead. Ways to contribute to that PR can either be adding a code review, or expressing your need for this functionality. Let me know if there's anything else I can clarify further!

@Antisys
Copy link
Copy Markdown
Author

Antisys commented Mar 30, 2026

Thanks for the heads up @ViktorT-11 — makes sense to consolidate. I'll follow #1183 and contribute there if I have anything useful to add. Closing this.

@Antisys Antisys closed this Mar 30, 2026
@ViktorT-11
Copy link
Copy Markdown
Contributor

Thanks a lot @Antisys!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fail litd startup on tapd startup error when taproot-assets-mode=enable

2 participants