Design Proposal Fixing Stuck Payments by ziggie1984 · Pull Request #10137 · lightningnetwork/lnd

ziggie1984 · 2025-08-07T09:11:32Z

In the design document I outlined how I want to fix the problem of stuck payments by introducing a new subsystem called AttemptHandler for more details checkout the design doc.

Waiting for Concept ACKs before starting the implementation.

cc @Roasbeef, @yyforyongyu

Also see the last commit for a temporary fix solution of the problem we encountered in the issue. I think it is acceptable to go with a temp fix and in the long run introduce an independant AttemptHandler for the attempts of the payments.

gemini-code-assist

Summary of Changes

Hello @ziggie1984, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive design proposal aimed at resolving the critical issue of stuck in-flight payments within the lnd system. It outlines a new architectural component, the AttemptHandler service, which will independently and reliably manage the resolution of HTLC attempts, thereby enhancing the overall robustness and user experience of the payment system.

Highlights

Problem Identification: This design proposal identifies a critical flaw in the current lnd payment architecture where payments can become permanently stuck in the INFLIGHT state. The root cause is attributed to the tight coupling and premature termination of resultCollector goroutines when the paymentLifecycle exits.
Proposed Solution: AttemptHandler Service: A new, long-lived AttemptHandler service is proposed to decouple HTLC attempt resolution from the payment lifecycle. This service will be responsible for the guaranteed, independent, and reliable resolution of all in-flight HTLC attempts, ensuring results are processed even if the original payment lifecycle terminates.
Robust Concurrency Management: The design details a robust concurrency and memory management strategy for the AttemptHandler using a buffered channel as a semaphore. This mechanism provides backpressure, preventing unbounded resource consumption and ensuring the system can safely handle a high volume of in-flight attempts without premature abandonment.
Architectural Advantages: The proposal highlights several advantages, including improved correctness by solving the stuck payment problem, enhanced robustness through backpressure, better architectural integrity via clear separation of concerns, and increased testability and maintainability of the payment system.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a design proposal for an AttemptHandler service to address the issue of stuck in-flight payments. The proposal is well-structured, clearly identifying the problem and presenting a robust, decoupled architecture as a solution. The use of a semaphore for backpressure and concurrency control is a solid design choice.

I've added a few comments to the design document to suggest minor improvements and clarifications, such as correcting a type name, proposing a simplification of the AttemptInfo struct by leveraging an existing TODO in the codebase, and aligning a code snippet with the project's style guide. Overall, this is an excellent proposal that promises to significantly enhance the reliability of the payment system.

gemini-code-assist · 2025-08-07T09:12:49Z

attempt_handler_design.md

+// AttemptHandler is a router-level service responsible for the guaranteed,
+// independent, and reliable resolution of all in-flight HTLC attempts.
+type AttemptHandler struct {
+	cfg                *routing.Router


The type *routing.Router for the cfg field appears to be a typo. In the lnd codebase, routing is a package name, and the main router struct is ChannelRouter. For the AttemptHandler to fulfill its responsibilities as described in the design (e.g., interacting with ControlTower and MissionControl), it would need access to the router's configuration, which is typically done via an instance of *routing.ChannelRouter.

Suggested change

cfg *routing.Router

cfg *routing.ChannelRouter

gemini-code-assist · 2025-08-07T09:12:49Z

attempt_handler_design.md

+type AttemptInfo struct {
+	AttemptID      uint64
+	PaymentHash    lntypes.Hash
+	Route          route.Route
+	errorDecryptor htlcswitch.ErrorDecryptor
+	NotifyChannel  chan<- *htlcswitch.PaymentResult // Optional
+}


The proposed AttemptInfo struct includes Route and errorDecryptor fields, which implies that the paymentLifecycle is still responsible for creating the ErrorDecryptor. The existing codebase contains TODO comments suggesting that this responsibility should be moved into the htlcswitch to be handled on-demand for failed HTLCs. This would be more efficient (avoids creating decryptors for successful HTLCs) and would simplify the interface.

This design proposal is a great opportunity to incorporate this improvement. Consider simplifying AttemptInfo and modifying htlcswitch.GetAttemptResult to not require an ErrorDecryptor. The switch can construct it internally when needed, as it has access to the database to retrieve the route.

A simplified AttemptInfo could look like:

type AttemptInfo struct { AttemptID uint64 PaymentHash lntypes.Hash NotifyChannel chan<- *htlcswitch.PaymentResult // Optional }

This would make the new AttemptHandler service even cleaner and more decoupled.

gemini-code-assist · 2025-08-07T09:12:49Z

attempt_handler_design.md

+// processAttempt is the main goroutine for handling a single attempt. It waits
+// indefinitely for a result from the switch.
+func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {


The comment for processAttempt doesn't fully align with the project's style guide.¹

Function comments should begin with the function name.

The current comment, "is the main goroutine...", is passive. It should actively describe what the function does.

An improved comment would be more direct and adhere to the naming convention.

Style Guide References

Suggested change

// processAttempt is the main goroutine for handling a single attempt. It waits

// indefinitely for a result from the switch.

func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {

// processAttempt processes a single HTLC attempt and waits indefinitely for a

// result from the switch.

func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {

Footnotes

Function comments must begin with the function name. (link) ↩

ziggie1984 · 2025-08-07T13:26:12Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request introduces a design document for a long-term fix to stuck payments and a temporary code fix. The design for the AttemptHandler is well-reasoned and addresses the core issue of coupled payment lifecycles. The temporary fix in payment_lifecycle.go correctly uses a sync.WaitGroup to prevent the lifecycle from exiting while HTLCs are still in flight, which should resolve the immediate problem. My review includes a minor suggestion on the design document to improve decoupling and feedback on the temporary fix to align it with the project's documentation and style guide standards.

gemini-code-assist · 2025-08-07T13:29:58Z

attempt_handler_design.md

+type AttemptHandler struct {
+	cfg                *routing.Router
+	concurrencyLimiter chan struct{} // Used as a semaphore
+	wg                 sync.WaitGroup
+	quit               chan struct{}
+}


The design proposal is excellent and clearly articulates the problem and a robust solution. One minor point for consideration is the AttemptHandler's dependency on the entire *routing.Router struct. To further improve decoupling and testability, as highlighted in the advantages section, it might be beneficial to define a more granular interface that provides only the methods the AttemptHandler requires from the router. This would adhere to the interface segregation principle and make the dependencies of the new service more explicit.

gemini-code-assist · 2025-08-07T13:29:59Z

routing/payment_lifecycle.go

+// resultCollectorLoop is a dedicated goroutine that handles HTLC attempt
+// results. It continues running even after the payment lifecycle exits to
+// ensure no results are lost.


The function comment is confusing. It refers to the function as resultCollectorLoop and describes it as a long-running goroutine that continues after the lifecycle exits. This seems to describe the proposed AttemptHandler service rather than the actual waitForOutstandingResults function, which is a blocking call that drains results before the lifecycle exits. The comment should be updated to accurately reflect the function's behavior in this temporary fix.¹

// waitForOutstandingResults blocks until all outstanding HTLCs have been // resolved. This is a temporary fix to prevent the payment lifecycle from // exiting while HTLCs are still in-flight, which could lead to stuck // payments.

Style Guide References

Footnotes

Function comments must begin with the function name and accurately describe its purpose and assumptions. (link) ↩

yyforyongyu · 2025-08-07T13:35:18Z

attempt_handler_design.md

+
+The root cause is the **tight coupling** between the `paymentLifecycle` (the process that sends a payment) and the `resultCollector` (the goroutine that waits for the payment's result). The `resultCollector`'s lifetime is tied directly to the `paymentLifecycle`.
+
+When the `paymentLifecycle` exits for any reason other than definitive success or failure—such as a context timeout, user cancellation, or a route-finding error—it closes a `quit` channel. This action immediately terminates all of its associated `resultCollector` goroutines.


I'm not sure this analysis is correct - if the payment has pending htlcs and timed out, the payment is still considered as inflight and will wait for attempt results? Think it's handled in decideNextStep.

Yeah it is more detailed than that, the problem is the following:

In case of the issue, we fail the payment in the context check, but we still have the non-failed payment as a reference because we load the payment before checking the context. Now there are conditions that in decideNextStep we will continue launching a payment, but when registering the attempt we fetch the new payment see that we cannot register a new attempt becauase the payment is failed and we hit the case: exitWithError

I think if we really want to make "sending a payment" non-blocking, I think there is no other way than separating the attempt collection, otherwise we will not be able to exit a payment after the timeout hit.

ok I see what's going here - the checkContext will change the db payment - in that case why don't we just reload the payment afterwards, so the first step is to check context, then reload the payment.

I think the new design means we now have a second loop to handle the payment state, and I think it's better to have a single source to handle it. We shouldn't have any remaining attempts out there.

This commit addresses a bug that caused payments to get "stuck" in a pending state because the main payment process could terminate before all asynchronous payment attempts (HTLCs) reported their results. The fix uses a sync.WaitGroup to ensure the payment lifecycle waits for every result to be collected, guaranteeing a final payment state. However, this is a temporary solution because it can block waiting for results even after the payment timeout is reached. This side effect is currently considered acceptable because sendAttempt is only used in contexts where this new blocking behavior is not an issue: either with no timeout specified, or in a fully asynchronous manner where the caller doesn't wait for the result anyway.

ziggie1984 · 2025-08-07T14:39:34Z

Temporary fix in #10141, before we consider a bigger change.

ziggie1984 · 2025-08-07T14:44:28Z

Closing this one in favour of #10141

design proposal fixing Stuck Payments

75b3561

gemini-code-assist bot reviewed Aug 7, 2025

View reviewed changes

ziggie1984 force-pushed the stuck-payments-fix branch 2 times, most recently from c296daa to 0254a3d Compare August 7, 2025 13:33

yyforyongyu reviewed Aug 7, 2025

View reviewed changes

ziggie1984 force-pushed the stuck-payments-fix branch from 0254a3d to 8437f5b Compare August 7, 2025 13:49

saubyk assigned ziggie1984 Aug 7, 2025

ziggie1984 closed this Aug 7, 2025


		The root cause is the tight coupling between the `paymentLifecycle` (the process that sends a payment) and the `resultCollector` (the goroutine that waits for the payment's result). The `resultCollector`'s lifetime is tied directly to the `paymentLifecycle`.

		When the `paymentLifecycle` exits for any reason other than definitive success or failure—such as a context timeout, user cancellation, or a route-finding error—it closes a `quit` channel. This action immediately terminates all of its associated `resultCollector` goroutines.

Conversation

ziggie1984 commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Style Guide References

Footnotes

Uh oh!

ziggie1984 commented Aug 7, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Aug 7, 2025

Choose a reason for hiding this comment

Style Guide References

Footnotes

Uh oh!

yyforyongyu Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

ziggie1984 Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

yyforyongyu Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

ziggie1984 commented Aug 7, 2025

Uh oh!

ziggie1984 commented Aug 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ziggie1984 commented Aug 7, 2025 •

edited

Loading