Skip to content

Design Proposal Fixing Stuck Payments#10137

Closed
ziggie1984 wants to merge 2 commits intolightningnetwork:masterfrom
ziggie1984:stuck-payments-fix
Closed

Design Proposal Fixing Stuck Payments#10137
ziggie1984 wants to merge 2 commits intolightningnetwork:masterfrom
ziggie1984:stuck-payments-fix

Conversation

@ziggie1984
Copy link
Collaborator

@ziggie1984 ziggie1984 commented Aug 7, 2025

In the design document I outlined how I want to fix the problem of stuck payments by introducing a new subsystem called AttemptHandler for more details checkout the design doc.

Waiting for Concept ACKs before starting the implementation.

cc @Roasbeef, @yyforyongyu

Also see the last commit for a temporary fix solution of the problem we encountered in the issue. I think it is acceptable to go with a temp fix and in the long run introduce an independant AttemptHandler for the attempts of the payments.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @ziggie1984, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a comprehensive design proposal aimed at resolving the critical issue of stuck in-flight payments within the lnd system. It outlines a new architectural component, the AttemptHandler service, which will independently and reliably manage the resolution of HTLC attempts, thereby enhancing the overall robustness and user experience of the payment system.

Highlights

  • Problem Identification: This design proposal identifies a critical flaw in the current lnd payment architecture where payments can become permanently stuck in the INFLIGHT state. The root cause is attributed to the tight coupling and premature termination of resultCollector goroutines when the paymentLifecycle exits.
  • Proposed Solution: AttemptHandler Service: A new, long-lived AttemptHandler service is proposed to decouple HTLC attempt resolution from the payment lifecycle. This service will be responsible for the guaranteed, independent, and reliable resolution of all in-flight HTLC attempts, ensuring results are processed even if the original payment lifecycle terminates.
  • Robust Concurrency Management: The design details a robust concurrency and memory management strategy for the AttemptHandler using a buffered channel as a semaphore. This mechanism provides backpressure, preventing unbounded resource consumption and ensuring the system can safely handle a high volume of in-flight attempts without premature abandonment.
  • Architectural Advantages: The proposal highlights several advantages, including improved correctness by solving the stuck payment problem, enhanced robustness through backpressure, better architectural integrity via clear separation of concerns, and increased testability and maintainability of the payment system.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design proposal for an AttemptHandler service to address the issue of stuck in-flight payments. The proposal is well-structured, clearly identifying the problem and presenting a robust, decoupled architecture as a solution. The use of a semaphore for backpressure and concurrency control is a solid design choice.

I've added a few comments to the design document to suggest minor improvements and clarifications, such as correcting a type name, proposing a simplification of the AttemptInfo struct by leveraging an existing TODO in the codebase, and aligning a code snippet with the project's style guide. Overall, this is an excellent proposal that promises to significantly enhance the reliability of the payment system.

// AttemptHandler is a router-level service responsible for the guaranteed,
// independent, and reliable resolution of all in-flight HTLC attempts.
type AttemptHandler struct {
cfg *routing.Router

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type *routing.Router for the cfg field appears to be a typo. In the lnd codebase, routing is a package name, and the main router struct is ChannelRouter. For the AttemptHandler to fulfill its responsibilities as described in the design (e.g., interacting with ControlTower and MissionControl), it would need access to the router's configuration, which is typically done via an instance of *routing.ChannelRouter.

Suggested change
cfg *routing.Router
cfg *routing.ChannelRouter

Comment on lines +92 to +98
type AttemptInfo struct {
AttemptID uint64
PaymentHash lntypes.Hash
Route route.Route
errorDecryptor htlcswitch.ErrorDecryptor
NotifyChannel chan<- *htlcswitch.PaymentResult // Optional
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposed AttemptInfo struct includes Route and errorDecryptor fields, which implies that the paymentLifecycle is still responsible for creating the ErrorDecryptor. The existing codebase contains TODO comments suggesting that this responsibility should be moved into the htlcswitch to be handled on-demand for failed HTLCs. This would be more efficient (avoids creating decryptors for successful HTLCs) and would simplify the interface.

This design proposal is a great opportunity to incorporate this improvement. Consider simplifying AttemptInfo and modifying htlcswitch.GetAttemptResult to not require an ErrorDecryptor. The switch can construct it internally when needed, as it has access to the database to retrieve the route.

A simplified AttemptInfo could look like:

type AttemptInfo struct {
	AttemptID      uint64
	PaymentHash    lntypes.Hash
	NotifyChannel  chan<- *htlcswitch.PaymentResult // Optional
}

This would make the new AttemptHandler service even cleaner and more decoupled.

Comment on lines +106 to +108
// processAttempt is the main goroutine for handling a single attempt. It waits
// indefinitely for a result from the switch.
func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for processAttempt doesn't fully align with the project's style guide.1

  • Function comments should begin with the function name.
  • The current comment, "is the main goroutine...", is passive. It should actively describe what the function does.

An improved comment would be more direct and adhere to the naming convention.

Style Guide References

Suggested change
// processAttempt is the main goroutine for handling a single attempt. It waits
// indefinitely for a result from the switch.
func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {
// processAttempt processes a single HTLC attempt and waits indefinitely for a
// result from the switch.
func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {

Footnotes

  1. Function comments must begin with the function name. (link)

@ziggie1984
Copy link
Collaborator Author

@gemini-code-assist review

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a design document for a long-term fix to stuck payments and a temporary code fix. The design for the AttemptHandler is well-reasoned and addresses the core issue of coupled payment lifecycles. The temporary fix in payment_lifecycle.go correctly uses a sync.WaitGroup to prevent the lifecycle from exiting while HTLCs are still in flight, which should resolve the immediate problem. My review includes a minor suggestion on the design document to improve decoupling and feedback on the temporary fix to align it with the project's documentation and style guide standards.

Comment on lines +83 to +88
type AttemptHandler struct {
cfg *routing.Router
concurrencyLimiter chan struct{} // Used as a semaphore
wg sync.WaitGroup
quit chan struct{}
}

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design proposal is excellent and clearly articulates the problem and a robust solution. One minor point for consideration is the AttemptHandler's dependency on the entire *routing.Router struct. To further improve decoupling and testability, as highlighted in the advantages section, it might be beneficial to define a more granular interface that provides only the methods the AttemptHandler requires from the router. This would adhere to the interface segregation principle and make the dependencies of the new service more explicit.

Comment on lines +94 to +96
// resultCollectorLoop is a dedicated goroutine that handles HTLC attempt
// results. It continues running even after the payment lifecycle exits to
// ensure no results are lost.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The function comment is confusing. It refers to the function as resultCollectorLoop and describes it as a long-running goroutine that continues after the lifecycle exits. This seems to describe the proposed AttemptHandler service rather than the actual waitForOutstandingResults function, which is a blocking call that drains results before the lifecycle exits. The comment should be updated to accurately reflect the function's behavior in this temporary fix.1

// waitForOutstandingResults blocks until all outstanding HTLCs have been
// resolved. This is a temporary fix to prevent the payment lifecycle from
// exiting while HTLCs are still in-flight, which could lead to stuck
// payments.

Style Guide References

Footnotes

  1. Function comments must begin with the function name and accurately describe its purpose and assumptions. (link)

@ziggie1984 ziggie1984 force-pushed the stuck-payments-fix branch 2 times, most recently from c296daa to 0254a3d Compare August 7, 2025 13:33

The root cause is the **tight coupling** between the `paymentLifecycle` (the process that sends a payment) and the `resultCollector` (the goroutine that waits for the payment's result). The `resultCollector`'s lifetime is tied directly to the `paymentLifecycle`.

When the `paymentLifecycle` exits for any reason other than definitive success or failure—such as a context timeout, user cancellation, or a route-finding error—it closes a `quit` channel. This action immediately terminates all of its associated `resultCollector` goroutines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this analysis is correct - if the payment has pending htlcs and timed out, the payment is still considered as inflight and will wait for attempt results? Think it's handled in decideNextStep.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is more detailed than that, the problem is the following:

In case of the issue, we fail the payment in the context check, but we still have the non-failed payment as a reference because we load the payment before checking the context. Now there are conditions that in decideNextStep we will continue launching a payment, but when registering the attempt we fetch the new payment see that we cannot register a new attempt becauase the payment is failed and we hit the case: exitWithError

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we really want to make "sending a payment" non-blocking, I think there is no other way than separating the attempt collection, otherwise we will not be able to exit a payment after the timeout hit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see what's going here - the checkContext will change the db payment - in that case why don't we just reload the payment afterwards, so the first step is to check context, then reload the payment.

I think the new design means we now have a second loop to handle the payment state, and I think it's better to have a single source to handle it. We shouldn't have any remaining attempts out there.

This commit addresses a bug that caused payments to get "stuck" in
a pending state because the main payment process could terminate
before all asynchronous payment attempts (HTLCs) reported their
results. The fix uses a sync.WaitGroup to ensure the payment
lifecycle waits for every result to be collected, guaranteeing a
final payment state.

However, this is a temporary solution because it can block
waiting for results even after the payment timeout is reached.
This side effect is currently considered acceptable because
sendAttempt is only used in contexts where this new blocking
behavior is not an issue: either with no timeout specified, or
in a fully asynchronous manner where the caller doesn't wait for
the result anyway.
@ziggie1984
Copy link
Collaborator Author

Temporary fix in #10141, before we consider a bigger change.

@ziggie1984
Copy link
Collaborator Author

Closing this one in favour of #10141

@ziggie1984 ziggie1984 closed this Aug 7, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants