Design Proposal Fixing Stuck Payments#10137
Design Proposal Fixing Stuck Payments#10137ziggie1984 wants to merge 2 commits intolightningnetwork:masterfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @ziggie1984, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a comprehensive design proposal aimed at resolving the critical issue of stuck in-flight payments within the lnd system. It outlines a new architectural component, the AttemptHandler service, which will independently and reliably manage the resolution of HTLC attempts, thereby enhancing the overall robustness and user experience of the payment system.
Highlights
- Problem Identification: This design proposal identifies a critical flaw in the current
lndpayment architecture where payments can become permanently stuck in theINFLIGHTstate. The root cause is attributed to the tight coupling and premature termination ofresultCollectorgoroutines when thepaymentLifecycleexits. - Proposed Solution:
AttemptHandlerService: A new, long-livedAttemptHandlerservice is proposed to decouple HTLC attempt resolution from the payment lifecycle. This service will be responsible for the guaranteed, independent, and reliable resolution of all in-flight HTLC attempts, ensuring results are processed even if the original payment lifecycle terminates. - Robust Concurrency Management: The design details a robust concurrency and memory management strategy for the
AttemptHandlerusing a buffered channel as a semaphore. This mechanism provides backpressure, preventing unbounded resource consumption and ensuring the system can safely handle a high volume of in-flight attempts without premature abandonment. - Architectural Advantages: The proposal highlights several advantages, including improved correctness by solving the stuck payment problem, enhanced robustness through backpressure, better architectural integrity via clear separation of concerns, and increased testability and maintainability of the payment system.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments or fill out our survey to provide feedback.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a design proposal for an AttemptHandler service to address the issue of stuck in-flight payments. The proposal is well-structured, clearly identifying the problem and presenting a robust, decoupled architecture as a solution. The use of a semaphore for backpressure and concurrency control is a solid design choice.
I've added a few comments to the design document to suggest minor improvements and clarifications, such as correcting a type name, proposing a simplification of the AttemptInfo struct by leveraging an existing TODO in the codebase, and aligning a code snippet with the project's style guide. Overall, this is an excellent proposal that promises to significantly enhance the reliability of the payment system.
| // AttemptHandler is a router-level service responsible for the guaranteed, | ||
| // independent, and reliable resolution of all in-flight HTLC attempts. | ||
| type AttemptHandler struct { | ||
| cfg *routing.Router |
There was a problem hiding this comment.
The type *routing.Router for the cfg field appears to be a typo. In the lnd codebase, routing is a package name, and the main router struct is ChannelRouter. For the AttemptHandler to fulfill its responsibilities as described in the design (e.g., interacting with ControlTower and MissionControl), it would need access to the router's configuration, which is typically done via an instance of *routing.ChannelRouter.
| cfg *routing.Router | |
| cfg *routing.ChannelRouter |
| type AttemptInfo struct { | ||
| AttemptID uint64 | ||
| PaymentHash lntypes.Hash | ||
| Route route.Route | ||
| errorDecryptor htlcswitch.ErrorDecryptor | ||
| NotifyChannel chan<- *htlcswitch.PaymentResult // Optional | ||
| } |
There was a problem hiding this comment.
The proposed AttemptInfo struct includes Route and errorDecryptor fields, which implies that the paymentLifecycle is still responsible for creating the ErrorDecryptor. The existing codebase contains TODO comments suggesting that this responsibility should be moved into the htlcswitch to be handled on-demand for failed HTLCs. This would be more efficient (avoids creating decryptors for successful HTLCs) and would simplify the interface.
This design proposal is a great opportunity to incorporate this improvement. Consider simplifying AttemptInfo and modifying htlcswitch.GetAttemptResult to not require an ErrorDecryptor. The switch can construct it internally when needed, as it has access to the database to retrieve the route.
A simplified AttemptInfo could look like:
type AttemptInfo struct {
AttemptID uint64
PaymentHash lntypes.Hash
NotifyChannel chan<- *htlcswitch.PaymentResult // Optional
}This would make the new AttemptHandler service even cleaner and more decoupled.
| // processAttempt is the main goroutine for handling a single attempt. It waits | ||
| // indefinitely for a result from the switch. | ||
| func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) { |
There was a problem hiding this comment.
The comment for processAttempt doesn't fully align with the project's style guide.1
- Function comments should begin with the function name.
- The current comment, "is the main goroutine...", is passive. It should actively describe what the function does.
An improved comment would be more direct and adhere to the naming convention.
Style Guide References
| // processAttempt is the main goroutine for handling a single attempt. It waits | |
| // indefinitely for a result from the switch. | |
| func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) { | |
| // processAttempt processes a single HTLC attempt and waits indefinitely for a | |
| // result from the switch. | |
| func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) { |
Footnotes
|
@gemini-code-assist review |
There was a problem hiding this comment.
Code Review
This pull request introduces a design document for a long-term fix to stuck payments and a temporary code fix. The design for the AttemptHandler is well-reasoned and addresses the core issue of coupled payment lifecycles. The temporary fix in payment_lifecycle.go correctly uses a sync.WaitGroup to prevent the lifecycle from exiting while HTLCs are still in flight, which should resolve the immediate problem. My review includes a minor suggestion on the design document to improve decoupling and feedback on the temporary fix to align it with the project's documentation and style guide standards.
| type AttemptHandler struct { | ||
| cfg *routing.Router | ||
| concurrencyLimiter chan struct{} // Used as a semaphore | ||
| wg sync.WaitGroup | ||
| quit chan struct{} | ||
| } |
There was a problem hiding this comment.
The design proposal is excellent and clearly articulates the problem and a robust solution. One minor point for consideration is the AttemptHandler's dependency on the entire *routing.Router struct. To further improve decoupling and testability, as highlighted in the advantages section, it might be beneficial to define a more granular interface that provides only the methods the AttemptHandler requires from the router. This would adhere to the interface segregation principle and make the dependencies of the new service more explicit.
routing/payment_lifecycle.go
Outdated
| // resultCollectorLoop is a dedicated goroutine that handles HTLC attempt | ||
| // results. It continues running even after the payment lifecycle exits to | ||
| // ensure no results are lost. |
There was a problem hiding this comment.
The function comment is confusing. It refers to the function as resultCollectorLoop and describes it as a long-running goroutine that continues after the lifecycle exits. This seems to describe the proposed AttemptHandler service rather than the actual waitForOutstandingResults function, which is a blocking call that drains results before the lifecycle exits. The comment should be updated to accurately reflect the function's behavior in this temporary fix.1
// waitForOutstandingResults blocks until all outstanding HTLCs have been
// resolved. This is a temporary fix to prevent the payment lifecycle from
// exiting while HTLCs are still in-flight, which could lead to stuck
// payments.Style Guide References
Footnotes
c296daa to
0254a3d
Compare
|
|
||
| The root cause is the **tight coupling** between the `paymentLifecycle` (the process that sends a payment) and the `resultCollector` (the goroutine that waits for the payment's result). The `resultCollector`'s lifetime is tied directly to the `paymentLifecycle`. | ||
|
|
||
| When the `paymentLifecycle` exits for any reason other than definitive success or failure—such as a context timeout, user cancellation, or a route-finding error—it closes a `quit` channel. This action immediately terminates all of its associated `resultCollector` goroutines. |
There was a problem hiding this comment.
I'm not sure this analysis is correct - if the payment has pending htlcs and timed out, the payment is still considered as inflight and will wait for attempt results? Think it's handled in decideNextStep.
There was a problem hiding this comment.
Yeah it is more detailed than that, the problem is the following:
In case of the issue, we fail the payment in the context check, but we still have the non-failed payment as a reference because we load the payment before checking the context. Now there are conditions that in decideNextStep we will continue launching a payment, but when registering the attempt we fetch the new payment see that we cannot register a new attempt becauase the payment is failed and we hit the case: exitWithError
There was a problem hiding this comment.
I think if we really want to make "sending a payment" non-blocking, I think there is no other way than separating the attempt collection, otherwise we will not be able to exit a payment after the timeout hit.
There was a problem hiding this comment.
ok I see what's going here - the checkContext will change the db payment - in that case why don't we just reload the payment afterwards, so the first step is to check context, then reload the payment.
I think the new design means we now have a second loop to handle the payment state, and I think it's better to have a single source to handle it. We shouldn't have any remaining attempts out there.
This commit addresses a bug that caused payments to get "stuck" in a pending state because the main payment process could terminate before all asynchronous payment attempts (HTLCs) reported their results. The fix uses a sync.WaitGroup to ensure the payment lifecycle waits for every result to be collected, guaranteeing a final payment state. However, this is a temporary solution because it can block waiting for results even after the payment timeout is reached. This side effect is currently considered acceptable because sendAttempt is only used in contexts where this new blocking behavior is not an issue: either with no timeout specified, or in a fully asynchronous manner where the caller doesn't wait for the result anyway.
0254a3d to
8437f5b
Compare
|
Temporary fix in #10141, before we consider a bigger change. |
|
Closing this one in favour of #10141 |
In the design document I outlined how I want to fix the problem of stuck payments by introducing a new subsystem called AttemptHandler for more details checkout the design doc.
Waiting for Concept ACKs before starting the implementation.
cc @Roasbeef, @yyforyongyu
Also see the last commit for a temporary fix solution of the problem we encountered in the issue. I think it is acceptable to go with a temp fix and in the long run introduce an independant AttemptHandler for the attempts of the payments.