diff --git a/attempt_handler_design.md b/attempt_handler_design.md new file mode 100644 index 00000000000..22778d52cdd --- /dev/null +++ b/attempt_handler_design.md @@ -0,0 +1,143 @@ +# Design Proposal: The `AttemptHandler` Service for Reliable Payment Resolution + +**Author:** ziggie +**Date:** August 7, 2025 +**Status:** Proposed + +## 1. Problem Statement: Stuck In-Flight Payments + +The current payment architecture in `lnd` suffers from a critical flaw where payments can become permanently stuck in the `INFLIGHT` state. + +### Root Cause + +The root cause is the **tight coupling** between the `paymentLifecycle` (the process that sends a payment) and the `resultCollector` (the goroutine that waits for the payment's result). The `resultCollector`'s lifetime is tied directly to the `paymentLifecycle`. + +When the `paymentLifecycle` exits for any reason other than definitive success or failure—such as a context timeout, user cancellation, or a route-finding error—it closes a `quit` channel. This action immediately terminates all of its associated `resultCollector` goroutines. + +The HTLC attempts that were already sent to the `htlcswitch`, however, are still active on the network. Because their collectors are gone, `lnd` is no longer listening for their results. This leaves the payment stuck in limbo, consuming node resources and providing a poor user experience. + +### Current Flawed Architecture + +```mermaid +sequenceDiagram + participant P as paymentLifecycle + participant R as resultCollector + participant S as htlcswitch + + P->>+R: 1. Launch goroutine + P->>S: 2. Send HTLC + + Note right of P: Lifecycle waits... + + alt On Timeout, Error, or Cancellation + P--xR: 4. KILLS resultCollector! + end + + S-->>R: 3. Result is sent back from switch + Note over R,S: But no one is listening! Result is ABANDONED. +``` + +## 2. Proposed Solution: The `AttemptHandler` Service + +We propose a new, registration-based architecture that decouples attempt resolution from the payment lifecycle. This is achieved by introducing a new, long-lived service: the **`AttemptHandler`**. + +The `AttemptHandler` is a router-level service whose lifetime is tied to the router itself, not to any individual payment. + +### New Decoupled Architecture + +```mermaid +sequenceDiagram + participant P as paymentLifecycle
(Short-Lived) + participant A as AttemptHandler
(Long-Lived) + participant S as htlcswitch
(Source of Truth) + participant D as Database /
Mission Control + + P->>S: 1. Send HTLC + P->>A: 2. Register Attempt(info) + Note right of P: Lifecycle is now free
to exit at any time. + + A->>S: 3. GetAttemptResult() + Note left of A: Waits indefinitely for result.
Manages concurrency. + + S-->>A: 4. Return Result(success/fail) + + A->>D: 5. SettleAttempt() / FailAttempt() + A->>D: 6. Report to Mission Control + + alt If lifecycle is still active + A-->>P: 7. Notify(Result) + end +``` + +## 3. Detailed Design + +### 3.1. The `AttemptHandler` Component + +A new file, `lnd/attempt_handler.go`, will be created to house the service. + +```go +// lnd/attempt_handler.go + +// AttemptHandler is a router-level service responsible for the guaranteed, +// independent, and reliable resolution of all in-flight HTLC attempts. +type AttemptHandler struct { + cfg *routing.Router + concurrencyLimiter chan struct{} // Used as a semaphore + wg sync.WaitGroup + quit chan struct{} +} + +// AttemptInfo holds all the necessary information for the AttemptHandler to +// process a single HTLC attempt to completion. +type AttemptInfo struct { + AttemptID uint64 + PaymentHash lntypes.Hash + Route route.Route + errorDecryptor htlcswitch.ErrorDecryptor + NotifyChannel chan<- *htlcswitch.PaymentResult // Optional +} + +// RegisterAttempt registers a new attempt for independent result processing. +// It provides backpressure if the node is at its concurrency limit. +func (h *AttemptHandler) RegisterAttempt(attempt *AttemptInfo) error { + // ... +} + +// processAttempt is the main goroutine for handling a single attempt. It waits +// indefinitely for a result from the switch. +func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) { + // ... +} +``` + +### 3.2. Concurrency and Memory Management + +This is the most critical aspect of the design. To prevent unbounded resource consumption, we will use a **buffered channel as a semaphore**. + +1. **Concurrency Limit:** The `concurrencyLimiter` channel will be initialized with a fixed capacity (e.g., `5000`). +2. **Acquiring a Slot:** Before spawning a `processAttempt` goroutine, `RegisterAttempt` must acquire a "slot" by sending a struct to the channel (`h.concurrencyLimiter <- struct{}{}`). +3. **Providing Backpressure:** If the channel is full, the `default` case in the `select` statement will be hit immediately. `RegisterAttempt` will return an `ErrMaxCapacity` error. This safely stops the `paymentLifecycle` from creating new attempts when the node is overloaded, preventing memory exhaustion. +4. **Releasing a Slot:** The `processAttempt` goroutine releases its slot in a `defer` block (`<-h.concurrencyLimiter`), ensuring the slot is always returned, even on panics. +5. **No Premature Timeouts:** The `processAttempt` goroutine will **not** have its own timeout. It will wait as long as necessary for the `htlcswitch` to return a result, as an HTLC cannot be stuck forever without the switch eventually resolving it (e.g., via a chain event after a force-close). + +This semaphore mechanism provides a robust, built-in, and efficient way to manage memory and concurrency without ever prematurely abandoning a legitimate in-flight HTLC. + +## 4. Rejected Alternative: Callbacks in the Switch + +An alternative design was considered where the `paymentLifecycle` would pass a `callback` function into the `htlcswitch`. The switch would then be responsible for executing this callback upon attempt resolution. + +This alternative was **rejected** for the following key reasons: + +1. **Poor Separation of Concerns:** It pollutes the `htlcswitch`—a low-level routing engine—with high-level application concerns like database updates and mission control reporting. +2. **Tangled Dependencies:** It would force the `htlcswitch` package to import high-level packages, creating a messy dependency graph. +3. **Misplaced Responsibility:** It makes the switch responsible for managing the concurrency of thousands of callback goroutines, a responsibility for which it is not designed. + +The `AttemptHandler` service provides a much cleaner architecture by properly isolating these application-level concerns. + +## 5. Advantages of the `AttemptHandler` Design + +1. **Correctness:** It fully solves the stuck payment problem by decoupling result collection from the payment lifecycle. +2. **Robustness:** The backpressure mechanism prevents unbounded resource consumption. +3. **Architectural Integrity:** It maintains a clean separation of concerns and a clear, hierarchical dependency graph. +4. **Testability:** Each component (`paymentLifecycle`, `AttemptHandler`, `htlcswitch`) can be tested in isolation more easily. +5. **Maintainability:** The logic for handling attempt results is centralized in one place, making it easier to debug and extend in the future. diff --git a/routing/payment_lifecycle.go b/routing/payment_lifecycle.go index 7a6443ab66a..ddd68b2d859 100644 --- a/routing/payment_lifecycle.go +++ b/routing/payment_lifecycle.go @@ -4,6 +4,7 @@ import ( "context" "errors" "fmt" + "sync" "time" "github.com/btcsuite/btcd/btcec/v2" @@ -47,6 +48,10 @@ type paymentLifecycle struct { currentHeight int32 firstHopCustomRecords lnwire.CustomRecords + // wg is used to wait for all result collectors to finish before the + // payment lifecycle exits. + wg sync.WaitGroup + // quit is closed to signal the sub goroutines of the payment lifecycle // to stop. quit chan struct{} @@ -86,6 +91,61 @@ func newPaymentLifecycle(r *ChannelRouter, feeLimit lnwire.MilliSatoshi, return p } +// waitForOutstandingResults is a dedicated goroutine that handles HTLC attempt +// results. It makes sure that if the resumePayment exits we still collect all +// outstanding results. This is only a temporary solution and should be +// separating the attempt handling out of the payment lifecyle. +// +// NOTE: must be run in a goroutine. +func (p *paymentLifecycle) waitForOutstandingResults() { + log.Debugf("Payment %v: starting outstanding results collector", + p.identifier) + + defer log.Debugf("Payment %v: outstanding results collector stopped", + p.identifier) + + for { + select { + case result := <-p.resultCollected: + if result == nil { + log.Debugf("Payment %v: received nil "+ + "result, stopping collector", + p.identifier) + + return + } + + log.Debugf("Payment %v: processing result for "+ + "attempt %v", p.identifier, + result.attempt.AttemptID) + + // Handle the result. This will update the payment + // in the database. + _, err := p.handleAttemptResult( + result.attempt, result.result, + ) + if err != nil { + log.Errorf("Payment %v: failed to handle "+ + "result for attempt %v: %v", + p.identifier, result.attempt.AttemptID, + err) + } + + case <-p.quit: + log.Debugf("Payment %v: quit signal received in "+ + "result collector", p.identifier) + + return + + case <-p.router.quit: + log.Debugf("Payment %v: router quit signal received "+ + "in result collector", p.identifier) + + return + } + } +} + // calcFeeBudget returns the available fee to be used for sending HTLC // attempts. func (p *paymentLifecycle) calcFeeBudget( @@ -219,6 +279,13 @@ func (p *paymentLifecycle) resumePayment(ctx context.Context) ([32]byte, log.Errorf("Payment %v with status=%v failed: %v", p.identifier, status, err) + // We need to wait for all outstanding results to be collected + // before exiting. + // + // NOTE: This is only a temporary solution and should be + // separating the attempt handling out of the payment lifecyle. + go p.waitForOutstandingResults() + return [32]byte{}, nil, err } @@ -226,6 +293,12 @@ func (p *paymentLifecycle) resumePayment(ctx context.Context) ([32]byte, // critical error during path finding. lifecycle: for { + // We need to check the context before reloading the payment + // state. + if err := p.checkContext(ctx); err != nil { + return exitWithErr(err) + } + // We update the payment state on every iteration. currentPayment, ps, err := p.reloadPayment() if err != nil { @@ -246,14 +319,6 @@ lifecycle: // 3. create HTLC attempt. // 4. send HTLC attempt. // 5. collect HTLC attempt result. - // - // Before we attempt any new shard, we'll check to see if we've - // gone past the payment attempt timeout, or if the context was - // cancelled, or the router is exiting. In any of these cases, - // we'll stop this payment attempt short. - if err := p.checkContext(ctx); err != nil { - return exitWithErr(err) - } // Now decide the next step of the current lifecycle. step, err := p.decideNextStep(payment) @@ -367,6 +432,9 @@ func (p *paymentLifecycle) checkContext(ctx context.Context) error { return fmt.Errorf("FailPayment got %w", err) } + return fmt.Errorf("payment %v failed with reason: %v", + p.identifier, reason) + case <-p.router.quit: return fmt.Errorf("check payment timeout got: %w", ErrRouterShuttingDown) @@ -437,6 +505,16 @@ func (p *paymentLifecycle) requestRoute( // stop signals any active shard goroutine to exit. func (p *paymentLifecycle) stop() { + log.Debugf("Stopping payment lifecycle for payment %v ...", + p.identifier) + + // We still wait for all result collectors to finish before exiting + // the payment lifecycle. + // + // NOTE: This is only a temporary solution and should be separating the + // attempt handling out of the payment lifecyle. + p.wg.Wait() + close(p.quit) } @@ -461,7 +539,10 @@ func (p *paymentLifecycle) collectResultAsync(attempt *channeldb.HTLCAttempt) { log.Debugf("Collecting result for attempt %v in payment %v", attempt.AttemptID, p.identifier) + p.wg.Add(1) go func() { + defer p.wg.Done() + result, err := p.collectResult(attempt) if err != nil { log.Errorf("Error collecting result for attempt %v in "+ @@ -483,14 +564,18 @@ func (p *paymentLifecycle) collectResultAsync(attempt *channeldb.HTLCAttempt) { } // Signal that a result has been collected. + // + // NOTE: We don't listen to the payment lifecycle quit channel + // here, because we always resolve the result collector before + // exiting the payment lifecycle which is guaranteed by the + // wait group. + // + // NOTE: This is only a temporary solution and should be + // separating the attempt handling out of the payment lifecyle. select { // Send the result so decideNextStep can proceed. case p.resultCollected <- r: - case <-p.quit: - log.Debugf("Lifecycle exiting while collecting "+ - "result for payment %v", p.identifier) - case <-p.router.quit: } }()