diff --git a/attempt_handler_design.md b/attempt_handler_design.md
new file mode 100644
index 00000000000..22778d52cdd
--- /dev/null
+++ b/attempt_handler_design.md
@@ -0,0 +1,143 @@
+# Design Proposal: The `AttemptHandler` Service for Reliable Payment Resolution
+
+**Author:** ziggie
+**Date:** August 7, 2025
+**Status:** Proposed
+
+## 1. Problem Statement: Stuck In-Flight Payments
+
+The current payment architecture in `lnd` suffers from a critical flaw where payments can become permanently stuck in the `INFLIGHT` state.
+
+### Root Cause
+
+The root cause is the **tight coupling** between the `paymentLifecycle` (the process that sends a payment) and the `resultCollector` (the goroutine that waits for the payment's result). The `resultCollector`'s lifetime is tied directly to the `paymentLifecycle`.
+
+When the `paymentLifecycle` exits for any reason other than definitive success or failure—such as a context timeout, user cancellation, or a route-finding error—it closes a `quit` channel. This action immediately terminates all of its associated `resultCollector` goroutines.
+
+The HTLC attempts that were already sent to the `htlcswitch`, however, are still active on the network. Because their collectors are gone, `lnd` is no longer listening for their results. This leaves the payment stuck in limbo, consuming node resources and providing a poor user experience.
+
+### Current Flawed Architecture
+
+```mermaid
+sequenceDiagram
+ participant P as paymentLifecycle
+ participant R as resultCollector
+ participant S as htlcswitch
+
+ P->>+R: 1. Launch goroutine
+ P->>S: 2. Send HTLC
+
+ Note right of P: Lifecycle waits...
+
+ alt On Timeout, Error, or Cancellation
+ P--xR: 4. KILLS resultCollector!
+ end
+
+ S-->>R: 3. Result is sent back from switch
+ Note over R,S: But no one is listening! Result is ABANDONED.
+```
+
+## 2. Proposed Solution: The `AttemptHandler` Service
+
+We propose a new, registration-based architecture that decouples attempt resolution from the payment lifecycle. This is achieved by introducing a new, long-lived service: the **`AttemptHandler`**.
+
+The `AttemptHandler` is a router-level service whose lifetime is tied to the router itself, not to any individual payment.
+
+### New Decoupled Architecture
+
+```mermaid
+sequenceDiagram
+ participant P as paymentLifecycle
(Short-Lived)
+ participant A as AttemptHandler
(Long-Lived)
+ participant S as htlcswitch
(Source of Truth)
+ participant D as Database /
Mission Control
+
+ P->>S: 1. Send HTLC
+ P->>A: 2. Register Attempt(info)
+ Note right of P: Lifecycle is now free
to exit at any time.
+
+ A->>S: 3. GetAttemptResult()
+ Note left of A: Waits indefinitely for result.
Manages concurrency.
+
+ S-->>A: 4. Return Result(success/fail)
+
+ A->>D: 5. SettleAttempt() / FailAttempt()
+ A->>D: 6. Report to Mission Control
+
+ alt If lifecycle is still active
+ A-->>P: 7. Notify(Result)
+ end
+```
+
+## 3. Detailed Design
+
+### 3.1. The `AttemptHandler` Component
+
+A new file, `lnd/attempt_handler.go`, will be created to house the service.
+
+```go
+// lnd/attempt_handler.go
+
+// AttemptHandler is a router-level service responsible for the guaranteed,
+// independent, and reliable resolution of all in-flight HTLC attempts.
+type AttemptHandler struct {
+ cfg *routing.Router
+ concurrencyLimiter chan struct{} // Used as a semaphore
+ wg sync.WaitGroup
+ quit chan struct{}
+}
+
+// AttemptInfo holds all the necessary information for the AttemptHandler to
+// process a single HTLC attempt to completion.
+type AttemptInfo struct {
+ AttemptID uint64
+ PaymentHash lntypes.Hash
+ Route route.Route
+ errorDecryptor htlcswitch.ErrorDecryptor
+ NotifyChannel chan<- *htlcswitch.PaymentResult // Optional
+}
+
+// RegisterAttempt registers a new attempt for independent result processing.
+// It provides backpressure if the node is at its concurrency limit.
+func (h *AttemptHandler) RegisterAttempt(attempt *AttemptInfo) error {
+ // ...
+}
+
+// processAttempt is the main goroutine for handling a single attempt. It waits
+// indefinitely for a result from the switch.
+func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {
+ // ...
+}
+```
+
+### 3.2. Concurrency and Memory Management
+
+This is the most critical aspect of the design. To prevent unbounded resource consumption, we will use a **buffered channel as a semaphore**.
+
+1. **Concurrency Limit:** The `concurrencyLimiter` channel will be initialized with a fixed capacity (e.g., `5000`).
+2. **Acquiring a Slot:** Before spawning a `processAttempt` goroutine, `RegisterAttempt` must acquire a "slot" by sending a struct to the channel (`h.concurrencyLimiter <- struct{}{}`).
+3. **Providing Backpressure:** If the channel is full, the `default` case in the `select` statement will be hit immediately. `RegisterAttempt` will return an `ErrMaxCapacity` error. This safely stops the `paymentLifecycle` from creating new attempts when the node is overloaded, preventing memory exhaustion.
+4. **Releasing a Slot:** The `processAttempt` goroutine releases its slot in a `defer` block (`<-h.concurrencyLimiter`), ensuring the slot is always returned, even on panics.
+5. **No Premature Timeouts:** The `processAttempt` goroutine will **not** have its own timeout. It will wait as long as necessary for the `htlcswitch` to return a result, as an HTLC cannot be stuck forever without the switch eventually resolving it (e.g., via a chain event after a force-close).
+
+This semaphore mechanism provides a robust, built-in, and efficient way to manage memory and concurrency without ever prematurely abandoning a legitimate in-flight HTLC.
+
+## 4. Rejected Alternative: Callbacks in the Switch
+
+An alternative design was considered where the `paymentLifecycle` would pass a `callback` function into the `htlcswitch`. The switch would then be responsible for executing this callback upon attempt resolution.
+
+This alternative was **rejected** for the following key reasons:
+
+1. **Poor Separation of Concerns:** It pollutes the `htlcswitch`—a low-level routing engine—with high-level application concerns like database updates and mission control reporting.
+2. **Tangled Dependencies:** It would force the `htlcswitch` package to import high-level packages, creating a messy dependency graph.
+3. **Misplaced Responsibility:** It makes the switch responsible for managing the concurrency of thousands of callback goroutines, a responsibility for which it is not designed.
+
+The `AttemptHandler` service provides a much cleaner architecture by properly isolating these application-level concerns.
+
+## 5. Advantages of the `AttemptHandler` Design
+
+1. **Correctness:** It fully solves the stuck payment problem by decoupling result collection from the payment lifecycle.
+2. **Robustness:** The backpressure mechanism prevents unbounded resource consumption.
+3. **Architectural Integrity:** It maintains a clean separation of concerns and a clear, hierarchical dependency graph.
+4. **Testability:** Each component (`paymentLifecycle`, `AttemptHandler`, `htlcswitch`) can be tested in isolation more easily.
+5. **Maintainability:** The logic for handling attempt results is centralized in one place, making it easier to debug and extend in the future.
diff --git a/routing/payment_lifecycle.go b/routing/payment_lifecycle.go
index 7a6443ab66a..ddd68b2d859 100644
--- a/routing/payment_lifecycle.go
+++ b/routing/payment_lifecycle.go
@@ -4,6 +4,7 @@ import (
"context"
"errors"
"fmt"
+ "sync"
"time"
"github.com/btcsuite/btcd/btcec/v2"
@@ -47,6 +48,10 @@ type paymentLifecycle struct {
currentHeight int32
firstHopCustomRecords lnwire.CustomRecords
+ // wg is used to wait for all result collectors to finish before the
+ // payment lifecycle exits.
+ wg sync.WaitGroup
+
// quit is closed to signal the sub goroutines of the payment lifecycle
// to stop.
quit chan struct{}
@@ -86,6 +91,61 @@ func newPaymentLifecycle(r *ChannelRouter, feeLimit lnwire.MilliSatoshi,
return p
}
+// waitForOutstandingResults is a dedicated goroutine that handles HTLC attempt
+// results. It makes sure that if the resumePayment exits we still collect all
+// outstanding results. This is only a temporary solution and should be
+// separating the attempt handling out of the payment lifecyle.
+//
+// NOTE: must be run in a goroutine.
+func (p *paymentLifecycle) waitForOutstandingResults() {
+ log.Debugf("Payment %v: starting outstanding results collector",
+ p.identifier)
+
+ defer log.Debugf("Payment %v: outstanding results collector stopped",
+ p.identifier)
+
+ for {
+ select {
+ case result := <-p.resultCollected:
+ if result == nil {
+ log.Debugf("Payment %v: received nil "+
+ "result, stopping collector",
+ p.identifier)
+
+ return
+ }
+
+ log.Debugf("Payment %v: processing result for "+
+ "attempt %v", p.identifier,
+ result.attempt.AttemptID)
+
+ // Handle the result. This will update the payment
+ // in the database.
+ _, err := p.handleAttemptResult(
+ result.attempt, result.result,
+ )
+ if err != nil {
+ log.Errorf("Payment %v: failed to handle "+
+ "result for attempt %v: %v",
+ p.identifier, result.attempt.AttemptID,
+ err)
+ }
+
+ case <-p.quit:
+ log.Debugf("Payment %v: quit signal received in "+
+ "result collector", p.identifier)
+
+ return
+
+ case <-p.router.quit:
+ log.Debugf("Payment %v: router quit signal received "+
+ "in result collector", p.identifier)
+
+ return
+ }
+ }
+}
+
// calcFeeBudget returns the available fee to be used for sending HTLC
// attempts.
func (p *paymentLifecycle) calcFeeBudget(
@@ -219,6 +279,13 @@ func (p *paymentLifecycle) resumePayment(ctx context.Context) ([32]byte,
log.Errorf("Payment %v with status=%v failed: %v", p.identifier,
status, err)
+ // We need to wait for all outstanding results to be collected
+ // before exiting.
+ //
+ // NOTE: This is only a temporary solution and should be
+ // separating the attempt handling out of the payment lifecyle.
+ go p.waitForOutstandingResults()
+
return [32]byte{}, nil, err
}
@@ -226,6 +293,12 @@ func (p *paymentLifecycle) resumePayment(ctx context.Context) ([32]byte,
// critical error during path finding.
lifecycle:
for {
+ // We need to check the context before reloading the payment
+ // state.
+ if err := p.checkContext(ctx); err != nil {
+ return exitWithErr(err)
+ }
+
// We update the payment state on every iteration.
currentPayment, ps, err := p.reloadPayment()
if err != nil {
@@ -246,14 +319,6 @@ lifecycle:
// 3. create HTLC attempt.
// 4. send HTLC attempt.
// 5. collect HTLC attempt result.
- //
- // Before we attempt any new shard, we'll check to see if we've
- // gone past the payment attempt timeout, or if the context was
- // cancelled, or the router is exiting. In any of these cases,
- // we'll stop this payment attempt short.
- if err := p.checkContext(ctx); err != nil {
- return exitWithErr(err)
- }
// Now decide the next step of the current lifecycle.
step, err := p.decideNextStep(payment)
@@ -367,6 +432,9 @@ func (p *paymentLifecycle) checkContext(ctx context.Context) error {
return fmt.Errorf("FailPayment got %w", err)
}
+ return fmt.Errorf("payment %v failed with reason: %v",
+ p.identifier, reason)
+
case <-p.router.quit:
return fmt.Errorf("check payment timeout got: %w",
ErrRouterShuttingDown)
@@ -437,6 +505,16 @@ func (p *paymentLifecycle) requestRoute(
// stop signals any active shard goroutine to exit.
func (p *paymentLifecycle) stop() {
+ log.Debugf("Stopping payment lifecycle for payment %v ...",
+ p.identifier)
+
+ // We still wait for all result collectors to finish before exiting
+ // the payment lifecycle.
+ //
+ // NOTE: This is only a temporary solution and should be separating the
+ // attempt handling out of the payment lifecyle.
+ p.wg.Wait()
+
close(p.quit)
}
@@ -461,7 +539,10 @@ func (p *paymentLifecycle) collectResultAsync(attempt *channeldb.HTLCAttempt) {
log.Debugf("Collecting result for attempt %v in payment %v",
attempt.AttemptID, p.identifier)
+ p.wg.Add(1)
go func() {
+ defer p.wg.Done()
+
result, err := p.collectResult(attempt)
if err != nil {
log.Errorf("Error collecting result for attempt %v in "+
@@ -483,14 +564,18 @@ func (p *paymentLifecycle) collectResultAsync(attempt *channeldb.HTLCAttempt) {
}
// Signal that a result has been collected.
+ //
+ // NOTE: We don't listen to the payment lifecycle quit channel
+ // here, because we always resolve the result collector before
+ // exiting the payment lifecycle which is guaranteed by the
+ // wait group.
+ //
+ // NOTE: This is only a temporary solution and should be
+ // separating the attempt handling out of the payment lifecyle.
select {
// Send the result so decideNextStep can proceed.
case p.resultCollected <- r:
- case <-p.quit:
- log.Debugf("Lifecycle exiting while collecting "+
- "result for payment %v", p.identifier)
-
case <-p.router.quit:
}
}()