Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
143 changes: 143 additions & 0 deletions attempt_handler_design.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Design Proposal: The `AttemptHandler` Service for Reliable Payment Resolution

**Author:** ziggie
**Date:** August 7, 2025
**Status:** Proposed

## 1. Problem Statement: Stuck In-Flight Payments

The current payment architecture in `lnd` suffers from a critical flaw where payments can become permanently stuck in the `INFLIGHT` state.

### Root Cause

The root cause is the **tight coupling** between the `paymentLifecycle` (the process that sends a payment) and the `resultCollector` (the goroutine that waits for the payment's result). The `resultCollector`'s lifetime is tied directly to the `paymentLifecycle`.

When the `paymentLifecycle` exits for any reason other than definitive success or failure—such as a context timeout, user cancellation, or a route-finding error—it closes a `quit` channel. This action immediately terminates all of its associated `resultCollector` goroutines.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this analysis is correct - if the payment has pending htlcs and timed out, the payment is still considered as inflight and will wait for attempt results? Think it's handled in decideNextStep.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it is more detailed than that, the problem is the following:

In case of the issue, we fail the payment in the context check, but we still have the non-failed payment as a reference because we load the payment before checking the context. Now there are conditions that in decideNextStep we will continue launching a payment, but when registering the attempt we fetch the new payment see that we cannot register a new attempt becauase the payment is failed and we hit the case: exitWithError

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we really want to make "sending a payment" non-blocking, I think there is no other way than separating the attempt collection, otherwise we will not be able to exit a payment after the timeout hit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok I see what's going here - the checkContext will change the db payment - in that case why don't we just reload the payment afterwards, so the first step is to check context, then reload the payment.

I think the new design means we now have a second loop to handle the payment state, and I think it's better to have a single source to handle it. We shouldn't have any remaining attempts out there.


The HTLC attempts that were already sent to the `htlcswitch`, however, are still active on the network. Because their collectors are gone, `lnd` is no longer listening for their results. This leaves the payment stuck in limbo, consuming node resources and providing a poor user experience.

### Current Flawed Architecture

```mermaid
sequenceDiagram
participant P as paymentLifecycle
participant R as resultCollector
participant S as htlcswitch

P->>+R: 1. Launch goroutine
P->>S: 2. Send HTLC

Note right of P: Lifecycle waits...

alt On Timeout, Error, or Cancellation
P--xR: 4. KILLS resultCollector!
end

S-->>R: 3. Result is sent back from switch
Note over R,S: But no one is listening! Result is ABANDONED.
```

## 2. Proposed Solution: The `AttemptHandler` Service

We propose a new, registration-based architecture that decouples attempt resolution from the payment lifecycle. This is achieved by introducing a new, long-lived service: the **`AttemptHandler`**.

The `AttemptHandler` is a router-level service whose lifetime is tied to the router itself, not to any individual payment.

### New Decoupled Architecture

```mermaid
sequenceDiagram
participant P as paymentLifecycle<br>(Short-Lived)
participant A as AttemptHandler<br>(Long-Lived)
participant S as htlcswitch<br>(Source of Truth)
participant D as Database /<br>Mission Control

P->>S: 1. Send HTLC
P->>A: 2. Register Attempt(info)
Note right of P: Lifecycle is now free<br>to exit at any time.

A->>S: 3. GetAttemptResult()
Note left of A: Waits indefinitely for result.<br>Manages concurrency.

S-->>A: 4. Return Result(success/fail)

A->>D: 5. SettleAttempt() / FailAttempt()
A->>D: 6. Report to Mission Control

alt If lifecycle is still active
A-->>P: 7. Notify(Result)
end
```

## 3. Detailed Design

### 3.1. The `AttemptHandler` Component

A new file, `lnd/attempt_handler.go`, will be created to house the service.

```go
// lnd/attempt_handler.go

// AttemptHandler is a router-level service responsible for the guaranteed,
// independent, and reliable resolution of all in-flight HTLC attempts.
type AttemptHandler struct {
cfg *routing.Router

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The type *routing.Router for the cfg field appears to be a typo. In the lnd codebase, routing is a package name, and the main router struct is ChannelRouter. For the AttemptHandler to fulfill its responsibilities as described in the design (e.g., interacting with ControlTower and MissionControl), it would need access to the router's configuration, which is typically done via an instance of *routing.ChannelRouter.

Suggested change
cfg *routing.Router
cfg *routing.ChannelRouter

concurrencyLimiter chan struct{} // Used as a semaphore
wg sync.WaitGroup
quit chan struct{}
}
Comment on lines +83 to +88

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The design proposal is excellent and clearly articulates the problem and a robust solution. One minor point for consideration is the AttemptHandler's dependency on the entire *routing.Router struct. To further improve decoupling and testability, as highlighted in the advantages section, it might be beneficial to define a more granular interface that provides only the methods the AttemptHandler requires from the router. This would adhere to the interface segregation principle and make the dependencies of the new service more explicit.


// AttemptInfo holds all the necessary information for the AttemptHandler to
// process a single HTLC attempt to completion.
type AttemptInfo struct {
AttemptID uint64
PaymentHash lntypes.Hash
Route route.Route
errorDecryptor htlcswitch.ErrorDecryptor
NotifyChannel chan<- *htlcswitch.PaymentResult // Optional
}
Comment on lines +92 to +98

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The proposed AttemptInfo struct includes Route and errorDecryptor fields, which implies that the paymentLifecycle is still responsible for creating the ErrorDecryptor. The existing codebase contains TODO comments suggesting that this responsibility should be moved into the htlcswitch to be handled on-demand for failed HTLCs. This would be more efficient (avoids creating decryptors for successful HTLCs) and would simplify the interface.

This design proposal is a great opportunity to incorporate this improvement. Consider simplifying AttemptInfo and modifying htlcswitch.GetAttemptResult to not require an ErrorDecryptor. The switch can construct it internally when needed, as it has access to the database to retrieve the route.

A simplified AttemptInfo could look like:

type AttemptInfo struct {
	AttemptID      uint64
	PaymentHash    lntypes.Hash
	NotifyChannel  chan<- *htlcswitch.PaymentResult // Optional
}

This would make the new AttemptHandler service even cleaner and more decoupled.


// RegisterAttempt registers a new attempt for independent result processing.
// It provides backpressure if the node is at its concurrency limit.
func (h *AttemptHandler) RegisterAttempt(attempt *AttemptInfo) error {
// ...
}

// processAttempt is the main goroutine for handling a single attempt. It waits
// indefinitely for a result from the switch.
func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {
Comment on lines +106 to +108

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment for processAttempt doesn't fully align with the project's style guide.1

  • Function comments should begin with the function name.
  • The current comment, "is the main goroutine...", is passive. It should actively describe what the function does.

An improved comment would be more direct and adhere to the naming convention.

Style Guide References

Suggested change
// processAttempt is the main goroutine for handling a single attempt. It waits
// indefinitely for a result from the switch.
func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {
// processAttempt processes a single HTLC attempt and waits indefinitely for a
// result from the switch.
func (h *AttemptHandler) processAttempt(attempt *AttemptInfo) {

Footnotes

  1. Function comments must begin with the function name. (link)

// ...
}
```

### 3.2. Concurrency and Memory Management

This is the most critical aspect of the design. To prevent unbounded resource consumption, we will use a **buffered channel as a semaphore**.

1. **Concurrency Limit:** The `concurrencyLimiter` channel will be initialized with a fixed capacity (e.g., `5000`).
2. **Acquiring a Slot:** Before spawning a `processAttempt` goroutine, `RegisterAttempt` must acquire a "slot" by sending a struct to the channel (`h.concurrencyLimiter <- struct{}{}`).
3. **Providing Backpressure:** If the channel is full, the `default` case in the `select` statement will be hit immediately. `RegisterAttempt` will return an `ErrMaxCapacity` error. This safely stops the `paymentLifecycle` from creating new attempts when the node is overloaded, preventing memory exhaustion.
4. **Releasing a Slot:** The `processAttempt` goroutine releases its slot in a `defer` block (`<-h.concurrencyLimiter`), ensuring the slot is always returned, even on panics.
5. **No Premature Timeouts:** The `processAttempt` goroutine will **not** have its own timeout. It will wait as long as necessary for the `htlcswitch` to return a result, as an HTLC cannot be stuck forever without the switch eventually resolving it (e.g., via a chain event after a force-close).

This semaphore mechanism provides a robust, built-in, and efficient way to manage memory and concurrency without ever prematurely abandoning a legitimate in-flight HTLC.

## 4. Rejected Alternative: Callbacks in the Switch

An alternative design was considered where the `paymentLifecycle` would pass a `callback` function into the `htlcswitch`. The switch would then be responsible for executing this callback upon attempt resolution.

This alternative was **rejected** for the following key reasons:

1. **Poor Separation of Concerns:** It pollutes the `htlcswitch`—a low-level routing engine—with high-level application concerns like database updates and mission control reporting.
2. **Tangled Dependencies:** It would force the `htlcswitch` package to import high-level packages, creating a messy dependency graph.
3. **Misplaced Responsibility:** It makes the switch responsible for managing the concurrency of thousands of callback goroutines, a responsibility for which it is not designed.

The `AttemptHandler` service provides a much cleaner architecture by properly isolating these application-level concerns.

## 5. Advantages of the `AttemptHandler` Design

1. **Correctness:** It fully solves the stuck payment problem by decoupling result collection from the payment lifecycle.
2. **Robustness:** The backpressure mechanism prevents unbounded resource consumption.
3. **Architectural Integrity:** It maintains a clean separation of concerns and a clear, hierarchical dependency graph.
4. **Testability:** Each component (`paymentLifecycle`, `AttemptHandler`, `htlcswitch`) can be tested in isolation more easily.
5. **Maintainability:** The logic for handling attempt results is centralized in one place, making it easier to debug and extend in the future.
109 changes: 97 additions & 12 deletions routing/payment_lifecycle.go
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
"context"
"errors"
"fmt"
"sync"
"time"

"github.com/btcsuite/btcd/btcec/v2"
Expand Down Expand Up @@ -47,6 +48,10 @@
currentHeight int32
firstHopCustomRecords lnwire.CustomRecords

// wg is used to wait for all result collectors to finish before the
// payment lifecycle exits.
wg sync.WaitGroup

// quit is closed to signal the sub goroutines of the payment lifecycle
// to stop.
quit chan struct{}
Expand Down Expand Up @@ -86,6 +91,61 @@
return p
}

// waitForOutstandingResults is a dedicated goroutine that handles HTLC attempt
// results. It makes sure that if the resumePayment exits we still collect all
// outstanding results. This is only a temporary solution and should be
// separating the attempt handling out of the payment lifecyle.
//
// NOTE: must be run in a goroutine.
func (p *paymentLifecycle) waitForOutstandingResults() {
log.Debugf("Payment %v: starting outstanding results collector",
p.identifier)

defer log.Debugf("Payment %v: outstanding results collector stopped",
p.identifier)

for {
select {
case result := <-p.resultCollected:
if result == nil {
log.Debugf("Payment %v: received nil "+
"result, stopping collector",
p.identifier)

return
}

log.Debugf("Payment %v: processing result for "+
"attempt %v", p.identifier,
result.attempt.AttemptID)

// Handle the result. This will update the payment
// in the database.
_, err := p.handleAttemptResult(
result.attempt, result.result,
)
if err != nil {
log.Errorf("Payment %v: failed to handle "+
"result for attempt %v: %v",
p.identifier, result.attempt.AttemptID,
err)
}

case <-p.quit:
log.Debugf("Payment %v: quit signal received in "+
"result collector", p.identifier)

return

case <-p.router.quit:
log.Debugf("Payment %v: router quit signal received "+
"in result collector", p.identifier)

return
}
}
}

// calcFeeBudget returns the available fee to be used for sending HTLC
// attempts.
func (p *paymentLifecycle) calcFeeBudget(
Expand Down Expand Up @@ -219,13 +279,26 @@
log.Errorf("Payment %v with status=%v failed: %v", p.identifier,
status, err)

// We need to wait for all outstanding results to be collected
// before exiting.
//
// NOTE: This is only a temporary solution and should be
// separating the attempt handling out of the payment lifecyle.
go p.waitForOutstandingResults()

return [32]byte{}, nil, err
}

// We'll continue until either our payment succeeds, or we encounter a
// critical error during path finding.
lifecycle:
for {
// We need to check the context before reloading the payment
// state.
if err := p.checkContext(ctx); err != nil {
return exitWithErr(err)
}

// We update the payment state on every iteration.
currentPayment, ps, err := p.reloadPayment()
if err != nil {
Expand All @@ -246,14 +319,6 @@
// 3. create HTLC attempt.
// 4. send HTLC attempt.
// 5. collect HTLC attempt result.
//
// Before we attempt any new shard, we'll check to see if we've
// gone past the payment attempt timeout, or if the context was
// cancelled, or the router is exiting. In any of these cases,
// we'll stop this payment attempt short.
if err := p.checkContext(ctx); err != nil {
return exitWithErr(err)
}

// Now decide the next step of the current lifecycle.
step, err := p.decideNextStep(payment)
Expand Down Expand Up @@ -367,6 +432,9 @@
return fmt.Errorf("FailPayment got %w", err)
}

return fmt.Errorf("payment %v failed with reason: %v",
p.identifier, reason)

Check failure on line 436 in routing/payment_lifecycle.go

View workflow job for this annotation

GitHub Actions / Lint code

non-wrapping format verb for fmt.Errorf. Use `%w` to format errors (errorlint)

case <-p.router.quit:
return fmt.Errorf("check payment timeout got: %w",
ErrRouterShuttingDown)
Expand Down Expand Up @@ -437,6 +505,16 @@

// stop signals any active shard goroutine to exit.
func (p *paymentLifecycle) stop() {
log.Debugf("Stopping payment lifecycle for payment %v ...",
p.identifier)

// We still wait for all result collectors to finish before exiting
// the payment lifecycle.
//
// NOTE: This is only a temporary solution and should be separating the
// attempt handling out of the payment lifecyle.
p.wg.Wait()

close(p.quit)
}

Expand All @@ -461,7 +539,10 @@
log.Debugf("Collecting result for attempt %v in payment %v",
attempt.AttemptID, p.identifier)

p.wg.Add(1)
go func() {
defer p.wg.Done()

result, err := p.collectResult(attempt)
if err != nil {
log.Errorf("Error collecting result for attempt %v in "+
Expand All @@ -483,14 +564,18 @@
}

// Signal that a result has been collected.
//
// NOTE: We don't listen to the payment lifecycle quit channel
// here, because we always resolve the result collector before
// exiting the payment lifecycle which is guaranteed by the
// wait group.
//
// NOTE: This is only a temporary solution and should be
// separating the attempt handling out of the payment lifecyle.
select {
// Send the result so decideNextStep can proceed.
case p.resultCollected <- r:

case <-p.quit:
log.Debugf("Lifecycle exiting while collecting "+
"result for payment %v", p.identifier)

case <-p.router.quit:
}
}()
Expand Down
Loading