Skip to content

Conversation

@thomash-acinq
Copy link
Member

@thomash-acinq thomash-acinq commented Jul 21, 2023

We do not yet drop HTLCs, the purpose is to collect data first.

We add

  1. An endorsement bit to UpdateAddHtlc. This follows blip-0004: experimental endorsement signaling in update_add_htlc lightning/blips#27.
  2. A local reputation system: For each pair (origin node, endorsement value), we compute its reputation as total fees that were paid divided by total fees that would have been paid if all HTLCs had fulfilled. When considering an HTLC to relay, we only forward it if the reputation of its source is higher than the occupancy of the outgoing channel.
  3. A limit on the number of small HTLCs per channel. We allow just very few small HTLCs per channel so that it's not possible to block large HTLCs using only small HTLCs (similar to Add a channel congestion control mechanism #2330 but continuous).

@codecov-commenter
Copy link

codecov-commenter commented Jul 21, 2023

Codecov Report

Merging #2716 (deab085) into master (12adf87) will increase coverage by 0.04%.
Report is 1 commits behind head on master.
The diff coverage is 100.00%.

❗ Your organization needs to install the Codecov GitHub app to enable full functionality.

@@            Coverage Diff             @@
##           master    #2716      +/-   ##
==========================================
+ Coverage   85.82%   85.86%   +0.04%     
==========================================
  Files         216      218       +2     
  Lines       18126    18209      +83     
  Branches      771      749      -22     
==========================================
+ Hits        15556    15636      +80     
- Misses       2570     2573       +3     
Files Coverage Δ
...re/src/main/scala/fr/acinq/eclair/NodeParams.scala 93.47% <100.00%> (+0.08%) ⬆️
...ir-core/src/main/scala/fr/acinq/eclair/Setup.scala 75.29% <100.00%> (+0.14%) ⬆️
...in/scala/fr/acinq/eclair/channel/ChannelData.scala 100.00% <ø> (ø)
...in/scala/fr/acinq/eclair/channel/Commitments.scala 96.93% <100.00%> (+0.11%) ⬆️
...ain/scala/fr/acinq/eclair/channel/Monitoring.scala 96.15% <100.00%> (+0.23%) ⬆️
...in/scala/fr/acinq/eclair/channel/fsm/Channel.scala 85.80% <100.00%> (+0.15%) ⬆️
...ain/scala/fr/acinq/eclair/payment/Monitoring.scala 98.30% <100.00%> (+0.09%) ⬆️
.../scala/fr/acinq/eclair/payment/PaymentPacket.scala 90.82% <100.00%> (ø)
...a/fr/acinq/eclair/payment/relay/ChannelRelay.scala 96.03% <100.00%> (+0.16%) ⬆️
...fr/acinq/eclair/payment/relay/ChannelRelayer.scala 100.00% <100.00%> (ø)
... and 9 more

... and 3 files with indirect coverage changes

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, it's now more clear to me how we assign reputation to our peers. I've made a few comments on the code itself, some of which (the easy ones) I fixed in #2893.

The reputation algorithm itself looks good to me, let's try it out and see what results we get in practice and during simulations.

However, I don't think the way we interact with the reputation recorder makes the most sense.
You are storing a relay attempt as soon as we start relaying, before we know whether we actually send HTLCs out or not.
This leads to the weird CancelRelay command and an inconsistency between channel relay and trampoline relay.
In the trampoline case, if we can't find a route or can't send outgoing HTLCs, we will treat this as a failure, which is incorrect.
This can probably even be used to skew our reputation algorithm.
It's also pretty invasive, especially in the NodeRelay component...

It seems to me that it would make more sense if we implemented the following flow:

  1. Once we start relaying (ChannelRelay / NodeRelay), we obtain the confidence value with GetConfidence and will include it in CMD_ADD_HTLC.
  2. At that point, we DON'T update the reputation to take this payment into account, because we don't know yet if it will be relayed.
  3. In Channel.scala, when we actually send an outgoing UpdateAddHtlc, we emit an OutgoingHtlcAdded event to the event stream, that contains the outgoing HTLC and its Origin.Hot.
  4. In Channel.scala, when an outgoing HTLC is failed or fulfilled, we emit an OutgoingHtlcFailed / OutgoingHtlcFulfilled event to the event stream.
  5. The reputation recorder listens to those events, and updates the internal reputation state accordingly.
  6. We don't use the relayId but rather the outgoing channel_id and htlc_id, combined with the origin to group HTLCs.
  7. For trampoline payments, since the reputation recorder has the Origin information, it can wait for all outgoing HTLCs to be settled to correctly account for the fees / timestamps.

I believe this better matches what we're trying to accomplish: the only thing the reputation recorder actually needs to know to update reputation is when outgoing HTLCs are sent and when they're settled.
It also provides more accurate relay data to ensure we're updating the reputation correctly, and has much less impact on the ChannelRelay / NodeRelay actors (which should simplify testing).

Can you try that, or let me know if you think that it wouldn't be as good as the currently implemented flow?

@thomash-acinq
Copy link
Member Author

The reason for the weird CancelRelay is that we need to take into account pending HTLCs in the reputation. If we receive two HTLCs at once we don't want both of them to enjoy the same reputation, the second one should be penalized. If we only update the reputation after we took our decision to relay, we can get a data race.

@t-bast
Copy link
Member

t-bast commented Aug 2, 2024

The reason for the weird CancelRelay is that we need to take into account pending HTLCs in the reputation. If we receive two HTLCs at once we don't want both of them to enjoy the same reputation, the second one should be penalized. If we only update the reputation after we took our decision to relay, we can get a data race.

But you're not doing this for trampoline relay, so that race can already be exploited anyway? I don't think this matters much in practice though, because:

  • that race is hard to exploit, because between the call to the ReputationRecorder and the outgoing HTLC, there will be at most a few milliseconds
  • exploiting that race requires ensuring that we receive the incoming update_add_htlc at exactly the same time, and network delays cannot be trivially manipulated
  • at some point we will add a randomized delay before forwarding HTLCs, because it's good for privacy (and was discussed in Oakland) which will make this race almost impossible to exploit

@thomash-acinq
Copy link
Member Author

1. Once we start relaying (`ChannelRelay` / `NodeRelay`), we obtain the confidence value with `GetConfidence` and will include it in `CMD_ADD_HTLC`.

2. At that point, we DON'T update the reputation to take this payment into account, because we don't know yet if it will be relayed.

3. In `Channel.scala`, when we actually send an outgoing `UpdateAddHtlc`, we emit an `OutgoingHtlcAdded` event to the event stream, that contains the outgoing HTLC and its `Origin.Hot`.

4. In `Channel.scala`, when an outgoing HTLC is failed or fulfilled, we emit an `OutgoingHtlcFailed` / `OutgoingHtlcFulfilled` event to the event stream.

5. The reputation recorder listens to those events, and updates the internal reputation state accordingly.

6. We don't use the `relayId` but rather the outgoing `channel_id` and `htlc_id`, combined with the origin to group HTLCs.

7. For trampoline payments, since the reputation recorder has the `Origin` information, it can wait for all outgoing HTLCs to be settled to correctly account for the fees / timestamps.

I've tried doing that in #2897.
For channel relays it works fine, however for trampoline I'm running into some problems:

  • We can't wait for for outgoing HTLCs to be settled to be able to compute the fees, we need to update the reputation as soon as we start relaying and for that we need to know the fees.
  • Even when all HTLCs associated to a trampoline relay fail, it's not necessarily the end of this relay because we may retry.

It seems to me that solving this would require adding more complexity than this refactoring was removing.

@t-bast
Copy link
Member

t-bast commented Aug 9, 2024

We can't wait for for outgoing HTLCs to be settled to be able to compute the fees, we need to update the reputation as soon as we start relaying and for that we need to know the fees.

It seems to me that we're trying to make trampoline fit into a box where it actually doesn't fit. One important aspect to trampoline is that the sender does not choose the outgoing channels and does not choose the fees, they allocate a total fee budget for the trampoline node which tries to relay within that fee budget. The trampoline node will ensure that it earns at least its usual channel routing fees, otherwise it won't relay the payment. If the trampoline node is well connected, or the sender over-allocated fees, the trampoline node earns more fees than its usual routing fees: but I'm not sure that this extra fee should count in the reputation?

So I think we could handle trampoline relays in a simplified way that gets rid of those issues, by using the channel routing fees instead of trying to take the extra trampoline fees into account: when sending an outgoing HTLC with a trampoline origin, the fees we allocate to it in the reputation algorithm should just be this outgoing channel's routing fees (which can be included in the relay event since we have access to our channel update in the channel data).

If the payment succeeds, if we want to give a bonus reputation if we earned more fees than our channel routing fees, this should be easy to do as well, by splitting the extra fee between all the outgoing channels? But I'm not sure we should do this, because we can't really match an outgoing HTLC to a specific incoming channel 1-to-1, so it's probably better to just count our channel routing fees?

Do you think that model would make sense, or am I missing something?

@thomash-acinq
Copy link
Member Author

It seems like a good solution indeed, I'll try it.

@thomash-acinq thomash-acinq force-pushed the endorse-htlc branch 2 times, most recently from 527cc06 to 59d312b Compare June 23, 2025 14:47
@thomash-acinq thomash-acinq requested a review from t-bast June 23, 2025 14:48
Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't looked at the logic inside Reputation.scala and ReputationRecorder.scala yet, but I've reviewed the interaction with the existing actors and this is nicely non-invasive, looks mostly good to me 👍

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good on the concept, comments are mostly about code and architecture. During yesterday's spec meeting, Carla asked that you write a gist detailing the steps of your reputation tracking algorithm in english/pseudo-code, which will let them compare it to what they're doing and verify that the implementation correctly matches the high-level algorithm. Can you create a public gist for this?

@thomash-acinq thomash-acinq requested a review from t-bast July 9, 2025 16:30
@thomash-acinq thomash-acinq merged commit 43c3986 into master Jul 11, 2025
1 of 2 checks passed
@t-bast t-bast deleted the endorse-htlc branch July 31, 2025 09:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants