Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix duplicate channel_updates in auditDb when restarting #1918

Closed
wants to merge 4 commits into from

Conversation

thomash-acinq
Copy link
Member

Because the previous channel update is never part of the channel state when restarting eclair, we always think that they have changed and add them to the database.
We now check that the channel update has changed when restoring it.

Copy link
Member

@t-bast t-bast left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a dangerous change, because it impacts downstream components and how we re-emit channel_updates to our peers, which may lead to spec violations if we make a mistake.

(state, nextState, stateData, nextStateData) match {
// ORDER MATTERS!
case (WAIT_FOR_INIT_INTERNAL, OFFLINE, _, normal: DATA_NORMAL) =>
// LocalChannelUpdate is already published when restoring the channel
Copy link
Member

@t-bast t-bast Aug 23, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I preferred having it here, it was clearer when it was explicitly done in the transition phases...we've spent a lot of time ensuring the retransmit logic worked properly, I'd avoid changing that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand but the problem is that at this place we don't have enough information to know if the channel update changed or not.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a channelUpdateBeforeRestore_opt in DATA_NORMAL so that we have access to it in the transition function and put everything back there.

@pm47
Copy link
Member

pm47 commented Aug 23, 2021

I still don't get what caused the issue precisely. Didn't the following code ensure that we wouldn't emit duplicate updates?

val channelUpdate1 = if (Announcements.areSame(candidateChannelUpdate, normal.channelUpdate)) {
  // if there was no configuration change we keep the existing channel update
  normal.channelUpdate
} else {
  log.info("refreshing channel_update due to configuration changes old={} new={}", normal.channelUpdate, candidateChannelUpdate)
  candidateChannelUpdate
}

@thomash-acinq
Copy link
Member Author

I still don't get what caused the issue precisely. Didn't the following code ensure that we wouldn't emit duplicate updates?

val channelUpdate1 = if (Announcements.areSame(candidateChannelUpdate, normal.channelUpdate)) {
  // if there was no configuration change we keep the existing channel update
  normal.channelUpdate
} else {
  log.info("refreshing channel_update due to configuration changes old={} new={}", normal.channelUpdate, candidateChannelUpdate)
  candidateChannelUpdate
}

This code only reuses the previous channel update if it is still valid. But after that, without knowing what was the previous update we can't know if it has changed or not.

@codecov-commenter
Copy link

codecov-commenter commented Aug 23, 2021

Codecov Report

Merging #1918 (ef31de1) into master (9a0fc14) will increase coverage by 0.22%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##           master    #1918      +/-   ##
==========================================
+ Coverage   87.52%   87.74%   +0.22%     
==========================================
  Files         159      159              
  Lines       12104    12103       -1     
  Branches      502      506       +4     
==========================================
+ Hits        10594    10620      +26     
+ Misses       1510     1483      -27     
Impacted Files Coverage Δ
...in/scala/fr/acinq/eclair/channel/ChannelData.scala 100.00% <ø> (ø)
...ire/internal/channel/version0/ChannelCodecs0.scala 96.03% <ø> (ø)
...ire/internal/channel/version1/ChannelCodecs1.scala 98.94% <ø> (ø)
...ire/internal/channel/version2/ChannelCodecs2.scala 100.00% <ø> (ø)
...ire/internal/channel/version3/ChannelCodecs3.scala 100.00% <ø> (ø)
...c/main/scala/fr/acinq/eclair/channel/Channel.scala 88.13% <100.00%> (+2.01%) ⬆️
...main/scala/fr/acinq/eclair/db/DbEventHandler.scala 92.53% <100.00%> (-0.43%) ⬇️
...n/scala/fr/acinq/eclair/router/Announcements.scala 100.00% <100.00%> (ø)
...clair/channel/publish/ReplaceableTxPublisher.scala 87.05% <0.00%> (+0.07%) ⬆️
...lockchain/bitcoind/rpc/ExtendedBitcoinClient.scala 91.42% <0.00%> (+0.25%) ⬆️
... and 3 more

@pm47
Copy link
Member

pm47 commented Aug 24, 2021

Here is an idea. We move the "update channel_update according to conf" code (https://github.com/ACINQ/eclair/blob/master/eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala#L289-L306) to the SYNCING state (there is already related code here: https://github.com/ACINQ/eclair/blob/c5046582649832248bc858522e4ebc18502e5759/eclair-core/src/main/scala/fr/acinq/eclair/channel/Channel.scala#L1668-1676).

This has the following benefits:

  • the change in channel_update would be caught by the onTransition handler
  • we do not change the general logic of emitting channel_updates (I agree with @t-bast's concerns)
  • it prevents us from generating thousands of new channel_update at startup when we update the conf, and will instead to it at the first reconnection.

OTOH, it also means that we would run this code at each reconnection.

@thomash-acinq
Copy link
Member Author

thomash-acinq commented Aug 24, 2021

  • the change in channel_update would be caught by the onTransition handler
  • we do not change the general logic of emitting channel_updates (I agree with @t-bast's concerns)

This would emit an additional LocalChannelUpdate compared to the current situation. Unless we stop emitting it on WAIT_FOR_INIT_INTERNAL -> OFFLINE which would actually simplify things.

@pm47
Copy link
Member

pm47 commented Aug 24, 2021

the change in channel_update would be caught by the onTransition handler
we do not change the general logic of emitting channel_updates (I agree with @t-bast's concerns)

This would emit an additional LocalChannelUpdate compared to the current situation.

Only if the conf changes, right? Alternatively, instead of updating the channel_update manually, we could have the channel send a CMD_UPDATE_RELAY_FEE to itself, which ensures we go to the normal flow, and is simpler to reason about. There are details though (e.g. need to add the cltv expiry delta to the command), and we still need to update channel_updates inline, e.g. in handleAddDisconnected.

Unless we stop emitting it on WAIT_FOR_INIT_INTERNAL -> OFFLINE which would actually simplify things.

That is a more risky change, even if sending a LocalChannelUpdate at this transition does feel hacky. At first glance, removing this shouldn't impact direct peers, since we are disconnected at this step. The router is another story though: LocalChannelUpdates are how it learns about private channels.

@thomash-acinq
Copy link
Member Author

This would emit an additional LocalChannelUpdate compared to the current situation.

Only if the conf changes, right?

Yes, only if it changes. And in that case it will emit a first channel update that is outdated and then the correct one.

If we want to not change the current flow between states, I think adding channelUpdateBeforeRestore_opt to DATA_NORMAL is the best solution. Anything else will probably change some behavior of the system.

@thomash-acinq
Copy link
Member Author

I've tried multiple ways doesn't seem possible to make it work by updating the channelUpdate in the SYNCING state.
The problem is the transition WAIT_FOR_INIT_INTERNAL -> OFFLINE. If we stop emitting the LocalChannelUpdate from there, then if there is no change, no LocalChannelUpdate will be emitted at all which is bad for the router. If we continue emitting the LocalChannelUpdate from there, then it may be an outdated one if there was a config change which leads to useless work.
Also we do additional work every time we reconnect even though it's only needed after a restart. Here is how that would look like : #1920

pm47 added a commit that referenced this pull request Aug 25, 2021
This is an alternative to #1918 and #1920. It's very close to the
latter, except that we do check the conf only once in the
`WAIT_FOR_INIT_INTERNAL` state, as opposed to at each reconnection in
`SYNCING`.

We do not change the `channel_update` in `WAIT_FOR_INIT_INTERNAL`, which
allows us to set `previousChannelUpdate_opt=Some(normal.channelUpdate)`
in the transition and fix the duplicate bug in the audit db.

If there is a change in conf, there will be an additional
`LocalChannelUpdate` emitted, but only at reconnection, and following
the regular update flow, which should protect us against regressions.

We do handle `CMD_UPDATE_RELAY_FEES` in both `OFFLINE` and `SYNCING`,
because there may be a race between `CMD_UPDATE_RELAY_FEES` and
`ChannelRestablish` if the conf change at restore. And there was no good
reason to behave differently in those states anyway.
pm47 added a commit that referenced this pull request Sep 2, 2021
This is very similar to #1918 (ef31de1), but we use the internal
actor state instead of the `ChannelData`.

Pros:
- conceptually simple, low risk of regression
- `ChannelData` stays untouched
Cons:
- now there is a `var` in the channel

Those are the minimal changes to have the simplest diff, but we can
include some improvements made in ef31de1.
@thomash-acinq
Copy link
Member Author

#1934

@thomash-acinq thomash-acinq deleted the no-duplicate-channel-update-on-restart branch September 9, 2021 13:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants