multi: integrate new RBF co-op flow into the server+peer#9575
multi: integrate new RBF co-op flow into the server+peer#9575Roasbeef merged 24 commits intolightningnetwork:rbf-stagingfrom
Conversation
|
Important Review skippedAuto reviews are limited to specific labels. 🏷️ Labels to auto review (1)
Please check the settings in the CodeRabbit UI or the You can disable this status message by setting the Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
c6cc135 to
8d4a3d7
Compare
8d4a3d7 to
84969ec
Compare
In the next commit, we'll start checking feature bits to decide how to init the chan closer. In the future, we can move the current chan closer handling logic into an `MsgEndpoint`, which'll allow us to get rid of the explicit chan closer map and direct handling.
In this commit, we use the interfaces we created in the prior commit to make a new method capable of spinning up the new rbf coop closer.
In this commit, we add a new composite chanCloserFsm type. This'll allow us to store a single value that might be a negotiator or and rbf-er. In a follow up commit, we'll use this to conditionally create the new rbf closer.
84969ec to
22cad99
Compare
|
Updated the base branch to be |
yyforyongyu
left a comment
There was a problem hiding this comment.
I think it's close - need to fix the flakes in the CI which are found in both unit test and itests. In addition the msgmux logger is not registered,
diff --git a/log.go b/log.go
index e4c1d2ebe..5d1f2567f 100644
--- a/log.go
+++ b/log.go
@@ -44,6 +44,7 @@ import (
"github.com/lightningnetwork/lnd/lnwallet/chanfunding"
"github.com/lightningnetwork/lnd/lnwallet/rpcwallet"
"github.com/lightningnetwork/lnd/monitoring"
+ "github.com/lightningnetwork/lnd/msgmux"
"github.com/lightningnetwork/lnd/netann"
"github.com/lightningnetwork/lnd/peer"
"github.com/lightningnetwork/lnd/peernotifier"
@@ -202,6 +203,7 @@ func SetupLoggers(root *build.SubLoggerManager, interceptor signal.Interceptor)
)
AddV1SubLogger(root, graphdb.Subsystem, interceptor, graphdb.UseLogger)
AddSubLogger(root, chainio.Subsystem, interceptor, chainio.UseLogger)
+ AddSubLogger(root, msgmux.Subsystem, interceptor, msgmux.UseLogger)
}
// AddSubLogger is a helper method to conveniently create and register theI ran the test rbf_coop_close locally - the normal flow works as expected, the restart behavior seems weird, maybe it's related to the flake,
It's weird to see this error log once the connection is reestablished,
2025-03-07 17:18:23.706 [DBG] DISC syncer.go:450: Starting GossipSyncer([Bob])
...
2025-03-07 17:18:23.706 [TRC] MSGX msg_router.go:246: MsgRouter: unable to route msg msgmux.PeerMsg
2025-03-07 17:18:23.706 [ERR] PEER brontide.go:2136: Peer([Bob]): resend failed: unable to fetch channel sync messages for peer [Bob]@127.0.0.1:49436: unable to find closed channel summary
Then Alice sent a new shutdown request, using 1 sat/vb although it used 5 sat/vb before the reconnection,
2025-03-07 17:18:23.765 [INF] CHCL rbf_coop_transitions.go:488: ChannelPoint([ChanPoint: Alice=>Bob]): channel flushed! proceeding with co-op close
2025-03-07 17:18:23.765 [INF] CHCL rbf_coop_transitions.go:516: ChannelPoint([ChanPoint: Alice=>Bob]): using ideal_fee=1 sat/vb, absolute_fee=0.00000193 BTC
2025-03-07 17:18:23.765 [DBG] PFSM state_machine.go:591: FSM(rbf_chan_closer([ChanPoint: Alice=>Bob])): Adding new internal event to queue event=(*chancloser.SendOfferEvent)(0x1400286d010)({
TargetFeeRate: (chainfee.SatPerVByte) 1 sat/vb
})
| // party (for w/e reason), but crashed before the close was complete. | ||
| // | ||
| //nolint:ll | ||
| type shutdownInit = fn.Option[fn.Either[*htlcswitch.ChanClose, channeldb.ShutdownInfo]] |
There was a problem hiding this comment.
I think it'd be better if we could avoid using this kinda data structure in the future. fn.Either is typically used to convey a result or an error, though we can use it to carry either, but there's no guarantee that it won't be in an invalid state where both Right and Left are set. In contrast, I think this is guaranteed in Haskell via compile-time checks.
Aside from that, from robustness's perspective, every state should be handled . This construct alone creates four states - 1) it's none; 2) it's some and Either.Right 3) it's some and Either.Left 4) it's some and Either.Left and Either.Right (or Either is empty?). Across the codebase we just use WhenSome, MapOption or WhenLeft to avoid answering the question what if it should be none but it's some, what is option is none and the mapping does nothing, etc.
Moving forward I think it's best to come back to the zen of Go - we just stick to interfaces to decomposite complexity, and I will create a proposal for that.
There was a problem hiding this comment.
fn.Either is typically used to convey a result or an error, though we can use it to carry either, but there's no guarantee that it won't be in an invalid state where both Right and Left are set
I think you're thinking of a result type? Such a type can be made using an either. The alternative to the usage here is creating a slim interface, that the relevant structs implement, then use that as the map value. The advantage of this approach vs that is that we retain the core types at the definition site (you can see exactly which two values are used here).
though we can use it to carry either, but there's no guarantee that it won't be in an invalid state where both Right and Left are set
So they can't actually both bet set, it's either or. The left vs right are private variables, and the exposed constructors only let you set one or the other. You don't always need to handle both instances, just like sometimes you do a type assertion and do nothing if it isn't the type you need to act on.
Aside from that, from robustness's perspective, every state should be handled . This construct alone creates four states - 1) it's none; 2) it's some and Either.Right 3) it's some and Either.Left 4) it's some and Either.Left and Either.Right (or Either is empty?)
So those states exist if this was an interface in a map (other than that 4th state which isn't possible). It can be empty, or one of the many structs that can implement the interface (which can't be known by looking at the map definition),. There's no exhaustive checking by the compiler, instead you need to find all the structs that implicitly implement the type. With the either, we can be explicit about what actually we're storing here vs the normal implicit interface implementations.
Across the codebase we just use WhenSome, MapOption or WhenLeft to avoid answering the question what if it should be none but it's some, what is option is none and the mapping does nothing, etc.
We're not avoiding answering the question with this functions. At times you want to do something if it's set, and nothing otherwise. Those functions let you express those patterns.
I think the advantage of using this type where applicable, is that you can be explicit about what is being stored/operated on. The alternative is to store a interface whose sole surface is bundling several types, but which types are stored there is implicit (once you arrive at the definition, you need to search further to see which structs can be stored there, so you can make sure that all the relevant cases are handled).
| alicePendingUpdate = aliceCloseUpdate.GetClosePending() | ||
| require.NotNil(ht, aliceCloseUpdate) | ||
| require.Equal( | ||
| ht, alicePendingUpdate.FeePerVbyte, |
There was a problem hiding this comment.
This got me to think maybe there should be a field to tell the user the RBF is failed? sth like Succeeded?
There was a problem hiding this comment.
Issue with that is that we can't always detect if it failed or not. For Neutrino, we have no idea. One option we have is to always try to do a fee bump if the user wants one. The side effect of that is that we may end up paying more than the user expect, eg: they try to do 5 sat/vb, but 8 sat/vb is actually what'll trigger a replacement.
I'm open to modifying the broadcast flow in another PR. It'll need some changes in the protofsm executor, and reuslt in some additional states in the state machine (BroadcastPending -> BroadcastFailed -> BroadcastSuccess, etc). I want to restrict this PR to just fundamental blockers to merge, as we still need to do further interop, and get the rc in user's hands so we can get feedback on the RPC/CLI changes.
22cad99 to
fefd583
Compare
So Alice is the one that initiated the initial shutdown via RPC request, we only store the shutdown message on disk, not the fee rate we used before the restart (see channeldb.ShutdownInfo`). We have behavior to restart the shutdown flow anew for the RBF update, at the same time, we'll then go to do a new manual update. The "unable to route message" log is normal, I'll update the log there to log the type of the inner msg (the wire msg). This'll happen for any message that we don't have a registered endpoint to. The close summary log is there as the channel hasn't fully closed yet. |
|
Here's the current test flake: It actually doesn't have to do with restart logic. This is the location that's failing: lnd/itest/lnd_coop_close_rbf_test.go Lines 95 to 103 in fefd583 We're doing a fee update that we know we can't pay for. We don't even send any p2p messages here, but an error is sent back to the user over the RPC stream. AFAICT, I see the logs that shows a failure: But it doesn't actually reach the RPC client, resulting in a timeout 🤔....looking into it now. When the test runs successfully, you'll see this: |
In this commit, we fully integrate the new RBF close state machine into the peer. For the restart case after shutdown, we can short circuit the existing logic as the new FSM will handle retransmitting the shutdown message itself, and doesn't need to delegate that duty to the link. Unlike the existing state machine, we're able to restart the flow to sign a coop close with a new higher fee rate. In this case, we can now send multiple updates to the RPC caller, one for each newly singed coop close transaction. To implement the async flush case, we'll launch a new goroutine to wait until the state machine reaches the `ChannelFlushing` state, then we'll register the hook. We don't do this at start up, as otherwise the channel may _already_ be flushed, triggering an invalid state transition.
For now, we disallow the option to be used with the taproot chans option, as the new flow hasn't yet been updated for nonce usage.
This fixes some existing race conditions, as the `finalizeChanClosure` function was being called from outside the main event loop.
If we hit an error, we want to wipe the state machine state, which also includes removing the old endpoint.
This'll allow us to notify the caller each time a new coop close transaction with a higher fee rate is signed.
Resp is always nil, so we actually need to log event.Update here.
In this commit, we extend `CloseChannelAssertPending` with new args that returns the raw close status update (as we have more things we'd like to assert), and also allows us to pass in a custom fee rate.
b97d255 to
4f24870
Compare
We'll properly handle a protocol error due to user input by halting, and sending the error back to the user. When a user goes to issue a new update, based on which state we're in, we'll either kick off the shutdown, or attempt a new offer. This matches the new spec update where we'll only send `Shutdown` once per connection.
In this commit, we alter the existing co-op close flow to enable RBF bumps after re connection. With the new RBF close flow, it's possible that after a success round _and_ a re connection, either side wants to do another fee bump. Typically we route these requests through the switch, but in this case, the link no longer exists in the switch, so any requests to fee bump again would find that the link doesn't exist. In this commit, we implement a work around wherein if we have an RBF chan closer active, and the link isn't in the switch, then we just route the request directly to the chan closer via the peer. Once we have the chan closer, we can use the exact same flow as prior.
4f24870 to
4b9294e
Compare
The itest has both sides try to close multiple times, each time with increasing fee rates. We also test the reconnection case, bad RBF updates, and instances where the local party can't actually pay for fees.
With this commit, we make sure we set the right height hint, even if the channel is a zero conf channel.
2ed12b1 to
3801f3a
Compare
|
I analyzed the logs from this failed build, and I think the initial flow doesn't seem right. The following logs are taken after Alice started the first close channel request, for unknown reason Bob will always attempt a coop close even tho he didn't request it, Bob received Alice's shutdown msg, and replied with a shutdown msg, Bob then started his own shutdown process once the channel was flushed, although the RPC was not called, Bob then sent a closing complete with his own fee rate, He also received Alice's closing complete, The rest of the msgs, which also shows Bob broadcast two closing txns, one from his, and the other from Alice, Bob finally started the coop process here, which means he shouldn't have attempted one before, Alice's log for reference, we can decide the msg order from the timestamp, |
| _, aliceCloseUpdate = ht.CloseChannelAssertPending( | ||
| alice, chanPoint, false, | ||
| lntest.WithCoopCloseFeeRate(aliceRejectedFeeRate), | ||
| lntest.WithLocalTxNotify(), lntest.WithSkipMempoolCheck(), |
There was a problem hiding this comment.
Since the coop close would fail, there's no need to skip the mempool check because the old txid is still in the mempool?
There was a problem hiding this comment.
We still need this here, just to skip the final assertion in CloseChannelAssertPending. With the API, close only fails (gRPC error) when we can't pay for the specified fee. Spurious RBF updates are surpressed in the form of a response that returns the target fee rate.
Crypt-iQ
left a comment
There was a problem hiding this comment.
LGTM pending linter, nice work 🦖
This fixes an issue in the itests in the restart case. We'd see an error like: ``` 2025-03-12 23:41:10.754 [ERR] PFSM state_machine.go:661: FSM(rbf_chan_closer(2f20725d9004f7fda7ef280f77dd8d419fd6669bda1a5231dd58d6f6597066e0:0)): Unable to apply event err="invalid state transition: received *chancloser.SendOfferEvent while in ClosingNegotiation(local=LocalOfferSent(proposed_fee=0.00000193 BTC), remote=ClosePending(txid=07229915459cb439bdb8ad4f5bf112dc6f42fca0192ea16a7d6dd05e607b92ae, party=Remote, fee_rate=1 sat/vb))" ``` We resolve this by waiting to send in the new request unil the old one has been completed.
82cb727 to
7060765
Compare
Final PR split off from #8453. This contains commits that pertain primarily to exposing the new RBF coop close to the end user via updates to the rpcserver, server, and peer.