-
Notifications
You must be signed in to change notification settings - Fork 421
Add ChannelDataPersister trait and point ChainMonitor to it. #681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add ChannelDataPersister trait and point ChainMonitor to it. #681
Conversation
Codecov Report
@@ Coverage Diff @@
## master #681 +/- ##
==========================================
+ Coverage 91.34% 91.36% +0.01%
==========================================
Files 37 37
Lines 21974 22067 +93
==========================================
+ Hits 20073 20162 +89
- Misses 1901 1905 +4
Continue to review full report at Codecov.
|
|
just confirming that there's no plan to persist anything before the funding tx is constructed. i.e. if the node restarts after a channel open but before |
Channel data is persisted at the same time as before this PR. Looking in |
595db8a to
5bb24fb
Compare
ariard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approach ACK.
I still need to think about the consistency requirements with regards to higher layer of channel monitors but overall the flexibility offered is super cool :)
| if let Some(orig_monitor) = monitors.get_mut(&outpoint) { | ||
| log_trace!(self.logger, "Updating Channel Monitor for channel {}", log_funding_info!(orig_monitor)); | ||
| if let Err(e) = orig_monitor.update_monitor(&update, &self.broadcaster, &self.logger) { | ||
| return Err(e) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per changes at https://github.com/rust-bitcoin/rust-lightning/pull/679/files#diff-dfe4c687889c83110048d0bb7a250ee2R147 we shouldn't return early, instead opting to store the new updated ChannelMonitor. Somewhat awkwardly, we probably need to only do so on some potential MonitorUpdateErrors - We should probably migrate towards those errors containing a flag to indicate if the situation implies a bogus update or simply a locked channel.e
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool, I interpret this as: it'd be a good idea to add that "bogus update or locked channel" flag you mention in this PR.
|
Looking great so far to me, bu definitely curious about the persistence pipeline for memory -> disk during state negotiation procedures. I think that is where revisiting incremental updates would be very helpful, as those ought to be way less performance-heavy. |
|
This is great. My progress had been blocked with the lack of this feature in RL. Previously I could not know precisely when I should take SCB, because when a channel is closed there is no way to know about it besides polling the ChannelManager. |
5bb24fb to
bd7e4b9
Compare
Hmm this PR isn't doesn't have the feature of notifying you on channel close (well, technically you'll get a callback on force close, but not on coop close). I filed #695 |
jkczyz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Running out of steam tonight. Will continue reviewing tomorrow.
35b34be to
3ae7184
Compare
ff43ee6 to
627871c
Compare
lightning/src/ln/data_persister.rs
Outdated
| /// supposed to be persisting bytes of data, backup sources may be considered | ||
| /// "in consensus" if a majority of them agree on the current state and are on | ||
| /// the highest known commitment number. See individual methods for more info. | ||
| pub trait ChannelDataPersister: Send + Sync { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the risk of being pedantic: how about simply ChannelPersister? In other words, would removing the word "data" lead to any ambiguity? Or should this rather be called ChannelMonitorPersister if it needs to differentiate from persisting Channels?
From a different angle, data_persister as used here as a module name and later as a variable name is ambiguous when reading code. Generally speaking, "data" usually can be substituted with some abstraction (e.g. channel) to provide clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, should we factor in that we intend to persist other data through this module in the future as well (e.g. ChannelManagers, Channels, etc)? Maybe it should just be called the Persister?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would say that ChannelMonitor and Channel should be persisted atomically together, otherwise inconsistency could result. So that could be part of this trait.
Perhaps the non-channel-specific things in ChannelManager should be persisted using a different trait. That could be named e.g. NodePersister or such, since it includes non-channel relevant things like the announcement serial number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. The thing is that iiuc, stuff that needs to be persisted is pretty much divided into ChannelMonitors (which need to be time sensitively persisted whenever a channel is updated) and everything else, which should be persisted sometimes and would suck to lose but isn't at the same level of sensitivity.
How often the user wants to persist the less-sensitive stuff is subjective. We could make that decision for them (like the ChannelDataPersister does) but different users may have different fault-tolerance preferences. Not sure what the solution is here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok discussed offline, update:
There's only 2 structs that need to be persisted: (1) the ChannelMonitors, and (2) the ChannelManager. The ChannelMonitors are sensitive data, and them being out of date could result in funds lost. However, the ChannelManager is not sensitive, and if e.g. a node starts up with an out of date ChannelManager, this will result in channels being force closed, but there's no chance for loss of funds.
So basically this PR is gonna focus on persisting ChannelMonitors, and we'll figure out how ChannelManager persistence is gonna work later. And the remaining open question is what to name the ChannelDataPersister.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see several fields in Channel that might be critical for correct functioning so need persistence, including:
channel_statechannel_outboundvalue_to_self_msatshort_channel_id(also needed to restoreChannelHoldercorrectly)
and maybe others. They probably won't cause loss of funds if out of date, but the system just won't work when restarted and restored from disk.
Also, ChannelHolder might have information that is required in order to be able to claim HTLCs, so if those are lost we can lose the HTLC value.
If we lose ChannelManager.last_node_announcement_serial, what is the impact?
Lastly, the KeysManager inside the ChannelManager has some state, including channel_child_index and rand_bytes_child_index. I'm not sure what is the impact if we end up with the same keyset for two channels.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and maybe others. They probably won't cause loss of funds if out of date, but the system just won't work when restarted and restored from disk.
You need to store the whole channel policy as announced to your counterparty, including its own policy, otherwise you may have channel policy discrepancies and your counterparty closing the channel unexpectedly (e.g a max_accepted_htlcs violation).
Also, ChannelHolder might have information that is required in order to be able to claim HTLCs, so if those are lost we can lose the HTLC value.
Good point, you need to store HTLC-relay data too, otherwise you may lost an in-flight update_fail_htlc form an downstream channel to an upstream one.
I don't think we need to store the Events as they should be re-generated if we re-apply the state correctly to every component.
If we lose ChannelManager.last_node_announcement_serial, what is the impact?
BOLT7:
if node_id is NOT previously known from a channel_announcement message, OR if timestamp is NOT greater than the last-received node_announcement from this node_id: SHOULD ignore the message.
Lastly, the KeysManager inside the ChannelManager has some state, including channel_child_index and rand_bytes_child_index. I'm not sure what is the impact if we end up with the same keyset for two channels.
Fund safety, so we should be careful storing it.
lightning/src/ln/data_persister.rs
Outdated
| /// | ||
| /// [`update_id`]: struct.ChannelMonitorUpdate.html#structfield.update_id | ||
| /// [`ChannelMonitor::write_for_disk`]: ../../chain/channelmonitor/struct.ChannelMonitor.html#method.write_for_disk | ||
| fn persist_channel_data(&self, id: OutPoint, data: &ChannelMonitor<Self::Keys>) -> Result<(), ChannelMonitorUpdateErr>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on the outcome of the above naming discussion, consider naming this persist_new_channel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In our discussion in the call, it seemed like our current idea of what our sample module will look like is: on persist_channel_data, the monitor update will be somehow persisted, and persistence of the whole ChannelManager will be kicked off in the background.
Given this, maybe Persister is the best name, and then indeed we can rename these methods persist_new_channel / update_channel?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, ChainMonitor will use the persister as currently implemented in this PR. ChannelManager can fire an event for persisting the ChannelManager that users can then process in the background. The latter can be implemented in a follow-up PR.
Regarding naming, Persister is probably fine so long as we don't need another trait to persist something else. Or possibly simply Persist to follow the loose rust trait naming conventions, which I went with in the chain module traits. Additionally, this trait could be moved to channelmonitor.rs which would make for a less ambiguous, module-qualified name: channelmonitor::Persist.
I don't have strong opinion yet on the function names. IIRC, we even discussed renaming ChannelMonitor (which could affect these function names), but I'm forgetting where that conversation went. I think I was opposed to it but can't recall the alternative name. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I switched to the names you suggested and channelmonitor::Persist. I think it's much nicer.
| pub struct MonitorUpdateError { | ||
| /// The human-readable error message string. | ||
| pub msg: &'static str, | ||
| /// A monitor update may have an error but still need to be persisted to disk. | ||
| pub persist_updated_monitor: bool | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we make this error an enum instead of adding a bool field. Not sure about naming but maybe something akin to the variants in ChannelMonitorUpdateErr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lmk what you think of the naming.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, w/r/t the naming of ChannelMonitorUpdateErr and MonitorUpdateError -- what do you think of just renaming MonitorUpdateError to InternalMonitorErr?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on our offline discussion, I'm starting to become of the mind that these should be reduced to a single error type. My reasoning is as follows:
ChannelMonitor's methods can return errors.chain::Watch's interface is defined in terms ofChannelMonitor,ChannelMonitorUpdate, andChannelMonitorUpdateErr.ChainMonitoris an implementation ofchain::Watch, but there may be other implementations of the trait.
Therefore, whether a ChannelMonitor must be persisted isn't necessarily always an internal detail handled by ChainMonitor; another implementation of chain::Watch or someone using ChannelMonitor directly would also need to do so.
My thought then would be to have one error type where the PermanentFailure variant has a bool indicating if it should be persisted.
That said, I'd like to get a better idea about where in the code that the "failure but persist" error will occur. I couldn't quite make that out from #681 (comment). Additionally, the docs for PermanentFailure indicate it can be used in two different ways. This makes me wonder if it should then be two different variants (and if so would this would remove the need for the bool?).
Thoughts? Let me know if I misunderstood anything.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, I think they're definitely very different in their purpose, though I don't necessarily think that implies they can't be the same type -
MonitorUpdateErroris an error type returned by RL to the user to indicate that the update provided to the monitor was somehow bogus. It instructs the user whether they should still persist the updated object to disk, but in any case the update was not fully applied and ready to move forward.ChannelMonitorUpdateErris an error type generated by the user instructing theChannel{,Manager}what the current state of the monitor update is. It is maybe more aptly namedChannelMonitorUpdateStatus(and replacing theResultwith just a general done/in-flight/failed enum).
The should-persist flag is definitely very awkward, and we should take pains in the documentation to point out that it is only relevant if you are running in a multi-primary redundant ChannelMonitor setup where each ChannelMonitor can unilaterally broadcast a transaction. In other words, for most folks, we can pretend its not there. See #679 for the one case where its ever set/matters. For normal folks, the MonitorUpdateError is just a human-readable error string that's arguably only useful for opening an issue on our github.
Indeed, for more complicated setups, arguably, they start to blend a bit more, but I do think there's value in them being different.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good observation. I like where you're going with making ChannelMonitorUpdateErr a status instead of an error.
May need to think on this more, but just to throw an idea out there: would it make sense to have the chain::Watch interface return something like Result<ChannelMonitorUpdateStatus, MonitorUpdateError>? Possibly renaming these in some way, of course.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result<ChannelMonitorUpdateStatus, MonitorUpdateError>
Hmm, interesting point. I suppose we could, though it doesn't mean anything to the ChannelManager which receives it - Err(_) would be converted to a PermanentFailure status either way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result<ChannelMonitorUpdateStatus, MonitorUpdateError>Hmm, interesting point. I suppose we could, though it doesn't mean anything to the
ChannelManagerwhich receives it -Err(_)would be converted to aPermanentFailurestatus either way.
Not 100% sure if I follow that last point. Presumably, ChannelManager would process either the status if the Result is ok or the error if it is err using handle_monitor_err!.
It's the ChainMonitor that currently converts any MonitorUpdateError to a PermanentFailure. In what I'm floating, ChainMonitor would examine the Result and persist if needed and then simply return the Result as is back to ChannelManager.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, my point was only that that would be communicating extra information to the ChannelManager that it doesnt care about - PermanentFailure is the same to it as Err(_).
edb2855 to
4ff66ef
Compare
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
New code looks pretty good, a few comments and I still need to thoroughly scan the new documentation, but can do after 649. Don't have strong feelings on naming.
| } | ||
| if should_persist { | ||
| match self.data_persister.update_channel_data(outpoint, &update, orig_monitor) { | ||
| Err(e) => return Err(e), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This must not "upgrade" a PermanentFailure in res to a TemporaryFailure.
lightning-data-persister/src/lib.rs
Outdated
| { | ||
| let mut f = fs::File::create(&tmp_filename)?; | ||
| monitor.write_for_disk(&mut f)?; | ||
| f.sync_all()?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this to be sensible on OSX, it looks like you need at least rust stdlib 1.36, at least going by this commit: rust-lang/rust@d602a6b. This should be documented somwehre.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I documented this but only by leaving a comment. Do you think it'd be ideal to add a README for gotchas like this? (It'd be somewhat redundant to the comments in the file.)
5cc5671 to
2d88541
Compare
ef92009 to
9ca8891
Compare
jkczyz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Largely happy where this is now! 👍 Happy to discuss the testing strategy more offline.
| // Test failure to rename in the process of atomically creating a channel | ||
| // monitor's file. We induce this failure by making the `tmp` file a | ||
| // directory. | ||
| // Explanation: given "from" = the file being renamed, "to" = the | ||
| // renamee that already exists: Windows should fail because it'll fail | ||
| // whenever "to" is a directory, and Unix should fail because if "from" is a | ||
| // file, then "to" is also required to be a file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this actually test temp file creation (line 81) rather than the rename (line 85)? It's a good test to have regardless, though I think in both cases creating a read-only version of the temp and final files, respectively, may be sufficient to handle any platform... shooting from the hip on the last part.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think in both cases creating a read-only version of the temp and final files, respectively, may be sufficient to handle any platform
I wasn't sure what this part meant.
(Btw I remembered last night you're on vacation right now, I'm just responding to your responses for posterity). But, good catch and I kept this test and added an actual rename faliure unit test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just thinking for either test, the presence of a read-only file ("to" or "from" depending on the test) might be sufficient to trigger the case without using a directory. Using a directory is an extra detail for the reader (different type vs invalid permission), so if possible it may be better to avoid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That doesn't seem work, at least on unix. This is what I tried changing the test to: https://i.imgur.com/yOHqQB5.png (sorry not the most readable). Lmk if i'm missing something
ariard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this really needs to be fixed, otherwise remove_dir_all is wiping out all the intent of persister as it doesn't check for emptiness ?
New tests sound good.
fc5d755 to
d546c91
Compare
Yeah really good catch that Jeff also noticed, fixed. |
07de61f to
ed6101f
Compare
ariard
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tested ACK ed6101f
TheBlueMatt
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. A few nits, only really worth addressing if you're squashing/making other changes anyway. What say you @jkczyz
The utilities will still be part of cfg(test). The purpose of this is to enable other crates in the workspace to use the test utilities.
Implementors of Persist are responsible for persisting new ChannelMonitors and ChannelMonitor updates to disk/backups.
- The ChainMonitor should: Whenever a new channel is added or updated, these updates should be conveyed to the persister and persisted to disk. Even if the update errors while it's being applied, the updated monitor still needs to be persisted.
d3aeaf8 to
8a9e1f3
Compare
jkczyz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. A few nits, only really worth addressing if you're squashing/making other changes anyway. What say you @jkczyz
LGTM. Sorry about the delay! Feel free to address any comments in a follow-up, if needed.
| // Test failure to rename in the process of atomically creating a channel | ||
| // monitor's file. We induce this failure by making the `tmp` file a | ||
| // directory. | ||
| // Explanation: given "from" = the file being renamed, "to" = the | ||
| // renamee that already exists: Windows should fail because it'll fail | ||
| // whenever "to" is a directory, and Unix should fail because if "from" is a | ||
| // file, then "to" is also required to be a file. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was just thinking for either test, the presence of a read-only file ("to" or "from" depending on the test) might be sufficient to trigger the case without using a directory. Using a directory is an extra detail for the reader (different type vs invalid permission), so if possible it may be better to avoid.
| // Set up a dummy channel and force close. This will produce a monitor | ||
| // that we can then use to test persistence. | ||
| let chanmon_cfgs = create_chanmon_cfgs(2); | ||
| let node_cfgs = create_node_cfgs(2, &chanmon_cfgs); | ||
| let node_chanmgrs = create_node_chanmgrs(2, &node_cfgs, &[None, None]); | ||
| let nodes = create_network(2, &node_cfgs, &node_chanmgrs); | ||
| let chan = create_announced_chan_between_nodes(&nodes, 0, 1, InitFeatures::known(), InitFeatures::known()); | ||
| nodes[1].node.force_close_channel(&chan.2); | ||
| let mut added_monitors = nodes[1].chain_monitor.added_monitors.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid this setup, could you test the persister using write_channel_datalike the earlier tests? Likewise in the next test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, it's nice to be able to test that it returns a PermanentError (since ultimately that's what we care about)
Intended to be a cross-platform implementation of the channelmonitor::Persist trait. This adds a new lightning-persister crate, that uses the newly exposed lightning crate's test utilities. Notably, this crate is pretty small right now. However, due to future plans to add more data persistence (e.g. persisting the ChannelManager, etc) and a desire to avoid pulling in filesystem usage into the core lightning package, it is best for it to be separated out. Note: Windows necessitates the use of OpenOptions with the `write` permission enabled to `sync_all` on a newly opened channel's data file.
If a persister returns a temporary failure, the channel monitor should be able to be put on ice and then revived later. If a persister returns a permanent failure, the channel should be force closed.
This function does not necessarily write to disk, it can serialize to anything that implements Writer.
8a9e1f3 to
cda8fb7
Compare
|
One or two tests may need a small follow-up, but looks like mostly all comments addressed :) |
Intended to be a simple way for users to know where and how to put their
backup and persistence logic.
Sanity check/architecture reviews appreciated.
TODOS:
std::pathfsync'ing the parent directory on writeDepends on #713 and #649.