-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
commitlog: Support traversal without opening the log #1103
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -80,6 +80,13 @@ impl<T> Commitlog<T> { | |
/// Open the log at root directory `root` with [`Options`]. | ||
/// | ||
/// The root directory must already exist. | ||
/// | ||
/// Note that opening a commitlog involves I/O: some consistency checks are | ||
/// performed, and the next writing position is determined. | ||
/// | ||
/// This is only necessary when opening the commitlog for writing. See the | ||
/// free-standing functions in this module for how to traverse a read-only | ||
/// commitlog. | ||
pub fn open(root: impl Into<PathBuf>, opts: Options) -> io::Result<Self> { | ||
let inner = commitlog::Generic::open(repo::Fs::new(root), opts)?; | ||
|
||
|
@@ -390,3 +397,85 @@ impl<T: Send + Sync + 'static> Commitlog<T> { | |
rx | ||
} | ||
} | ||
|
||
/// Obtain an iterator which traverses the commitlog located at the `root` | ||
/// directory from the start, yielding [`Commit`]s. | ||
/// | ||
/// Starts the traversal without the upfront I/O imposed by [`Commitlog::open`]. | ||
/// See [`Commitlog::commits`] for more information. | ||
pub fn commits(root: impl Into<PathBuf>) -> io::Result<impl Iterator<Item = Result<Commit, error::Traversal>>> { | ||
commits_from(root, 0) | ||
} | ||
|
||
/// Obtain an iterator which traverses the commitlog located at the `root` | ||
/// directory starting from `offset` and yielding [`Commit`]s. | ||
/// | ||
/// Starts the traversal without the upfront I/O imposed by [`Commitlog::open`]. | ||
/// See [`Commitlog::commits_from`] for more information. | ||
pub fn commits_from( | ||
root: impl Into<PathBuf>, | ||
offset: u64, | ||
) -> io::Result<impl Iterator<Item = Result<Commit, error::Traversal>>> { | ||
commitlog::commits_from(repo::Fs::new(root), DEFAULT_LOG_FORMAT_VERSION, offset) | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is the only difference between these and the commitlog:: variants the abstraction over the repo? Would it be better to just have the ones in commitlog for this and let the caller call (In my case, a custom TarRepo is used to consume the logs without extracting them from the *.tar archive first) There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Most of what you’re using isn’t exported from the crate. We can change that, but I’ll probably be molested to add docstrings everywhere 😿 |
||
|
||
/// Obtain an iterator which traverses the commitlog located at the `root` | ||
/// directory from the start, yielding [`Transaction`]s. | ||
/// | ||
/// Starts the traversal without the upfront I/O imposed by [`Commitlog::open`]. | ||
/// See [`Commitlog::transactions`] for more information. | ||
pub fn transactions<'a, D, T>( | ||
root: impl Into<PathBuf>, | ||
de: &'a D, | ||
) -> io::Result<impl Iterator<Item = Result<Transaction<T>, D::Error>> + 'a> | ||
where | ||
D: Decoder<Record = T>, | ||
D::Error: From<error::Traversal>, | ||
T: 'a, | ||
{ | ||
transactions_from(root, 0, de) | ||
} | ||
|
||
/// Obtain an iterator which traverses the commitlog located at the `root` | ||
/// directory starting from `offset` and yielding [`Transaction`]s. | ||
/// | ||
/// Starts the traversal without the upfront I/O imposed by [`Commitlog::open`]. | ||
/// See [`Commitlog::transactions_from`] for more information. | ||
pub fn transactions_from<'a, D, T>( | ||
root: impl Into<PathBuf>, | ||
offset: u64, | ||
de: &'a D, | ||
) -> io::Result<impl Iterator<Item = Result<Transaction<T>, D::Error>> + 'a> | ||
where | ||
D: Decoder<Record = T>, | ||
D::Error: From<error::Traversal>, | ||
T: 'a, | ||
{ | ||
commitlog::transactions_from(repo::Fs::new(root), DEFAULT_LOG_FORMAT_VERSION, offset, de) | ||
} | ||
|
||
/// Traverse the commitlog located at the `root` directory from the start and | ||
/// "fold" its transactions into the provided [`Decoder`]. | ||
/// | ||
/// Starts the traversal without the upfront I/O imposed by [`Commitlog::open`]. | ||
/// See [`Commitlog::fold_transactions`] for more information. | ||
pub fn fold_transactions<D>(root: impl Into<PathBuf>, de: D) -> Result<(), D::Error> | ||
where | ||
D: Decoder, | ||
D::Error: From<error::Traversal> + From<io::Error>, | ||
{ | ||
fold_transactions_from(root, 0, de) | ||
} | ||
|
||
/// Traverse the commitlog located at the `root` directory starting from `offset` | ||
/// and "fold" its transactions into the provided [`Decoder`]. | ||
/// | ||
/// Starts the traversal without the upfront I/O imposed by [`Commitlog::open`]. | ||
/// See [`Commitlog::fold_transactions_from`] for more information. | ||
pub fn fold_transactions_from<D>(root: impl Into<PathBuf>, offset: u64, de: D) -> Result<(), D::Error> | ||
where | ||
D: Decoder, | ||
D::Error: From<error::Traversal> + From<io::Error>, | ||
{ | ||
commitlog::fold_transactions_from(repo::Fs::new(root), DEFAULT_LOG_FORMAT_VERSION, offset, de) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth having these transactions_from variants here? Maybe I'm not aware of existing code to persist the tables schema state, or how to skip entire tx ranges using these functions, but it seems that unless the caller can persist their decoder state, they always have to replay from 0. In the case of code under dev, where a full replay is needed on code changes, going through the full snapshot on every iteration is too slow, and --release isn't friendly inside the debugger.
For analytics, the custom decoder calls
decode_record_fn
and only replays records for the system tables to create the schemas needed to decode rows, and processes everything else on the fly from within the Visitor.For information, I'm only using
commits
. This is because, given the large snapshot at the beginning of the log, it's starting at offset 0 to load the schema information, then skips to offset 163 to load the relevant table's snapshot, and then skips all the way to the first post-snapshot commit. This saves most of the startup time without having to persist the state of the schemas, ie many millions of records don't have to be decoded at all before getting to the interesting data.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nevermind most of that, just realized Decoder.decode_record is meant to call decode_record_fn, and the tx could be filtered there instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_from
is useful for forensics, and for replication.