-
Notifications
You must be signed in to change notification settings - Fork 110
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
commitlog: Support traversal without opening the log #1103
Conversation
I am not sure I understand what this does. Does this defer the cost of |
Oh I see, it's just not opening the last segment for writing? |
Right -- we usually want to open the log for writing, and later get an iterator. Not covered in this patch is replaying, and then get a writable commitlog without traversing the last segment twice. |
Traversing the commitlog without also making it available for writing would still require upfront I/O imposed by the `open` constructor. Avoid that by introducing free-standing functions which start traversal right away.
3c5a9e8
to
161582d
Compare
The latter I would actually not try, because bootstrapping becomes quite convoluted (replay db without durability -> obtain valid commitlog writer -> set as db durability). Instead, we can make this faster using an offset index. @lcodes Do you need |
offset: u64, | ||
) -> io::Result<impl Iterator<Item = Result<Commit, error::Traversal>>> { | ||
commitlog::commits_from(repo::Fs::new(root), DEFAULT_LOG_FORMAT_VERSION, offset) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the only difference between these and the commitlog:: variants the abstraction over the repo?
Would it be better to just have the ones in commitlog for this and let the caller call repo::Fs::new
manually?
(In my case, a custom TarRepo is used to consume the logs without extracting them from the *.tar archive first)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most of what you’re using isn’t exported from the crate. We can change that, but I’ll probably be molested to add docstrings everywhere 😿
}) | ||
} | ||
|
||
pub fn transactions_from<'a, R, D, T>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it worth having these transactions_from variants here? Maybe I'm not aware of existing code to persist the tables schema state, or how to skip entire tx ranges using these functions, but it seems that unless the caller can persist their decoder state, they always have to replay from 0. In the case of code under dev, where a full replay is needed on code changes, going through the full snapshot on every iteration is too slow, and --release isn't friendly inside the debugger.
For analytics, the custom decoder calls decode_record_fn
and only replays records for the system tables to create the schemas needed to decode rows, and processes everything else on the fly from within the Visitor.
For information, I'm only using commits
. This is because, given the large snapshot at the beginning of the log, it's starting at offset 0 to load the schema information, then skips to offset 163 to load the relevant table's snapshot, and then skips all the way to the first post-snapshot commit. This saves most of the startup time without having to persist the state of the schemas, ie many millions of records don't have to be decoded at all before getting to the interesting data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah nevermind most of that, just realized Decoder.decode_record is meant to call decode_record_fn, and the tx could be filtered there instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_from
is useful for forensics, and for replication.
Traversing the commitlog without also making it available for writing would still require upfront I/O imposed by the
open
constructor.Avoid that by introducing free-standing functions which start traversal right away.