commitlog: Support traversal without opening the log #1103

kim · 2024-04-17T14:46:35Z

Traversing the commitlog without also making it available for writing would still require upfront I/O imposed by the open constructor.

Avoid that by introducing free-standing functions which start traversal right away.

cloutiertyler · 2024-04-18T07:20:58Z

I am not sure I understand what this does. Does this defer the cost of open until the first .next call? I couldn't really work it out from the code at a glance.

cloutiertyler · 2024-04-18T07:22:13Z

Oh I see, it's just not opening the last segment for writing?

kim · 2024-04-18T09:34:50Z

Right -- we usually want to open the log for writing, and later get an iterator.

Not covered in this patch is replaying, and then get a writable commitlog without traversing the last segment twice.

Traversing the commitlog without also making it available for writing would still require upfront I/O imposed by the `open` constructor. Avoid that by introducing free-standing functions which start traversal right away.

kim · 2024-04-18T12:17:57Z

The latter I would actually not try, because bootstrapping becomes quite convoluted (replay db without durability -> obtain valid commitlog writer -> set as db durability). Instead, we can make this faster using an offset index.

@lcodes Do you need commitlog::Generic to be public?

lcodes · 2024-04-19T14:10:28Z

crates/commitlog/src/lib.rs

+    offset: u64,
+) -> io::Result<impl Iterator<Item = Result<Commit, error::Traversal>>> {
+    commitlog::commits_from(repo::Fs::new(root), DEFAULT_LOG_FORMAT_VERSION, offset)
+}


Is the only difference between these and the commitlog:: variants the abstraction over the repo?

Would it be better to just have the ones in commitlog for this and let the caller call repo::Fs::new manually?

(In my case, a custom TarRepo is used to consume the logs without extracting them from the *.tar archive first)

Most of what you’re using isn’t exported from the crate. We can change that, but I’ll probably be molested to add docstrings everywhere 😿

lcodes · 2024-04-19T14:28:20Z

crates/commitlog/src/commitlog.rs

+    })
+}
+
+pub fn transactions_from<'a, R, D, T>(


Is it worth having these transactions_from variants here? Maybe I'm not aware of existing code to persist the tables schema state, or how to skip entire tx ranges using these functions, but it seems that unless the caller can persist their decoder state, they always have to replay from 0. In the case of code under dev, where a full replay is needed on code changes, going through the full snapshot on every iteration is too slow, and --release isn't friendly inside the debugger.

For analytics, the custom decoder calls decode_record_fn and only replays records for the system tables to create the schemas needed to decode rows, and processes everything else on the fly from within the Visitor.

For information, I'm only using commits. This is because, given the large snapshot at the beginning of the log, it's starting at offset 0 to load the schema information, then skips to offset 163 to load the relevant table's snapshot, and then skips all the way to the first post-snapshot commit. This saves most of the startup time without having to persist the state of the schemas, ie many millions of records don't have to be decoded at all before getting to the interesting data.

Ah nevermind most of that, just realized Decoder.decode_record is meant to call decode_record_fn, and the tx could be filtered there instead.

_from is useful for forensics, and for replication.

commitlog: Support traversal without opening the log

161582d

Traversing the commitlog without also making it available for writing would still require upfront I/O imposed by the `open` constructor. Avoid that by introducing free-standing functions which start traversal right away.

kim force-pushed the kim/commitlog2/iter-ro branch from 3c5a9e8 to 161582d Compare April 18, 2024 10:18

lcodes approved these changes Apr 19, 2024

View reviewed changes

kim added this pull request to the merge queue Apr 19, 2024

github-merge-queue bot removed this pull request from the merge queue due to failed status checks Apr 19, 2024

kim added this pull request to the merge queue Apr 19, 2024

Merged via the queue into master with commit 06d5481 Apr 19, 2024
6 checks passed

kim deleted the kim/commitlog2/iter-ro branch April 20, 2024 01:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

commitlog: Support traversal without opening the log #1103

commitlog: Support traversal without opening the log #1103

kim commented Apr 17, 2024

cloutiertyler commented Apr 18, 2024

cloutiertyler commented Apr 18, 2024

kim commented Apr 18, 2024

kim commented Apr 18, 2024

lcodes Apr 19, 2024

kim Apr 19, 2024

lcodes Apr 19, 2024

lcodes Apr 19, 2024

kim Apr 19, 2024

commitlog: Support traversal without opening the log #1103

commitlog: Support traversal without opening the log #1103

Conversation

kim commented Apr 17, 2024

cloutiertyler commented Apr 18, 2024

cloutiertyler commented Apr 18, 2024

kim commented Apr 18, 2024

kim commented Apr 18, 2024

lcodes Apr 19, 2024

Choose a reason for hiding this comment

kim Apr 19, 2024

Choose a reason for hiding this comment

lcodes Apr 19, 2024

Choose a reason for hiding this comment

lcodes Apr 19, 2024

Choose a reason for hiding this comment

kim Apr 19, 2024

Choose a reason for hiding this comment