Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clp-s: Add support for decompression in ascending timestamp order. #440

Merged
merged 8 commits into from
Jun 14, 2024

Conversation

gibber9809
Copy link
Contributor

Description

This PR adds support to decompress a clp-s archive in timestamp order with the --ordered command line option. This is implemented by loading all tables into memory, putting them into a min-heap ordered by the timestamp of the next record in each table, and popping from the heap until all records have been decompressed. The timestamp is pulled from the authoritative timestamp column specified at compression time, and if a table does not contain such a column the timestamp is 0 for ordering purposes.

Note: we don't sort by timestamp within each table, so the logs can end up slightly out of timestamp order (this is probably preferred anyway, since it brings us closer to log order).

This seems to add minimal decompression speed overhead (about 2.5%) for fairly typical archives, though would likely have higher overhead for large archives or archives with many tables (due to extra memory allocations compared to decompressing one table at a time). Similarly there can be significant memory overhead, particularly for large archives.

Validation performed

  • Validated that records end up sorted in timestamp order after decompression
  • Validated that in order decompression results in identical set of records compared to out of order decompression

@gibber9809 gibber9809 requested a review from wraymo June 13, 2024 19:24
decompression_options.add_options()(
"ordered",
po::bool_switch(&m_ordered_decompression),
"Enable in-order decompression for this archive"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to make it clear that "order" is based on timestamp?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I had some vague idea that we'd transparently use timestamp or MOT depending on what was available in the archive, but being explicit is best for now.

*/
epochtime_t get_next_timestamp() const { return m_get_timestamp(); }

bool done() const { return m_cur_message >= m_num_messages; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we want to add a description for this method?

auto& schema_reader
= create_schema_reader(schema_id, should_extract_timestamp, should_marshal_records);
create_schema_reader(
m_schema_reader,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to rename it to initialize_schema_reader?

return m_schema_reader;
}

std::vector<std::shared_ptr<SchemaReader>> ArchiveReader::load_all_tables() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to rename it to read_all_tables (we have read_table above)?

components/core/src/clp_s/ArchiveReader.cpp Outdated Show resolved Hide resolved
@@ -49,8 +51,11 @@ class JsonConstructor {
void store();

private:
void construct_in_order(FileWriter& writer);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a description for this method?

@gibber9809 gibber9809 requested a review from wraymo June 13, 2024 23:54
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Since we only support ascending order for timestamps, should we explicitly state this in the command-line description and the commit message?

@gibber9809 gibber9809 changed the title clp-s: Implement decompression in timestamp order. clp-s: Implement decompression in ascending timestamp order. Jun 14, 2024
@gibber9809 gibber9809 requested a review from wraymo June 14, 2024 15:16
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the commit message, what about "clp-s: Add support for decompression in ascending timestamp order."?

@gibber9809 gibber9809 changed the title clp-s: Implement decompression in ascending timestamp order. clp-s: Add support for decompression in ascending timestamp order. Jun 14, 2024
@gibber9809 gibber9809 merged commit 8cfa96c into y-scope:main Jun 14, 2024
11 checks passed
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants