-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clp-s: Add support for decompression in ascending timestamp order. #440
clp-s: Add support for decompression in ascending timestamp order. #440
Conversation
decompression_options.add_options()( | ||
"ordered", | ||
po::bool_switch(&m_ordered_decompression), | ||
"Enable in-order decompression for this archive" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to make it clear that "order" is based on timestamp?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I had some vague idea that we'd transparently use timestamp or MOT depending on what was available in the archive, but being explicit is best for now.
*/ | ||
epochtime_t get_next_timestamp() const { return m_get_timestamp(); } | ||
|
||
bool done() const { return m_cur_message >= m_num_messages; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to add a description for this method?
auto& schema_reader | ||
= create_schema_reader(schema_id, should_extract_timestamp, should_marshal_records); | ||
create_schema_reader( | ||
m_schema_reader, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to rename it to initialize_schema_reader
?
return m_schema_reader; | ||
} | ||
|
||
std::vector<std::shared_ptr<SchemaReader>> ArchiveReader::load_all_tables() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to rename it to read_all_tables
(we have read_table
above)?
@@ -49,8 +51,11 @@ class JsonConstructor { | |||
void store(); | |||
|
|||
private: | |||
void construct_in_order(FileWriter& writer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a description for this method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Since we only support ascending order for timestamps, should we explicitly state this in the command-line description and the commit message?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the commit message, what about "clp-s: Add support for decompression in ascending timestamp order."?
…-scope#440) Co-authored-by: wraymo <[email protected]>
Description
This PR adds support to decompress a clp-s archive in timestamp order with the
--ordered
command line option. This is implemented by loading all tables into memory, putting them into a min-heap ordered by the timestamp of the next record in each table, and popping from the heap until all records have been decompressed. The timestamp is pulled from the authoritative timestamp column specified at compression time, and if a table does not contain such a column the timestamp is 0 for ordering purposes.Note: we don't sort by timestamp within each table, so the logs can end up slightly out of timestamp order (this is probably preferred anyway, since it brings us closer to log order).
This seems to add minimal decompression speed overhead (about 2.5%) for fairly typical archives, though would likely have higher overhead for large archives or archives with many tables (due to extra memory allocations compared to decompressing one table at a time). Similarly there can be significant memory overhead, particularly for large archives.
Validation performed