Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clp-s: Add support for chunking output into different files during timestamp-ordered decompression #451

Merged
merged 7 commits into from
Jun 25, 2024

Conversation

gibber9809
Copy link
Contributor

Description

This PR adds support for chunking the output of timestamp-ordered decompression into several files, where each file has at most the number of records specified in the command line argument. The argument --ordered-chunk-split-threshold <value> can be used in conjunction with the --ordered argument during decompression to trigger this feature.

Validation performed

  • Tested edge case where every records ends up in same chunk
  • Tested edge case where num_records % chunk_size == 0
  • Tested case where num_records % chunk_size > 0

@gibber9809 gibber9809 requested a review from wraymo June 18, 2024 18:58
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Most of the comments are about the style changes.

components/core/src/clp_s/CommandLineArguments.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/CommandLineArguments.cpp Outdated Show resolved Hide resolved
Comment on lines 84 to 98
auto finish_chunk = [&](bool open_new_writer) {
writer.close();
std::string new_file_name = std::string(src_path) + "_" + std::to_string(first_timestamp)
+ "_" + std::to_string(last_timestamp) + ".jsonl";
auto new_file_path = std::filesystem::path(new_file_name);
std::error_code ec;
std::filesystem::rename(src_path, new_file_path, ec);
if (ec) {
throw OperationFailed(ErrorCodeFailure, __FILE__, __LINE__, ec.message());
}

if (open_new_writer) {
writer.open(src_path, FileWriter::OpenMode::CreateForWriting);
}
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have any concerns of making it a private method?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's easier to read this way, but if you prefer private method I can change it.

components/core/src/clp_s/JsonConstructor.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonConstructor.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/JsonConstructor.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/CommandLineArguments.hpp Outdated Show resolved Hide resolved
po::value<size_t>(&m_ordered_chunk_split_threshold)
->default_value(m_ordered_chunk_split_threshold),
"Number of records to include in each output chunk when decompressing records "
"in order"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
"in order"
"in timestamp ascending order"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I replaced with "in ascending timestamp order" instead.

components/core/src/clp_s/CommandLineArguments.cpp Outdated Show resolved Hide resolved
@gibber9809 gibber9809 requested a review from wraymo June 20, 2024 15:21
decompression_options.add_options()(
"ordered",
po::bool_switch(&m_ordered_decompression),
"Enable decompression in ascending timestamp order for this archive"
)(
"ordered-chunk-split-threshold",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you want to update the name of this argument?

@gibber9809 gibber9809 requested a review from wraymo June 21, 2024 16:05
Copy link
Contributor

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, what about "clp-s: Add support for chunking output into different files during timestamp-ordered decompression"?

@gibber9809 gibber9809 changed the title clp-s: Support chunking output into different files during timestamp-ordered decompression clp-s: Add support for chunking output into different files during timestamp-ordered decompression Jun 25, 2024
@gibber9809 gibber9809 merged commit 01d5737 into y-scope:main Jun 25, 2024
11 checks passed
jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants