Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-s): Chunk output by size (in bytes) during ordered decompression. #600

Merged
merged 10 commits into from
Nov 20, 2024

Conversation

gibber9809
Copy link
Contributor

@gibber9809 gibber9809 commented Nov 19, 2024

Description

This PR changes the behaviour of chunking during ordered decompression to target a chunk size in bytes instead of number of records. Generating chunks in this way should allow the log viewer to have more consistent memory usage and computation required per chunk.

Chunks are generated by appending records to them until the total chunk size exceeds that specified by the --ordered-chunk-size argument. That is, --ordered-chunk-size specifies a lower bound for the chunk size which may be slightly exceeded.

As before, specifying --ordered-chunk-size 0 (which is also the default) will result in the output not being divided into chunks.

Validation performed

  • Validated that all records for an archive are decompressed into a single chunk when using --ordered-chunk-size 0
  • Validated that generated chunk size closely corresponds to requested size
  • Validated that --target-chunk-size argument still functions correctly in the package.

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Enhanced command-line argument parsing for improved user experience.
    • Clearer error messages for invalid inputs and specific command requirements.
    • Updated help text for command-line options to clarify their usage, including units for chunk sizes.
  • Bug Fixes

    • Improved error handling for file reading, including specific messages for file not found and opening errors.
  • Chores

    • Streamlined variable initialization for better code maintainability.

Copy link
Contributor

coderabbitai bot commented Nov 19, 2024

Walkthrough

The pull request modifies the command-line argument parsing in the CommandLineArguments class, enhancing error handling and input validation for various commands. It includes improved logging for file reading errors and clearer descriptions of command options. Additionally, the initialization of member variables in both CommandLineArguments and JsonConstructorOption has been updated to default initialization. The JsonConstructor class also sees changes in variable management within the construct_in_order method, focusing on chunk size handling.

Changes

File Path Change Summary
components/core/src/clp_s/CommandLineArguments.cpp Enhanced error handling and logging for file reading; refined command-line option parsing and validation.
components/core/src/clp_s/CommandLineArguments.hpp Updated initialization of m_ordered_chunk_size to m_target_ordered_chunk_size with default initialization.
components/core/src/clp_s/JsonConstructor.cpp Modified variable initialization and control flow in construct_in_order; streamlined chunk handling logic.
components/core/src/clp_s/JsonConstructor.hpp Updated initialization of ordered_chunk_size to target_ordered_chunk_size with default initialization.
components/clp-package-utils/clp_package_utils/scripts/decompress.py Updated help text for --target-chunk-size argument to clarify unit of measurement.
components/clp-package-utils/clp_package_utils/scripts/native/decompress.py Updated help text for --target-chunk-size argument to clarify unit of measurement.
components/core/src/clp_s/clp-s.cpp Renamed parameter option.ordered_chunk_size to option.target_ordered_chunk_size in main function.
components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py Renamed command-line argument from --ordered-chunk-size to --target-ordered-chunk-size in command construction.

Possibly related PRs

  • feat(clp-package): Add support for extracting JSON streams from archives. #569: The changes in this PR involve adding JSON extraction capabilities, which relate to the modifications in the main PR that enhance command-line argument parsing for options like target-ordered-chunk-size. Both PRs focus on improving the handling of command-line arguments and options related to data extraction processes.

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b3a36ff and 3ae4886.

📒 Files selected for processing (1)
  • components/core/src/clp_s/CommandLineArguments.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/CommandLineArguments.cpp

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@gibber9809 gibber9809 requested a review from haiqi96 November 19, 2024 18:10
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/JsonConstructor.hpp (1)

33-33: Add documentation for ordered_chunk_size member

While the initialization change to default initialization is good, this new functionality needs documentation. Consider adding a comment explaining that this member controls the chunk size in bytes during ordered decompression, and that a value of 0 means no chunking.

Apply this diff to add documentation:

+    /// Size in bytes for chunking during ordered decompression.
+    /// When set to 0 (default), output will not be divided into chunks.
     size_t ordered_chunk_size{};
components/core/src/clp_s/JsonConstructor.cpp (1)

162-164: Document potential chunk size variance

Consider adding a comment explaining that chunks may slightly exceed the requested size since we finalize only after exceeding the threshold. This behaviour aligns with the PR objectives but should be documented for maintainability.

+    // Note: Actual chunk size may slightly exceed ordered_chunk_size since we
+    // always include the complete record that causes the size threshold to be exceeded
     if (0 != m_option.ordered_chunk_size && chunk_size >= m_option.ordered_chunk_size) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 12a5f8d and e67de5c.

📒 Files selected for processing (4)
  • components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (1 hunks)
  • components/core/src/clp_s/JsonConstructor.cpp (2 hunks)
  • components/core/src/clp_s/JsonConstructor.hpp (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • components/core/src/clp_s/CommandLineArguments.hpp
🔇 Additional comments (4)
components/core/src/clp_s/JsonConstructor.cpp (3)

84-86: LGTM: Clean variable initialization

The initialization of tracking variables follows modern C++ practices and provides clear intent.


168-168: LGTM: Proper handling of final chunk

The condition correctly ensures that partial chunks are properly finalized while preventing the creation of empty files.


152-160: Verify buffer length accumulation for multi-byte characters

The implementation correctly accumulates buffer lengths for chunk size tracking. However, let's verify that the buffer length calculation handles multi-byte characters consistently.

✅ Verification successful

Let me check the specific UTF-8 handling in the JsonConstructor and related components.


Let me check how the buffer is populated and if there's any string encoding conversion happening in the message processing chain.


Let me check one final aspect - how the actual message content is handled in the schema reader implementation.


Buffer length calculation is safe for multi-byte characters

The implementation is correct as it uses std::string::length() which returns the number of bytes (not characters) in the UTF-8 encoded string. The chunk size accumulation is byte-accurate since:

  1. The message content is handled as raw bytes throughout the processing chain
  2. No character-level operations or encoding conversions are performed
  3. The buffer's byte length is directly used for chunk size calculations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any UTF-8 or multi-byte character handling in the codebase
rg -l "utf-?8|encoding|multi.*byte|mb.*len|wcs.*len" --type cpp

Length of output: 2737


Script:

#!/bin/bash
# Check how buffer and string handling is done in JsonConstructor and related files
rg -A 5 "buffer\.(length|size|c_str)" components/core/src/clp_s/JsonConstructor.cpp
rg -A 5 "std::string.*buffer" components/core/src/clp_s/JsonConstructor.cpp

# Check if there's any specific string encoding handling
rg -A 5 "encoding|utf-?8" components/core/src/clp_s/JsonConstructor.cpp components/core/src/clp_s/JsonParser.cpp

Length of output: 2136


Script:

#!/bin/bash
# Check how messages are processed and how the buffer is populated
rg -A 10 "get_next_message.*buffer" components/core/src/clp_s/
rg -A 10 "class.*SchemaReader" components/core/src/clp_s/

# Check if there's any string encoding conversion in the schema reader
rg -A 10 "class.*JsonReader" components/core/src/clp_s/

Length of output: 2578


Script:

#!/bin/bash
# Check the SchemaReader implementation for message handling
rg -A 10 "void.*get_next_message" components/core/src/clp_s/SchemaReader.cpp components/core/src/clp_s/SchemaReader.hpp

# Look for any string encoding or conversion in the message processing chain
rg -A 5 "string.*message" components/core/src/clp_s/SchemaReader.cpp components/core/src/clp_s/SchemaReader.hpp

Length of output: 5055

components/core/src/clp_s/CommandLineArguments.cpp (1)

308-309: LGTM! Clear and accurate description update.

The updated description accurately reflects that the chunk size is measured in bytes, which aligns with the PR's objective of modifying the chunking behavior to focus on byte-size rather than record count.

components/core/src/clp_s/JsonConstructor.hpp Outdated Show resolved Hide resolved
if (0 != m_option.ordered_chunk_size
&& num_records_marshalled >= m_option.ordered_chunk_size)
{
if (0 != m_option.ordered_chunk_size && chunk_size >= m_option.ordered_chunk_size) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we can remove 0 != m_option.ordered_chunk_size. See comments below

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its worth keeping this behaviour because its pretty much the only way an actual CLI user (including me) will use this command.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure

@@ -30,7 +30,7 @@ struct JsonConstructorOption {
std::string archive_id;
std::string output_dir;
bool ordered{false};
size_t ordered_chunk_size{0};
size_t ordered_chunk_size{};
Copy link
Contributor

@haiqi96 haiqi96 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: @kirkrodrigues wonder should we give it a non-0 default value? A default = 0 may look unintuitive to user.

If we don't have any other use case for ordered-decompression other than log viewer, we can use a default like 128MB which is what CLP is using. Or we can just set it to a large number.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer to keep the current default behaviour. We can add more to the description of the argument to describe that 0 gives you no chunking.

I think it makes more sense for the package to pick its behaviour inside of the package config instead of forcing it into the CLI.

Copy link
Contributor

@haiqi96 haiqi96 Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, now I agree that it should be in the package config. It is never there though, maybe I should consider changing this behavior in another PR to let the default to be configurable through the package

@haiqi96
Copy link
Contributor

haiqi96 commented Nov 19, 2024

Can we agree on a new default for log-viewer, it's currently set to 100'000 records.

Please also update the default values in components/clp-package-utils/clp_package_utils/scripts/decompress.py

@@ -299,7 +299,7 @@ def main(argv):
json_extraction_parser = command_args_parser.add_parser(EXTRACT_JSON_CMD)
json_extraction_parser.add_argument("archive_id", type=str, help="Archive ID")
json_extraction_parser.add_argument(
"--target-chunk-size", type=int, help="Target chunk size.", required=True
"--target-chunk-size", type=int, help="Target chunk size (B)."
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am fine with removing the default value, but can you make sure the script still works without specifying target-chunk-size?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep, already tested on my end and it works.

haiqi96
haiqi96 previously approved these changes Nov 19, 2024
Copy link
Contributor

@haiqi96 haiqi96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change looks good to me. @kirkrodrigues do you want to do another pass?

components/core/src/clp_s/CommandLineArguments.cpp Outdated Show resolved Hide resolved
@@ -178,7 +178,7 @@ class CommandLineArguments {
size_t m_max_document_size{512ULL * 1024 * 1024}; // 512 MB
bool m_structurize_arrays{false};
bool m_ordered_decompression{false};
size_t m_ordered_chunk_size{0};
size_t m_ordered_chunk_size{};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming this to m_target_ordered_chunk_size to indicate that it's a threshold that may be exceeded? Ditto for the CLI option name and JsonConstructorOption.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (3)
components/core/src/clp_s/JsonConstructor.cpp (1)

162-166: Consider improving readability of chunk size condition

While the logic is correct, the condition could be more readable by combining the checks into a single expression or using early return.

Here's a suggested improvement:

-        if (0 != m_option.target_ordered_chunk_size
-            && chunk_size >= m_option.target_ordered_chunk_size)
+        const bool should_finalize_chunk = (m_option.target_ordered_chunk_size > 0 
+            && chunk_size >= m_option.target_ordered_chunk_size);
+        if (should_finalize_chunk)
components/core/src/clp_s/CommandLineArguments.cpp (2)

305-309: Enhance the description of target-ordered-chunk-size parameter.

The description is clear but could be more specific about the behaviour when the chunk size is exceeded. Consider adding details about the actual chunking behaviour.

-                    "Chunk size (B) for each output file when decompressing records in log order."
-                    " When set to 0, no chunking is performed."
+                    "Target chunk size (B) for each output file when decompressing records in log order."
+                    " Records are added to a chunk until the size exceeds this value."
+                    " When set to 0, no chunking is performed."

372-375: Improve error message clarity for the ordered-chunk-size validation.

The error message should reference the exact command-line flag name for better user experience.

-                throw std::invalid_argument("ordered-chunk-size must be used with ordered argument"
+                throw std::invalid_argument("--target-ordered-chunk-size must be used with --ordered"
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 29bbe19 and 81826fc.

📒 Files selected for processing (6)
  • components/core/src/clp_s/CommandLineArguments.cpp (2 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
  • components/core/src/clp_s/JsonConstructor.cpp (2 hunks)
  • components/core/src/clp_s/JsonConstructor.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
  • components/job-orchestration/job_orchestration/executor/query/extract_stream_task.py (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/CommandLineArguments.hpp
🔇 Additional comments (5)
components/core/src/clp_s/JsonConstructor.hpp (2)

33-33: LGTM! The variable name accurately reflects its purpose.

The rename to target_ordered_chunk_size better indicates that this is a target size that may be slightly exceeded during chunking, which aligns well with the size-based chunking behaviour.


33-33: Size validation could be strengthened

The previous review comment about adding size validation to prevent excessive memory usage is still applicable.

components/core/src/clp_s/JsonConstructor.cpp (3)

84-86: LGTM: Clean variable initialization

The default initialization of tracking variables follows modern C++ practices and maintains clear intent.


170-171: LGTM: Proper handling of final chunk

The condition ensures that any remaining data is properly handled and no records are lost, which is essential for data integrity.


152-160: 🛠️ Refactor suggestion

Consider integer overflow protection

When accumulating chunk_size, there's a potential risk of integer overflow when adding buffer.length(). Consider adding a check to ensure the addition won't overflow.

Here's a suggested improvement:

-        chunk_size += buffer.length();
+        if (chunk_size > std::numeric_limits<size_t>::max() - buffer.length()) {
+            throw OperationFailed(
+                ErrorCodeFailure,
+                __FILE__,
+                __LINE__,
+                "Chunk size would overflow"
+            );
+        }
+        chunk_size += buffer.length();

Comment on lines +68 to 69
command.append("--target-ordered-chunk-size")
command.append(str(extract_json_config.target_chunk_size))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codebase verification

Potential inconsistency found in argument name transition

There appears to be a remaining reference to the old ordered-chunk-size in an error message within components/core/src/clp_s/CommandLineArguments.cpp. While the parameter name has been updated to target-ordered-chunk-size, the error message still uses the old terminology, which could cause confusion.

  • components/core/src/clp_s/CommandLineArguments.cpp: Update error message to use the new argument name target-ordered-chunk-size instead of ordered-chunk-size
🔗 Analysis chain

The argument name change aligns with PR objectives

The update from --ordered-chunk-size to --target-ordered-chunk-size correctly implements the requested change for byte-based chunking. The implementation properly handles the optional nature of the parameter.

Let's verify the consistency of this change across the codebase:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the argument name change is consistent across the codebase
# Expected: No occurrences of the old argument name remain

# Check for any remaining instances of the old argument name
rg --type py --type cpp --type hpp "ordered-chunk-size"

# Verify the new argument name is used consistently
rg --type py --type cpp --type hpp "target-ordered-chunk-size"

Length of output: 174


Script:

#!/bin/bash
# Let's try without the hpp extension and use a more focused search
# Check for any remaining instances of the old argument name
echo "Searching for old argument name:"
rg --type py --type cpp "ordered-chunk-size"

echo -e "\nSearching for new argument name:"
rg --type py --type cpp "target-ordered-chunk-size"

# Also search in any potential test files or configuration
echo -e "\nSearching in all files (including others):"
rg "ordered-chunk-size"

Length of output: 1408

components/core/src/clp_s/clp-s.cpp Show resolved Hide resolved
Copy link
Member

@kirkrodrigues kirkrodrigues left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

feat(clp-s): Chunk output by size (in bytes) during ordered decompression.

@gibber9809 gibber9809 changed the title feat(clp-s): Chunk output during ordered decompression by size in bytes. feat(clp-s): Chunk output by size (in bytes) during ordered decompression. Nov 20, 2024
@gibber9809 gibber9809 merged commit ee7e493 into y-scope:main Nov 20, 2024
21 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants