clp-s: Implement table packing #466

gibber9809 · 2024-07-02T15:53:16Z

Description

This PR implements table-packing; we combine small tables together into one compression stream until they reach a certain size threshold in order to avoid having many tiny compression streams. This helps avoid outliers in compression ratio, particularly when we enable features like array-structurization which can create many small table.

On the compression side the key differences are that (1) SchemaWriter now keeps track of the total in-memory size of the table it owns instead of determining it after writing to a compression stream; (2) before compression tables are sorted by that in-memory size, and smaller tables are packed together in sequence until their combined size reaches a certain threshold; and (3) table metadata has been changed to accommodate table packing.

On the decompression side the business of making sure tables are read in the correct order, as well as decompressing the packed streams is implemented in TableReader. Most of the rest of the change is contained in SchemaReader. The logic for reading table metadata is split between TableReader and SchemaReader where TableReader reads metadata about individual compression streams, and SchemaReader reading metadata about how schema tables map to those streams. Also note that schema tables now need to be read in the order they appear in the table metadata, which can be different than schema ID order.

Note: this PR makes the decision to leave uncompressed size of individual schema tables out of the table metadata. This is because uncompressed size can be derived from other metadata we do store, and storing uncompressed size in addition to metadata offsets would actually increase the amount of work we need to do to check an archive isn't corrupt while decompressing it.

Validation performed

Validated that this PR fixes bad compression ratio outliers during array structurization
Validated that performance seems to be within variance compared to before this change

Summary by CodeRabbit

New Features
- Added a new PackedStreamReader class for improved management of reading packed streams.
- Introduced a comprehensive metadata schema in the ArchiveWriter for better handling of compression streams and schema tables.
- Added a command-line option for specifying a minimum table size in compression options.
Improvements
- Enhanced ArchiveReader for more efficient schema and metadata management.
- Updated ArchiveWriter to optimize metadata processing and writing.
- Improved the SchemaReader for better data loading and timestamp handling.
Bug Fixes
- Corrected method calls in ArchiveReader and ArchiveWriter to align with new schema handling logic.
Documentation
- Updated method signatures and added new methods to improve clarity and usability.

easie This reverts commit 19590b1.

wraymo

Great work!

components/core/src/clp_s/ArchiveWriter.cpp

wraymo · 2024-07-18T04:35:56Z

components/core/src/clp_s/CommandLineArguments.hpp

@@ -108,6 +108,8 @@ class CommandLineArguments {

    size_t get_ordered_chunk_size() const { return m_ordered_chunk_size; }

+    size_t get_min_table_size() const { return m_minimum_table_size; }


Should the method name be consistent with the variable name?

wraymo · 2024-07-18T04:36:56Z

components/core/src/clp_s/ColumnWriter.hpp

-    size_t store(ZstdCompressor& compressor) override;
+    void store(ZstdCompressor& compressor) override;
+
+    size_t get_total_header_size() const override { return sizeof(size_t); }


Do we need to add this method for other classes?

The default implementation that returns 0 works for all of the other column writers. The clp string column writer is the only one that has an extra header to record the size of the encoded variables column.

wraymo · 2024-07-18T04:37:50Z

components/core/src/clp_s/ColumnWriter.hpp

+
+    /**
+     * Returns the total size of the header data that will be written to the compressor. This header
+     * size plus the sum of sizes returned by add_value is equal to the total size of data that will


After the code change, add_value will not return any size?

It does, but by reference. I'll change it to return by value so that it's less confusing.

wraymo · 2024-07-18T04:38:20Z

components/core/src/clp_s/ColumnWriter.hpp

+     * size plus the sum of sizes returned by add_value is equal to the total size of data that will
+     * be written to the compressor in bytes.
+     *
+     * @return the total size of header data that will


The description is not complete?

wraymo · 2024-07-18T04:52:32Z

components/core/src/clp_s/ArchiveWriter.cpp

+    };
+    std::sort(schemas.begin(), schemas.end(), comp);
+
+    size_t current_table_size = 0;


Do you think we should come up with a better name since in line 203, we use table_offset?

I'll change it to current_stream_offset, and stream_offset.

wraymo · 2024-07-18T04:54:20Z

components/core/src/clp_s/ArchiveWriter.cpp

+    // table metadata schema
+    // # num tables <64 bit>
+    // # [offset into file <64 bit> uncompressed size <64 bit>]+
+    // # num schemas <64 bit>
+    // # [table id <64 bit> offset into table <64 bit> schema id <32 bit> num messages <64 bit>]+


Maybe we should have a description when we declare these two variables (the format and the usage)?

wraymo · 2024-07-18T04:55:59Z

components/core/src/clp_s/ArchiveWriter.cpp

        m_table_metadata_compressor.write_numeric_value(uncompressed_size);
    }
+
+    m_table_metadata_compressor.write_numeric_value(schema_metadata.size());
+    for (auto& [schema_id, num_messages, table_id, table_offset] : schema_metadata) {


Can we make the order of these four values in the tuple consistent with the writing order?

Co-authored-by: wraymo <[email protected]>

This reverts commit 794a732.

wraymo

Great work! One thing to note is that we may need to clarify the distinction between schema and table, as now a table refers to something like a merged table.

wraymo · 2024-07-29T13:39:14Z

components/core/src/clp_s/ArchiveReader.hpp

+     * ask for the same buffer to be reused to read multiple different tables: this can save memory
+     * allocations, but can only be used when tables are read one at a time.
+     */
+    std::shared_ptr<char[]> read_table(size_t table_id, bool reuse_buffer);


Can we add descriptions for parameters and the return value?

wraymo · 2024-07-29T14:07:33Z

components/core/src/clp_s/TableReader.hpp

+     * Reads table metadata from the provided compression stream. Must be invoked before reading
+     * tables.
+     */
+    void read_metadata(ZstdDecompressor& decompressor);
+
+    /**
+     * Opens a file reader for the tables section. Must be invoked before reading tables.
+     */
+    void open_tables(std::string const& tables_file_path);


Add descriptions for the parameters?

wraymo · 2024-07-31T01:38:08Z

components/core/src/clp_s/TableReader.hpp

+     * @param table_id
+     * @param buf
+     * @param buf_size
+     * @return a shared_ptr to a buffer containing the requested table


It doesn't return any value?

wraymo · 2024-07-31T01:38:28Z

components/core/src/clp_s/TableReader.hpp

+    void close();
+
+    /**
+     * Decompresses a table with a given table_id and returns it. This function must be called


components/core/src/clp_s/TableReader.hpp

components/core/src/clp_s/ArchiveWriter.cpp

wraymo · 2024-07-31T03:03:18Z

components/core/src/clp_s/TableReader.hpp

+    FileReader m_tables_reader;
+    ZstdDecompressor m_tables_decompressor;
+    TableReaderState m_state{TableReaderState::Uninitialized};
+    size_t m_previous_table_id{0ULL};


Since we use prev in other places, do you think it's better change it to m_prev_table_id?

Co-authored-by: wraymo <[email protected]>

gibber9809 · 2024-08-02T00:36:57Z

Great work! One thing to note is that we may need to clarify the distinction between schema and table, as now a table refers to something like a merged table.

Right, yeah that will probably be too confusing to anyone new to this code. We could change the terminology to "streams" or "packed streams" or "packed compression streams" to disambiguate from schema tables, and rename TableReader to PackedStreamReader or something? I kind of like PackedStreamReader since it can hold the double meaning of multiple things packed within a stream, and multiple streams packed together in a file.

We should definitely give it some thought and clear up the terminology before merging at any rate.

wraymo · 2024-08-12T14:26:50Z

Great work! One thing to note is that we may need to clarify the distinction between schema and table, as now a table refers to something like a merged table.

Right, yeah that will probably be too confusing to anyone new to this code. We could change the terminology to "streams" or "packed streams" or "packed compression streams" to disambiguate from schema tables, and rename TableReader to PackedStreamReader or something? I kind of like PackedStreamReader since it can hold the double meaning of multiple things packed within a stream, and multiple streams packed together in a file.

We should definitely give it some thought and clear up the terminology before merging at any rate.

Yeah, PackedStreamReader is better. Not sure if @kirkrodrigues has any better ideas.

kirkrodrigues · 2024-08-13T11:35:18Z

Yeah, PackedStreamReader is better. Not sure if @kirkrodrigues has any better ideas.

Seems reasonable.

…enaming

gibber9809 · 2024-09-09T17:11:08Z

This PR should have everything it needs to get merged, so could you take another look @wraymo? If you don't have bandwidth to do the final review soon I'll probably go ahead and do a few more changes to make the metadata format forward-compatible with future encoding plans.

gibber9809 · 2024-09-19T15:26:35Z

This PR should have everything it needs to get merged, so could you take another look @wraymo? If you don't have bandwidth to do the final review soon I'll probably go ahead and do a few more changes to make the metadata format forward-compatible with future encoding plans.

I'm going to go ahead and slightly extend the table packing metadata in the way that I explain in that doc that I shared with you.

wraymo · 2024-09-05T03:18:19Z

components/core/src/clp_s/ArchiveReader.hpp

@@ -171,6 +175,18 @@ class ArchiveReader {
            bool should_marshal_records
    );

+    /**
+     * Reads a table with given ID from the table reader. If read_stream is called multiple times in


PackedStreamReader?

components/core/src/clp_s/ArchiveWriter.cpp

wraymo · 2024-09-25T18:59:56Z

components/core/src/clp_s/ArchiveWriter.cpp

+     */
+    using schema_map_it = decltype(m_id_to_schema_writer)::iterator;
+    std::vector<schema_map_it> schemas;
+    std::vector<std::tuple<size_t, size_t>> stream_metadata;


Can we reuse PackedStreamMetadata here?

I'd prefer leaving this as is for now, and considering a refactor when we come back to add support for storing different columns as different compressions streams.

Co-authored-by: wraymo <[email protected]>

coderabbitai · 2024-09-26T14:34:57Z

Walkthrough

The changes involve substantial updates to the ArchiveReader, ArchiveWriter, and related classes. Key modifications include renaming methods, restructuring metadata handling, and introducing a new PackedStreamReader class for managing packed streams. The read_table method has been renamed to read_schema_table, and a new read_stream method has been added. Various methods have been updated to enhance buffer management and streamline data processing, alongside the introduction of new member variables to support these changes.

Changes

Files	Change Summary
`components/core/src/clp_s/ArchiveReader.cpp`, `ArchiveReader.hpp`	Renamed `read_table` to `read_schema_table`, introduced `read_stream` method, updated metadata handling, and renamed member variables for schema management.
`components/core/src/clp_s/ArchiveWriter.cpp`, `ArchiveWriter.hpp`	Enhanced `store_tables` method to manage schema tables and compression streams, added `m_min_table_size` variable.
`components/core/src/clp_s/PackedStreamReader.cpp`, `PackedStreamReader.hpp`	Implemented `PackedStreamReader` class for managing packed stream reading, including methods for reading metadata and streams.
`components/core/src/clp_s/CommandLineArguments.cpp`, `CommandLineArguments.hpp`	Added `--min-table-size` option for command-line arguments and corresponding member variable to manage minimum table size.
`components/core/src/clp_s/SchemaReader.cpp`, `SchemaReader.hpp`	Updated `load` method signature and replaced `TableMetadata` with `SchemaMetadata`, modifying how data is managed and loaded.
`components/core/src/clp_s/clp-s.cpp`	Updated `compress` function to include `min_table_size` parameter from command line arguments.
`components/core/src/clp_s/CMakeLists.txt`	Added new source and header files for `PackedStreamReader`.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CommandLineArguments
    participant ArchiveWriter
    participant ArchiveReader
    participant PackedStreamReader

    User->>CommandLineArguments: Provide --min-table-size
    CommandLineArguments->>ArchiveWriter: Set min_table_size
    ArchiveWriter->>ArchiveReader: Prepare to read tables
    ArchiveReader->>PackedStreamReader: Read packed streams
    PackedStreamReader->>ArchiveReader: Return stream data
    ArchiveReader->>ArchiveWriter: Write data with updated schema

Possibly related PRs

clp-s: Add support for projecting of a subset of columns during search. #510: The changes in the initialize_schema_reader method in ArchiveReader.cpp relate to the main PR's modifications in the ArchiveReader class, particularly in how schema reading is managed, indicating a connection in functionality.

Suggested reviewers

wraymo

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 12

🧹 Outside diff range and nitpick comments (22)

components/core/src/clp_s/SchemaWriter.cpp (1)
12-16: LGTM: Improved initialization and simplified size accumulation

The changes to append_message are well-implemented:

The use of uniform initialization for count and total_size is a good modern C++ practice.

The simplification of total_size accumulation improves readability and reduces the chance of errors.

These modifications maintain the method's original functionality while enhancing code quality.

Consider using size_t instead of int for count to match the type used for sizes throughout the method:
-    int count{};
+    size_t count{};
This change would provide better type consistency and potentially avoid implicit conversions.

Also applies to: 20-20, 25-25
components/core/src/clp_s/SchemaWriter.hpp (2)
49-51: LGTM! Consider adding a unit test.

The addition of the get_total_uncompressed_size method is well-implemented and aligns with the PR objectives. It provides a clean way to retrieve the total in-memory size of the managed tables.

Consider adding a unit test for this new method to ensure it correctly reports the total uncompressed size. Would you like me to generate a sample unit test or open a GitHub issue for this task?

55-55: LGTM! Consider adding a comment.

The addition of the m_total_uncompressed_size member variable is appropriate and aligns with the PR objectives. The use of uniform initialization is a good C++ practice.

Consider adding a brief comment to explain the purpose of this member variable, for example:
// Tracks the total uncompressed size of all managed tables
size_t m_total_uncompressed_size{};
components/core/src/clp_s/JsonParser.hpp (1)
35-35: LGTM! Consider adding documentation for the new member variable.

The addition of min_table_size aligns well with the PR objectives for implementing table packing. This new configuration option will allow users to specify the minimum size for tables, which is crucial for the table packing algorithm.

To improve code clarity, consider adding a brief comment explaining the purpose and usage of min_table_size. For example:
// Minimum size (in bytes) for a table before it's considered for packing
size_t min_table_size;
components/core/src/clp_s/ColumnWriter.cpp (3)
9-12: Approved change with a minor suggestion.

The removal of the size_t return type simplifies the method and aligns it with the changes made to other store methods. This is a good improvement.

For consistency with other C++ standards, consider removing the unused size variable:
 void Int64ColumnWriter::store(ZstdCompressor& compressor) {
-    size_t size = m_values.size() * sizeof(int64_t);
-    compressor.write(reinterpret_cast<char const*>(m_values.data()), size);
+    compressor.write(reinterpret_cast<char const*>(m_values.data()), 
+                     m_values.size() * sizeof(int64_t));
 }
This change would eliminate an unnecessary variable and make the code slightly more concise.

19-22: Approved change with a minor suggestion for consistency.

The removal of the size_t return type is a good improvement, consistent with the changes made to Int64ColumnWriter::store.

For consistency, consider removing the unused size variable:
 void FloatColumnWriter::store(ZstdCompressor& compressor) {
-    size_t size = m_values.size() * sizeof(double);
-    compressor.write(reinterpret_cast<char const*>(m_values.data()), size);
+    compressor.write(reinterpret_cast<char const*>(m_values.data()), 
+                     m_values.size() * sizeof(double));
 }
This change would make the code slightly more concise and consistent with the suggested improvement for Int64ColumnWriter::store.

29-32: Approved change with a minor suggestion for consistency.

The removal of the size_t return type is a good improvement, consistent with the changes made to other store methods.

For consistency, consider removing the unused size variable:
 void BooleanColumnWriter::store(ZstdCompressor& compressor) {
-    size_t size = m_values.size() * sizeof(uint8_t);
-    compressor.write(reinterpret_cast<char const*>(m_values.data()), size);
+    compressor.write(reinterpret_cast<char const*>(m_values.data()), 
+                     m_values.size() * sizeof(uint8_t));
 }
This change would make the code slightly more concise and consistent with the suggested improvements for other store methods.
components/core/src/clp_s/ArchiveWriter.hpp (2)
24-24: LGTM! Consider adding documentation for the new option.

The addition of min_table_size to the ArchiveWriterOption struct is appropriate and aligns with the PR objectives. The type and naming are suitable.

Consider adding a brief comment to explain the purpose and expected values of min_table_size. For example:
/// Minimum size threshold for packing tables together (in bytes)
size_t min_table_size;
163-163: LGTM! Consider initializing with a default value.

The addition of m_min_table_size to the ArchiveWriter class is appropriate and consistent with the class's naming conventions.

Consider initializing m_min_table_size with a sensible default value instead of zero. This could prevent potential issues if the value is not explicitly set. For example:
size_t m_min_table_size{1024};  // Default to 1 KB
Ensure that the chosen default value aligns with the expected usage of this parameter.
components/core/src/clp_s/ColumnWriter.hpp (1)

30-47: LGTM! Consider adding more detailed documentation.

The changes to the BaseColumnWriter class improve the design and provide more flexibility. The new add_value signature is more intuitive, and the addition of get_total_header_size allows for separate tracking of header sizes.

Consider adding more detailed documentation for the get_total_header_size method, explaining its purpose and how it relates to the overall size calculation process.

components/core/src/clp_s/ArchiveReader.hpp (2)

95-99: LGTM: Method renaming improves clarity.

The renaming of read_table to read_schema_table enhances code readability and aligns with the discussed terminology improvements.

Consider updating the method's documentation to reflect the new name and clarify its purpose in the context of schema tables.

206-208: LGTM: New member variables support packed stream functionality.

The addition of m_stream_buffer, m_stream_buffer_size, and m_cur_stream_id appropriately supports the new packed stream reading functionality. The use of std::shared_ptr for the buffer and the initialization of size variables to 0 are good practices.

Consider adding brief inline comments to explain the purpose of these new member variables, especially their role in supporting the read_stream method.
components/core/src/clp_s/CommandLineArguments.hpp (1)
178-178: LGTM: New member variable added correctly. Consider using a named constant.

The new m_minimum_table_size member variable is well-implemented. It follows the class's naming conventions and is initialized with a clear value.

For improved readability and maintainability, consider defining a named constant for the 1 MB value. For example:
static constexpr size_t ONE_MEGABYTE = 1ULL * 1024 * 1024;
size_t m_minimum_table_size{ONE_MEGABYTE};  // 1 MB
This approach makes the code more self-documenting and easier to update if needed.
components/core/src/clp_s/clp-s.cpp (1)

92-92: LGTM! Consider adding a comment for consistency.

The addition of min_table_size option is well-placed and aligns with the PR's objective of implementing table packing. This change allows for configurable minimum table size thresholds, which is crucial for the new feature.

For consistency with other options, consider adding a brief comment explaining the purpose of min_table_size, similar to the comments for other options in the CommandLineArguments class. This would improve code readability and maintainability.
components/core/src/clp_s/JsonParser.cpp (1)
34-34: LGTM! Consider adding a comment for clarity.

The addition of min_table_size to m_archive_options is consistent with the PR objectives and aligns well with the existing code structure. This change supports the new table packing feature by providing a size threshold for combining small tables.

Consider adding a brief comment explaining the purpose of min_table_size for improved code readability. For example:
+    // Minimum size threshold for table packing
     m_archive_options.min_table_size = option.min_table_size;
components/core/src/clp_s/SchemaReader.cpp (2)
40-46: LGTM! Consider adding error handling for null buffer.

The changes to the load function look good. The new implementation simplifies buffer management by relying on an external buffer, which could potentially improve performance and reduce memory overhead.

Consider adding a null check for the stream_buffer to ensure it's not empty before dereferencing it. You could add something like this at the beginning of the function:
if (!stream_buffer) {
    throw OperationFailed(ErrorCodeInvalidArgument, __FILENAME__, __LINE__);
}
Line range hint 1-624: Consider refactoring and improving documentation for better maintainability.

While the changes made to the load function are good, the overall complexity of this file suggests that it might benefit from some refactoring and additional documentation. Consider the following suggestions:

Break down large functions into smaller, more manageable pieces.

Add more inline comments explaining complex logic, especially in the JSON serialization and schema handling sections.

Consider creating separate classes for handling different aspects of the schema reading process (e.g., JSON serialization, schema tree management).

Add comprehensive documentation for public methods, including their purpose, parameters, and return values.

These improvements could enhance the maintainability and readability of the code, making it easier for other developers to understand and modify in the future.
components/core/src/clp_s/CommandLineArguments.cpp (1)
163-167: LGTM! Consider clarifying the option description.

The addition of the --min-table-size option aligns well with the PR objectives for improving compression efficiency. The implementation is correct and consistent with the existing code structure.

Consider slightly modifying the option description to be more explicit:
-                    "Minimum size (B) for a packed table before it gets compressed."
+                    "Minimum size (in bytes) for a packed table before it gets compressed."
This change makes it clearer that the size is specified in bytes, which could be helpful for users unfamiliar with the notation.
components/core/src/clp_s/PackedStreamReader.hpp (2)
80-86: Use scoped enumeration for PackedStreamReaderState.

Converting PackedStreamReaderState to a scoped enumeration (enum class) enhances type safety and prevents accidental misuse of the enumerators.

Apply the following diff:
-enum PackedStreamReaderState {
+enum class PackedStreamReaderState {
    Uninitialized,
    MetadataRead,
    PackedStreamsOpened,
    PackedStreamsOpenedAndMetadataRead,
    ReadingPackedStreams
};
Remember to qualify enumerator usage with the enum class name, e.g., PackedStreamReaderState::Uninitialized.

73-73: Consider marking read_stream as nodiscard.

The method read_stream modifies its parameters by reference. To prevent accidental misuse where the updated parameters are ignored, consider marking the method with [[nodiscard]].

Apply the following diff:
+[[nodiscard]]
 void read_stream(size_t stream_id, std::shared_ptr<char[]>& buf, size_t& buf_size);
components/core/src/clp_s/PackedStreamReader.cpp (1)

87-89: Document the sequential access requirement for stream_id

The condition if (m_prev_stream_id >= stream_id) enforces that stream_id must be greater than m_prev_stream_id, effectively requiring streams to be read in increasing order. If this sequential access is intended, consider documenting this requirement in the class interface or method documentation to inform users.

components/core/src/clp_s/ArchiveWriter.cpp (1)

146-180: Improve documentation clarity for the metadata schema

The added comments provide detailed information about the packed stream metadata schema. To enhance readability and maintain consistency, consider reformatting the documentation. Aligning the indentation and using bullet points more effectively can make the structure clearer.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 3ceb17e and 0cab4ce.

📒 Files selected for processing (19)

components/core/src/clp_s/ArchiveReader.cpp (7 hunks)
components/core/src/clp_s/ArchiveReader.hpp (4 hunks)
components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
components/core/src/clp_s/ArchiveWriter.hpp (2 hunks)
components/core/src/clp_s/CMakeLists.txt (1 hunks)
components/core/src/clp_s/ColumnWriter.cpp (2 hunks)
components/core/src/clp_s/ColumnWriter.hpp (7 hunks)
components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
components/core/src/clp_s/JsonParser.cpp (1 hunks)
components/core/src/clp_s/JsonParser.hpp (1 hunks)
components/core/src/clp_s/PackedStreamReader.cpp (1 hunks)
components/core/src/clp_s/PackedStreamReader.hpp (1 hunks)
components/core/src/clp_s/SchemaReader.cpp (1 hunks)
components/core/src/clp_s/SchemaReader.hpp (4 hunks)
components/core/src/clp_s/SchemaWriter.cpp (1 hunks)
components/core/src/clp_s/SchemaWriter.hpp (1 hunks)
components/core/src/clp_s/clp-s.cpp (1 hunks)
components/core/src/clp_s/search/Output.cpp (1 hunks)

🧰 Additional context used

🪛 cppcheck

components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (39)

components/core/src/clp_s/SchemaWriter.cpp (3)
7-7: LGTM: Appropriate update to track total uncompressed size

The addition of the column header size to m_total_uncompressed_size is consistent with the PR's objective of tracking the total in-memory size of tables. This change will contribute to more accurate size calculations for table packing.

Line range hint 1-43: Overall assessment: Well-implemented changes supporting table packing

The modifications to SchemaWriter.cpp are well-aligned with the PR objectives:

Improved tracking of uncompressed sizes in append_column and append_message.

Simplified store method, removing redundant size calculations.

Consistent use of modern C++ practices.

These changes effectively support the implementation of table packing while maintaining code clarity and efficiency. The refactoring suggestions provided are minor and do not impact the overall quality of the implementation.

29-32: LGTM: Simplified store method aligns with new size tracking approach

The changes to the store method are appropriate:

Changing the return type to void is consistent with the new approach of tracking sizes incrementally.

The simplified method body correctly focuses on storing column data without redundant size tracking.

These modifications align well with the PR objectives and the changes made in other methods.

To ensure this change doesn't break any existing code that might have relied on the returned size, let's verify the usage of this method:

✅ Verification successful

Verified: No external usages expect a return value from SchemaWriter::store

The verification confirms that there are no calls to SchemaWriter::store expecting a return value. Therefore, changing the return type to void is safe and aligns with the updated size tracking approach.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for any uses of SchemaWriter::store that expect a return value

# Search for SchemaWriter::store method calls
rg --type cpp -A 5 'SchemaWriter::store|schemaWriter\.store'
Length of output: 456
components/core/src/clp_s/SchemaWriter.hpp (1)
44-44: LGTM! Verify impact on dependent code.

The change to the store method signature aligns with the PR objectives. The removal of the return value is consistent with the new approach of tracking the total in-memory size internally.

Please run the following script to verify that no other parts of the codebase are relying on the removed return value:

✅ Verification successful

Verified: No dependent code relies on the removed return value of SchemaWriter::store().

The shell script did not find any usages of SchemaWriter::store() that expect a return value. This confirms that the change to the method signature does not impact other parts of the codebase.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for usages of SchemaWriter::store that expect a return value

# Test: Search for SchemaWriter::store usages. Expect: No assignments from the method call.
rg --type cpp -e "SchemaWriter::store" -e "\.store\(" | grep -v "void.*store"
Length of output: 835

Script:
#!/bin/bash
# Description: Check for SchemaWriter::store() usages that assign its return value

rg --type cpp 'SchemaWriter::store\s*\(' | grep '='
Length of output: 51
components/core/src/clp_s/JsonParser.hpp (1)

35-35: Summary: Minimal change with potential impact on table packing

The addition of min_table_size to JsonParserOption is a focused change that supports the new table packing feature. While the modification in this file is minimal, it has the potential to significantly impact the compression efficiency of small tables as described in the PR objectives.

To ensure the full benefit of this change:

Verify that the table packing implementation in other files (e.g., SchemaWriter) correctly utilizes this new option.

Consider adding a brief comment in the code to explain the purpose of min_table_size.

Update any relevant documentation or user guides to explain how to use this new option effectively.

components/core/src/clp_s/ColumnWriter.cpp (4)

4-7: Excellent improvement to the method signature!

The change from using an output parameter to returning the size directly is a great improvement. It simplifies the method interface and aligns with modern C++ best practices. This modification enhances readability and reduces the potential for errors related to output parameter usage.

14-17: Excellent consistency in method signature improvement!

The changes to FloatColumnWriter::add_value are consistent with those made to Int64ColumnWriter::add_value. This modification simplifies the method interface, improves readability, and aligns with modern C++ best practices. Well done!

24-27: Excellent consistency in method signature improvement!

The changes to BooleanColumnWriter::add_value are consistent with those made to other add_value methods. This modification simplifies the method interface, improves readability, and aligns with modern C++ best practices. Great job maintaining consistency across different column writer classes!

Line range hint 34-86: Consistent improvements across remaining column writer classes.

The changes made to the add_value and store methods of ClpStringColumnWriter, VariableStringColumnWriter, and DateStringColumnWriter are consistent with the improvements made to the previously reviewed classes. These changes include:

Modifying add_value methods to return size_t instead of using an output parameter.

Changing store methods to return void instead of size_t.

These modifications maintain consistency across all column writer classes, simplify method interfaces, and align with modern C++ practices. The core functionality of each method remains intact.

Consider applying the same minor improvement suggested for other store methods by removing unused size variables where applicable.

Overall, these changes represent a significant improvement in code consistency and readability across the entire file.
components/core/src/clp_s/ArchiveWriter.hpp (1)
Line range hint 1-179: Verify the usage of m_min_table_size in class methods.

The changes to add min_table_size to ArchiveWriterOption and m_min_table_size to ArchiveWriter look good. However, it's important to ensure that this new member variable is properly utilized in the relevant class methods.

Please run the following script to check for the usage of m_min_table_size in the implementation file:

If there are no matches, consider updating the relevant methods (e.g., store_tables()) to use this new member variable for table packing logic.

✅ Verification successful

Usage of m_min_table_size Verified

The m_min_table_size member variable is properly set and utilized within ArchiveWriter.cpp, ensuring the new min_table_size option is effectively integrated.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for the usage of m_min_table_size in ArchiveWriter implementation

# Test: Search for m_min_table_size usage in ArchiveWriter.cpp
rg --type cpp "m_min_table_size" "components/core/src/clp_s/ArchiveWriter.cpp"
Length of output: 222
components/core/src/clp_s/ColumnWriter.hpp (6)

62-64: LGTM! Changes are consistent with the base class.

The updates to the add_value and store method signatures in the Int64ColumnWriter class align with the changes made in the BaseColumnWriter class, maintaining consistency across the inheritance hierarchy.

79-81: LGTM! Changes are consistent with the base class.

The updates to the add_value and store method signatures in the FloatColumnWriter class align with the changes made in the BaseColumnWriter class, maintaining consistency across the inheritance hierarchy.

96-98: LGTM! Changes are consistent with the base class.

The updates to the add_value and store method signatures in the BooleanColumnWriter class align with the changes made in the BaseColumnWriter class, maintaining consistency across the inheritance hierarchy.

120-124: LGTM! Changes are consistent and address previous concerns.

The updates to the add_value and store method signatures in the ClpStringColumnWriter class align with the changes made in the BaseColumnWriter class. The addition of the get_total_header_size method is appropriate for this specific column type.

Regarding the previous comment about adding this method to other classes: As gibber9809 mentioned, the default implementation returning 0 works for all other column writers. The ClpStringColumnWriter is the only one that needs a custom implementation due to its extra header for recording the size of the encoded variables column.

176-178: LGTM! Changes are consistent with the base class.

The updates to the add_value and store method signatures in the VariableStringColumnWriter class align with the changes made in the BaseColumnWriter class, maintaining consistency across the inheritance hierarchy.

194-196: LGTM! Changes are consistent with the base class.

The updates to the add_value and store method signatures in the DateStringColumnWriter class align with the changes made in the BaseColumnWriter class, maintaining consistency across the inheritance hierarchy.
components/core/src/clp_s/CMakeLists.txt (1)
68-69: LGTM! New files for PackedStreamReader added correctly.

The addition of PackedStreamReader.cpp and PackedStreamReader.hpp to the CLP_S_SOURCES list is correct and aligns with the PR objectives for implementing table packing.

Let's verify that these files exist in the correct directory:

✅ Verification successful

Verification Successful: Files Exist as Expected.

Both PackedStreamReader.cpp and PackedStreamReader.hpp are present in the components/core/src/clp_s directory, confirming that the additions align with the PR objectives for implementing table packing.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the existence of PackedStreamReader files

# Test: Check if the files exist
fd -t f "PackedStreamReader\.(cpp|hpp)$" components/core/src/clp_s
Length of output: 166
components/core/src/clp_s/ArchiveReader.hpp (5)

13-13: LGTM: New header file inclusion is appropriate.

The inclusion of "PackedStreamReader.hpp" aligns with the PR objectives of implementing table packing functionality.

178-188: LGTM: New read_stream method is well-implemented and documented.

The introduction of the read_stream method aligns with the PR objectives for implementing table packing. The method's signature and documentation provide clear guidance on its usage and purpose, particularly regarding buffer reuse for optimizing memory allocations.

200-200: LGTM: Member variable update improves consistency with new terminology.

The renaming of m_id_to_table_metadata to m_id_to_schema_metadata and the corresponding type change to SchemaMetadata align well with the PR's focus on schema-level operations and the new packed stream approach.

202-202: LGTM: Member variable update aligns with packed stream implementation.

The change from FileReader m_tables_file_reader to PackedStreamReader m_stream_reader accurately reflects the new packed stream approach. This update also implements the previously suggested naming convention, improving code clarity.

Line range hint 1-212: Overall LGTM: Changes effectively implement table packing functionality.

The modifications to the ArchiveReader class successfully implement the table packing functionality as outlined in the PR objectives. The changes, including the new read_stream method, updated member variables, and improved terminology, collectively enhance the class's ability to handle packed streams efficiently. The code maintains good practices in memory management and provides clear documentation for new features.

components/core/src/clp_s/CommandLineArguments.hpp (1)

111-112: LGTM: New getter method added correctly.

The new get_minimum_table_size() method is well-implemented. It follows the class's naming conventions for getter methods, is properly const-qualified, and returns the correct type.

components/core/src/clp_s/SchemaReader.cpp (1)

Line range hint 46-52: LGTM! Buffer initialization and column loading look good.

The changes to initialize the BufferViewReader with the adjusted stream_buffer pointer are correct and consistent with the new function signature. The rest of the function, including the column loading and error checking, remains unchanged and appropriate.
components/core/src/clp_s/search/Output.cpp (1)
Line range hint 87-91: LGTM! Method updated to use read_schema_table().

The change from read_table() to read_schema_table() aligns with the new terminology for packed streams. This update is consistent with the PR objectives and discussions.

Let's verify that there are no remaining instances of read_table() that might need updating:

✅ Verification successful

Verification Successful: No Remaining read_table() Instances Found

All instances of read_table() have been successfully updated to read_schema_table().
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for any remaining instances of 'read_table()' in the codebase.

# Search for 'read_table(' in all C++ files
rg --type cpp 'read_table\(' -C 3
Length of output: 35
components/core/src/clp_s/PackedStreamReader.hpp (1)

76-76: Potential out-of-bounds access in get_uncompressed_stream_size.

The method get_uncompressed_stream_size uses m_stream_metadata.at(stream_id), which will throw an exception if stream_id is out of bounds. While at() performs bounds checking, it might be clearer to document or handle this possibility explicitly.

To ensure that all calls to get_uncompressed_stream_size use valid stream_id values, run the following script:

components/core/src/clp_s/SchemaReader.hpp (3)

4-4: LGTM

The inclusion of <memory> is appropriate for the use of std::shared_ptr.

50-53: LGTM

The new SchemaMetadata structure properly includes stream_id, stream_offset, and num_messages.

282-282: LGTM

The use of std::shared_ptr<char[]> for m_stream_buffer is appropriate.
components/core/src/clp_s/ArchiveWriter.cpp (1)
212-213: Ensure correct handling when flushing compression streams

The condition:
if (current_stream_offset > m_min_table_size || schemas.size() == schema_metadata.size()) {
might skip flushing when current_stream_offset equals m_min_table_size. Verify that this logic correctly handles all intended cases to prevent potential off-by-one errors.
components/core/src/clp_s/ArchiveReader.cpp (9)

31-31: Initialization of Packed Streams

The call to m_stream_reader.open_packed_streams correctly initializes the packed streams using the archive path. This aligns with the new approach for accessing archives.

41-54: Proper Handling of Metadata Reading and Error Checking

The code effectively reads the metadata and handles potential errors from try_read_numeric_value. The check for num_separate_column_schemas ensures that unsupported schemas are appropriately flagged, preventing processing of incompatible data.

99-114: Accurate Calculation of Uncompressed Size

The logic for calculating uncompressed_size and updating the schema metadata is correctly implemented. It handles transitions between different streams and updates the metadata accordingly.

117-120: Final Metadata Update for Last Schema

The code correctly calculates the uncompressed size for the last schema after the loop, ensuring that all schema metadata entries have accurate size information.

Line range hint 132-151: Schema Table Reading Logic Updated

The read_schema_table method properly retrieves the schema metadata from m_id_to_schema_metadata and loads the data using the new stream-based approach. This update reflects the shift from table-based to stream-based operations.

157-167: Efficient Reading of All Schema Tables

The read_all_tables method has been updated to utilize the new stream-based operations effectively. It iterates over m_schema_ids, initializes each schema reader, and loads the corresponding data, ensuring all tables are read correctly.

274-274: Initialization with Updated Number of Messages

In initialize_schema_reader, the use of num_messages from m_id_to_schema_metadata ensures that the schema reader is accurately initialized with the correct number of messages, enhancing data consistency.

321-322: Correct Usage of Updated read_schema_table Method

The store method correctly utilizes the updated read_schema_table method to read messages from all schema readers, ensuring that data is accurately written to the provided writer.

340-347: Proper Resource Cleanup in close Method

The close method effectively releases all resources by closing m_stream_reader, clearing metadata maps, and resetting buffers. This ensures no resource leaks occur upon closing the archive reader.

components/core/src/clp_s/JsonParser.hpp

components/core/src/clp_s/PackedStreamReader.hpp

components/core/src/clp_s/PackedStreamReader.cpp

components/core/src/clp_s/SchemaReader.hpp

components/core/src/clp_s/ArchiveWriter.cpp

components/core/src/clp_s/ArchiveReader.cpp

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (4)

components/core/src/clp_s/ArchiveWriter.cpp (4)
17-17: New member variable added correctly

The addition of m_min_table_size is appropriate for the new table packing feature. It's correctly initialized from the ArchiveWriterOption.

Consider adding a comment explaining the purpose of this variable, especially its role in determining when to flush metadata in the store_tables method.

146-180: Excellent documentation of the new metadata schema

The detailed comment block provides a clear and comprehensive explanation of the packed stream metadata schema. This documentation is crucial for understanding the new table packing feature and will greatly assist future maintainers.

Consider adding a brief note about why this approach was chosen (e.g., to enhance compression efficiency for small tables) to provide additional context.

197-223: Table packing logic implemented effectively

The new logic for packing tables into streams is well-implemented and aligns with the PR objectives. The use of m_min_table_size provides flexible control over stream creation, which is excellent.

To improve clarity, consider adding a comment explaining the significance of the condition on line 212:
// Flush the current stream if it exceeds the minimum size or if this is the last schema
if (current_stream_offset > m_min_table_size || schemas.size() == schema_metadata.size()) {
    // ... (existing code)
}
This will help future readers understand the logic behind stream flushing more quickly.

231-234: Clear explanation for future extensibility

The comment explaining that the current implementation doesn't store large tables as separate columns is helpful. It provides context for the zero value and hints at potential future enhancements.

Consider adding a TODO comment to make it easier for future developers to find this spot when implementing separate column storage:
// TODO: Implement support for storing large tables as separate columns
size_t const num_separate_column_schemas{0};

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0cab4ce and 402d62e.

📒 Files selected for processing (4)

components/core/src/clp_s/ArchiveReader.hpp (4 hunks)
components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
components/core/src/clp_s/PackedStreamReader.cpp (1 hunks)
components/core/src/clp_s/PackedStreamReader.hpp (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (3)

components/core/src/clp_s/ArchiveReader.hpp
components/core/src/clp_s/PackedStreamReader.cpp
components/core/src/clp_s/PackedStreamReader.hpp

🧰 Additional context used

🪛 cppcheck

components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (3)

components/core/src/clp_s/ArchiveWriter.cpp (3)

3-5: Appropriate header inclusions added

The addition of <algorithm> and <tuple> headers is necessary for the new functionality in the store_tables method. Good job on including only the required headers.

225-242: Metadata writing implemented correctly

The implementation of writing the collected metadata to the compressor is correct and follows the schema described in the earlier comment block. The use of structured bindings enhances readability, which is excellent.

Both stream metadata and schema metadata are handled appropriately, ensuring that all necessary information is stored for later retrieval during decompression.

Line range hint 1-285: Comprehensive implementation of table packing feature

The changes in this file successfully implement the table packing feature as described in the PR objectives. The new metadata schema and writing logic are well-designed and should effectively address the issue of compression ratio outliers for small tables.

Key improvements:

Introduction of a flexible minimum table size for packing.

Sorting of schemas based on uncompressed size for optimal packing.

Clear and detailed documentation of the new metadata schema.

Efficient implementation of metadata collection and writing.

These changes should significantly enhance compression efficiency, especially for cases where array-structurization generates numerous small tables. The code is well-commented and structured, which will aid in future maintenance and potential extensions of the feature.

Great job on this implementation!

🧰 Tools

🪛 cppcheck

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

coderabbitai · 2024-09-27T13:18:18Z

components/core/src/clp_s/ArchiveWriter.cpp

+    using schema_map_it = decltype(m_id_to_schema_writer)::iterator;
+    std::vector<schema_map_it> schemas;
+    std::vector<std::tuple<size_t, size_t>> stream_metadata;
+    std::vector<std::tuple<size_t, size_t, int32_t, size_t>> schema_metadata;


🛠️ Refactor suggestion

Consider using structs instead of tuples for metadata

While using std::tuple for metadata storage is efficient, it may reduce code readability due to unnamed fields. Consider defining custom structs with descriptive member names to improve clarity and maintainability.

Example:

struct StreamMetadata { size_t file_offset; size_t uncompressed_size; }; struct SchemaMetadata { size_t stream_id; size_t stream_offset; int32_t schema_id; size_t num_messages; };

Then update the declarations:

std::vector<StreamMetadata> stream_metadata; std::vector<SchemaMetadata> schema_metadata;

components/core/src/clp_s/ArchiveWriter.cpp

This reverts commit 45b693f.

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

components/core/src/clp_s/ArchiveWriter.cpp (1)
150-154: Minor grammatical improvements in comments

To enhance clarity, consider adding commas and correcting verb agreement in the comments.

Apply this diff to improve the comments:
- * streams is identified by a 64 bit stream id. In the first half of the metadata we identify
+ * streams are identified by a 64-bit stream ID. In the first half of the metadata, we identify
- * In the second half of the metadata we record how many schema tables there are,
+ * In the second half of the metadata, we record how many schema tables there are,

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 402d62e and 88fb17f.

📒 Files selected for processing (1)

components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)

🧰 Additional context used

🪛 cppcheck

components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (6)

components/core/src/clp_s/ArchiveWriter.cpp (6)

3-4: Header files are appropriately included

The addition of <algorithm> and <tuple> is necessary for std::sort and std::tuple used later in the code.

17-17: Proper initialization of m_min_table_size

Assigning option.min_table_size to m_min_table_size correctly integrates the new table packing feature.

146-176: Comprehensive documentation of packed stream metadata schema

The detailed comments explaining the packed stream metadata schema enhance code readability and maintainability.

181-184: Consider defining custom structs instead of using std::tuple

Using custom structs with descriptive member names for stream_metadata and schema_metadata could improve code readability and maintainability.

191-193: Handle potential exceptions in the comparison lambda

Ensure that get_total_uncompressed_size() is noexcept, or handle exceptions within the lambda to maintain exception safety during the sort operation.

🧰 Tools

🪛 cppcheck

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

212-223: Proper handling of stream boundaries in table packing

The logic correctly determines when to close the current compression stream based on m_min_table_size or when all schemas have been processed, ensuring efficient packing.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (3)

components/core/src/clp_s/CommandLineArguments.hpp (1)
180-180: Consider adding documentation about the minimum table size purpose.

The variable initialization and naming are consistent with the codebase. However, it would be helpful to document that this represents the minimum size threshold for table packing, as mentioned in the PR objectives.

Consider adding a detailed comment like this:
-    size_t m_minimum_table_size{1ULL * 1024 * 1024};  // 1 MB
+    // Minimum size threshold (1 MB) for packing small tables into a single compression stream
+    size_t m_minimum_table_size{1ULL * 1024 * 1024};
components/core/src/clp_s/ArchiveReader.cpp (1)
341-348: Enhance cleanup robustness

Consider adding null checks before clearing resources to prevent potential issues during partial initialization scenarios.
-    m_stream_reader.close();
-    m_table_metadata_file_reader.close();
+    if (m_stream_reader) m_stream_reader.close();
+    if (m_table_metadata_file_reader) m_table_metadata_file_reader.close();
components/core/src/clp_s/CommandLineArguments.cpp (1)
163-167: LGTM! Consider adding value validation.

The new --min-table-size option is well-integrated with the existing compression options. However, consider adding validation to ensure the value is positive and reasonable.

Add validation after parsing, similar to other size parameters:
 if (parsed_command_line_options.count("min-table-size") > 0) {
+    if (m_minimum_table_size <= 0) {
+        throw std::invalid_argument("min-table-size must be greater than zero.");
+    }
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 88fb17f and d0619b8.

📒 Files selected for processing (8)

components/core/src/clp_s/ArchiveReader.cpp (7 hunks)
components/core/src/clp_s/ArchiveReader.hpp (4 hunks)
components/core/src/clp_s/CMakeLists.txt (1 hunks)
components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
components/core/src/clp_s/SchemaReader.cpp (1 hunks)
components/core/src/clp_s/SchemaReader.hpp (4 hunks)
components/core/src/clp_s/clp-s.cpp (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

components/core/src/clp_s/ArchiveReader.hpp

🧰 Additional context used

🔇 Additional comments (9)

components/core/src/clp_s/CMakeLists.txt (1)
68-69: The build configuration changes look good, eh!

The addition of PackedStreamReader source files follows the project's build patterns and maintains alphabetical ordering in the source list.

Let's verify that all required dependencies for the packed stream functionality are present:

✅ Verification successful

Build configuration is properly set up for packed stream functionality, eh!

After verification, the build configuration is complete and correct:

The CMakeLists.txt includes all necessary compression-related dependencies:

ZStd::ZStd is properly linked in the target libraries

Required compression utilities (ZstdCompressor.hpp/cpp and ZstdDecompressor.hpp/cpp) are included in CLP_S_SOURCES

The PackedStreamReader implementation correctly:

Includes the required ZstdDecompressor.hpp

Uses the compression functionality for stream handling

Has proper integration with the existing compression infrastructure
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify dependencies for packed stream functionality
# Test 1: Check for compression-related dependencies
rg -l "zstd|compress" CMakeLists.txt

# Test 2: Check for any new header includes in PackedStreamReader
ast-grep --pattern 'include "$_"' PackedStreamReader.hpp
Length of output: 222

Script:
#!/bin/bash
# Let's try with correct paths and broader search
# Find CMakeLists.txt files
fd CMakeLists.txt

# Then search for compression-related dependencies in found files
fd CMakeLists.txt --exec rg -l "zstd|compress" {}

# Find PackedStreamReader files
fd "PackedStreamReader\.(cpp|hpp)"

# Check includes in any found PackedStreamReader files
fd "PackedStreamReader\.hpp" --exec ast-grep --pattern 'include "$_"' {}
Length of output: 1169

Script:
#!/bin/bash
# Let's check the content of the relevant CMakeLists.txt for compression dependencies
cat components/core/src/clp_s/CMakeLists.txt

# Let's also check for any compression-related includes or dependencies
rg -p "zstd|compress" components/core/src/clp_s/PackedStreamReader.hpp || true
rg -p "zstd|compress" components/core/src/clp_s/PackedStreamReader.cpp || true

# Check for any related target dependencies
rg -p "target_link_libraries.*clp-s" components/core/src/clp_s/CMakeLists.txt
Length of output: 8192
components/core/src/clp_s/CommandLineArguments.hpp (1)

111-112: LGTM! The getter method follows the class conventions.

The implementation is clean and consistent with other getter methods in the class.
components/core/src/clp_s/ArchiveReader.cpp (5)
31-31: LGTM: Stream reader initialization aligns with table packing feature

The change from table file reader to packed stream reader correctly implements the new architecture for handling packed tables.

Line range hint 132-152: LGTM: Schema table reading implementation

The schema-based table reading correctly integrates with the packed streams architecture and properly handles buffer reuse.

322-326: LGTM: Store method properly handles schema order

The store method correctly uses schema IDs to maintain the proper order of tables during storage.

351-364: LGTM: Stream reading implementation

The stream reading implementation correctly handles buffer management and stream switching.

41-120: Verify metadata integrity across packed streams

The metadata reading implementation looks solid, but we should verify that the uncompressed size calculations are accurate across different stream configurations.

✅ Verification successful

Let me try a different approach to verify the metadata integrity implementation.

The previous search was too specific. Let's broaden it to find relevant test cases and related validation code that can help verify the metadata integrity implementation.

Based on the code analysis from the search results, I can now provide a final verification response.

Metadata integrity implementation is correctly implemented

The implementation in ArchiveReader.cpp correctly handles stream metadata integrity through several mechanisms:

Stream offset validation: Checks that offsets don't exceed stream sizes (line 41-43 in ArchiveReader.cpp)

Proper uncompressed size calculation:

For same stream: Uses offset difference (stream_offset - prev_metadata.stream_offset)

For different streams: Uses stream size minus offset (get_uncompressed_stream_size() - prev_metadata.stream_offset)

Consistent metadata tracking through PackedStreamReader which maintains stream metadata with file offsets and uncompressed sizes

Complete validation chain from reading to schema loading

The implementation ensures that stream boundaries are respected and uncompressed sizes are accurately calculated across stream transitions.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that stream offsets are properly ordered within each stream
# and that uncompressed sizes don't exceed stream sizes

# Search for test files that verify stream metadata integrity
rg -l "test.*stream.*metadata|test.*packed.*stream" --type cpp

# Look for potential metadata validation functions
ast-grep --pattern 'void $func($$$) {
  $$$
  stream_offset
  $$$
  uncompressed
  $$$
}'
Length of output: 156

Script:
#!/bin/bash
# Search for test files related to ArchiveReader
fd "test.*\.cpp" | xargs rg -l "ArchiveReader"

# Look for test cases involving stream metadata
rg -A 5 "TEST.*Archive.*Reader" --type cpp

# Find related validation code
rg -A 3 "stream_offset|uncompressed_size" --type cpp components/core/src/clp_s/
Length of output: 15231
components/core/src/clp_s/clp-s.cpp (1)
94-94: LGTM! Please verify minimum table size validation.

The addition of min_table_size parameter aligns well with the table packing feature. The implementation follows the established pattern for option configuration.

Let's verify the parameter validation:

✅ Verification successful

Let me analyze the results. I can see that:

The minimum table size has a default value of 1MB (1ULL * 1024 * 1024)

It's configurable via command line arguments using boost::program_options

The parameter has a description: "Minimum size (B) for a packed table before it gets compressed"

Let me verify if there are any additional validations or constraints on this parameter:

Let me check one more thing to see if there are any validations in the constructor or initialization:

Let me try one last search to find any validation logic:

Parameter validation appears sufficient

The minimum table size parameter implementation is well-structured with:

A sensible default value of 1 MB (1ULL * 1024 * 1024)

Command-line configuration support via boost::program_options

Clear documentation indicating its purpose for packed table compression

Implicit validation through the size_t type, ensuring non-negative values

Since size_t naturally prevents negative values and the default value of 1 MB provides a reasonable starting point, no additional validation appears necessary for this use case.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check for minimum table size validation in CommandLineArguments
# Expected: Find validation logic for minimum table size parameter

# Search for minimum table size validation
rg -A 5 "minimum.*table.*size" --type cpp
Length of output: 2468

Script:
#!/bin/bash
# Search for any validation or constraints on minimum table size
rg -A 5 "m_minimum_table_size" --type cpp
Length of output: 1859

Script:
#!/bin/bash
# Search for CommandLineArguments constructor and validate method
ast-grep --pattern 'CommandLineArguments::CommandLineArguments($$$) {
  $$$
}'

ast-grep --pattern 'CommandLineArguments::validate($$$) {
  $$$
}'
Length of output: 149

Script:
#!/bin/bash
# Search for validation methods in CommandLineArguments
rg -B 2 -A 5 "void.*validate" --type cpp components/core/src/clp_s/CommandLineArguments

# Also check for any error handling related to minimum table size
rg -B 2 -A 5 "throw.*minimum.*table" --type cpp
Length of output: 226
components/core/src/clp_s/SchemaReader.hpp (1)

139-144: Duplicate Comment: Consider passing stream_buffer by const reference

The previous suggestion to pass stream_buffer as const std::shared_ptr<char[]>& is still valid. This change would avoid unnecessary increments of the reference count on the shared pointer, potentially improving performance.

components/core/src/clp_s/ArchiveReader.cpp

components/core/src/clp_s/SchemaReader.hpp

components/core/src/clp_s/SchemaReader.cpp

coderabbitai

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)

components/core/src/clp_s/SchemaReader.hpp (1)
139-144: Documentation needs clarification about buffer ownership

The method documentation should specify:

Buffer ownership semantics (whether the method takes ownership)

Buffer lifetime requirements

Whether the buffer can be modified

Example documentation:
/**
 * Loads the encoded messages from a shared buffer starting at a given offset
 * @param stream_buffer Shared buffer containing the encoded messages. The buffer must remain valid
 *                     for the lifetime of the SchemaReader instance
 * @param offset Starting offset within the buffer
 * @param uncompressed_size Size of the uncompressed data
 */
components/core/src/clp_s/ArchiveWriter.cpp (1)

146-180: Consider adding size threshold documentation

The metadata schema documentation is thorough. Consider adding a note about how the m_min_table_size threshold influences the packing of tables into streams, helping future maintainers understand the relationship between table sizes and stream creation.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d0619b8 and efc2260.

📒 Files selected for processing (3)

components/core/src/clp_s/ArchiveReader.cpp (7 hunks)
components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
components/core/src/clp_s/SchemaReader.hpp (4 hunks)

🧰 Additional context used

🪛 cppcheck

components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (10)

components/core/src/clp_s/SchemaReader.hpp (2)

52-56: LGTM: SchemaMetadata struct changes align with table packing design

The renamed struct and new fields (stream_id, stream_offset) effectively support the table packing feature by tracking the location of schema data within packed streams.

287-287: Verify buffer cleanup handling

The change from unique_ptr to shared_ptr requires verification of proper buffer cleanup, especially in error cases.

components/core/src/clp_s/ArchiveWriter.cpp (2)

3-4: LGTM: New includes and member initialization

The additions properly support the table packing feature. The includes are required for sorting schemas and storing metadata, while the member initialization correctly sets up the minimum table size threshold.

Also applies to: 17-17

225-242: LGTM: Metadata writing implementation

The implementation correctly follows the documented metadata schema, properly writing both stream and schema metadata sections. The code is well-structured and maintains clear separation between the two metadata sections.
components/core/src/clp_s/ArchiveReader.cpp (6)
31-31: LGTM: Stream reader initialization aligns with table packing feature

The change from table file reader to packed streams reader accurately reflects the new architecture.

83-85: LGTM: Proper stream offset validation

The code correctly validates that stream offsets don't exceed the uncompressed stream size, preventing potential buffer overruns.

275-275: LGTM: Proper use of schema metadata

The initialization correctly uses the schema metadata to set the number of messages.

322-326: LGTM: Proper schema order preservation

The store method correctly uses schema_ids to ensure tables are processed in the intended order.

341-348: LGTM: Comprehensive cleanup of stream resources

The close method properly cleans up all stream-related resources in a logical order.

99-120: Verify metadata calculation logic for edge cases

The uncompressed size calculation uses a complex prev_metadata pattern. While the logic appears sound, it's worth verifying the handling of edge cases.

✅ Verification successful

Let me try a different approach to search for relevant test files.

This should help us find relevant test files that might verify the metadata calculation logic, even if they don't explicitly mention "metadata calculation" in their names.

Additionally, let's look at the actual test implementations:

Let me search for tests specifically related to the schema metadata calculation.

This should help us find any test cases or implementations that verify the schema metadata calculation logic, particularly around the uncompressed size calculation.

Based on the code analysis, I can now provide a final assessment of the metadata calculation logic.

Metadata calculation logic is correctly implemented with proper edge case handling

The uncompressed size calculation in ArchiveReader.cpp follows a robust pattern that handles both same-stream and cross-stream scenarios:

For same stream (when stream_id == prev_metadata.stream_id):

Uses direct offset difference: stream_offset - prev_metadata.stream_offset

For different streams:

Uses stream size minus offset: m_stream_reader.get_uncompressed_stream_size(prev_metadata.stream_id) - prev_metadata.stream_offset

Final schema's size is properly calculated using the same pattern

The implementation includes proper initialization checks (prev_metadata_initialized) and maintains consistency across stream boundaries. The metadata is stored in a map (m_id_to_schema_metadata) that ensures each schema's metadata is properly tracked.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Search for test cases that verify metadata calculation
rg -l "test.*metadata.*calculation|test.*uncompressed.*size" --type cpp
Length of output: 73

Script:
#!/bin/bash
# Search for test files related to ArchiveReader
fd "test.*\.cpp$" | rg -l "ArchiveReader"

# Also search for any usage or tests of uncompressed size calculation
rg -l "uncompressed.*size" --type cpp
Length of output: 2386

Script:
#!/bin/bash
# Look for test cases involving stream size or metadata in the test files
rg "TEST.*Archive.*|uncompressed.*size|stream.*size" -A 5 --type cpp
Length of output: 113781

Script:
#!/bin/bash
# Look for tests or implementations related to schema metadata
rg -A 5 "SchemaMetadata|schema.*metadata" --type cpp

# Also look for any test files that might verify the schema metadata
fd "test.*\.cpp$" | rg -l "Schema"
Length of output: 9738

components/core/src/clp_s/ArchiveWriter.cpp

coderabbitai

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)

components/core/src/clp_s/ArchiveWriter.hpp (4)
24-24: Add documentation for the min_table_size parameter.

Please add a documentation comment explaining that this parameter controls the minimum size threshold for table packing. This will help future maintainers understand its purpose and expected values.
 struct ArchiveWriterOption {
     boost::uuids::uuid id;
     std::string archives_dir;
     int compression_level;
     bool print_archive_stats;
+    /// Minimum size threshold in bytes for packing tables into a single compression stream
     size_t min_table_size;
 };
36-43: Add class-level documentation for StreamMetadata.

While the implementation is sound, please add documentation explaining this struct's role in the table packing feature. Consider documenting each member's purpose as well.
+    /**
+     * Metadata for a packed compression stream containing one or more tables
+     */
     struct StreamMetadata {
         StreamMetadata(uint64_t file_offset, uint64_t uncompressed_size)
                 : file_offset(file_offset),
                   uncompressed_size(uncompressed_size) {}
 
+        /// Offset of this stream in the archive file
         uint64_t file_offset{};
+        /// Total uncompressed size of all tables in this stream
         uint64_t uncompressed_size{};
     };
45-61: Add comprehensive documentation for SchemaMetadata.

Please add documentation to clarify:

The struct's role in mapping schema tables to packed streams

The relationship between SchemaMetadata and StreamMetadata

The purpose of each member variable
+    /**
+     * Metadata mapping a schema table to its location within a packed compression stream
+     */
     struct SchemaMetadata {
         SchemaMetadata(
                 uint64_t stream_id,
                 uint64_t stream_offset,
                 int32_t schema_id,
                 uint64_t num_messages
         )
                 : stream_id(stream_id),
                   stream_offset(stream_offset),
                   schema_id(schema_id),
                   num_messages(num_messages) {}
 
+        /// ID of the packed stream containing this schema table
         uint64_t stream_id{};
+        /// Offset of this table within its packed stream
         uint64_t stream_offset{};
+        /// ID of the schema this table belongs to
         int32_t schema_id{};
+        /// Number of messages in this table
         uint64_t num_messages{};
     };
190-190: Add documentation for m_min_table_size member variable.

Please add a documentation comment explaining this member's role in the table packing process.
     size_t m_encoded_message_size{};
     size_t m_uncompressed_size{};
     size_t m_compressed_size{};
 
     std::string m_id;
 
     std::string m_archive_path;
     std::string m_encoded_messages_dir;
 
     std::shared_ptr<VariableDictionaryWriter> m_var_dict;
     std::shared_ptr<LogTypeDictionaryWriter> m_log_dict;
     std::shared_ptr<LogTypeDictionaryWriter> m_array_dict;  // log type dictionary for arrays
     std::shared_ptr<TimestampDictionaryWriter> m_timestamp_dict;
     std::shared_ptr<clp::GlobalMySQLMetadataDB> m_metadata_db;
     int m_compression_level{};
     bool m_print_archive_stats{};
+    /// Minimum size threshold in bytes for packing tables into a single compression stream
     size_t m_min_table_size{};
components/core/src/clp_s/ArchiveWriter.cpp (1)

145-179: Consider documenting size constraints and performance implications

The metadata schema documentation is thorough. Consider adding:

Typical/recommended ranges for stream sizes

Performance implications of different packing thresholds

Memory overhead considerations for the two-section approach

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between efc2260 and 6130155.

📒 Files selected for processing (2)

components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
components/core/src/clp_s/ArchiveWriter.hpp (3 hunks)

🧰 Additional context used

🪛 cppcheck

components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (3)

components/core/src/clp_s/ArchiveWriter.hpp (1)

Line range hint 1-205: Overall implementation looks good!

The changes effectively support the table packing feature with well-structured metadata types. The implementation aligns well with the PR objectives of combining small tables into single compression streams for better efficiency.

components/core/src/clp_s/ArchiveWriter.cpp (2)

3-4: LGTM: Clean initialization of table packing threshold

The addition of m_min_table_size and the algorithm header properly sets up the foundation for table packing.

Also applies to: 16-16

224-240: Consider adding error handling for metadata writing

The metadata writing sequence is critical for archive integrity. Consider adding error handling and validation:

Verify write operations succeed

Add checksums for metadata integrity

Handle potential I/O errors

components/core/src/clp_s/ArchiveWriter.cpp

* ffi: Add support for serializing/deserializing auto-generated and user-generated schema tree node IDs. (y-scope#557) Co-authored-by: kirkrodrigues <[email protected]> * clp: Add missing C++ standard library includes in IR parsing files. (y-scope#561) Co-authored-by: kirkrodrigues <[email protected]> * log-viewer-webui: Update `yscope-log-viewer` to the latest version (which uses `clp-ffi-js`). (y-scope#562) * package: Upgrade dependencies to resolve security issues. (y-scope#536) * clp-s: Implement table packing (y-scope#466) Co-authored-by: wraymo <[email protected]> Co-authored-by: Kirk Rodrigues <[email protected]> Co-authored-by: wraymo <[email protected]> * log-viewer-webui: Update `yscope-log-viewer` to the latest version. (y-scope#565) * ci: Switch GitHub macOS build workflow to use macos-13 (x86) and macos-14 (ARM) runners. (y-scope#566) * core: Add support for user-defined HTTP headers in `NetworkReader`. (y-scope#568) Co-authored-by: Lin Zhihao <[email protected]> Co-authored-by: Xiaochong Wei <[email protected]> * chore: Update to the latest version of yscope-dev-utils. (y-scope#574) * build(core): Upgrade msgpack to v7.0.0. (y-scope#575) * feat(ffi): Update IR stream protocol version handling in preparation for releasing the kv-pair IR stream format: (y-scope#573) - Bump the IR stream protocol version to 0.1.0 for the kv-pair IR stream format. - Treat the previous IR stream format's versions as backwards compatible. - Differentiate between backwards-compatible and supported versions during validation. Co-authored-by: kirkrodrigues <[email protected]> * fix(taskfiles): Trim trailing slash from URL prefix in `download-and-extract-tar` (fixes y-scope#577). (y-scope#578) * fix(ffi): Correct `clp::ffi::ir_stream::Deserializer::deserialize_next_ir_unit`'s return value when failing to read the next IR unit's type tag. (y-scope#579) * fix(taskfiles): Update `yscope-log-viewer` sources in `log-viewer-webui-clients` sources list (fixes y-scope#576). (y-scope#580) * fix(cmake): Add Homebrew path detection for `mariadb-connector-c` to fix macOS build failure. (y-scope#582) Co-authored-by: kirkrodrigues <[email protected]> * refactor(ffi): Make `get_schema_subtree_bitmap` a public method of `KeyValuePairLogEvent`. (y-scope#581) * ci: Schedule GitHub workflows to daily run to detect failures due to upgraded dependencies or environments. (y-scope#583) * docs: Update the required version of task. (y-scope#567) * Add pr check workflow --------- Co-authored-by: kirkrodrigues <[email protected]> Co-authored-by: Junhao Liao <[email protected]> Co-authored-by: Henry8192 <[email protected]> Co-authored-by: Devin Gibson <[email protected]> Co-authored-by: wraymo <[email protected]> Co-authored-by: wraymo <[email protected]> Co-authored-by: Xiaochong(Eddy) Wei <[email protected]> Co-authored-by: Xiaochong Wei <[email protected]> Co-authored-by: haiqi96 <[email protected]>

Co-authored-by: wraymo <[email protected]> Co-authored-by: Kirk Rodrigues <[email protected]> Co-authored-by: wraymo <[email protected]>

gibber9809 added 3 commits July 2, 2024 15:30

Implement compression side of table packing

6894a99

Implement decompression side of table packing

19590b1

Revert "Implement decompression side of table packing" to make reviewing

794a732

easie This reverts commit 19590b1.

gibber9809 requested a review from wraymo July 2, 2024 15:53

wraymo reviewed Jul 18, 2024

View reviewed changes

gibber9809 and others added 3 commits July 19, 2024 12:08

Apply suggestions from code review

0225c9e

Co-authored-by: wraymo <[email protected]>

Address some review comments

b2e2c14

Address remaining review comments

a1dac04

gibber9809 requested a review from wraymo July 19, 2024 16:45

Add back decompression side of table packing

cd6c0b8

This reverts commit 794a732.

gibber9809 changed the title ~~[WIP] clp-s: Implement table packing~~ clp-s: Implement table packing Jul 23, 2024

gibber9809 marked this pull request as ready for review July 23, 2024 23:35

Fix build issue on MacOS

0176ecb

wraymo reviewed Jul 31, 2024

View reviewed changes

Apply suggestions from code review

0f6b354

Co-authored-by: wraymo <[email protected]>

Update docstrings and minor refactor

2c68328

gibber9809 requested a review from wraymo August 2, 2024 00:38

Rename TableReader -> PackedStreamReader and do associated variable r…

88337df

…enaming

Merge remote-tracking branch 'upstream/main' into table-packing

edf5773

Future-proof table-packing metadata

237dddf

wraymo reviewed Sep 25, 2024

View reviewed changes

Update components/core/src/clp_s/ArchiveWriter.cpp

0cab4ce

Co-authored-by: wraymo <[email protected]>

coderabbitai bot reviewed Sep 26, 2024

View reviewed changes

Address review comments

36c4ef6

gibber9809 added 2 commits September 26, 2024 14:43

Merge remote-tracking branch 'upstream/main' into table-packing

23875e8

Lint fix

43c9231

gibber9809 requested a review from wraymo September 26, 2024 14:50

Address rabbit comments

402d62e

coderabbitai bot reviewed Sep 27, 2024

View reviewed changes

kirkrodrigues and others added 3 commits October 3, 2024 22:26

Benign change.

45b693f

Revert "Benign change."

0311077

This reverts commit 45b693f.

modify comments about the schema of packed streams in ArchiveWriter

88fb17f

coderabbitai bot reviewed Oct 4, 2024

View reviewed changes

kirkrodrigues and others added 2 commits October 4, 2024 18:55

Merge branch 'main' into table-packing

40d1409

Merge remote-tracking branch 'upstream/main' into table-packing

d0619b8

coderabbitai bot reviewed Oct 23, 2024

View reviewed changes

components/core/src/clp_s/ArchiveReader.cpp Show resolved Hide resolved

components/core/src/clp_s/SchemaReader.hpp Outdated Show resolved Hide resolved

components/core/src/clp_s/SchemaReader.cpp Show resolved Hide resolved

Clean up suggested by coderabbit

efc2260

coderabbitai bot reviewed Oct 24, 2024

View reviewed changes

components/core/src/clp_s/ArchiveWriter.cpp Outdated Show resolved Hide resolved

components/core/src/clp_s/ArchiveWriter.cpp Show resolved Hide resolved

Use structs on write side for packed streams

6130155

coderabbitai bot reviewed Oct 24, 2024

View reviewed changes

components/core/src/clp_s/ArchiveWriter.cpp Show resolved Hide resolved

wraymo approved these changes Oct 25, 2024

View reviewed changes

gibber9809 merged commit 4daf8d8 into y-scope:main Oct 25, 2024
18 checks passed

gibber9809 deleted the table-packing branch October 25, 2024 02:52

coderabbitai bot mentioned this pull request Oct 30, 2024

clp-core: Add custom headers support for NetworkReader #564

Closed

This was referenced Nov 6, 2024

feat(clp-s): Add the write path for single-file archives. #563

Merged

feat(clp-package): Add support for extracting JSON streams from archives. #569

Merged

coderabbitai bot mentioned this pull request Nov 19, 2024

feat(clp-s): Chunk output by size (in bytes) during ordered decompression. #600

Merged

jackluo923 pushed a commit to jackluo923/clp that referenced this pull request Dec 4, 2024

clp-s: Implement table packing (y-scope#466)

6f7d65f

Co-authored-by: wraymo <[email protected]> Co-authored-by: Kirk Rodrigues <[email protected]> Co-authored-by: wraymo <[email protected]>

coderabbitai bot mentioned this pull request Dec 5, 2024

refactor(clp-package): Unify the metadata schema for JSON and IR streams. #620

Merged

		@@ -108,6 +108,8 @@ class CommandLineArguments {

		size_t get_ordered_chunk_size() const { return m_ordered_chunk_size; }

		size_t get_min_table_size() const { return m_minimum_table_size; }

clp-s: Implement table packing #466

clp-s: Implement table packing #466

Conversation

gibber9809 commented Jul 2, 2024 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

wraymo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibber9809 Jul 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wraymo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gibber9809 commented Aug 2, 2024

wraymo commented Aug 12, 2024

kirkrodrigues commented Aug 13, 2024

gibber9809 commented Sep 9, 2024

gibber9809 commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot commented Sep 26, 2024 • edited Loading

Walkthrough

Changes

Sequence Diagram(s)

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Sep 27, 2024

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

gibber9809 commented Jul 2, 2024 •

edited by coderabbitai bot

Loading

gibber9809 Jul 19, 2024 •

edited

Loading

coderabbitai bot commented Sep 26, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)