-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
clp-s: Implement table packing #466
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work!
@@ -108,6 +108,8 @@ class CommandLineArguments { | |||
|
|||
size_t get_ordered_chunk_size() const { return m_ordered_chunk_size; } | |||
|
|||
size_t get_min_table_size() const { return m_minimum_table_size; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the method name be consistent with the variable name?
size_t store(ZstdCompressor& compressor) override; | ||
void store(ZstdCompressor& compressor) override; | ||
|
||
size_t get_total_header_size() const override { return sizeof(size_t); } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need to add this method for other classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The default implementation that returns 0 works for all of the other column writers. The clp string column writer is the only one that has an extra header to record the size of the encoded variables column.
|
||
/** | ||
* Returns the total size of the header data that will be written to the compressor. This header | ||
* size plus the sum of sizes returned by add_value is equal to the total size of data that will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After the code change, add_value
will not return any size?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does, but by reference. I'll change it to return by value so that it's less confusing.
* size plus the sum of sizes returned by add_value is equal to the total size of data that will | ||
* be written to the compressor in bytes. | ||
* | ||
* @return the total size of header data that will |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The description is not complete?
}; | ||
std::sort(schemas.begin(), schemas.end(), comp); | ||
|
||
size_t current_table_size = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we should come up with a better name since in line 203, we use table_offset
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll change it to current_stream_offset, and stream_offset.
// table metadata schema | ||
// # num tables <64 bit> | ||
// # [offset into file <64 bit> uncompressed size <64 bit>]+ | ||
// # num schemas <64 bit> | ||
// # [table id <64 bit> offset into table <64 bit> schema id <32 bit> num messages <64 bit>]+ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we should have a description when we declare these two variables (the format and the usage)?
m_table_metadata_compressor.write_numeric_value(uncompressed_size); | ||
} | ||
|
||
m_table_metadata_compressor.write_numeric_value(schema_metadata.size()); | ||
for (auto& [schema_id, num_messages, table_id, table_offset] : schema_metadata) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make the order of these four values in the tuple consistent with the writing order?
This reverts commit 794a732.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! One thing to note is that we may need to clarify the distinction between schema and table, as now a table refers to something like a merged table.
* ask for the same buffer to be reused to read multiple different tables: this can save memory | ||
* allocations, but can only be used when tables are read one at a time. | ||
*/ | ||
std::shared_ptr<char[]> read_table(size_t table_id, bool reuse_buffer); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add descriptions for parameters and the return value?
* Reads table metadata from the provided compression stream. Must be invoked before reading | ||
* tables. | ||
*/ | ||
void read_metadata(ZstdDecompressor& decompressor); | ||
|
||
/** | ||
* Opens a file reader for the tables section. Must be invoked before reading tables. | ||
*/ | ||
void open_tables(std::string const& tables_file_path); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add descriptions for the parameters?
* @param table_id | ||
* @param buf | ||
* @param buf_size | ||
* @return a shared_ptr to a buffer containing the requested table |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It doesn't return any value?
void close(); | ||
|
||
/** | ||
* Decompresses a table with a given table_id and returns it. This function must be called |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same here.
FileReader m_tables_reader; | ||
ZstdDecompressor m_tables_decompressor; | ||
TableReaderState m_state{TableReaderState::Uninitialized}; | ||
size_t m_previous_table_id{0ULL}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use prev
in other places, do you think it's better change it to m_prev_table_id
?
Co-authored-by: wraymo <[email protected]>
Right, yeah that will probably be too confusing to anyone new to this code. We could change the terminology to "streams" or "packed streams" or "packed compression streams" to disambiguate from schema tables, and rename TableReader to PackedStreamReader or something? I kind of like PackedStreamReader since it can hold the double meaning of multiple things packed within a stream, and multiple streams packed together in a file. We should definitely give it some thought and clear up the terminology before merging at any rate. |
Yeah, PackedStreamReader is better. Not sure if @kirkrodrigues has any better ideas. |
Seems reasonable. |
This PR should have everything it needs to get merged, so could you take another look @wraymo? If you don't have bandwidth to do the final review soon I'll probably go ahead and do a few more changes to make the metadata format forward-compatible with future encoding plans. |
I'm going to go ahead and slightly extend the table packing metadata in the way that I explain in that doc that I shared with you. |
@@ -171,6 +175,18 @@ class ArchiveReader { | |||
bool should_marshal_records | |||
); | |||
|
|||
/** | |||
* Reads a table with given ID from the table reader. If read_stream is called multiple times in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PackedStreamReader
?
*/ | ||
using schema_map_it = decltype(m_id_to_schema_writer)::iterator; | ||
std::vector<schema_map_it> schemas; | ||
std::vector<std::tuple<size_t, size_t>> stream_metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we reuse PackedStreamMetadata
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer leaving this as is for now, and considering a refactor when we come back to add support for storing different columns as different compressions streams.
Co-authored-by: wraymo <[email protected]>
WalkthroughThe changes involve substantial updates to the Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CommandLineArguments
participant ArchiveWriter
participant ArchiveReader
participant PackedStreamReader
User->>CommandLineArguments: Provide --min-table-size
CommandLineArguments->>ArchiveWriter: Set min_table_size
ArchiveWriter->>ArchiveReader: Prepare to read tables
ArchiveReader->>PackedStreamReader: Read packed streams
PackedStreamReader->>ArchiveReader: Return stream data
ArchiveReader->>ArchiveWriter: Write data with updated schema
Possibly related PRs
Suggested reviewers
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
🧹 Outside diff range and nitpick comments (22)
components/core/src/clp_s/SchemaWriter.cpp (1)
12-16
: LGTM: Improved initialization and simplified size accumulationThe changes to
append_message
are well-implemented:
- The use of uniform initialization for
count
andtotal_size
is a good modern C++ practice.- The simplification of
total_size
accumulation improves readability and reduces the chance of errors.These modifications maintain the method's original functionality while enhancing code quality.
Consider using
size_t
instead ofint
forcount
to match the type used for sizes throughout the method:- int count{}; + size_t count{};This change would provide better type consistency and potentially avoid implicit conversions.
Also applies to: 20-20, 25-25
components/core/src/clp_s/SchemaWriter.hpp (2)
49-51
: LGTM! Consider adding a unit test.The addition of the
get_total_uncompressed_size
method is well-implemented and aligns with the PR objectives. It provides a clean way to retrieve the total in-memory size of the managed tables.Consider adding a unit test for this new method to ensure it correctly reports the total uncompressed size. Would you like me to generate a sample unit test or open a GitHub issue for this task?
55-55
: LGTM! Consider adding a comment.The addition of the
m_total_uncompressed_size
member variable is appropriate and aligns with the PR objectives. The use of uniform initialization is a good C++ practice.Consider adding a brief comment to explain the purpose of this member variable, for example:
// Tracks the total uncompressed size of all managed tables size_t m_total_uncompressed_size{};components/core/src/clp_s/JsonParser.hpp (1)
35-35
: LGTM! Consider adding documentation for the new member variable.The addition of
min_table_size
aligns well with the PR objectives for implementing table packing. This new configuration option will allow users to specify the minimum size for tables, which is crucial for the table packing algorithm.To improve code clarity, consider adding a brief comment explaining the purpose and usage of
min_table_size
. For example:// Minimum size (in bytes) for a table before it's considered for packing size_t min_table_size;components/core/src/clp_s/ColumnWriter.cpp (3)
9-12
: Approved change with a minor suggestion.The removal of the
size_t
return type simplifies the method and aligns it with the changes made to otherstore
methods. This is a good improvement.For consistency with other C++ standards, consider removing the unused
size
variable:void Int64ColumnWriter::store(ZstdCompressor& compressor) { - size_t size = m_values.size() * sizeof(int64_t); - compressor.write(reinterpret_cast<char const*>(m_values.data()), size); + compressor.write(reinterpret_cast<char const*>(m_values.data()), + m_values.size() * sizeof(int64_t)); }This change would eliminate an unnecessary variable and make the code slightly more concise.
19-22
: Approved change with a minor suggestion for consistency.The removal of the
size_t
return type is a good improvement, consistent with the changes made toInt64ColumnWriter::store
.For consistency, consider removing the unused
size
variable:void FloatColumnWriter::store(ZstdCompressor& compressor) { - size_t size = m_values.size() * sizeof(double); - compressor.write(reinterpret_cast<char const*>(m_values.data()), size); + compressor.write(reinterpret_cast<char const*>(m_values.data()), + m_values.size() * sizeof(double)); }This change would make the code slightly more concise and consistent with the suggested improvement for
Int64ColumnWriter::store
.
29-32
: Approved change with a minor suggestion for consistency.The removal of the
size_t
return type is a good improvement, consistent with the changes made to otherstore
methods.For consistency, consider removing the unused
size
variable:void BooleanColumnWriter::store(ZstdCompressor& compressor) { - size_t size = m_values.size() * sizeof(uint8_t); - compressor.write(reinterpret_cast<char const*>(m_values.data()), size); + compressor.write(reinterpret_cast<char const*>(m_values.data()), + m_values.size() * sizeof(uint8_t)); }This change would make the code slightly more concise and consistent with the suggested improvements for other
store
methods.components/core/src/clp_s/ArchiveWriter.hpp (2)
24-24
: LGTM! Consider adding documentation for the new option.The addition of
min_table_size
to theArchiveWriterOption
struct is appropriate and aligns with the PR objectives. The type and naming are suitable.Consider adding a brief comment to explain the purpose and expected values of
min_table_size
. For example:/// Minimum size threshold for packing tables together (in bytes) size_t min_table_size;
163-163
: LGTM! Consider initializing with a default value.The addition of
m_min_table_size
to theArchiveWriter
class is appropriate and consistent with the class's naming conventions.Consider initializing
m_min_table_size
with a sensible default value instead of zero. This could prevent potential issues if the value is not explicitly set. For example:size_t m_min_table_size{1024}; // Default to 1 KBEnsure that the chosen default value aligns with the expected usage of this parameter.
components/core/src/clp_s/ColumnWriter.hpp (1)
30-47
: LGTM! Consider adding more detailed documentation.The changes to the
BaseColumnWriter
class improve the design and provide more flexibility. The newadd_value
signature is more intuitive, and the addition ofget_total_header_size
allows for separate tracking of header sizes.Consider adding more detailed documentation for the
get_total_header_size
method, explaining its purpose and how it relates to the overall size calculation process.components/core/src/clp_s/ArchiveReader.hpp (2)
95-99
: LGTM: Method renaming improves clarity.The renaming of
read_table
toread_schema_table
enhances code readability and aligns with the discussed terminology improvements.Consider updating the method's documentation to reflect the new name and clarify its purpose in the context of schema tables.
206-208
: LGTM: New member variables support packed stream functionality.The addition of
m_stream_buffer
,m_stream_buffer_size
, andm_cur_stream_id
appropriately supports the new packed stream reading functionality. The use ofstd::shared_ptr
for the buffer and the initialization of size variables to 0 are good practices.Consider adding brief inline comments to explain the purpose of these new member variables, especially their role in supporting the
read_stream
method.components/core/src/clp_s/CommandLineArguments.hpp (1)
178-178
: LGTM: New member variable added correctly. Consider using a named constant.The new
m_minimum_table_size
member variable is well-implemented. It follows the class's naming conventions and is initialized with a clear value.For improved readability and maintainability, consider defining a named constant for the 1 MB value. For example:
static constexpr size_t ONE_MEGABYTE = 1ULL * 1024 * 1024; size_t m_minimum_table_size{ONE_MEGABYTE}; // 1 MBThis approach makes the code more self-documenting and easier to update if needed.
components/core/src/clp_s/clp-s.cpp (1)
92-92
: LGTM! Consider adding a comment for consistency.The addition of
min_table_size
option is well-placed and aligns with the PR's objective of implementing table packing. This change allows for configurable minimum table size thresholds, which is crucial for the new feature.For consistency with other options, consider adding a brief comment explaining the purpose of
min_table_size
, similar to the comments for other options in theCommandLineArguments
class. This would improve code readability and maintainability.components/core/src/clp_s/JsonParser.cpp (1)
34-34
: LGTM! Consider adding a comment for clarity.The addition of
min_table_size
tom_archive_options
is consistent with the PR objectives and aligns well with the existing code structure. This change supports the new table packing feature by providing a size threshold for combining small tables.Consider adding a brief comment explaining the purpose of
min_table_size
for improved code readability. For example:+ // Minimum size threshold for table packing m_archive_options.min_table_size = option.min_table_size;
components/core/src/clp_s/SchemaReader.cpp (2)
40-46
: LGTM! Consider adding error handling for null buffer.The changes to the
load
function look good. The new implementation simplifies buffer management by relying on an external buffer, which could potentially improve performance and reduce memory overhead.Consider adding a null check for the
stream_buffer
to ensure it's not empty before dereferencing it. You could add something like this at the beginning of the function:if (!stream_buffer) { throw OperationFailed(ErrorCodeInvalidArgument, __FILENAME__, __LINE__); }
Line range hint
1-624
: Consider refactoring and improving documentation for better maintainability.While the changes made to the
load
function are good, the overall complexity of this file suggests that it might benefit from some refactoring and additional documentation. Consider the following suggestions:
- Break down large functions into smaller, more manageable pieces.
- Add more inline comments explaining complex logic, especially in the JSON serialization and schema handling sections.
- Consider creating separate classes for handling different aspects of the schema reading process (e.g., JSON serialization, schema tree management).
- Add comprehensive documentation for public methods, including their purpose, parameters, and return values.
These improvements could enhance the maintainability and readability of the code, making it easier for other developers to understand and modify in the future.
components/core/src/clp_s/CommandLineArguments.cpp (1)
163-167
: LGTM! Consider clarifying the option description.The addition of the
--min-table-size
option aligns well with the PR objectives for improving compression efficiency. The implementation is correct and consistent with the existing code structure.Consider slightly modifying the option description to be more explicit:
- "Minimum size (B) for a packed table before it gets compressed." + "Minimum size (in bytes) for a packed table before it gets compressed."This change makes it clearer that the size is specified in bytes, which could be helpful for users unfamiliar with the notation.
components/core/src/clp_s/PackedStreamReader.hpp (2)
80-86
: Use scoped enumeration forPackedStreamReaderState
.Converting
PackedStreamReaderState
to a scoped enumeration (enum class
) enhances type safety and prevents accidental misuse of the enumerators.Apply the following diff:
-enum PackedStreamReaderState { +enum class PackedStreamReaderState { Uninitialized, MetadataRead, PackedStreamsOpened, PackedStreamsOpenedAndMetadataRead, ReadingPackedStreams };Remember to qualify enumerator usage with the enum class name, e.g.,
PackedStreamReaderState::Uninitialized
.
73-73
: Consider markingread_stream
asnodiscard
.The method
read_stream
modifies its parameters by reference. To prevent accidental misuse where the updated parameters are ignored, consider marking the method with[[nodiscard]]
.Apply the following diff:
+[[nodiscard]] void read_stream(size_t stream_id, std::shared_ptr<char[]>& buf, size_t& buf_size);
components/core/src/clp_s/PackedStreamReader.cpp (1)
87-89
: Document the sequential access requirement forstream_id
The condition
if (m_prev_stream_id >= stream_id)
enforces thatstream_id
must be greater thanm_prev_stream_id
, effectively requiring streams to be read in increasing order. If this sequential access is intended, consider documenting this requirement in the class interface or method documentation to inform users.components/core/src/clp_s/ArchiveWriter.cpp (1)
146-180
: Improve documentation clarity for the metadata schemaThe added comments provide detailed information about the packed stream metadata schema. To enhance readability and maintain consistency, consider reformatting the documentation. Aligning the indentation and using bullet points more effectively can make the structure clearer.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (19)
- components/core/src/clp_s/ArchiveReader.cpp (7 hunks)
- components/core/src/clp_s/ArchiveReader.hpp (4 hunks)
- components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
- components/core/src/clp_s/ArchiveWriter.hpp (2 hunks)
- components/core/src/clp_s/CMakeLists.txt (1 hunks)
- components/core/src/clp_s/ColumnWriter.cpp (2 hunks)
- components/core/src/clp_s/ColumnWriter.hpp (7 hunks)
- components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
- components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
- components/core/src/clp_s/JsonParser.cpp (1 hunks)
- components/core/src/clp_s/JsonParser.hpp (1 hunks)
- components/core/src/clp_s/PackedStreamReader.cpp (1 hunks)
- components/core/src/clp_s/PackedStreamReader.hpp (1 hunks)
- components/core/src/clp_s/SchemaReader.cpp (1 hunks)
- components/core/src/clp_s/SchemaReader.hpp (4 hunks)
- components/core/src/clp_s/SchemaWriter.cpp (1 hunks)
- components/core/src/clp_s/SchemaWriter.hpp (1 hunks)
- components/core/src/clp_s/clp-s.cpp (1 hunks)
- components/core/src/clp_s/search/Output.cpp (1 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (39)
components/core/src/clp_s/SchemaWriter.cpp (3)
7-7
: LGTM: Appropriate update to track total uncompressed sizeThe addition of the column header size to
m_total_uncompressed_size
is consistent with the PR's objective of tracking the total in-memory size of tables. This change will contribute to more accurate size calculations for table packing.
Line range hint
1-43
: Overall assessment: Well-implemented changes supporting table packingThe modifications to
SchemaWriter.cpp
are well-aligned with the PR objectives:
- Improved tracking of uncompressed sizes in
append_column
andappend_message
.- Simplified
store
method, removing redundant size calculations.- Consistent use of modern C++ practices.
These changes effectively support the implementation of table packing while maintaining code clarity and efficiency. The refactoring suggestions provided are minor and do not impact the overall quality of the implementation.
29-32
: LGTM: Simplified store method aligns with new size tracking approachThe changes to the
store
method are appropriate:
- Changing the return type to
void
is consistent with the new approach of tracking sizes incrementally.- The simplified method body correctly focuses on storing column data without redundant size tracking.
These modifications align well with the PR objectives and the changes made in other methods.
To ensure this change doesn't break any existing code that might have relied on the returned size, let's verify the usage of this method:
✅ Verification successful
Verified: No external usages expect a return value from
SchemaWriter::store
The verification confirms that there are no calls to
SchemaWriter::store
expecting a return value. Therefore, changing the return type tovoid
is safe and aligns with the updated size tracking approach.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for any uses of SchemaWriter::store that expect a return value # Search for SchemaWriter::store method calls rg --type cpp -A 5 'SchemaWriter::store|schemaWriter\.store'Length of output: 456
components/core/src/clp_s/SchemaWriter.hpp (1)
44-44
: LGTM! Verify impact on dependent code.The change to the
store
method signature aligns with the PR objectives. The removal of the return value is consistent with the new approach of tracking the total in-memory size internally.Please run the following script to verify that no other parts of the codebase are relying on the removed return value:
✅ Verification successful
Verified: No dependent code relies on the removed return value of
SchemaWriter::store()
.The shell script did not find any usages of
SchemaWriter::store()
that expect a return value. This confirms that the change to the method signature does not impact other parts of the codebase.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for usages of SchemaWriter::store that expect a return value # Test: Search for SchemaWriter::store usages. Expect: No assignments from the method call. rg --type cpp -e "SchemaWriter::store" -e "\.store\(" | grep -v "void.*store"Length of output: 835
Script:
#!/bin/bash # Description: Check for SchemaWriter::store() usages that assign its return value rg --type cpp 'SchemaWriter::store\s*\(' | grep '='Length of output: 51
components/core/src/clp_s/JsonParser.hpp (1)
35-35
: Summary: Minimal change with potential impact on table packingThe addition of
min_table_size
toJsonParserOption
is a focused change that supports the new table packing feature. While the modification in this file is minimal, it has the potential to significantly impact the compression efficiency of small tables as described in the PR objectives.To ensure the full benefit of this change:
- Verify that the table packing implementation in other files (e.g.,
SchemaWriter
) correctly utilizes this new option.- Consider adding a brief comment in the code to explain the purpose of
min_table_size
.- Update any relevant documentation or user guides to explain how to use this new option effectively.
components/core/src/clp_s/ColumnWriter.cpp (4)
4-7
: Excellent improvement to the method signature!The change from using an output parameter to returning the size directly is a great improvement. It simplifies the method interface and aligns with modern C++ best practices. This modification enhances readability and reduces the potential for errors related to output parameter usage.
14-17
: Excellent consistency in method signature improvement!The changes to
FloatColumnWriter::add_value
are consistent with those made toInt64ColumnWriter::add_value
. This modification simplifies the method interface, improves readability, and aligns with modern C++ best practices. Well done!
24-27
: Excellent consistency in method signature improvement!The changes to
BooleanColumnWriter::add_value
are consistent with those made to otheradd_value
methods. This modification simplifies the method interface, improves readability, and aligns with modern C++ best practices. Great job maintaining consistency across different column writer classes!
Line range hint
34-86
: Consistent improvements across remaining column writer classes.The changes made to the
add_value
andstore
methods ofClpStringColumnWriter
,VariableStringColumnWriter
, andDateStringColumnWriter
are consistent with the improvements made to the previously reviewed classes. These changes include:
- Modifying
add_value
methods to returnsize_t
instead of using an output parameter.- Changing
store
methods to returnvoid
instead ofsize_t
.These modifications maintain consistency across all column writer classes, simplify method interfaces, and align with modern C++ practices. The core functionality of each method remains intact.
Consider applying the same minor improvement suggested for other
store
methods by removing unusedsize
variables where applicable.Overall, these changes represent a significant improvement in code consistency and readability across the entire file.
components/core/src/clp_s/ArchiveWriter.hpp (1)
Line range hint
1-179
: Verify the usage ofm_min_table_size
in class methods.The changes to add
min_table_size
toArchiveWriterOption
andm_min_table_size
toArchiveWriter
look good. However, it's important to ensure that this new member variable is properly utilized in the relevant class methods.Please run the following script to check for the usage of
m_min_table_size
in the implementation file:If there are no matches, consider updating the relevant methods (e.g.,
store_tables()
) to use this new member variable for table packing logic.✅ Verification successful
Usage of
m_min_table_size
VerifiedThe
m_min_table_size
member variable is properly set and utilized withinArchiveWriter.cpp
, ensuring the newmin_table_size
option is effectively integrated.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for the usage of m_min_table_size in ArchiveWriter implementation # Test: Search for m_min_table_size usage in ArchiveWriter.cpp rg --type cpp "m_min_table_size" "components/core/src/clp_s/ArchiveWriter.cpp"Length of output: 222
components/core/src/clp_s/ColumnWriter.hpp (6)
62-64
: LGTM! Changes are consistent with the base class.The updates to the
add_value
andstore
method signatures in theInt64ColumnWriter
class align with the changes made in theBaseColumnWriter
class, maintaining consistency across the inheritance hierarchy.
79-81
: LGTM! Changes are consistent with the base class.The updates to the
add_value
andstore
method signatures in theFloatColumnWriter
class align with the changes made in theBaseColumnWriter
class, maintaining consistency across the inheritance hierarchy.
96-98
: LGTM! Changes are consistent with the base class.The updates to the
add_value
andstore
method signatures in theBooleanColumnWriter
class align with the changes made in theBaseColumnWriter
class, maintaining consistency across the inheritance hierarchy.
120-124
: LGTM! Changes are consistent and address previous concerns.The updates to the
add_value
andstore
method signatures in theClpStringColumnWriter
class align with the changes made in theBaseColumnWriter
class. The addition of theget_total_header_size
method is appropriate for this specific column type.Regarding the previous comment about adding this method to other classes: As gibber9809 mentioned, the default implementation returning 0 works for all other column writers. The
ClpStringColumnWriter
is the only one that needs a custom implementation due to its extra header for recording the size of the encoded variables column.
176-178
: LGTM! Changes are consistent with the base class.The updates to the
add_value
andstore
method signatures in theVariableStringColumnWriter
class align with the changes made in theBaseColumnWriter
class, maintaining consistency across the inheritance hierarchy.
194-196
: LGTM! Changes are consistent with the base class.The updates to the
add_value
andstore
method signatures in theDateStringColumnWriter
class align with the changes made in theBaseColumnWriter
class, maintaining consistency across the inheritance hierarchy.components/core/src/clp_s/CMakeLists.txt (1)
68-69
: LGTM! New files for PackedStreamReader added correctly.The addition of
PackedStreamReader.cpp
andPackedStreamReader.hpp
to theCLP_S_SOURCES
list is correct and aligns with the PR objectives for implementing table packing.Let's verify that these files exist in the correct directory:
✅ Verification successful
Verification Successful: Files Exist as Expected.
Both
PackedStreamReader.cpp
andPackedStreamReader.hpp
are present in thecomponents/core/src/clp_s
directory, confirming that the additions align with the PR objectives for implementing table packing.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the existence of PackedStreamReader files # Test: Check if the files exist fd -t f "PackedStreamReader\.(cpp|hpp)$" components/core/src/clp_sLength of output: 166
components/core/src/clp_s/ArchiveReader.hpp (5)
13-13
: LGTM: New header file inclusion is appropriate.The inclusion of "PackedStreamReader.hpp" aligns with the PR objectives of implementing table packing functionality.
178-188
: LGTM: Newread_stream
method is well-implemented and documented.The introduction of the
read_stream
method aligns with the PR objectives for implementing table packing. The method's signature and documentation provide clear guidance on its usage and purpose, particularly regarding buffer reuse for optimizing memory allocations.
200-200
: LGTM: Member variable update improves consistency with new terminology.The renaming of
m_id_to_table_metadata
tom_id_to_schema_metadata
and the corresponding type change toSchemaMetadata
align well with the PR's focus on schema-level operations and the new packed stream approach.
202-202
: LGTM: Member variable update aligns with packed stream implementation.The change from
FileReader m_tables_file_reader
toPackedStreamReader m_stream_reader
accurately reflects the new packed stream approach. This update also implements the previously suggested naming convention, improving code clarity.
Line range hint
1-212
: Overall LGTM: Changes effectively implement table packing functionality.The modifications to the
ArchiveReader
class successfully implement the table packing functionality as outlined in the PR objectives. The changes, including the newread_stream
method, updated member variables, and improved terminology, collectively enhance the class's ability to handle packed streams efficiently. The code maintains good practices in memory management and provides clear documentation for new features.components/core/src/clp_s/CommandLineArguments.hpp (1)
111-112
: LGTM: New getter method added correctly.The new
get_minimum_table_size()
method is well-implemented. It follows the class's naming conventions for getter methods, is properly const-qualified, and returns the correct type.components/core/src/clp_s/SchemaReader.cpp (1)
Line range hint
46-52
: LGTM! Buffer initialization and column loading look good.The changes to initialize the
BufferViewReader
with the adjustedstream_buffer
pointer are correct and consistent with the new function signature. The rest of the function, including the column loading and error checking, remains unchanged and appropriate.components/core/src/clp_s/search/Output.cpp (1)
Line range hint
87-91
: LGTM! Method updated to useread_schema_table()
.The change from
read_table()
toread_schema_table()
aligns with the new terminology for packed streams. This update is consistent with the PR objectives and discussions.Let's verify that there are no remaining instances of
read_table()
that might need updating:✅ Verification successful
Verification Successful: No Remaining
read_table()
Instances FoundAll instances of
read_table()
have been successfully updated toread_schema_table()
.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for any remaining instances of 'read_table()' in the codebase. # Search for 'read_table(' in all C++ files rg --type cpp 'read_table\(' -C 3Length of output: 35
components/core/src/clp_s/PackedStreamReader.hpp (1)
76-76
: Potential out-of-bounds access inget_uncompressed_stream_size
.The method
get_uncompressed_stream_size
usesm_stream_metadata.at(stream_id)
, which will throw an exception ifstream_id
is out of bounds. Whileat()
performs bounds checking, it might be clearer to document or handle this possibility explicitly.To ensure that all calls to
get_uncompressed_stream_size
use validstream_id
values, run the following script:components/core/src/clp_s/SchemaReader.hpp (3)
4-4
: LGTMThe inclusion of
<memory>
is appropriate for the use ofstd::shared_ptr
.
50-53
: LGTMThe new
SchemaMetadata
structure properly includesstream_id
,stream_offset
, andnum_messages
.
282-282
: LGTMThe use of
std::shared_ptr<char[]>
form_stream_buffer
is appropriate.components/core/src/clp_s/ArchiveWriter.cpp (1)
212-213
: Ensure correct handling when flushing compression streamsThe condition:
if (current_stream_offset > m_min_table_size || schemas.size() == schema_metadata.size()) {
might skip flushing when
current_stream_offset
equalsm_min_table_size
. Verify that this logic correctly handles all intended cases to prevent potential off-by-one errors.components/core/src/clp_s/ArchiveReader.cpp (9)
31-31
: Initialization of Packed StreamsThe call to
m_stream_reader.open_packed_streams
correctly initializes the packed streams using the archive path. This aligns with the new approach for accessing archives.
41-54
: Proper Handling of Metadata Reading and Error CheckingThe code effectively reads the metadata and handles potential errors from
try_read_numeric_value
. The check fornum_separate_column_schemas
ensures that unsupported schemas are appropriately flagged, preventing processing of incompatible data.
99-114
: Accurate Calculation of Uncompressed SizeThe logic for calculating
uncompressed_size
and updating the schema metadata is correctly implemented. It handles transitions between different streams and updates the metadata accordingly.
117-120
: Final Metadata Update for Last SchemaThe code correctly calculates the uncompressed size for the last schema after the loop, ensuring that all schema metadata entries have accurate size information.
Line range hint
132-151
: Schema Table Reading Logic UpdatedThe
read_schema_table
method properly retrieves the schema metadata fromm_id_to_schema_metadata
and loads the data using the new stream-based approach. This update reflects the shift from table-based to stream-based operations.
157-167
: Efficient Reading of All Schema TablesThe
read_all_tables
method has been updated to utilize the new stream-based operations effectively. It iterates overm_schema_ids
, initializes each schema reader, and loads the corresponding data, ensuring all tables are read correctly.
274-274
: Initialization with Updated Number of MessagesIn
initialize_schema_reader
, the use ofnum_messages
fromm_id_to_schema_metadata
ensures that the schema reader is accurately initialized with the correct number of messages, enhancing data consistency.
321-322
: Correct Usage of Updatedread_schema_table
MethodThe
store
method correctly utilizes the updatedread_schema_table
method to read messages from all schema readers, ensuring that data is accurately written to the provided writer.
340-347
: Proper Resource Cleanup inclose
MethodThe
close
method effectively releases all resources by closingm_stream_reader
, clearing metadata maps, and resetting buffers. This ensures no resource leaks occur upon closing the archive reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (4)
components/core/src/clp_s/ArchiveWriter.cpp (4)
17-17
: New member variable added correctlyThe addition of
m_min_table_size
is appropriate for the new table packing feature. It's correctly initialized from theArchiveWriterOption
.Consider adding a comment explaining the purpose of this variable, especially its role in determining when to flush metadata in the
store_tables
method.
146-180
: Excellent documentation of the new metadata schemaThe detailed comment block provides a clear and comprehensive explanation of the packed stream metadata schema. This documentation is crucial for understanding the new table packing feature and will greatly assist future maintainers.
Consider adding a brief note about why this approach was chosen (e.g., to enhance compression efficiency for small tables) to provide additional context.
197-223
: Table packing logic implemented effectivelyThe new logic for packing tables into streams is well-implemented and aligns with the PR objectives. The use of
m_min_table_size
provides flexible control over stream creation, which is excellent.To improve clarity, consider adding a comment explaining the significance of the condition on line 212:
// Flush the current stream if it exceeds the minimum size or if this is the last schema if (current_stream_offset > m_min_table_size || schemas.size() == schema_metadata.size()) { // ... (existing code) }This will help future readers understand the logic behind stream flushing more quickly.
231-234
: Clear explanation for future extensibilityThe comment explaining that the current implementation doesn't store large tables as separate columns is helpful. It provides context for the zero value and hints at potential future enhancements.
Consider adding a TODO comment to make it easier for future developers to find this spot when implementing separate column storage:
// TODO: Implement support for storing large tables as separate columns size_t const num_separate_column_schemas{0};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- components/core/src/clp_s/ArchiveReader.hpp (4 hunks)
- components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
- components/core/src/clp_s/PackedStreamReader.cpp (1 hunks)
- components/core/src/clp_s/PackedStreamReader.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- components/core/src/clp_s/ArchiveReader.hpp
- components/core/src/clp_s/PackedStreamReader.cpp
- components/core/src/clp_s/PackedStreamReader.hpp
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (3)
components/core/src/clp_s/ArchiveWriter.cpp (3)
3-5
: Appropriate header inclusions addedThe addition of
<algorithm>
and<tuple>
headers is necessary for the new functionality in thestore_tables
method. Good job on including only the required headers.
225-242
: Metadata writing implemented correctlyThe implementation of writing the collected metadata to the compressor is correct and follows the schema described in the earlier comment block. The use of structured bindings enhances readability, which is excellent.
Both stream metadata and schema metadata are handled appropriately, ensuring that all necessary information is stored for later retrieval during decompression.
Line range hint
1-285
: Comprehensive implementation of table packing featureThe changes in this file successfully implement the table packing feature as described in the PR objectives. The new metadata schema and writing logic are well-designed and should effectively address the issue of compression ratio outliers for small tables.
Key improvements:
- Introduction of a flexible minimum table size for packing.
- Sorting of schemas based on uncompressed size for optimal packing.
- Clear and detailed documentation of the new metadata schema.
- Efficient implementation of metadata collection and writing.
These changes should significantly enhance compression efficiency, especially for cases where array-structurization generates numerous small tables. The code is well-commented and structured, which will aid in future maintenance and potential extensions of the feature.
Great job on this implementation!
🧰 Tools
🪛 cppcheck
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
using schema_map_it = decltype(m_id_to_schema_writer)::iterator; | ||
std::vector<schema_map_it> schemas; | ||
std::vector<std::tuple<size_t, size_t>> stream_metadata; | ||
std::vector<std::tuple<size_t, size_t, int32_t, size_t>> schema_metadata; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Consider using structs instead of tuples for metadata
While using std::tuple
for metadata storage is efficient, it may reduce code readability due to unnamed fields. Consider defining custom structs with descriptive member names to improve clarity and maintainability.
Example:
struct StreamMetadata {
size_t file_offset;
size_t uncompressed_size;
};
struct SchemaMetadata {
size_t stream_id;
size_t stream_offset;
int32_t schema_id;
size_t num_messages;
};
Then update the declarations:
std::vector<StreamMetadata> stream_metadata;
std::vector<SchemaMetadata> schema_metadata;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
components/core/src/clp_s/ArchiveWriter.cpp (1)
150-154
: Minor grammatical improvements in commentsTo enhance clarity, consider adding commas and correcting verb agreement in the comments.
Apply this diff to improve the comments:
- * streams is identified by a 64 bit stream id. In the first half of the metadata we identify + * streams are identified by a 64-bit stream ID. In the first half of the metadata, we identify - * In the second half of the metadata we record how many schema tables there are, + * In the second half of the metadata, we record how many schema tables there are,
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (6)
components/core/src/clp_s/ArchiveWriter.cpp (6)
3-4
: Header files are appropriately includedThe addition of
<algorithm>
and<tuple>
is necessary forstd::sort
andstd::tuple
used later in the code.
17-17
: Proper initialization ofm_min_table_size
Assigning
option.min_table_size
tom_min_table_size
correctly integrates the new table packing feature.
146-176
: Comprehensive documentation of packed stream metadata schemaThe detailed comments explaining the packed stream metadata schema enhance code readability and maintainability.
181-184
: Consider defining custom structs instead of usingstd::tuple
Using custom structs with descriptive member names for
stream_metadata
andschema_metadata
could improve code readability and maintainability.
191-193
: Handle potential exceptions in the comparison lambdaEnsure that
get_total_uncompressed_size()
isnoexcept
, or handle exceptions within the lambda to maintain exception safety during the sort operation.🧰 Tools
🪛 cppcheck
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
212-223
: Proper handling of stream boundaries in table packingThe logic correctly determines when to close the current compression stream based on
m_min_table_size
or when all schemas have been processed, ensuring efficient packing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (3)
components/core/src/clp_s/CommandLineArguments.hpp (1)
180-180
: Consider adding documentation about the minimum table size purpose.The variable initialization and naming are consistent with the codebase. However, it would be helpful to document that this represents the minimum size threshold for table packing, as mentioned in the PR objectives.
Consider adding a detailed comment like this:
- size_t m_minimum_table_size{1ULL * 1024 * 1024}; // 1 MB + // Minimum size threshold (1 MB) for packing small tables into a single compression stream + size_t m_minimum_table_size{1ULL * 1024 * 1024};components/core/src/clp_s/ArchiveReader.cpp (1)
341-348
: Enhance cleanup robustnessConsider adding null checks before clearing resources to prevent potential issues during partial initialization scenarios.
- m_stream_reader.close(); - m_table_metadata_file_reader.close(); + if (m_stream_reader) m_stream_reader.close(); + if (m_table_metadata_file_reader) m_table_metadata_file_reader.close();components/core/src/clp_s/CommandLineArguments.cpp (1)
163-167
: LGTM! Consider adding value validation.The new
--min-table-size
option is well-integrated with the existing compression options. However, consider adding validation to ensure the value is positive and reasonable.Add validation after parsing, similar to other size parameters:
if (parsed_command_line_options.count("min-table-size") > 0) { + if (m_minimum_table_size <= 0) { + throw std::invalid_argument("min-table-size must be greater than zero."); + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (8)
- components/core/src/clp_s/ArchiveReader.cpp (7 hunks)
- components/core/src/clp_s/ArchiveReader.hpp (4 hunks)
- components/core/src/clp_s/CMakeLists.txt (1 hunks)
- components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
- components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
- components/core/src/clp_s/SchemaReader.cpp (1 hunks)
- components/core/src/clp_s/SchemaReader.hpp (4 hunks)
- components/core/src/clp_s/clp-s.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp_s/ArchiveReader.hpp
🧰 Additional context used
🔇 Additional comments (9)
components/core/src/clp_s/CMakeLists.txt (1)
68-69
: The build configuration changes look good, eh!The addition of
PackedStreamReader
source files follows the project's build patterns and maintains alphabetical ordering in the source list.Let's verify that all required dependencies for the packed stream functionality are present:
✅ Verification successful
Build configuration is properly set up for packed stream functionality, eh!
After verification, the build configuration is complete and correct:
- The CMakeLists.txt includes all necessary compression-related dependencies:
ZStd::ZStd
is properly linked in the target libraries- Required compression utilities (
ZstdCompressor.hpp/cpp
andZstdDecompressor.hpp/cpp
) are included inCLP_S_SOURCES
- The
PackedStreamReader
implementation correctly:
- Includes the required
ZstdDecompressor.hpp
- Uses the compression functionality for stream handling
- Has proper integration with the existing compression infrastructure
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify dependencies for packed stream functionality # Test 1: Check for compression-related dependencies rg -l "zstd|compress" CMakeLists.txt # Test 2: Check for any new header includes in PackedStreamReader ast-grep --pattern 'include "$_"' PackedStreamReader.hppLength of output: 222
Script:
#!/bin/bash # Let's try with correct paths and broader search # Find CMakeLists.txt files fd CMakeLists.txt # Then search for compression-related dependencies in found files fd CMakeLists.txt --exec rg -l "zstd|compress" {} # Find PackedStreamReader files fd "PackedStreamReader\.(cpp|hpp)" # Check includes in any found PackedStreamReader files fd "PackedStreamReader\.hpp" --exec ast-grep --pattern 'include "$_"' {}Length of output: 1169
Script:
#!/bin/bash # Let's check the content of the relevant CMakeLists.txt for compression dependencies cat components/core/src/clp_s/CMakeLists.txt # Let's also check for any compression-related includes or dependencies rg -p "zstd|compress" components/core/src/clp_s/PackedStreamReader.hpp || true rg -p "zstd|compress" components/core/src/clp_s/PackedStreamReader.cpp || true # Check for any related target dependencies rg -p "target_link_libraries.*clp-s" components/core/src/clp_s/CMakeLists.txtLength of output: 8192
components/core/src/clp_s/CommandLineArguments.hpp (1)
111-112
: LGTM! The getter method follows the class conventions.The implementation is clean and consistent with other getter methods in the class.
components/core/src/clp_s/ArchiveReader.cpp (5)
31-31
: LGTM: Stream reader initialization aligns with table packing featureThe change from table file reader to packed stream reader correctly implements the new architecture for handling packed tables.
Line range hint
132-152
: LGTM: Schema table reading implementationThe schema-based table reading correctly integrates with the packed streams architecture and properly handles buffer reuse.
322-326
: LGTM: Store method properly handles schema orderThe store method correctly uses schema IDs to maintain the proper order of tables during storage.
351-364
: LGTM: Stream reading implementationThe stream reading implementation correctly handles buffer management and stream switching.
41-120
: Verify metadata integrity across packed streamsThe metadata reading implementation looks solid, but we should verify that the uncompressed size calculations are accurate across different stream configurations.
✅ Verification successful
Let me try a different approach to verify the metadata integrity implementation.
The previous search was too specific. Let's broaden it to find relevant test cases and related validation code that can help verify the metadata integrity implementation.
Based on the code analysis from the search results, I can now provide a final verification response.
Metadata integrity implementation is correctly implemented
The implementation in ArchiveReader.cpp correctly handles stream metadata integrity through several mechanisms:
- Stream offset validation: Checks that offsets don't exceed stream sizes (line 41-43 in ArchiveReader.cpp)
- Proper uncompressed size calculation:
- For same stream: Uses offset difference (
stream_offset - prev_metadata.stream_offset
)- For different streams: Uses stream size minus offset (
get_uncompressed_stream_size() - prev_metadata.stream_offset
)- Consistent metadata tracking through PackedStreamReader which maintains stream metadata with file offsets and uncompressed sizes
- Complete validation chain from reading to schema loading
The implementation ensures that stream boundaries are respected and uncompressed sizes are accurately calculated across stream transitions.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify that stream offsets are properly ordered within each stream # and that uncompressed sizes don't exceed stream sizes # Search for test files that verify stream metadata integrity rg -l "test.*stream.*metadata|test.*packed.*stream" --type cpp # Look for potential metadata validation functions ast-grep --pattern 'void $func($$$) { $$$ stream_offset $$$ uncompressed $$$ }'Length of output: 156
Script:
#!/bin/bash # Search for test files related to ArchiveReader fd "test.*\.cpp" | xargs rg -l "ArchiveReader" # Look for test cases involving stream metadata rg -A 5 "TEST.*Archive.*Reader" --type cpp # Find related validation code rg -A 3 "stream_offset|uncompressed_size" --type cpp components/core/src/clp_s/Length of output: 15231
components/core/src/clp_s/clp-s.cpp (1)
94-94
: LGTM! Please verify minimum table size validation.The addition of
min_table_size
parameter aligns well with the table packing feature. The implementation follows the established pattern for option configuration.Let's verify the parameter validation:
✅ Verification successful
Let me analyze the results. I can see that:
- The minimum table size has a default value of 1MB (
1ULL * 1024 * 1024
)- It's configurable via command line arguments using
boost::program_options
- The parameter has a description: "Minimum size (B) for a packed table before it gets compressed"
Let me verify if there are any additional validations or constraints on this parameter:
Let me check one more thing to see if there are any validations in the constructor or initialization:
Let me try one last search to find any validation logic:
Parameter validation appears sufficient
The minimum table size parameter implementation is well-structured with:
- A sensible default value of 1 MB (
1ULL * 1024 * 1024
)- Command-line configuration support via boost::program_options
- Clear documentation indicating its purpose for packed table compression
- Implicit validation through the
size_t
type, ensuring non-negative valuesSince
size_t
naturally prevents negative values and the default value of 1 MB provides a reasonable starting point, no additional validation appears necessary for this use case.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check for minimum table size validation in CommandLineArguments # Expected: Find validation logic for minimum table size parameter # Search for minimum table size validation rg -A 5 "minimum.*table.*size" --type cppLength of output: 2468
Script:
#!/bin/bash # Search for any validation or constraints on minimum table size rg -A 5 "m_minimum_table_size" --type cppLength of output: 1859
Script:
#!/bin/bash # Search for CommandLineArguments constructor and validate method ast-grep --pattern 'CommandLineArguments::CommandLineArguments($$$) { $$$ }' ast-grep --pattern 'CommandLineArguments::validate($$$) { $$$ }'Length of output: 149
Script:
#!/bin/bash # Search for validation methods in CommandLineArguments rg -B 2 -A 5 "void.*validate" --type cpp components/core/src/clp_s/CommandLineArguments # Also check for any error handling related to minimum table size rg -B 2 -A 5 "throw.*minimum.*table" --type cppLength of output: 226
components/core/src/clp_s/SchemaReader.hpp (1)
139-144
: Duplicate Comment: Consider passingstream_buffer
by const referenceThe previous suggestion to pass
stream_buffer
asconst std::shared_ptr<char[]>&
is still valid. This change would avoid unnecessary increments of the reference count on the shared pointer, potentially improving performance.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/SchemaReader.hpp (1)
139-144
: Documentation needs clarification about buffer ownershipThe method documentation should specify:
- Buffer ownership semantics (whether the method takes ownership)
- Buffer lifetime requirements
- Whether the buffer can be modified
Example documentation:
/** * Loads the encoded messages from a shared buffer starting at a given offset * @param stream_buffer Shared buffer containing the encoded messages. The buffer must remain valid * for the lifetime of the SchemaReader instance * @param offset Starting offset within the buffer * @param uncompressed_size Size of the uncompressed data */components/core/src/clp_s/ArchiveWriter.cpp (1)
146-180
: Consider adding size threshold documentationThe metadata schema documentation is thorough. Consider adding a note about how the
m_min_table_size
threshold influences the packing of tables into streams, helping future maintainers understand the relationship between table sizes and stream creation.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- components/core/src/clp_s/ArchiveReader.cpp (7 hunks)
- components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
- components/core/src/clp_s/SchemaReader.hpp (4 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (10)
components/core/src/clp_s/SchemaReader.hpp (2)
52-56
: LGTM: SchemaMetadata struct changes align with table packing designThe renamed struct and new fields (stream_id, stream_offset) effectively support the table packing feature by tracking the location of schema data within packed streams.
287-287
: Verify buffer cleanup handlingThe change from unique_ptr to shared_ptr requires verification of proper buffer cleanup, especially in error cases.
components/core/src/clp_s/ArchiveWriter.cpp (2)
3-4
: LGTM: New includes and member initializationThe additions properly support the table packing feature. The includes are required for sorting schemas and storing metadata, while the member initialization correctly sets up the minimum table size threshold.
Also applies to: 17-17
225-242
: LGTM: Metadata writing implementationThe implementation correctly follows the documented metadata schema, properly writing both stream and schema metadata sections. The code is well-structured and maintains clear separation between the two metadata sections.
components/core/src/clp_s/ArchiveReader.cpp (6)
31-31
: LGTM: Stream reader initialization aligns with table packing featureThe change from table file reader to packed streams reader accurately reflects the new architecture.
83-85
: LGTM: Proper stream offset validationThe code correctly validates that stream offsets don't exceed the uncompressed stream size, preventing potential buffer overruns.
275-275
: LGTM: Proper use of schema metadataThe initialization correctly uses the schema metadata to set the number of messages.
322-326
: LGTM: Proper schema order preservationThe store method correctly uses schema_ids to ensure tables are processed in the intended order.
341-348
: LGTM: Comprehensive cleanup of stream resourcesThe close method properly cleans up all stream-related resources in a logical order.
99-120
: Verify metadata calculation logic for edge casesThe uncompressed size calculation uses a complex prev_metadata pattern. While the logic appears sound, it's worth verifying the handling of edge cases.
✅ Verification successful
Let me try a different approach to search for relevant test files.
This should help us find relevant test files that might verify the metadata calculation logic, even if they don't explicitly mention "metadata calculation" in their names.
Additionally, let's look at the actual test implementations:
Let me search for tests specifically related to the schema metadata calculation.
This should help us find any test cases or implementations that verify the schema metadata calculation logic, particularly around the uncompressed size calculation.
Based on the code analysis, I can now provide a final assessment of the metadata calculation logic.
Metadata calculation logic is correctly implemented with proper edge case handling
The uncompressed size calculation in
ArchiveReader.cpp
follows a robust pattern that handles both same-stream and cross-stream scenarios:
- For same stream (when
stream_id == prev_metadata.stream_id
):
- Uses direct offset difference:
stream_offset - prev_metadata.stream_offset
- For different streams:
- Uses stream size minus offset:
m_stream_reader.get_uncompressed_stream_size(prev_metadata.stream_id) - prev_metadata.stream_offset
- Final schema's size is properly calculated using the same pattern
The implementation includes proper initialization checks (
prev_metadata_initialized
) and maintains consistency across stream boundaries. The metadata is stored in a map (m_id_to_schema_metadata
) that ensures each schema's metadata is properly tracked.🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for test cases that verify metadata calculation rg -l "test.*metadata.*calculation|test.*uncompressed.*size" --type cppLength of output: 73
Script:
#!/bin/bash # Search for test files related to ArchiveReader fd "test.*\.cpp$" | rg -l "ArchiveReader" # Also search for any usage or tests of uncompressed size calculation rg -l "uncompressed.*size" --type cppLength of output: 2386
Script:
#!/bin/bash # Look for test cases involving stream size or metadata in the test files rg "TEST.*Archive.*|uncompressed.*size|stream.*size" -A 5 --type cppLength of output: 113781
Script:
#!/bin/bash # Look for tests or implementations related to schema metadata rg -A 5 "SchemaMetadata|schema.*metadata" --type cpp # Also look for any test files that might verify the schema metadata fd "test.*\.cpp$" | rg -l "Schema"Length of output: 9738
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
components/core/src/clp_s/ArchiveWriter.hpp (4)
24-24
: Add documentation for the min_table_size parameter.Please add a documentation comment explaining that this parameter controls the minimum size threshold for table packing. This will help future maintainers understand its purpose and expected values.
struct ArchiveWriterOption { boost::uuids::uuid id; std::string archives_dir; int compression_level; bool print_archive_stats; + /// Minimum size threshold in bytes for packing tables into a single compression stream size_t min_table_size; };
36-43
: Add class-level documentation for StreamMetadata.While the implementation is sound, please add documentation explaining this struct's role in the table packing feature. Consider documenting each member's purpose as well.
+ /** + * Metadata for a packed compression stream containing one or more tables + */ struct StreamMetadata { StreamMetadata(uint64_t file_offset, uint64_t uncompressed_size) : file_offset(file_offset), uncompressed_size(uncompressed_size) {} + /// Offset of this stream in the archive file uint64_t file_offset{}; + /// Total uncompressed size of all tables in this stream uint64_t uncompressed_size{}; };
45-61
: Add comprehensive documentation for SchemaMetadata.Please add documentation to clarify:
- The struct's role in mapping schema tables to packed streams
- The relationship between SchemaMetadata and StreamMetadata
- The purpose of each member variable
+ /** + * Metadata mapping a schema table to its location within a packed compression stream + */ struct SchemaMetadata { SchemaMetadata( uint64_t stream_id, uint64_t stream_offset, int32_t schema_id, uint64_t num_messages ) : stream_id(stream_id), stream_offset(stream_offset), schema_id(schema_id), num_messages(num_messages) {} + /// ID of the packed stream containing this schema table uint64_t stream_id{}; + /// Offset of this table within its packed stream uint64_t stream_offset{}; + /// ID of the schema this table belongs to int32_t schema_id{}; + /// Number of messages in this table uint64_t num_messages{}; };
190-190
: Add documentation for m_min_table_size member variable.Please add a documentation comment explaining this member's role in the table packing process.
size_t m_encoded_message_size{}; size_t m_uncompressed_size{}; size_t m_compressed_size{}; std::string m_id; std::string m_archive_path; std::string m_encoded_messages_dir; std::shared_ptr<VariableDictionaryWriter> m_var_dict; std::shared_ptr<LogTypeDictionaryWriter> m_log_dict; std::shared_ptr<LogTypeDictionaryWriter> m_array_dict; // log type dictionary for arrays std::shared_ptr<TimestampDictionaryWriter> m_timestamp_dict; std::shared_ptr<clp::GlobalMySQLMetadataDB> m_metadata_db; int m_compression_level{}; bool m_print_archive_stats{}; + /// Minimum size threshold in bytes for packing tables into a single compression stream size_t m_min_table_size{};
components/core/src/clp_s/ArchiveWriter.cpp (1)
145-179
: Consider documenting size constraints and performance implicationsThe metadata schema documentation is thorough. Consider adding:
- Typical/recommended ranges for stream sizes
- Performance implications of different packing thresholds
- Memory overhead considerations for the two-section approach
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- components/core/src/clp_s/ArchiveWriter.cpp (3 hunks)
- components/core/src/clp_s/ArchiveWriter.hpp (3 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (3)
components/core/src/clp_s/ArchiveWriter.hpp (1)
Line range hint
1-205
: Overall implementation looks good!The changes effectively support the table packing feature with well-structured metadata types. The implementation aligns well with the PR objectives of combining small tables into single compression streams for better efficiency.
components/core/src/clp_s/ArchiveWriter.cpp (2)
3-4
: LGTM: Clean initialization of table packing thresholdThe addition of
m_min_table_size
and the algorithm header properly sets up the foundation for table packing.Also applies to: 16-16
224-240
: Consider adding error handling for metadata writingThe metadata writing sequence is critical for archive integrity. Consider adding error handling and validation:
- Verify write operations succeed
- Add checksums for metadata integrity
- Handle potential I/O errors
* ffi: Add support for serializing/deserializing auto-generated and user-generated schema tree node IDs. (y-scope#557) Co-authored-by: kirkrodrigues <[email protected]> * clp: Add missing C++ standard library includes in IR parsing files. (y-scope#561) Co-authored-by: kirkrodrigues <[email protected]> * log-viewer-webui: Update `yscope-log-viewer` to the latest version (which uses `clp-ffi-js`). (y-scope#562) * package: Upgrade dependencies to resolve security issues. (y-scope#536) * clp-s: Implement table packing (y-scope#466) Co-authored-by: wraymo <[email protected]> Co-authored-by: Kirk Rodrigues <[email protected]> Co-authored-by: wraymo <[email protected]> * log-viewer-webui: Update `yscope-log-viewer` to the latest version. (y-scope#565) * ci: Switch GitHub macOS build workflow to use macos-13 (x86) and macos-14 (ARM) runners. (y-scope#566) * core: Add support for user-defined HTTP headers in `NetworkReader`. (y-scope#568) Co-authored-by: Lin Zhihao <[email protected]> Co-authored-by: Xiaochong Wei <[email protected]> * chore: Update to the latest version of yscope-dev-utils. (y-scope#574) * build(core): Upgrade msgpack to v7.0.0. (y-scope#575) * feat(ffi): Update IR stream protocol version handling in preparation for releasing the kv-pair IR stream format: (y-scope#573) - Bump the IR stream protocol version to 0.1.0 for the kv-pair IR stream format. - Treat the previous IR stream format's versions as backwards compatible. - Differentiate between backwards-compatible and supported versions during validation. Co-authored-by: kirkrodrigues <[email protected]> * fix(taskfiles): Trim trailing slash from URL prefix in `download-and-extract-tar` (fixes y-scope#577). (y-scope#578) * fix(ffi): Correct `clp::ffi::ir_stream::Deserializer::deserialize_next_ir_unit`'s return value when failing to read the next IR unit's type tag. (y-scope#579) * fix(taskfiles): Update `yscope-log-viewer` sources in `log-viewer-webui-clients` sources list (fixes y-scope#576). (y-scope#580) * fix(cmake): Add Homebrew path detection for `mariadb-connector-c` to fix macOS build failure. (y-scope#582) Co-authored-by: kirkrodrigues <[email protected]> * refactor(ffi): Make `get_schema_subtree_bitmap` a public method of `KeyValuePairLogEvent`. (y-scope#581) * ci: Schedule GitHub workflows to daily run to detect failures due to upgraded dependencies or environments. (y-scope#583) * docs: Update the required version of task. (y-scope#567) * Add pr check workflow --------- Co-authored-by: kirkrodrigues <[email protected]> Co-authored-by: Junhao Liao <[email protected]> Co-authored-by: Henry8192 <[email protected]> Co-authored-by: Devin Gibson <[email protected]> Co-authored-by: wraymo <[email protected]> Co-authored-by: wraymo <[email protected]> Co-authored-by: Xiaochong(Eddy) Wei <[email protected]> Co-authored-by: Xiaochong Wei <[email protected]> Co-authored-by: haiqi96 <[email protected]>
Co-authored-by: wraymo <[email protected]> Co-authored-by: Kirk Rodrigues <[email protected]> Co-authored-by: wraymo <[email protected]>
Description
This PR implements table-packing; we combine small tables together into one compression stream until they reach a certain size threshold in order to avoid having many tiny compression streams. This helps avoid outliers in compression ratio, particularly when we enable features like array-structurization which can create many small table.
On the compression side the key differences are that (1) SchemaWriter now keeps track of the total in-memory size of the table it owns instead of determining it after writing to a compression stream; (2) before compression tables are sorted by that in-memory size, and smaller tables are packed together in sequence until their combined size reaches a certain threshold; and (3) table metadata has been changed to accommodate table packing.
On the decompression side the business of making sure tables are read in the correct order, as well as decompressing the packed streams is implemented in TableReader. Most of the rest of the change is contained in SchemaReader. The logic for reading table metadata is split between TableReader and SchemaReader where TableReader reads metadata about individual compression streams, and SchemaReader reading metadata about how schema tables map to those streams. Also note that schema tables now need to be read in the order they appear in the table metadata, which can be different than schema ID order.
Note: this PR makes the decision to leave uncompressed size of individual schema tables out of the table metadata. This is because uncompressed size can be derived from other metadata we do store, and storing uncompressed size in addition to metadata offsets would actually increase the amount of work we need to do to check an archive isn't corrupt while decompressing it.
Validation performed
Summary by CodeRabbit
New Features
PackedStreamReader
class for improved management of reading packed streams.ArchiveWriter
for better handling of compression streams and schema tables.Improvements
ArchiveReader
for more efficient schema and metadata management.ArchiveWriter
to optimize metadata processing and writing.SchemaReader
for better data loading and timestamp handling.Bug Fixes
ArchiveReader
andArchiveWriter
to align with new schema handling logic.Documentation