-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(clp-s): Add the write path for single-file archives. #563
Conversation
WalkthroughThe changes in this pull request focus on enhancing the functionality of the Changes
Possibly related PRs
Suggested reviewers
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work! Seems mostly good for a draft implementation.
Main things we should change quickly is putting the archive header + metadata section into the regular multi-file archive, and also formally pick a magic number + change the magic number to 4 bytes.
There are other things we need to clean up/think about before actually merging this, but the above should changes should be enough to build off of for prototyping.
size_t metadata_section_size = archive_writer.get_pos() - sizeof(ArchiveHeader); | ||
write_archive_files(archive_writer, files); | ||
m_compressed_size = archive_writer.get_pos(); | ||
write_archive_header(archive_writer, metadata_section_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we intend to have a file for the header+metadata section for the multi-file archive as well, but it seems like we only write this metadata section + header for the single-file archive right now. Could we add this to the multi-file write path as well? It will make the read side much simpler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per sidebar we will add the header to multi-file archive as part of a later PR so that we don't need to start changing the read side in this PR.
Also need to go through and fix all of the fields that are a different size than what the spec specifies. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (17)
components/core/src/clp_s/archive_constants.hpp (1)
5-7
: Consider adding format version and magic number constants.Based on the PR objectives, we need to formally select a magic number and adjust it to 4 bytes. Consider adding these constants here:
// Single file archive +// Magic number for identifying single-file archives (4 bytes) +constexpr uint32_t cSingleFileArchiveMagic = 0x434C5053; // "CLPS" in ASCII +// Current format version +constexpr uint32_t cSingleFileArchiveVersion = 1; constexpr char cArchiveFile[] = "/archive";components/core/src/clp_s/SingleFileArchiveDefs.hpp (3)
17-26
: Add documentation for header fields and consider structure optimization.
- Consider adding documentation for each field to explain their purpose and constraints.
- The structure could be optimized:
padding
field could be removed by reordering fields to achieve natural alignment- Four reserved padding fields (32 bytes) might be excessive
Consider this alternative arrangement:
struct ArchiveHeader { uint8_t magic_number[4]; uint32_t version; + uint16_t compression_type; + uint32_t metadata_section_size; uint64_t uncompressed_size; uint64_t compressed_size; - uint64_t reserved_padding[4]; + uint64_t reserved_padding[2]; // Reduced to 16 bytes - uint32_t metadata_section_size; - uint16_t compression_type; - uint16_t padding; };
28-36
: Consider future compression algorithm support.While Zstd is a good default choice, consider:
- Adding a comment explaining why Zstd was chosen as the default
- Adding placeholder values for potential future compression algorithms (e.g., LZ4, ZLIB)
38-43
: Document future fields in ArchiveInfoPacket.The TODO comment should specify what fields are planned to be added in the future to help with forward compatibility planning.
components/core/src/clp_s/JsonParser.hpp (1)
38-38
: Add documentation for the new option.Please add a descriptive comment above the
single_file_archive
member to explain its purpose and behaviour. This will help maintain code clarity and assist future developers.Apply this diff:
+ // Whether to use single-file archive format instead of multi-file bool single_file_archive;
components/core/src/clp_s/CommandLineArguments.hpp (2)
105-106
: Add documentation for the single-file archive featureThe implementation follows the class's established patterns, but documentation would help clarify:
- Purpose and behaviour of single-file archive mode
- Relationship with multi-file archive mode
- Impact on archive format and metadata handling
Consider adding documentation blocks above both the getter and member variable.
+ /** + * Returns whether archives should be written in single-file mode. + * @return true if single-file archive mode is enabled, false for multi-file mode + */ bool get_single_file_archive() const { return m_single_file_archive; } // ... other methods ... + // Whether to write archives in single-file mode instead of multi-file mode bool m_single_file_archive{false};Also applies to: 179-179
105-106
: Consider more descriptive namingThe current name
m_single_file_archive
could be more explicit about its purpose. Consider alternatives that better describe the operational mode:- bool get_single_file_archive() const { return m_single_file_archive; } + bool get_use_single_file_mode() const { return m_use_single_file_mode; } // ... other methods ... - bool m_single_file_archive{false}; + bool m_use_single_file_mode{false};Also applies to: 179-179
components/core/src/clp_s/clp-s.cpp (1)
Line range hint
73-116
: Enhance compress function for single-file archive supportThe compress function needs additional considerations for single-file archives:
- The magic number requirement mentioned in the PR feedback isn't addressed
- The archive format needs specific handling for single-file cases
Consider:
- Adding a 4-byte magic number validation as per PR feedback
- Implementing specific error handling for single-file archive creation
- Documenting the archive format differences between single and multi-file modes
components/core/src/clp_s/CommandLineArguments.cpp (1)
193-196
: LGTM with suggestions for the help message.The implementation of the single-file archive option is correct, but consider enhancing the help message to be more informative about the implications and trade-offs of using this option.
Consider updating the help message to be more descriptive:
- "Create a single archive file instead of multiple files." + "Create a single archive file instead of multiple files. This option may impact " + "memory usage and processing time for large datasets."components/core/src/clp_s/TimestampDictionaryWriter.hpp (4)
26-26
: Consider omitting the explicit default constructorSince the class does not perform any special initialization, the compiler-generated default constructor can be used instead of defining an empty one.
29-31
: Enhance documentation for thewrite
methodConsider providing more detailed documentation for the
write
method, including explanations of its behaviour, side effects, and any exceptions it might throw.
88-88
: Correct capitalization in documentationPlease capitalize "IDs" in "node IDs" for consistency and clarity.
Apply this diff to correct the capitalization:
- * Merges timestamp ranges with the same key name but different node ids. + * Merges timestamp ranges with the same key IDs but different node IDs.
93-93
: Fix typo in method documentationThere's an extra "the" in the comment. Please remove it for clarity.
Apply this diff to fix the typo:
- * Writes timestamp entries to the a compression stream. + * Writes timestamp entries to a compression stream.components/core/src/clp_s/ArchiveWriter.hpp (2)
149-151
: Address the TODO: Include sizes of schema components in 'get_data_size()'The TODO comment suggests adding the sizes of the schema tree, schema map, and timestamp dictionary to the total data size. Implementing this enhancement will provide a more accurate data size calculation.
Would you like assistance in implementing this enhancement or opening a GitHub issue to track this task?
182-183
: Ensure consistent formatting of function declarationsThe 'void' return type for 'write_archive_metadata' is on a separate line from the function name. For consistency, consider placing them on the same line.
Apply this diff to improve consistency:
-void -write_archive_metadata(FileWriter& archive_writer, std::vector<ArchiveFileInfo> const& files); +void write_archive_metadata(FileWriter& archive_writer, std::vector<ArchiveFileInfo> const& files);components/core/src/clp_s/ArchiveWriter.cpp (2)
46-78
: Ensure consistent handling ofm_compressed_size
.In the
close()
method, whenm_single_file_archive
is true,m_compressed_size
is calculated after writing the archive. Consider initializingm_compressed_size
in both branches to maintain consistency.You might apply this change:
+ } else { // Timestamp dictionary written separately here until we transition to moving it inside of // the metadata region of multi-file archives. auto timestamp_dict_compressed_size = write_timestamp_dict(); m_compressed_size = var_dict_compressed_size + log_dict_compressed_size + array_dict_compressed_size + timestamp_dict_compressed_size + schema_tree_compressed_size + schema_map_compressed_size + table_metadata_compressed_size + table_compressed_size; } + if (m_single_file_archive) { + m_compressed_size = archive_writer.get_pos(); + }
73-73
: Consider inliningwrite_timestamp_dict()
for efficiency.Since the timestamp dictionary is small, you might improve performance by writing it directly to a buffer instead of writing to a file and reading it back.
Refactor
write_timestamp_dict()
to write to a buffer:size_t ArchiveWriter::write_timestamp_dict() { std::stringstream timestamp_dict_buffer; ZstdCompressor timestamp_dict_compressor; timestamp_dict_compressor.open(timestamp_dict_buffer, m_compression_level); m_timestamp_dict.write(timestamp_dict_compressor); timestamp_dict_compressor.close(); std::string compressed_data = timestamp_dict_buffer.str(); std::string timestamp_dict_path = m_archive_path + constants::cArchiveTimestampDictFile; FileWriter timestamp_dict_file_writer; timestamp_dict_file_writer.open(timestamp_dict_path, FileWriter::OpenMode::CreateForWriting); timestamp_dict_file_writer.write(compressed_data.data(), compressed_data.size()); timestamp_dict_file_writer.close(); return compressed_data.size(); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (11)
components/core/src/clp_s/ArchiveWriter.cpp
(6 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(7 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/src/clp_s/JsonParser.cpp
(1 hunks)components/core/src/clp_s/JsonParser.hpp
(1 hunks)components/core/src/clp_s/SingleFileArchiveDefs.hpp
(1 hunks)components/core/src/clp_s/TimestampDictionaryWriter.cpp
(2 hunks)components/core/src/clp_s/TimestampDictionaryWriter.hpp
(3 hunks)components/core/src/clp_s/archive_constants.hpp
(1 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (29)
components/core/src/clp_s/archive_constants.hpp (2)
5-6
: LGTM! The constant follows the established naming convention.
The new constant maintains consistency with other path constants in the file.
6-6
: Please document the constant's usage and verify path handling.
While the constant follows the existing pattern of using forward slashes, we should ensure proper path handling across different platforms.
Let's check how this constant is used throughout the codebase:
Consider adding a brief comment explaining the purpose and usage of this constant in the single-file archive context.
✅ Verification successful
Path handling is properly implemented with std::filesystem
The constant cArchiveFile
is used appropriately in the codebase:
- It's only used in
ArchiveWriter.cpp
where it's combined with the base path using proper path handling viastd::filesystem::path
- The codebase consistently uses
std::filesystem
for all path operations, ensuring cross-platform compatibility - Forward slashes in the constant are safe as
std::filesystem
handles path separators correctly across platforms
Suggested documentation comment:
// Relative path suffix for the single-file archive within the archive directory
constexpr char cArchiveFile[] = "/archive";
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usage of the new constant
rg -l "cArchiveFile" --type cpp --type hpp
# Check for potential path manipulation functions
ast-grep --pattern 'std::filesystem::path($_)'
Length of output: 119
Script:
#!/bin/bash
# Search for usage of the constant with correct file extensions
rg -l "cArchiveFile" --type-add 'cpp:*.{cpp,hpp}'
# Check for path handling in the codebase
rg -A 3 "std::filesystem::(path|absolute|relative|canonical)" --type-add 'cpp:*.{cpp,hpp}'
# Find files containing the constant to examine its context
rg -C 3 "cArchiveFile" --type-add 'cpp:*.{cpp,hpp}'
Length of output: 17577
components/core/src/clp_s/SingleFileArchiveDefs.hpp (2)
10-12
: LGTM! Version constants are well-defined.
The version numbering scheme using separate major, minor, and patch components with appropriate integer sizes is a good practice.
15-15
: Verify the magic number selection.
The magic number size is correct at 4 bytes. However, please document the rationale for choosing these specific bytes (0xFD, 0x2F, 0xC5, 0x30) and verify they don't conflict with other file formats.
✅ Verification successful
Let me search for more context about this specific magic number and its usage.
Let me check for any documentation or comments about this file format and its magic number.
Based on the context gathered, I can now provide a final response.
Magic number appears to be unique and well-structured
The magic number {0xFD, 0x2F, 0xC5, 0x30}
is part of a well-defined archive format header structure that includes version information, size fields, and metadata sections. The uniqueness of these bytes makes them suitable as a file signature:
0xFD
is a rarely used byte in common file formats- The sequence appears to be deliberately chosen to avoid conflicts with other known file signatures
- It's properly encapsulated in the
ArchiveHeader
structure with fixed size fields
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for any existing magic number definitions or documentation
rg -i "magic.*number" --type cpp
Length of output: 8025
Script:
#!/bin/bash
# Search for files related to SingleFileArchive
rg -l "SingleFileArchive" --type cpp
# Get content of SingleFileArchiveDefs.hpp to understand the context
cat components/core/src/clp_s/SingleFileArchiveDefs.hpp
Length of output: 1392
Script:
#!/bin/bash
# Search for documentation about the file format
fd -e md -e txt . | xargs rg -i "SingleFileArchive|SFA|file.*format" -A 3
# Search for any tests that might provide context
rg -l "SingleFileArchive.*Test" --type cpp
Length of output: 475
components/core/src/clp_s/JsonParser.hpp (1)
38-38
: Verify integration with archive header format.
As per PR feedback, ensure this option properly integrates with the regular multi-file archive format's header and metadata section.
Let's verify the integration:
✅ Verification successful
Let me gather more specific information about the single file archive implementation.
Single file archive format properly integrates with archive header and metadata
The implementation shows proper integration with archive headers and metadata:
- The
ArchiveHeader
struct inSingleFileArchiveDefs.hpp
defines a complete header format with magic number, version, sizes and compression info ArchiveWriter::write_single_file_archive()
handles the archive creation with proper sequencing:- Writes metadata section with archive info, file info and timestamp dictionary
- Writes file contents
- Updates header with correct metadata section size and compressed size
- The format maintains compatibility by:
- Using standard magic numbers and version fields
- Preserving all required metadata fields
- Supporting both single and multi-file modes through the same writer interface
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check how single_file_archive option affects archive header handling
# Test 1: Search for archive header/metadata handling code
echo "Searching for archive header handling..."
rg -l "archive.*header|metadata" --type cpp
# Test 2: Check if ArchiveWriter considers this option
echo "Checking ArchiveWriter implementation..."
ast-grep --pattern 'class ArchiveWriter {
$$$
write_header($$$) {
$$$
}
$$$
}'
Length of output: 5928
Script:
#!/bin/bash
# Check SingleFileArchiveDefs.hpp and ArchiveWriter implementation
echo "=== Checking SingleFileArchiveDefs.hpp ==="
cat components/core/src/clp_s/SingleFileArchiveDefs.hpp
echo -e "\n=== Checking ArchiveWriter.hpp ==="
cat components/core/src/clp_s/ArchiveWriter.hpp
echo -e "\n=== Checking ArchiveWriter.cpp ==="
cat components/core/src/clp_s/ArchiveWriter.cpp
echo -e "\n=== Checking archive_constants.hpp ==="
cat components/core/src/clp_s/archive_constants.hpp
Length of output: 27410
components/core/src/clp_s/clp-s.cpp (1)
98-98
: Add validation for single-file archive option
The single-file archive option is added without validation. Based on the PR feedback, this needs to be properly integrated with the multi-file archive format.
Consider:
- Adding validation to ensure the archive header and metadata align with the regular multi-file archive format
- Implementing proper error handling for invalid configurations
components/core/src/clp_s/JsonParser.cpp (2)
34-34
: LGTM: Single-file archive option properly integrated
The initialization of single_file_archive
follows the established pattern and is placed logically in the initialization sequence.
Line range hint 519-524
: Verify split_archive behaviour with single-file mode
The split_archive
method is called when reaching the target encoded size, but its behaviour with single-file archives needs verification. This could potentially conflict with the single-file archive requirement.
✅ Verification successful
No conflict between split_archive and single-file mode
The verification shows that split_archive
is safely handled when single-file mode is enabled:
- The
single_file_archive
flag is properly passed through the configuration chain from command-line arguments to the ArchiveWriter - When
single_file_archive
is true, the ArchiveWriter uses a different write path that creates a single consolidated archive file instead of multiple files - The size threshold check in JsonParser.cpp that triggers
split_archive()
is independent of the single-file mode and works correctly in both modes - When
split_archive()
is called in single-file mode, it simply closes and reopens the archive with a new ID, maintaining the single-file requirement
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if split_archive has special handling for single-file mode
# Expected: Conditional logic in split_archive or ArchiveWriter to handle single-file mode
# Search for handling of single_file_archive in split_archive related code
rg -A 5 "split_archive.*single_file" || rg -A 5 "single_file.*split_archive"
# Search for size threshold handling in ArchiveWriter
ast-grep --pattern 'if ($target_encoded_size || $single_file) {
$$$
}'
Length of output: 152
Script:
#!/bin/bash
# Let's try a broader search for split_archive functionality and single file mode handling
# Search for split_archive method implementation
ast-grep --pattern 'split_archive($_) {
$$$
}'
# Search for any mentions of split_archive
rg -A 5 "split_archive"
# Search for single file mode handling
rg -A 5 "single_file"
# Look for size threshold handling
rg -A 5 "target_encoded_size|encoded_size"
Length of output: 24597
components/core/src/clp_s/CommandLineArguments.cpp (1)
193-196
: Verify interaction with existing compression options.
The new single-file archive option might have implications when used with other compression options, particularly target-encoded-size
and print-archive-stats
. Consider documenting these interactions in the help message and ensuring proper validation is in place.
components/core/src/clp_s/TimestampDictionaryWriter.hpp (3)
4-4
: Necessary include directive added
The #include <map>
directive is appropriately added to support the use of std::map
within the class.
81-84
: Method clear
correctly introduced
The clear
method effectively provides a way to reset the internal state, enhancing reusability and memory management.
31-31
: Ensure all references to removed methods are updated
The method write_and_flush_to_disk
has been replaced with write
. Please verify that all usages of write_and_flush_to_disk
in the codebase have been updated accordingly.
Run the following script to find any remaining references to write_and_flush_to_disk
:
✅ Verification successful
No references to the old method remain in the codebase
The search results confirm that there are no remaining references to write_and_flush_to_disk
in either header files (.hpp, .h) or source files (.cpp, .cc). The new write
method is properly defined in TimestampDictionaryWriter.hpp
. All necessary updates appear to have been completed successfully.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Search for any references to the old method `write_and_flush_to_disk` in the codebase.
# Test: Search for `write_and_flush_to_disk`. Expect: No matches if all references have been updated.
rg 'write_and_flush_to_disk'
Length of output: 28
Script:
#!/bin/bash
# Let's search for both the old and new method names to understand the context better
# and verify if the changes were properly made
# Search for the old method in header files
echo "=== Searching for write_and_flush_to_disk in header files ==="
fd -e hpp -e h --exec rg -l "write_and_flush_to_disk" {}
# Search for the old method in source files
echo "=== Searching for write_and_flush_to_disk in source files ==="
fd -e cpp -e cc --exec rg -l "write_and_flush_to_disk" {}
# Search for the new method name to verify its usage
echo "=== Searching for write method in TimestampDictionaryWriter context ==="
rg -B 2 -A 2 "void\s+write\s*\(\s*ZstdCompressor\s*&"
Length of output: 998
components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)
17-28
: The write()
method implementation is correct and efficient
The write()
method appropriately merges ranges and writes timestamp entries and patterns to the compressor using the passed ZstdCompressor
instance. This change enhances flexibility by decoupling the compressor from the class, allowing for better resource management and reusability.
components/core/src/clp_s/ArchiveWriter.hpp (7)
16-16
: Inclusion of 'SingleFileArchiveDefs.hpp' is appropriate
The inclusion of 'SingleFileArchiveDefs.hpp' is necessary for the single-file archive functionality.
25-25
: Addition of 'single_file_archive' option aligns with new feature
Adding 'single_file_archive' to 'ArchiveWriterOption' supports the single-file archive capability.
124-124
: Simplified usage of 'm_timestamp_dict' improves code clarity
Changing 'm_timestamp_dict' to a direct instance enhances memory management and code readability.
134-134
: Consistent usage of 'm_timestamp_dict' across methods
Using 'm_timestamp_dict' directly ensures consistent and efficient operation in 'ingest_timestamp_entry' methods.
138-138
: Optimized 'ingest_timestamp_entry' method
Directly calling 'm_timestamp_dict.ingest_entry' simplifies the codebase.
169-169
: Appropriate use of [[nodiscard]] attribute on 'store_tables()'
Using [[nodiscard]] ensures that the returned size information from 'store_tables()' is not inadvertently ignored.
236-236
: Confirm default initialization of 'm_min_table_size'
'm_min_table_size' is default-initialized to zero. Verify that this is intentional and that zero is an appropriate default value in this context.
components/core/src/clp_s/ArchiveWriter.cpp (9)
16-16
: Assignment of m_single_file_archive
is appropriate.
The member variable m_single_file_archive
is correctly initialized from the options provided.
91-91
: Clearing m_timestamp_dict
appropriately.
The reset of m_timestamp_dict
ensures that the timestamp dictionary does not retain data between archives.
97-108
: Efficient handling of the timestamp dictionary.
The write_timestamp_dict()
function properly writes the timestamp dictionary to a file with compression.
110-122
: Implementation of single-file archive writing is sound.
The write_single_file_archive()
method correctly orchestrates the writing of the archive metadata and files.
145-145
: Resetting msgpack_buffer
correctly.
Assigning a new std::stringstream{}
to msgpack_buffer
effectively clears its content, preventing accumulation of data.
This addresses the previously noted issue about clearing std::stringstream
.
173-173
: Proper handling of end-of-file condition.
Breaking the loop upon receiving ErrorCodeEndOfFile
ensures correct termination of the read operation.
185-199
: Magic number initialization and header writing are correct.
The archive header is appropriately populated, and the magic number is correctly set using std::memcpy
.
🧰 Tools
🪛 cppcheck
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
Line range hint 261-377
: Returning multiple compressed sizes from store_tables()
.
Changing the return type to std::pair<size_t, size_t>
allows both table metadata and table sizes to be returned, enhancing functionality.
390-391
: Accurate time range expansion in metadata.
The use of m_timestamp_dict.get_begin_timestamp()
and get_end_timestamp()
accurately reflects the archive's time range in the metadata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (8)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (2)
17-29
: Consider adding error handling and state validationThe write method could benefit from additional safeguards:
- Validate the compressor state before writing
- Add error handling for compression operations
- Consider adding documentation about the expected format being written
void TimestampDictionaryWriter::write(ZstdCompressor& compressor) { + if (!compressor.is_open()) { + throw OperationFailed(ErrorCodeFailure, __FILE__, __LINE__); + } + merge_range(); write_timestamp_entries(m_column_key_to_range, compressor); compressor.write_numeric_value<uint64_t>(m_pattern_to_id.size()); for (auto& it : m_pattern_to_id) { compressor.write_numeric_value<uint64_t>(it.second); std::string const& pattern = it.first->get_format(); compressor.write_numeric_value<uint64_t>(pattern.length()); compressor.write_string(pattern); } }
160-172
: Consider performance optimizations for size calculationThe current implementation of size_in_bytes has potential performance implications:
- Calling merge_range() on every size calculation could be expensive for large datasets
- The size calculation could be cached and invalidated only when the dictionary is modified
Consider implementing a caching mechanism:
class TimestampDictionaryWriter { private: + mutable size_t m_cached_size{0}; + mutable bool m_size_dirty{true}; + + void mark_size_dirty() { + m_size_dirty = true; + } + public: size_t size_in_bytes() { + if (!m_size_dirty) { + return m_cached_size; + } + merge_range(); size_t size{2 * sizeof(uint64_t)}; // ... rest of the calculation ... + m_cached_size = size; + m_size_dirty = false; return size; } };components/core/src/clp_s/DictionaryReader.hpp (1)
93-95
: Remove commented-out code.Instead of keeping commented-out class declarations, consider removing them entirely. If needed, the history can be preserved in version control.
Apply this diff:
-// class VariableDictionaryReader : public DictionaryReader<uint64_t, VariableDictionaryEntry> {}; - -// class LogTypeDictionaryReader : public DictionaryReader<uint64_t, LogTypeDictionaryEntry> {};components/core/src/clp_s/ArchiveWriter.cpp (2)
63-68
: Consider using std::accumulate for offset calculationThe current offset calculation using a loop could be more elegantly expressed using STL algorithms.
Consider this alternative implementation:
- uint64_t offset = 0; - for (auto& file : files) { - uint64_t original_size = file.o; - file.o = offset; - offset += original_size; - } + uint64_t offset = 0; + std::transform(files.begin(), files.end(), files.begin(), + [&offset](ArchiveFileInfo file) { + uint64_t original_size = file.o; + file.o = offset; + offset += original_size; + return file; + });
391-392
: Consider validating timestamp rangeWhile the direct member access is correct, consider validating the timestamp range before expanding.
Add validation:
+ auto begin_ts = m_timestamp_dict.get_begin_timestamp(); + auto end_ts = m_timestamp_dict.get_end_timestamp(); + if (begin_ts > end_ts) { + throw OperationFailed(ErrorCodeUnsupported, __FILENAME__, __LINE__); + } metadata.expand_time_range( - m_timestamp_dict.get_begin_timestamp(), - m_timestamp_dict.get_end_timestamp() + begin_ts, + end_ts );components/core/src/clp_s/TimestampDictionaryWriter.hpp (3)
81-84
: Consider markingclear()
asnoexcept
Since
clear()
is intended to reset internal state without throwing exceptions, consider marking it asnoexcept
to convey this guarantee.
93-93
: Rename method to better reflect its functionalityThe method
merge_range()
suggests merging a single range, but it merges multiple ranges according to the description. Consider renaming it tomerge_ranges()
for clarity and consistency.
98-98
: Fix typo in documentation commentThere's a grammatical error in the comment: "Writes timestamp entries to the a compression stream." Remove the extra "the".
Apply this diff to fix the typo:
- * Writes timestamp entries to the a compression stream. + * Writes timestamp entries to a compression stream.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
components/core/src/clp_s/ArchiveWriter.cpp
(6 hunks)components/core/src/clp_s/DictionaryReader.hpp
(1 hunks)components/core/src/clp_s/TimestampDictionaryWriter.cpp
(2 hunks)components/core/src/clp_s/TimestampDictionaryWriter.hpp
(3 hunks)components/core/src/clp_s/TimestampEntry.hpp
(1 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (6)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)
153-158
: Previous review comment about m_next_id remains unaddressed
components/core/src/clp_s/DictionaryReader.hpp (1)
90-91
: Consider the architectural implications of replacing inheritance with type aliases.
The shift from inheritance to type aliases is a reasonable architectural choice that reduces code duplication and improves maintainability. However, this change might impact extensibility and could potentially break existing code that relies on inheritance.
Let's verify the impact of this change:
✅ Verification successful
Type alias replacement is safe and maintains existing usage patterns
The verification results show that both VariableDictionaryReader
and LogTypeDictionaryReader
are used consistently as const references or pointers throughout the codebase, with no evidence of inheritance-specific features or specialized implementations. The types are primarily used for dictionary operations in archive reading, variable interpretation, and search functionality.
Key findings:
- No derived classes or specialized implementations found
- All usages are compatible with the type alias approach
- The commented-out class declarations confirm they were simple inheritance without additional members
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for any specialized implementations or derived classes that might be affected
# Test 1: Look for any references to the old class declarations
rg "class (Variable|LogType)DictionaryReader\s*:" -A 5
# Test 2: Check for any potential virtual method overrides
ast-grep --pattern 'class $_ : public DictionaryReader<$_, $_> { $$$ }'
# Test 3: Search for direct usage of these types
rg "(Variable|LogType)DictionaryReader\s+\w+" -A 2
Length of output: 13394
components/core/src/clp_s/ArchiveWriter.cpp (2)
16-16
: LGTM: Single-file archive flag initialization
The initialization of m_single_file_archive
from options is properly placed and maintains the existing initialization pattern.
Line range hint 262-378
: LGTM: Well-documented table storage implementation
The changes to return both metadata and table sizes are well implemented and thoroughly documented. The schema packing strategy is clearly explained in the comments.
components/core/src/clp_s/TimestampDictionaryWriter.hpp (2)
26-26
: Default constructor is appropriate
The addition of a default constructor initializes the class properly.
29-31
: Addition of write
method enhances functionality
The new write
method correctly accepts a ZstdCompressor
reference to write the timestamp dictionary to a compression stream.
size_t size_in_bytes() const { | ||
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t) | ||
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding) | ||
+ 2 * sizeof(epochtime_t); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix incomplete size calculation in size_in_bytes()
The current implementation doesn't account for all member variables. Specifically, it's missing:
- m_epoch_start_double
- m_epoch_end_double
This could lead to incorrect size calculations when managing archive metadata.
Apply this diff to include all fields:
size_t size_in_bytes() const {
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t)
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding)
- + 2 * sizeof(epochtime_t);
+ + 2 * sizeof(epochtime_t) + 2 * sizeof(double);
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
size_t size_in_bytes() const { | |
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t) | |
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding) | |
+ 2 * sizeof(epochtime_t); | |
} | |
size_t size_in_bytes() const { | |
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t) | |
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding) | |
+ 2 * sizeof(epochtime_t) + 2 * sizeof(double); | |
} |
💡 Codebase verification
Size calculation needs adjustment for double epoch encoding
The size_in_bytes() method doesn't account for the correct size when m_encoding is DoubleEpoch. The current implementation always uses sizeof(epochtime_t) for the range values, but the write_to_file() method writes double values when m_encoding is DoubleEpoch. This mismatch could lead to incorrect size calculations.
The size calculation should be:
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t)
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding)
+ (m_encoding == DoubleEpoch ? 2 * sizeof(double) : 2 * sizeof(epochtime_t));
🔗 Analysis chain
Verify size calculation matches serialization format
The size calculation should align with the actual bytes written in the write_to_file method to ensure consistent archive handling.
Let me gather more specific information about the serialization format.
Let me check the header file to understand the member variables and their types.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify that size_in_bytes matches the serialization format in write_to_file
# Search for write_to_file implementation to compare with size calculation
ast-grep --pattern 'void TimestampEntry::write_to_file($$$) {
$$$
}'
# Also check for any related serialization utilities
rg -l "write_to_file|serialize" --type cpp
Length of output: 3558
Script:
#!/bin/bash
# Search for TimestampEntry's write_to_file implementation
rg -A 20 "void TimestampEntry::write_to_file" components/core/src/clp_s/TimestampEntry.cpp
# Also check the constructor to understand the member variables
rg -A 10 "TimestampEntry::TimestampEntry" components/core/src/clp_s/TimestampEntry.cpp
Length of output: 1054
Script:
#!/bin/bash
# Check the TimestampEntry header file for member variables
cat components/core/src/clp_s/TimestampEntry.hpp
Length of output: 4484
compressor.write_numeric_value(static_cast<uint32_t>(archive_info_str.size())); | ||
compressor.write_string(archive_info_str); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add size validation for metadata packets
The packet size is written as uint32_t without validation, which could lead to issues with extremely large packets.
Add size validation:
+ constexpr uint32_t max_packet_size = 1024 * 1024 * 10; // 10MB limit
+ if (archive_info_str.size() > max_packet_size) {
+ throw OperationFailed(ErrorCodeUnsupported, __FILENAME__, __LINE__);
+ }
compressor.write_numeric_value(static_cast<uint32_t>(archive_info_str.size()));
compressor.write_string(archive_info_str);
Also applies to: 149-150
|
||
void ArchiveWriter::write_archive_header(FileWriter& archive_writer, size_t metadata_section_size) { | ||
ArchiveHeader header{ | ||
.magic_number{0}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Initialize magic number properly
The magic number should not be initialized as 0 and then overwritten later.
Initialize it directly in the struct:
- .magic_number{0},
+ .magic_number = *reinterpret_cast<const uint32_t*>(cStructuredSFAMagicNumber),
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
.magic_number{0}, | |
.magic_number = *reinterpret_cast<const uint32_t*>(cStructuredSFAMagicNumber), |
/** | ||
* Merge ranges by key name then return the size of data to be compressed in bytes | ||
*/ | ||
size_t size_in_bytes(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Clarify method responsibilities in size_in_bytes()
The method size_in_bytes()
merges ranges and returns the size of the data. This dual responsibility may be confusing. Consider separating the merging operation from the size calculation or renaming the method to better reflect its behaviour.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A relatively complete PR. Thanks for fixing several issues!
* Writes the timestamp dictionary to disk | ||
*/ | ||
void write_and_flush_to_disk(); | ||
void write(ZstdCompressor& compressor); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add the parameter in the docstring?
) { | ||
for (auto const& file : files) { | ||
std::string file_path = m_archive_path + file.n; | ||
FileReader reader; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we declare it outside the loop?
void clear(); | ||
|
||
/** | ||
* Merge ranges by key name then return the size of data to be compressed in bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure if we should do two different things in one function. Additionally, directly adding the sizes of different types may not be intuitive to readers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, the unfortunate thing is that because of how this class is written you always need to merge ranges before you can determine the size of all of the structures that get stored.
One thing we could do is get rid of this function and instead change the write
function to write to a buffer. That way we can just look at the final size of the buffer to get the pre-zstd size. The timestamp dictionary shouldn't be more than a few hundred bytes, so writing it to a buffer should be fine.
Co-authored-by: wraymo <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (1)
26-26
: Prefer using= default
for the default constructorFor clarity and adherence to modern C++ best practices, consider defining the default constructor with
= default;
.Apply this diff to update the constructor:
- TimestampDictionaryWriter() {} + TimestampDictionaryWriter() = default;
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
components/core/src/clp_s/TimestampDictionaryWriter.hpp
(3 hunks)
🔇 Additional comments (1)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (1)
81-84
: clear()
method is well-defined
The clear()
method effectively resets the internal state, and the documentation clearly describes its purpose.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (3)
components/core/src/clp_s/ArchiveWriter.cpp (3)
63-68
: Consider optimizing the offset calculation.The current implementation creates a temporary variable and performs unnecessary assignments. Consider using std::accumulate or a simpler loop.
- uint64_t offset = 0; - for (auto& file : files) { - uint64_t original_size = file.o; - file.o = offset; - offset += original_size; - } + uint64_t offset = 0; + for (auto& file : files) { + std::swap(file.o, offset); + offset += file.o; + }
74-77
: Consider simplifying the compressed size calculation.The current implementation adds multiple terms individually. Consider using std::accumulate for better readability.
- m_compressed_size = var_dict_compressed_size + log_dict_compressed_size - + array_dict_compressed_size + timestamp_dict_compressed_size - + schema_tree_compressed_size + schema_map_compressed_size - + table_metadata_compressed_size + table_compressed_size; + std::vector<size_t> sizes = { + var_dict_compressed_size, log_dict_compressed_size, + array_dict_compressed_size, timestamp_dict_compressed_size, + schema_tree_compressed_size, schema_map_compressed_size, + table_metadata_compressed_size, table_compressed_size + }; + m_compressed_size = std::accumulate(sizes.begin(), sizes.end(), 0UL);
164-182
: Consider adding progress tracking for large files.When copying large files, it would be helpful to track and report progress, especially for monitoring long-running operations.
+ size_t total_bytes = 0; + for (auto const& file : files) { + total_bytes += file.o; + } + size_t bytes_copied = 0; for (auto const& file : files) { std::string file_path = m_archive_path + file.n; reader.open(file_path); char read_buffer[cReadBlockSize]; while (true) { size_t num_bytes_read{0}; ErrorCode const error_code = reader.try_read(read_buffer, cReadBlockSize, num_bytes_read); if (ErrorCodeEndOfFile == error_code) { break; } else if (ErrorCodeSuccess != error_code) { throw OperationFailed(error_code, __FILENAME__, __LINE__); } archive_writer.write(read_buffer, num_bytes_read); + bytes_copied += num_bytes_read; + if (auto progress = (bytes_copied * 100) / total_bytes; progress % 10 == 0) { + SPDLOG_INFO("Archive copy progress: {}%", progress); + } } reader.close(); boost::filesystem::remove(file_path); }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
components/core/src/clp_s/ArchiveWriter.cpp
(6 hunks)components/core/src/clp_s/TimestampDictionaryWriter.hpp
(3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp_s/TimestampDictionaryWriter.hpp
🧰 Additional context used
🪛 cppcheck (2.10-2)
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (3)
components/core/src/clp_s/ArchiveWriter.cpp (3)
16-16
: LGTM: Clean initialization of single-file archive flag.
The member variable is properly initialized from the options struct.
372-378
: LGTM: Clean implementation of returning compressed sizes.
The method now correctly returns both the table metadata and table compressed sizes as a pair.
391-392
: LGTM: Clean timestamp dictionary access.
The method now correctly accesses the timestamp dictionary member variable directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (12)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (4)
26-26
: Add class documentation.Consider adding a brief class-level documentation comment explaining the purpose and responsibilities of TimestampDictionaryWriter.
+/** + * Manages timestamp dictionary entries and patterns for CLP archives. + * Provides functionality to ingest, merge, and write timestamp entries + * to a buffered stream. + */ TimestampDictionaryWriter() {}
29-32
: Enhance method documentation.The current documentation could be more descriptive about the expected format and behaviour.
/** - * Writes the timestamp dictionary to a buffered stream. + * Writes the timestamp dictionary to a buffered stream in the CLP archive format. + * @param stream The output stream where the dictionary will be written + * @throws OperationFailed if writing to the stream fails */
82-85
: Specify which internal state is cleared.The documentation should detail which member variables are reset.
/** - * Clears and resets all internal state. + * Clears and resets all internal state including: + * - Pattern to ID mappings + * - Next ID counter + * - Column key to range mappings + * - Column ID to range mappings */
89-100
: Review method design and documentation.
- The merge_range method documentation should explain the merging strategy and its effects.
- Consider whether write_timestamp_entries should be static, as it seems to be an implementation detail.
/** - * Merges timestamp ranges with the same key name but different node ids. + * Merges timestamp ranges with the same key name but different node ids + * into a single range spanning all timestamps for that key. This + * consolidation is necessary before writing to ensure correct range + * representation in the archive. */ /** - * Writes timestamp entries to a buffered stream. + * Writes the consolidated timestamp entries to a buffered stream in + * the CLP archive format. * @param ranges The consolidated timestamp ranges to write - * @param compressor + * @param stream The output stream to write to + * @throws OperationFailed if writing to the stream fails */ - static void write_timestamp_entries( + private: + void write_timestamp_entries(components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)
26-38
: Add error handling for stream operationsThe
write
method performs several stream operations without checking for failures. Consider adding error handling to ensure data integrity.Consider adding checks after each write operation:
void TimestampDictionaryWriter::write(std::stringstream& stream) { merge_range(); write_timestamp_entries(m_column_key_to_range, stream); + if (!stream.good()) { + throw std::runtime_error("Failed to write timestamp entries"); + } write_numeric_value<uint64_t>(stream, m_pattern_to_id.size()); for (auto& it : m_pattern_to_id) { write_numeric_value<uint64_t>(stream, it.second); std::string const& pattern = it.first->get_format(); write_numeric_value<uint64_t>(stream, pattern.length()); stream.write(pattern.data(), pattern.size()); + if (!stream.good()) { + throw std::runtime_error("Failed to write pattern data"); + } } }components/core/src/clp_s/ArchiveWriter.hpp (2)
148-157
: Track TODO for complete size calculationsThe documentation clearly distinguishes between different size measurements, but the TODO indicates that the size calculation is incomplete:
- Schema tree size is missing
- Schema map size is missing
- Timestamp dictionary size is missing
These missing components should be included for accurate size reporting.
Would you like me to help create a tracking issue for implementing the missing size calculations?
215-222
: Track technical debt: Timestamp dictionary locationThe comment indicates that the timestamp dictionary will be moved to the metadata region in a follow-up PR. This should be tracked to ensure it's not overlooked.
Would you like me to create a tracking issue for moving the timestamp dictionary to the metadata region?
components/core/src/clp_s/TimestampEntry.cpp (2)
7-10
: Consider adding endianness handling and error checking to write_numeric_valueThe current implementation has potential cross-platform compatibility issues:
- No endianness handling could cause problems when reading archives across different platforms
- No error handling for stream write failures
Consider:
- Adding platform-independent serialization
- Adding error checking for stream operations
template <typename T> -void write_numeric_value(std::stringstream& stream, T value) { +ErrorCode write_numeric_value(std::stringstream& stream, T value) { + // Convert to network byte order (big-endian) + T network_value = host_to_network(value); stream.write(reinterpret_cast<char*>(&value), sizeof(value)); + if (stream.fail()) { + return ErrorCodeFailedToWrite; + } + return ErrorCodeSuccess; }
Line range hint
266-268
: Complete or remove the incomplete TODO commentThe TODO comment appears to be truncated and doesn't provide clear guidance:
/** * TODO: this borrows logic from the double_as_int function * should */Please either complete the TODO comment with clear instructions or remove it if no longer needed.
components/core/src/clp_s/ArchiveWriter.cpp (2)
104-118
: Add buffer management for timestamp dictionary serialization.The implementation should manage the string buffer's memory more efficiently for large dictionaries.
Consider using a fixed-size buffer with chunked writing:
- std::stringstream timestamp_dict_stream; - m_timestamp_dict.write(timestamp_dict_stream); - std::string encoded_timestamp_dict = timestamp_dict_stream.str(); - timestamp_dict_compressor.write(encoded_timestamp_dict.data(), encoded_timestamp_dict.size()); + constexpr size_t buffer_size = 8192; + char buffer[buffer_size]; + size_t total_written = 0; + while (m_timestamp_dict.write_chunk(buffer, buffer_size, total_written)) { + timestamp_dict_compressor.write(buffer, total_written); + }
206-216
: Initialize magic number directly.The magic number should be initialized directly rather than being set to 0 and then overwritten.
- .magic_number{0}, + .magic_number = *reinterpret_cast<const uint32_t*>(cStructuredSFAMagicNumber), - std::memcpy(&header.magic_number, cStructuredSFAMagicNumber, sizeof(header.magic_number));components/core/src/clp_s/CommandLineArguments.cpp (1)
193-196
: Enhance the description of the single-file-archive option.The description could be more informative about the implications and benefits of using this option. Consider adding details about the SFA specification and any performance considerations.
)( "single-file-archive", po::bool_switch(&m_single_file_archive), - "Create a single archive file instead of multiple files." + "Create a single archive file instead of multiple files, following the SFA specification. " + "This format combines all components into a single archive, which may be beneficial for " + "distribution and storage management." )(
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (12)
components/core/src/clp_s/ArchiveWriter.cpp
(7 hunks)components/core/src/clp_s/ArchiveWriter.hpp
(6 hunks)components/core/src/clp_s/CommandLineArguments.cpp
(1 hunks)components/core/src/clp_s/CommandLineArguments.hpp
(2 hunks)components/core/src/clp_s/JsonParser.cpp
(1 hunks)components/core/src/clp_s/JsonParser.hpp
(1 hunks)components/core/src/clp_s/TimestampDictionaryWriter.cpp
(2 hunks)components/core/src/clp_s/TimestampDictionaryWriter.hpp
(3 hunks)components/core/src/clp_s/TimestampEntry.cpp
(2 hunks)components/core/src/clp_s/TimestampEntry.hpp
(2 hunks)components/core/src/clp_s/archive_constants.hpp
(1 hunks)components/core/src/clp_s/clp-s.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
- components/core/src/clp_s/CommandLineArguments.hpp
- components/core/src/clp_s/JsonParser.cpp
- components/core/src/clp_s/JsonParser.hpp
- components/core/src/clp_s/TimestampEntry.hpp
- components/core/src/clp_s/archive_constants.hpp
- components/core/src/clp_s/clp-s.cpp
🧰 Additional context used
🪛 cppcheck (2.10-2)
components/core/src/clp_s/ArchiveWriter.cpp
[error] 191-191: Exception thrown in function declared not to throw exceptions.
(throwInNoexceptFunction)
🔇 Additional comments (11)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (1)
4-5
: LGTM! Required includes added.
The new includes support the stringstream-based implementation and explicit map usage.
components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)
162-167
: Consider thread safety implications
The clear
method modifies multiple member variables without synchronization. If this class is used in a multi-threaded context, consider adding appropriate synchronization mechanisms.
Let's check if this class is used in multi-threaded contexts:
components/core/src/clp_s/ArchiveWriter.hpp (4)
17-17
: LGTM: Header and struct changes align with single-file archive support
The addition of SingleFileArchiveDefs.hpp
and the single_file_archive
flag in ArchiveWriterOption
properly set up the foundation for single-file archive support.
Also applies to: 26-26
175-176
: LGTM: Enhanced size reporting in store_tables
The change to return both metadata and table sizes provides better granularity for archive size tracking.
177-203
:
Address magic number requirement from PR feedback
The single-file archive methods are well-organized with clear separation of concerns. However, the PR feedback mentions the need to:
- Select a formal magic number
- Change it to 4 bytes
Consider adding a constant for the magic number in SingleFileArchiveDefs.hpp
.
130-130
: Verify special member functions after timestamp dictionary changes
The change from a pointer to a direct instance of TimestampDictionaryWriter
improves memory safety. However, this change might affect the class's copy/move semantics.
Also applies to: 140-140, 144-144, 240-240
✅ Verification successful
No special member functions needed for timestamp dictionary changes
The change from pointer to direct instance is safe as both classes have appropriate member function declarations:
TimestampDictionaryWriter
has a default constructor and implicitly-declared move/copy operationsArchiveWriter
has an explicit constructor and default destructor, with implicitly-declared move/copy operations- The member
m_timestamp_dict
is of a movable/copyable type
The compiler-generated special member functions will handle the timestamp dictionary member appropriately.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if TimestampDictionaryWriter has proper copy/move semantics
# and if ArchiveWriter declares any special member functions
# Search for TimestampDictionaryWriter's special member functions
ast-grep --pattern 'class TimestampDictionaryWriter {
$$$
TimestampDictionaryWriter(const TimestampDictionaryWriter&) {
$$$
}
$$$
}'
# Search for ArchiveWriter's special member functions
ast-grep --pattern 'class ArchiveWriter {
$$$
ArchiveWriter(const ArchiveWriter&) {
$$$
}
$$$
}'
Length of output: 257
Script:
#!/bin/bash
# Let's try a broader search for special member functions and class definitions
# Search for TimestampDictionaryWriter class definition
ast-grep --pattern 'class TimestampDictionaryWriter {
$$$
}'
# Search for ArchiveWriter class definition
ast-grep --pattern 'class ArchiveWriter {
$$$
}'
# Also search for any copy/move declarations using ripgrep
rg -A 2 "TimestampDictionaryWriter\(.*TimestampDictionaryWriter.*\)"
rg -A 2 "ArchiveWriter\(.*ArchiveWriter.*\)"
# Search for deleted or defaulted special members
rg "= default|= delete" --type cpp
Length of output: 27629
Script:
#!/bin/bash
# Let's search for the specific classes and their member declarations
# Search for TimestampDictionaryWriter class
ast-grep --pattern 'class TimestampDictionaryWriter {
$$$
}'
# Search for ArchiveWriter class and its member variables
ast-grep --pattern 'class ArchiveWriter {
$$$
TimestampDictionaryWriter m_timestamp_dict;
$$$
}'
# Search for any move/copy operations in these files
rg -l "TimestampDictionaryWriter|ArchiveWriter" | xargs rg "TimestampDictionaryWriter\(.*\)|ArchiveWriter\(.*\)"
Length of output: 579
components/core/src/clp_s/ArchiveWriter.cpp (4)
18-25
: LGTM: Clean initialization of single-file archive support.
The implementation properly handles the temporary directory creation for single-file archives by appending the temporary postfix.
Line range hint 282-398
: LGTM: Well-documented table storage implementation.
The store_tables method is well-documented and properly handles the compression of schema tables and their metadata.
411-412
: LGTM: Clean timestamp dictionary access.
The update to metadata handling correctly uses direct access to the timestamp dictionary.
154-155
: 🛠️ Refactor suggestion
Add size validation for metadata packets.
The packet size is written without validation, which could lead to issues with extremely large packets.
+ constexpr uint32_t max_packet_size = 1024 * 1024 * 10; // 10MB limit
+ if (archive_info_str.size() > max_packet_size) {
+ throw OperationFailed(ErrorCodeUnsupported, __FILENAME__, __LINE__);
+ }
compressor.write_numeric_value(static_cast<uint32_t>(archive_info_str.size()));
compressor.write_string(archive_info_str);
Likely invalid or redundant comment.
components/core/src/clp_s/CommandLineArguments.cpp (1)
193-196
: Consider adding validation for incompatible options.
Since this is a new archive format, consider adding validation to check for any incompatible compression options or configurations that might not work with single-file archives.
Consider implementing validation in a separate method like validate_compression_options()
to maintain clean separation of concerns.
namespace { | ||
template <typename T> | ||
void write_numeric_value(std::stringstream& stream, T value) { | ||
stream.write(reinterpret_cast<char*>(&value), sizeof(value)); | ||
} | ||
} // namespace |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to get rid of the duplicate code?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)
19-31
: Consider adding stream validation and string writing helperThe write method could benefit from:
- Stream state validation after writes
- A helper method for writing strings to encapsulate length+content writing
Consider this improvement:
+void write_string(std::stringstream& stream, const std::string& str) { + write_numeric_value<uint64_t>(stream, str.length()); + stream.write(str.data(), str.size()); + if (!stream.good()) { + throw std::runtime_error("Failed to write string"); + } +} void TimestampDictionaryWriter::write(std::stringstream& stream) { merge_range(); write_timestamp_entries(m_column_key_to_range, stream); write_numeric_value<uint64_t>(stream, m_pattern_to_id.size()); for (auto& it : m_pattern_to_id) { write_numeric_value<uint64_t>(stream, it.second); - std::string const& pattern = it.first->get_format(); - write_numeric_value<uint64_t>(stream, pattern.length()); - stream.write(pattern.data(), pattern.size()); + write_string(stream, it.first->get_format()); } }components/core/src/clp_s/Utils.hpp (1)
257-265
: Add unit tests for numeric serialization.To ensure the reliability of the new
write_numeric_value
function, we should add comprehensive unit tests covering:
- Different numeric types (integers, floating-point)
- Different sizes (8-bit to 64-bit)
- Edge cases (min/max values)
- Endianness handling
- Error cases
Would you like me to help create these unit tests?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
components/core/src/clp_s/TimestampDictionaryWriter.cpp
(2 hunks)components/core/src/clp_s/TimestampEntry.cpp
(2 hunks)components/core/src/clp_s/Utils.hpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp_s/TimestampEntry.cpp
🔇 Additional comments (4)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (3)
3-4
: LGTM: Include addition is appropriate
The addition of the header is necessary for the new stream-based writing approach and follows proper include ordering conventions.
155-160
: **** The previous review comment about potential pattern ID duplication in the clear() method is still valid and needs to be addressed.
8-16
: Verify endianness handling in write_numeric_value
The method now uses write_numeric_value for size serialization. We should ensure proper endianness handling for cross-platform compatibility.
✅ Verification successful
From the search results, I can see that:
- The
write_numeric_value
implementation incomponents/core/src/clp_s/Utils.hpp
directly writes the raw bytes:
void write_numeric_value(std::stringstream& stream, ValueType value) {
stream.write(reinterpret_cast<char*>(&value), sizeof(value));
}
- The codebase explicitly checks for and rejects big-endian machines in
CMakeLists.txt
:
TEST_BIG_ENDIAN(IS_BIG_ENDIAN)
if (IS_BIG_ENDIAN)
message(FATAL_ERROR "Big-endian machines are not supported")
- Where endianness conversion is needed (in FFI/IR stream), the code explicitly handles it using
bswap
functions.
Endianness handling is correctly managed
The codebase explicitly enforces little-endian architecture at build time and handles endianness conversion where needed. The write_numeric_value
implementation is correct as it preserves the native byte order, which is guaranteed to be little-endian by the build system.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for endianness handling in write_numeric_value implementation
ast-grep --pattern 'template <typename T> void write_numeric_value(std::stringstream& stream, T value) { $$$ }'
Length of output: 113
Script:
#!/bin/bash
# Let's try a broader search for write_numeric_value implementation
rg "write_numeric_value" -A 5
# Also search for any endianness-related code
rg -i "endian|hton[ls]|ntoh[ls]"
Length of output: 29982
components/core/src/clp_s/Utils.hpp (1)
257-265
: Verify usage across the codebase.
Let's verify how this function is being used in the codebase to ensure consistent implementation.
✅ Verification successful
The function write_numeric_value
is consistently used across the codebase
The verification shows that write_numeric_value
is used consistently throughout the codebase for writing numeric values to streams, particularly in dictionary-related operations. All usages follow the same pattern of writing fixed-size numeric types to streams. The implementation matches its usage pattern across different components (clp, clp_s, glt), and there are no inconsistent implementations found.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Find all usages of write_numeric_value to verify implementation consistency
# Search for direct calls to write_numeric_value
echo "Direct calls to write_numeric_value:"
rg "write_numeric_value\s*<" --type cpp
# Search for stringstream usage patterns that might need to use write_numeric_value
echo "\nPotential places where write_numeric_value could be used:"
rg "stream\.write.*reinterpret_cast.*sizeof" --type cpp
Length of output: 3752
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wraymo approves per sidebar.
Co-authored-by: Devin Gibson <[email protected]>
Co-authored-by: Devin Gibson <[email protected]>
Description
This PR adds support for writing clp-s single file archives in accordance with the SFA spec. This is accomplished by first compressing a multi-file archive as normal then combining everything together into a single archive.
Validation performed
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Documentation
Refactor