Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(clp-s): Add the write path for single-file archives. #563

Merged
merged 21 commits into from
Nov 27, 2024

Conversation

wraymo
Copy link
Contributor

@wraymo wraymo commented Oct 24, 2024

Description

This PR adds support for writing clp-s single file archives in accordance with the SFA spec. This is accomplished by first compressing a multi-file archive as normal then combining everything together into a single archive.

Validation performed

  • Validated that SFAs are well-formed by succesfully decompressing and searching on them
  • Validated by inspection (and by implementing a reader) that the new metadata and header agree with the spec

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced support for single-file archives, allowing users to create a single compressed file instead of multiple files.
    • Added new methods for managing archive metadata and writing single-file archives.
    • Enhanced JSON parsing to accommodate single-file archive options.
    • New command line option for single-file archive creation.
  • Bug Fixes

    • Improved error handling and state management in the archiving process.
  • Documentation

    • Updated comments and documentation for clarity on new functionalities and methods.
  • Refactor

    • Restructured classes and methods to streamline archive handling and improve internal logic.

Copy link
Contributor

coderabbitai bot commented Oct 24, 2024

Walkthrough

The changes in this pull request focus on enhancing the functionality of the ArchiveWriter and related classes, particularly in handling single-file archives. Key updates include the introduction of new member variables and methods, modifications to existing method signatures, and improvements to command line argument parsing. The TimestampDictionaryWriter class has also been restructured to streamline its operations. Additionally, a new file defining structures for single-file archives has been added, along with updates to related classes and methods to support these enhancements.

Changes

File Change Summary
components/core/src/clp_s/ArchiveWriter.cpp Added member variable m_single_file_archive, modified close method to differentiate between single and multi-file archives, added write_timestamp_dict method, updated store_tables return type to std::pair<size_t, size_t>.
components/core/src/clp_s/ArchiveWriter.hpp Added bool single_file_archive to ArchiveWriterOption, updated store_tables return type, added methods for single-file archive handling.
components/core/src/clp_s/CommandLineArguments.cpp Introduced single-file-archive option in command line argument parsing.
components/core/src/clp_s/CommandLineArguments.hpp Added member variable m_single_file_archive and getter method get_single_file_archive().
components/core/src/clp_s/JsonParser.cpp Added single_file_archive to m_archive_options structure in constructor.
components/core/src/clp_s/JsonParser.hpp Added bool single_file_archive to JsonParserOption struct.
components/core/src/clp_s/SingleFileArchiveDefs.hpp Introduced definitions and structures for managing single-file archives, including ArchiveHeader, ArchiveCompressionType, and related structures.
components/core/src/clp_s/TimestampDictionaryWriter.cpp Replaced write_and_flush_to_disk with write, added clear method, removed open and close methods.
components/core/src/clp_s/TimestampDictionaryWriter.hpp Updated constructor and method signatures, removed file management methods, added write and clear methods.
components/core/src/clp_s/archive_constants.hpp Added constant cTmpPostfix for temporary file postfix.
components/core/src/clp_s/clp-s.cpp Modified compress function to include single_file_archive parameter.
components/core/src/clp_s/TimestampEntry.hpp Updated method write_to_file to write_to_stream, changing parameter type from ZstdCompressor& to std::stringstream&.

Possibly related PRs

  • clp-s: Implement table packing #466: The changes in ArchiveWriter.cpp and ArchiveWriter.hpp regarding the handling of archive formats and metadata are related to the overall archiving functionality, which may connect with the changes in ArchiveReader that also deal with metadata and schema reading.
  • feat(clp-s): Chunk output by size (in bytes) during ordered decompression. #600: The modifications to the CommandLineArguments class, specifically the renaming of ordered_chunk_size to target_ordered_chunk_size, directly relate to the changes in the main PR that involve chunk size handling in the ArchiveWriter class. This indicates a cohesive approach to managing chunk sizes across different components.

Suggested reviewers

  • wraymo

📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4c7a50f and c698487.

📒 Files selected for processing (1)
  • components/core/src/clp_s/Utils.hpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/Utils.hpp

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work! Seems mostly good for a draft implementation.

Main things we should change quickly is putting the archive header + metadata section into the regular multi-file archive, and also formally pick a magic number + change the magic number to 4 bytes.

There are other things we need to clean up/think about before actually merging this, but the above should changes should be enough to build off of for prototyping.

components/core/src/clp_s/ArchiveWriter.cpp Outdated Show resolved Hide resolved
components/core/src/clp_s/ArchiveWriter.cpp Outdated Show resolved Hide resolved
size_t metadata_section_size = archive_writer.get_pos() - sizeof(ArchiveHeader);
write_archive_files(archive_writer, files);
m_compressed_size = archive_writer.get_pos();
write_archive_header(archive_writer, metadata_section_size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we intend to have a file for the header+metadata section for the multi-file archive as well, but it seems like we only write this metadata section + header for the single-file archive right now. Could we add this to the multi-file write path as well? It will make the read side much simpler.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Per sidebar we will add the header to multi-file archive as part of a later PR so that we don't need to start changing the read side in this PR.

components/core/src/clp_s/ArchiveWriter.cpp Show resolved Hide resolved
components/core/src/clp_s/SingleFileArchiveDefs.hpp Outdated Show resolved Hide resolved
components/core/src/clp_s/ArchiveWriter.cpp Outdated Show resolved Hide resolved
@gibber9809
Copy link
Contributor

Nice work! Seems mostly good for a draft implementation.

Main things we should change quickly is putting the archive header + metadata section into the regular multi-file archive, and also formally pick a magic number + change the magic number to 4 bytes.

There are other things we need to clean up/think about before actually merging this, but the above should changes should be enough to build off of for prototyping.

Also need to go through and fix all of the fields that are a different size than what the spec specifies.

@gibber9809 gibber9809 changed the title clp-s: Add the read path for single-file archive clp-s: Add the write path for single-file archive Oct 29, 2024
@gibber9809 gibber9809 marked this pull request as ready for review November 6, 2024 21:00
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (17)
components/core/src/clp_s/archive_constants.hpp (1)

5-7: Consider adding format version and magic number constants.

Based on the PR objectives, we need to formally select a magic number and adjust it to 4 bytes. Consider adding these constants here:

 // Single file archive
+// Magic number for identifying single-file archives (4 bytes)
+constexpr uint32_t cSingleFileArchiveMagic = 0x434C5053;  // "CLPS" in ASCII
+// Current format version
+constexpr uint32_t cSingleFileArchiveVersion = 1;
 constexpr char cArchiveFile[] = "/archive";
components/core/src/clp_s/SingleFileArchiveDefs.hpp (3)

17-26: Add documentation for header fields and consider structure optimization.

  1. Consider adding documentation for each field to explain their purpose and constraints.
  2. The structure could be optimized:
    • padding field could be removed by reordering fields to achieve natural alignment
    • Four reserved padding fields (32 bytes) might be excessive

Consider this alternative arrangement:

 struct ArchiveHeader {
     uint8_t magic_number[4];
     uint32_t version;
+    uint16_t compression_type;
+    uint32_t metadata_section_size;
     uint64_t uncompressed_size;
     uint64_t compressed_size;
-    uint64_t reserved_padding[4];
+    uint64_t reserved_padding[2];  // Reduced to 16 bytes
-    uint32_t metadata_section_size;
-    uint16_t compression_type;
-    uint16_t padding;
 };

28-36: Consider future compression algorithm support.

While Zstd is a good default choice, consider:

  1. Adding a comment explaining why Zstd was chosen as the default
  2. Adding placeholder values for potential future compression algorithms (e.g., LZ4, ZLIB)

38-43: Document future fields in ArchiveInfoPacket.

The TODO comment should specify what fields are planned to be added in the future to help with forward compatibility planning.

components/core/src/clp_s/JsonParser.hpp (1)

38-38: Add documentation for the new option.

Please add a descriptive comment above the single_file_archive member to explain its purpose and behaviour. This will help maintain code clarity and assist future developers.

Apply this diff:

+    // Whether to use single-file archive format instead of multi-file
     bool single_file_archive;
components/core/src/clp_s/CommandLineArguments.hpp (2)

105-106: Add documentation for the single-file archive feature

The implementation follows the class's established patterns, but documentation would help clarify:

  • Purpose and behaviour of single-file archive mode
  • Relationship with multi-file archive mode
  • Impact on archive format and metadata handling

Consider adding documentation blocks above both the getter and member variable.

+    /**
+     * Returns whether archives should be written in single-file mode.
+     * @return true if single-file archive mode is enabled, false for multi-file mode
+     */
     bool get_single_file_archive() const { return m_single_file_archive; }

     // ... other methods ...

+    // Whether to write archives in single-file mode instead of multi-file mode
     bool m_single_file_archive{false};

Also applies to: 179-179


105-106: Consider more descriptive naming

The current name m_single_file_archive could be more explicit about its purpose. Consider alternatives that better describe the operational mode:

-    bool get_single_file_archive() const { return m_single_file_archive; }
+    bool get_use_single_file_mode() const { return m_use_single_file_mode; }

     // ... other methods ...

-    bool m_single_file_archive{false};
+    bool m_use_single_file_mode{false};

Also applies to: 179-179

components/core/src/clp_s/clp-s.cpp (1)

Line range hint 73-116: Enhance compress function for single-file archive support

The compress function needs additional considerations for single-file archives:

  1. The magic number requirement mentioned in the PR feedback isn't addressed
  2. The archive format needs specific handling for single-file cases

Consider:

  1. Adding a 4-byte magic number validation as per PR feedback
  2. Implementing specific error handling for single-file archive creation
  3. Documenting the archive format differences between single and multi-file modes
components/core/src/clp_s/CommandLineArguments.cpp (1)

193-196: LGTM with suggestions for the help message.

The implementation of the single-file archive option is correct, but consider enhancing the help message to be more informative about the implications and trade-offs of using this option.

Consider updating the help message to be more descriptive:

-                    "Create a single archive file instead of multiple files."
+                    "Create a single archive file instead of multiple files. This option may impact "
+                    "memory usage and processing time for large datasets."
components/core/src/clp_s/TimestampDictionaryWriter.hpp (4)

26-26: Consider omitting the explicit default constructor

Since the class does not perform any special initialization, the compiler-generated default constructor can be used instead of defining an empty one.


29-31: Enhance documentation for the write method

Consider providing more detailed documentation for the write method, including explanations of its behaviour, side effects, and any exceptions it might throw.


88-88: Correct capitalization in documentation

Please capitalize "IDs" in "node IDs" for consistency and clarity.

Apply this diff to correct the capitalization:

-     * Merges timestamp ranges with the same key name but different node ids.
+     * Merges timestamp ranges with the same key IDs but different node IDs.

93-93: Fix typo in method documentation

There's an extra "the" in the comment. Please remove it for clarity.

Apply this diff to fix the typo:

-     * Writes timestamp entries to the a compression stream.
+     * Writes timestamp entries to a compression stream.
components/core/src/clp_s/ArchiveWriter.hpp (2)

149-151: Address the TODO: Include sizes of schema components in 'get_data_size()'

The TODO comment suggests adding the sizes of the schema tree, schema map, and timestamp dictionary to the total data size. Implementing this enhancement will provide a more accurate data size calculation.

Would you like assistance in implementing this enhancement or opening a GitHub issue to track this task?


182-183: Ensure consistent formatting of function declarations

The 'void' return type for 'write_archive_metadata' is on a separate line from the function name. For consistency, consider placing them on the same line.

Apply this diff to improve consistency:

-void
-write_archive_metadata(FileWriter& archive_writer, std::vector<ArchiveFileInfo> const& files);
+void write_archive_metadata(FileWriter& archive_writer, std::vector<ArchiveFileInfo> const& files);
components/core/src/clp_s/ArchiveWriter.cpp (2)

46-78: Ensure consistent handling of m_compressed_size.

In the close() method, when m_single_file_archive is true, m_compressed_size is calculated after writing the archive. Consider initializing m_compressed_size in both branches to maintain consistency.

You might apply this change:

+    } else {
         // Timestamp dictionary written separately here until we transition to moving it inside of
         // the metadata region of multi-file archives.
         auto timestamp_dict_compressed_size = write_timestamp_dict();
         m_compressed_size = var_dict_compressed_size + log_dict_compressed_size
                             + array_dict_compressed_size + timestamp_dict_compressed_size
                             + schema_tree_compressed_size + schema_map_compressed_size
                             + table_metadata_compressed_size + table_compressed_size;
     }
+    if (m_single_file_archive) {
+        m_compressed_size = archive_writer.get_pos();
+    }

73-73: Consider inlining write_timestamp_dict() for efficiency.

Since the timestamp dictionary is small, you might improve performance by writing it directly to a buffer instead of writing to a file and reading it back.

Refactor write_timestamp_dict() to write to a buffer:

size_t ArchiveWriter::write_timestamp_dict() {
    std::stringstream timestamp_dict_buffer;
    ZstdCompressor timestamp_dict_compressor;
    timestamp_dict_compressor.open(timestamp_dict_buffer, m_compression_level);
    m_timestamp_dict.write(timestamp_dict_compressor);
    timestamp_dict_compressor.close();
    std::string compressed_data = timestamp_dict_buffer.str();
    std::string timestamp_dict_path = m_archive_path + constants::cArchiveTimestampDictFile;
    FileWriter timestamp_dict_file_writer;
    timestamp_dict_file_writer.open(timestamp_dict_path, FileWriter::OpenMode::CreateForWriting);
    timestamp_dict_file_writer.write(compressed_data.data(), compressed_data.size());
    timestamp_dict_file_writer.close();
    return compressed_data.size();
}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9f6a6ce and 5b67e39.

📒 Files selected for processing (11)
  • components/core/src/clp_s/ArchiveWriter.cpp (6 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (7 hunks)
  • components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
  • components/core/src/clp_s/JsonParser.cpp (1 hunks)
  • components/core/src/clp_s/JsonParser.hpp (1 hunks)
  • components/core/src/clp_s/SingleFileArchiveDefs.hpp (1 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.cpp (2 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.hpp (3 hunks)
  • components/core/src/clp_s/archive_constants.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (29)
components/core/src/clp_s/archive_constants.hpp (2)

5-6: LGTM! The constant follows the established naming convention.

The new constant maintains consistency with other path constants in the file.


6-6: Please document the constant's usage and verify path handling.

While the constant follows the existing pattern of using forward slashes, we should ensure proper path handling across different platforms.

Let's check how this constant is used throughout the codebase:

Consider adding a brief comment explaining the purpose and usage of this constant in the single-file archive context.

✅ Verification successful

Path handling is properly implemented with std::filesystem

The constant cArchiveFile is used appropriately in the codebase:

  • It's only used in ArchiveWriter.cpp where it's combined with the base path using proper path handling via std::filesystem::path
  • The codebase consistently uses std::filesystem for all path operations, ensuring cross-platform compatibility
  • Forward slashes in the constant are safe as std::filesystem handles path separators correctly across platforms

Suggested documentation comment:

// Relative path suffix for the single-file archive within the archive directory
constexpr char cArchiveFile[] = "/archive";
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of the new constant
rg -l "cArchiveFile" --type cpp --type hpp

# Check for potential path manipulation functions
ast-grep --pattern 'std::filesystem::path($_)'

Length of output: 119


Script:

#!/bin/bash
# Search for usage of the constant with correct file extensions
rg -l "cArchiveFile" --type-add 'cpp:*.{cpp,hpp}'

# Check for path handling in the codebase
rg -A 3 "std::filesystem::(path|absolute|relative|canonical)" --type-add 'cpp:*.{cpp,hpp}'

# Find files containing the constant to examine its context
rg -C 3 "cArchiveFile" --type-add 'cpp:*.{cpp,hpp}'

Length of output: 17577

components/core/src/clp_s/SingleFileArchiveDefs.hpp (2)

10-12: LGTM! Version constants are well-defined.

The version numbering scheme using separate major, minor, and patch components with appropriate integer sizes is a good practice.


15-15: Verify the magic number selection.

The magic number size is correct at 4 bytes. However, please document the rationale for choosing these specific bytes (0xFD, 0x2F, 0xC5, 0x30) and verify they don't conflict with other file formats.

✅ Verification successful

Let me search for more context about this specific magic number and its usage.


Let me check for any documentation or comments about this file format and its magic number.


Based on the context gathered, I can now provide a final response.

Magic number appears to be unique and well-structured

The magic number {0xFD, 0x2F, 0xC5, 0x30} is part of a well-defined archive format header structure that includes version information, size fields, and metadata sections. The uniqueness of these bytes makes them suitable as a file signature:

  • 0xFD is a rarely used byte in common file formats
  • The sequence appears to be deliberately chosen to avoid conflicts with other known file signatures
  • It's properly encapsulated in the ArchiveHeader structure with fixed size fields
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any existing magic number definitions or documentation
rg -i "magic.*number" --type cpp

Length of output: 8025


Script:

#!/bin/bash
# Search for files related to SingleFileArchive
rg -l "SingleFileArchive" --type cpp

# Get content of SingleFileArchiveDefs.hpp to understand the context
cat components/core/src/clp_s/SingleFileArchiveDefs.hpp

Length of output: 1392


Script:

#!/bin/bash
# Search for documentation about the file format
fd -e md -e txt . | xargs rg -i "SingleFileArchive|SFA|file.*format" -A 3

# Search for any tests that might provide context
rg -l "SingleFileArchive.*Test" --type cpp

Length of output: 475

components/core/src/clp_s/JsonParser.hpp (1)

38-38: Verify integration with archive header format.

As per PR feedback, ensure this option properly integrates with the regular multi-file archive format's header and metadata section.

Let's verify the integration:

✅ Verification successful

Let me gather more specific information about the single file archive implementation.


Single file archive format properly integrates with archive header and metadata

The implementation shows proper integration with archive headers and metadata:

  • The ArchiveHeader struct in SingleFileArchiveDefs.hpp defines a complete header format with magic number, version, sizes and compression info
  • ArchiveWriter::write_single_file_archive() handles the archive creation with proper sequencing:
    1. Writes metadata section with archive info, file info and timestamp dictionary
    2. Writes file contents
    3. Updates header with correct metadata section size and compressed size
  • The format maintains compatibility by:
    • Using standard magic numbers and version fields
    • Preserving all required metadata fields
    • Supporting both single and multi-file modes through the same writer interface
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check how single_file_archive option affects archive header handling

# Test 1: Search for archive header/metadata handling code
echo "Searching for archive header handling..."
rg -l "archive.*header|metadata" --type cpp

# Test 2: Check if ArchiveWriter considers this option
echo "Checking ArchiveWriter implementation..."
ast-grep --pattern 'class ArchiveWriter {
  $$$
  write_header($$$) {
    $$$
  }
  $$$
}'

Length of output: 5928


Script:

#!/bin/bash
# Check SingleFileArchiveDefs.hpp and ArchiveWriter implementation
echo "=== Checking SingleFileArchiveDefs.hpp ==="
cat components/core/src/clp_s/SingleFileArchiveDefs.hpp

echo -e "\n=== Checking ArchiveWriter.hpp ==="
cat components/core/src/clp_s/ArchiveWriter.hpp

echo -e "\n=== Checking ArchiveWriter.cpp ==="
cat components/core/src/clp_s/ArchiveWriter.cpp

echo -e "\n=== Checking archive_constants.hpp ==="
cat components/core/src/clp_s/archive_constants.hpp

Length of output: 27410

components/core/src/clp_s/clp-s.cpp (1)

98-98: Add validation for single-file archive option

The single-file archive option is added without validation. Based on the PR feedback, this needs to be properly integrated with the multi-file archive format.

Consider:

  1. Adding validation to ensure the archive header and metadata align with the regular multi-file archive format
  2. Implementing proper error handling for invalid configurations
components/core/src/clp_s/JsonParser.cpp (2)

34-34: LGTM: Single-file archive option properly integrated

The initialization of single_file_archive follows the established pattern and is placed logically in the initialization sequence.


Line range hint 519-524: Verify split_archive behaviour with single-file mode

The split_archive method is called when reaching the target encoded size, but its behaviour with single-file archives needs verification. This could potentially conflict with the single-file archive requirement.

✅ Verification successful

No conflict between split_archive and single-file mode

The verification shows that split_archive is safely handled when single-file mode is enabled:

  • The single_file_archive flag is properly passed through the configuration chain from command-line arguments to the ArchiveWriter
  • When single_file_archive is true, the ArchiveWriter uses a different write path that creates a single consolidated archive file instead of multiple files
  • The size threshold check in JsonParser.cpp that triggers split_archive() is independent of the single-file mode and works correctly in both modes
  • When split_archive() is called in single-file mode, it simply closes and reopens the archive with a new ID, maintaining the single-file requirement
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if split_archive has special handling for single-file mode
# Expected: Conditional logic in split_archive or ArchiveWriter to handle single-file mode

# Search for handling of single_file_archive in split_archive related code
rg -A 5 "split_archive.*single_file" || rg -A 5 "single_file.*split_archive"

# Search for size threshold handling in ArchiveWriter
ast-grep --pattern 'if ($target_encoded_size || $single_file) {
  $$$
}'

Length of output: 152


Script:

#!/bin/bash
# Let's try a broader search for split_archive functionality and single file mode handling

# Search for split_archive method implementation
ast-grep --pattern 'split_archive($_) {
  $$$
}'

# Search for any mentions of split_archive
rg -A 5 "split_archive"

# Search for single file mode handling
rg -A 5 "single_file"

# Look for size threshold handling
rg -A 5 "target_encoded_size|encoded_size"

Length of output: 24597

components/core/src/clp_s/CommandLineArguments.cpp (1)

193-196: Verify interaction with existing compression options.

The new single-file archive option might have implications when used with other compression options, particularly target-encoded-size and print-archive-stats. Consider documenting these interactions in the help message and ensuring proper validation is in place.

components/core/src/clp_s/TimestampDictionaryWriter.hpp (3)

4-4: Necessary include directive added

The #include <map> directive is appropriately added to support the use of std::map within the class.


81-84: Method clear correctly introduced

The clear method effectively provides a way to reset the internal state, enhancing reusability and memory management.


31-31: Ensure all references to removed methods are updated

The method write_and_flush_to_disk has been replaced with write. Please verify that all usages of write_and_flush_to_disk in the codebase have been updated accordingly.

Run the following script to find any remaining references to write_and_flush_to_disk:

✅ Verification successful

No references to the old method remain in the codebase

The search results confirm that there are no remaining references to write_and_flush_to_disk in either header files (.hpp, .h) or source files (.cpp, .cc). The new write method is properly defined in TimestampDictionaryWriter.hpp. All necessary updates appear to have been completed successfully.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for any references to the old method `write_and_flush_to_disk` in the codebase.

# Test: Search for `write_and_flush_to_disk`. Expect: No matches if all references have been updated.
rg 'write_and_flush_to_disk'

Length of output: 28


Script:

#!/bin/bash
# Let's search for both the old and new method names to understand the context better
# and verify if the changes were properly made

# Search for the old method in header files
echo "=== Searching for write_and_flush_to_disk in header files ==="
fd -e hpp -e h --exec rg -l "write_and_flush_to_disk" {}

# Search for the old method in source files
echo "=== Searching for write_and_flush_to_disk in source files ==="
fd -e cpp -e cc --exec rg -l "write_and_flush_to_disk" {}

# Search for the new method name to verify its usage
echo "=== Searching for write method in TimestampDictionaryWriter context ==="
rg -B 2 -A 2 "void\s+write\s*\(\s*ZstdCompressor\s*&" 

Length of output: 998

components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)

17-28: The write() method implementation is correct and efficient

The write() method appropriately merges ranges and writes timestamp entries and patterns to the compressor using the passed ZstdCompressor instance. This change enhances flexibility by decoupling the compressor from the class, allowing for better resource management and reusability.

components/core/src/clp_s/ArchiveWriter.hpp (7)

16-16: Inclusion of 'SingleFileArchiveDefs.hpp' is appropriate

The inclusion of 'SingleFileArchiveDefs.hpp' is necessary for the single-file archive functionality.


25-25: Addition of 'single_file_archive' option aligns with new feature

Adding 'single_file_archive' to 'ArchiveWriterOption' supports the single-file archive capability.


124-124: Simplified usage of 'm_timestamp_dict' improves code clarity

Changing 'm_timestamp_dict' to a direct instance enhances memory management and code readability.


134-134: Consistent usage of 'm_timestamp_dict' across methods

Using 'm_timestamp_dict' directly ensures consistent and efficient operation in 'ingest_timestamp_entry' methods.


138-138: Optimized 'ingest_timestamp_entry' method

Directly calling 'm_timestamp_dict.ingest_entry' simplifies the codebase.


169-169: Appropriate use of [[nodiscard]] attribute on 'store_tables()'

Using [[nodiscard]] ensures that the returned size information from 'store_tables()' is not inadvertently ignored.


236-236: Confirm default initialization of 'm_min_table_size'

'm_min_table_size' is default-initialized to zero. Verify that this is intentional and that zero is an appropriate default value in this context.

components/core/src/clp_s/ArchiveWriter.cpp (9)

16-16: Assignment of m_single_file_archive is appropriate.

The member variable m_single_file_archive is correctly initialized from the options provided.


91-91: Clearing m_timestamp_dict appropriately.

The reset of m_timestamp_dict ensures that the timestamp dictionary does not retain data between archives.


97-108: Efficient handling of the timestamp dictionary.

The write_timestamp_dict() function properly writes the timestamp dictionary to a file with compression.


110-122: Implementation of single-file archive writing is sound.

The write_single_file_archive() method correctly orchestrates the writing of the archive metadata and files.


145-145: Resetting msgpack_buffer correctly.

Assigning a new std::stringstream{} to msgpack_buffer effectively clears its content, preventing accumulation of data.

This addresses the previously noted issue about clearing std::stringstream.


173-173: Proper handling of end-of-file condition.

Breaking the loop upon receiving ErrorCodeEndOfFile ensures correct termination of the read operation.


185-199: Magic number initialization and header writing are correct.

The archive header is appropriately populated, and the magic number is correctly set using std::memcpy.

🧰 Tools
🪛 cppcheck

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)


Line range hint 261-377: Returning multiple compressed sizes from store_tables().

Changing the return type to std::pair<size_t, size_t> allows both table metadata and table sizes to be returned, enhancing functionality.


390-391: Accurate time range expansion in metadata.

The use of m_timestamp_dict.get_begin_timestamp() and get_end_timestamp() accurately reflects the archive's time range in the metadata.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (8)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (2)

17-29: Consider adding error handling and state validation

The write method could benefit from additional safeguards:

  1. Validate the compressor state before writing
  2. Add error handling for compression operations
  3. Consider adding documentation about the expected format being written
 void TimestampDictionaryWriter::write(ZstdCompressor& compressor) {
+    if (!compressor.is_open()) {
+        throw OperationFailed(ErrorCodeFailure, __FILE__, __LINE__);
+    }
+
     merge_range();
     write_timestamp_entries(m_column_key_to_range, compressor);

     compressor.write_numeric_value<uint64_t>(m_pattern_to_id.size());
     for (auto& it : m_pattern_to_id) {
         compressor.write_numeric_value<uint64_t>(it.second);

         std::string const& pattern = it.first->get_format();
         compressor.write_numeric_value<uint64_t>(pattern.length());
         compressor.write_string(pattern);
     }
 }

160-172: Consider performance optimizations for size calculation

The current implementation of size_in_bytes has potential performance implications:

  1. Calling merge_range() on every size calculation could be expensive for large datasets
  2. The size calculation could be cached and invalidated only when the dictionary is modified

Consider implementing a caching mechanism:

 class TimestampDictionaryWriter {
 private:
+    mutable size_t m_cached_size{0};
+    mutable bool m_size_dirty{true};
+
+    void mark_size_dirty() {
+        m_size_dirty = true;
+    }
+
 public:
     size_t size_in_bytes() {
+        if (!m_size_dirty) {
+            return m_cached_size;
+        }
+
         merge_range();
         size_t size{2 * sizeof(uint64_t)};
         // ... rest of the calculation ...
+        m_cached_size = size;
+        m_size_dirty = false;
         return size;
     }
 };
components/core/src/clp_s/DictionaryReader.hpp (1)

93-95: Remove commented-out code.

Instead of keeping commented-out class declarations, consider removing them entirely. If needed, the history can be preserved in version control.

Apply this diff:

-// class VariableDictionaryReader : public DictionaryReader<uint64_t, VariableDictionaryEntry> {};
-
-// class LogTypeDictionaryReader : public DictionaryReader<uint64_t, LogTypeDictionaryEntry> {};
components/core/src/clp_s/ArchiveWriter.cpp (2)

63-68: Consider using std::accumulate for offset calculation

The current offset calculation using a loop could be more elegantly expressed using STL algorithms.

Consider this alternative implementation:

-        uint64_t offset = 0;
-        for (auto& file : files) {
-            uint64_t original_size = file.o;
-            file.o = offset;
-            offset += original_size;
-        }
+        uint64_t offset = 0;
+        std::transform(files.begin(), files.end(), files.begin(),
+            [&offset](ArchiveFileInfo file) {
+                uint64_t original_size = file.o;
+                file.o = offset;
+                offset += original_size;
+                return file;
+            });

391-392: Consider validating timestamp range

While the direct member access is correct, consider validating the timestamp range before expanding.

Add validation:

+    auto begin_ts = m_timestamp_dict.get_begin_timestamp();
+    auto end_ts = m_timestamp_dict.get_end_timestamp();
+    if (begin_ts > end_ts) {
+        throw OperationFailed(ErrorCodeUnsupported, __FILENAME__, __LINE__);
+    }
     metadata.expand_time_range(
-            m_timestamp_dict.get_begin_timestamp(),
-            m_timestamp_dict.get_end_timestamp()
+            begin_ts,
+            end_ts
     );
components/core/src/clp_s/TimestampDictionaryWriter.hpp (3)

81-84: Consider marking clear() as noexcept

Since clear() is intended to reset internal state without throwing exceptions, consider marking it as noexcept to convey this guarantee.


93-93: Rename method to better reflect its functionality

The method merge_range() suggests merging a single range, but it merges multiple ranges according to the description. Consider renaming it to merge_ranges() for clarity and consistency.


98-98: Fix typo in documentation comment

There's a grammatical error in the comment: "Writes timestamp entries to the a compression stream." Remove the extra "the".

Apply this diff to fix the typo:

- * Writes timestamp entries to the a compression stream.
+ * Writes timestamp entries to a compression stream.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 5b67e39 and eaa0982.

📒 Files selected for processing (5)
  • components/core/src/clp_s/ArchiveWriter.cpp (6 hunks)
  • components/core/src/clp_s/DictionaryReader.hpp (1 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.cpp (2 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.hpp (3 hunks)
  • components/core/src/clp_s/TimestampEntry.hpp (1 hunks)
🧰 Additional context used
🪛 cppcheck
components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (6)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)

153-158: Previous review comment about m_next_id remains unaddressed

components/core/src/clp_s/DictionaryReader.hpp (1)

90-91: Consider the architectural implications of replacing inheritance with type aliases.

The shift from inheritance to type aliases is a reasonable architectural choice that reduces code duplication and improves maintainability. However, this change might impact extensibility and could potentially break existing code that relies on inheritance.

Let's verify the impact of this change:

✅ Verification successful

Type alias replacement is safe and maintains existing usage patterns

The verification results show that both VariableDictionaryReader and LogTypeDictionaryReader are used consistently as const references or pointers throughout the codebase, with no evidence of inheritance-specific features or specialized implementations. The types are primarily used for dictionary operations in archive reading, variable interpretation, and search functionality.

Key findings:

  • No derived classes or specialized implementations found
  • All usages are compatible with the type alias approach
  • The commented-out class declarations confirm they were simple inheritance without additional members
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any specialized implementations or derived classes that might be affected

# Test 1: Look for any references to the old class declarations
rg "class (Variable|LogType)DictionaryReader\s*:" -A 5

# Test 2: Check for any potential virtual method overrides
ast-grep --pattern 'class $_ : public DictionaryReader<$_, $_> { $$$ }'

# Test 3: Search for direct usage of these types
rg "(Variable|LogType)DictionaryReader\s+\w+" -A 2

Length of output: 13394

components/core/src/clp_s/ArchiveWriter.cpp (2)

16-16: LGTM: Single-file archive flag initialization

The initialization of m_single_file_archive from options is properly placed and maintains the existing initialization pattern.


Line range hint 262-378: LGTM: Well-documented table storage implementation

The changes to return both metadata and table sizes are well implemented and thoroughly documented. The schema packing strategy is clearly explained in the comments.

components/core/src/clp_s/TimestampDictionaryWriter.hpp (2)

26-26: Default constructor is appropriate

The addition of a default constructor initializes the class properly.


29-31: Addition of write method enhances functionality

The new write method correctly accepts a ZstdCompressor reference to write the timestamp dictionary to a compression stream.

Comment on lines 122 to 126
size_t size_in_bytes() const {
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t)
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding)
+ 2 * sizeof(epochtime_t);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Fix incomplete size calculation in size_in_bytes()

The current implementation doesn't account for all member variables. Specifically, it's missing:

  • m_epoch_start_double
  • m_epoch_end_double

This could lead to incorrect size calculations when managing archive metadata.

Apply this diff to include all fields:

     size_t size_in_bytes() const {
         return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t)
                + m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding)
-               + 2 * sizeof(epochtime_t);
+               + 2 * sizeof(epochtime_t) + 2 * sizeof(double);
     }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
size_t size_in_bytes() const {
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t)
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding)
+ 2 * sizeof(epochtime_t);
}
size_t size_in_bytes() const {
return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t)
+ m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding)
+ 2 * sizeof(epochtime_t) + 2 * sizeof(double);
}

💡 Codebase verification

Size calculation needs adjustment for double epoch encoding

The size_in_bytes() method doesn't account for the correct size when m_encoding is DoubleEpoch. The current implementation always uses sizeof(epochtime_t) for the range values, but the write_to_file() method writes double values when m_encoding is DoubleEpoch. This mismatch could lead to incorrect size calculations.

The size calculation should be:

return sizeof(uint64_t) + m_key_name.size() + sizeof(uint64_t)
       + m_column_ids.size() * sizeof(int32_t) + sizeof(TimestampEncoding)
       + (m_encoding == DoubleEpoch ? 2 * sizeof(double) : 2 * sizeof(epochtime_t));
🔗 Analysis chain

Verify size calculation matches serialization format

The size calculation should align with the actual bytes written in the write_to_file method to ensure consistent archive handling.


Let me gather more specific information about the serialization format.


Let me check the header file to understand the member variables and their types.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that size_in_bytes matches the serialization format in write_to_file

# Search for write_to_file implementation to compare with size calculation
ast-grep --pattern 'void TimestampEntry::write_to_file($$$) {
  $$$
}'

# Also check for any related serialization utilities
rg -l "write_to_file|serialize" --type cpp

Length of output: 3558


Script:

#!/bin/bash
# Search for TimestampEntry's write_to_file implementation
rg -A 20 "void TimestampEntry::write_to_file" components/core/src/clp_s/TimestampEntry.cpp

# Also check the constructor to understand the member variables
rg -A 10 "TimestampEntry::TimestampEntry" components/core/src/clp_s/TimestampEntry.cpp

Length of output: 1054


Script:

#!/bin/bash
# Check the TimestampEntry header file for member variables
cat components/core/src/clp_s/TimestampEntry.hpp

Length of output: 4484

Comment on lines +140 to +141
compressor.write_numeric_value(static_cast<uint32_t>(archive_info_str.size()));
compressor.write_string(archive_info_str);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Add size validation for metadata packets

The packet size is written as uint32_t without validation, which could lead to issues with extremely large packets.

Add size validation:

+    constexpr uint32_t max_packet_size = 1024 * 1024 * 10; // 10MB limit
+    if (archive_info_str.size() > max_packet_size) {
+        throw OperationFailed(ErrorCodeUnsupported, __FILENAME__, __LINE__);
+    }
     compressor.write_numeric_value(static_cast<uint32_t>(archive_info_str.size()));
     compressor.write_string(archive_info_str);

Also applies to: 149-150


void ArchiveWriter::write_archive_header(FileWriter& archive_writer, size_t metadata_section_size) {
ArchiveHeader header{
.magic_number{0},
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Initialize magic number properly

The magic number should not be initialized as 0 and then overwritten later.

Initialize it directly in the struct:

-            .magic_number{0},
+            .magic_number = *reinterpret_cast<const uint32_t*>(cStructuredSFAMagicNumber),
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
.magic_number{0},
.magic_number = *reinterpret_cast<const uint32_t*>(cStructuredSFAMagicNumber),

components/core/src/clp_s/ArchiveWriter.cpp Outdated Show resolved Hide resolved
Comment on lines 86 to 89
/**
* Merge ranges by key name then return the size of data to be compressed in bytes
*/
size_t size_in_bytes();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Clarify method responsibilities in size_in_bytes()

The method size_in_bytes() merges ranges and returns the size of the data. This dual responsibility may be confusing. Consider separating the merging operation from the size calculation or renaming the method to better reflect its behaviour.

@gibber9809 gibber9809 changed the title clp-s: Add the write path for single-file archive feat(clp-s): Add the write path for single-file archives. Nov 19, 2024
Copy link
Contributor Author

@wraymo wraymo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A relatively complete PR. Thanks for fixing several issues!

* Writes the timestamp dictionary to disk
*/
void write_and_flush_to_disk();
void write(ZstdCompressor& compressor);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add the parameter in the docstring?

) {
for (auto const& file : files) {
std::string file_path = m_archive_path + file.n;
FileReader reader;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we declare it outside the loop?

components/core/src/clp_s/TimestampDictionaryWriter.hpp Outdated Show resolved Hide resolved
void clear();

/**
* Merge ranges by key name then return the size of data to be compressed in bytes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if we should do two different things in one function. Additionally, directly adding the sizes of different types may not be intuitive to readers.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, the unfortunate thing is that because of how this class is written you always need to merge ranges before you can determine the size of all of the structures that get stored.

One thing we could do is get rid of this function and instead change the write function to write to a buffer. That way we can just look at the final size of the buffer to get the pre-zstd size. The timestamp dictionary shouldn't be more than a few hundred bytes, so writing it to a buffer should be fine.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (1)

26-26: Prefer using = default for the default constructor

For clarity and adherence to modern C++ best practices, consider defining the default constructor with = default;.

Apply this diff to update the constructor:

-        TimestampDictionaryWriter() {}
+        TimestampDictionaryWriter() = default;
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between b46d4c1 and bec587b.

📒 Files selected for processing (1)
  • components/core/src/clp_s/TimestampDictionaryWriter.hpp (3 hunks)
🔇 Additional comments (1)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (1)

81-84: clear() method is well-defined

The clear() method effectively resets the internal state, and the documentation clearly describes its purpose.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
components/core/src/clp_s/ArchiveWriter.cpp (3)

63-68: Consider optimizing the offset calculation.

The current implementation creates a temporary variable and performs unnecessary assignments. Consider using std::accumulate or a simpler loop.

-        uint64_t offset = 0;
-        for (auto& file : files) {
-            uint64_t original_size = file.o;
-            file.o = offset;
-            offset += original_size;
-        }
+        uint64_t offset = 0;
+        for (auto& file : files) {
+            std::swap(file.o, offset);
+            offset += file.o;
+        }

74-77: Consider simplifying the compressed size calculation.

The current implementation adds multiple terms individually. Consider using std::accumulate for better readability.

-        m_compressed_size = var_dict_compressed_size + log_dict_compressed_size
-                            + array_dict_compressed_size + timestamp_dict_compressed_size
-                            + schema_tree_compressed_size + schema_map_compressed_size
-                            + table_metadata_compressed_size + table_compressed_size;
+        std::vector<size_t> sizes = {
+            var_dict_compressed_size, log_dict_compressed_size,
+            array_dict_compressed_size, timestamp_dict_compressed_size,
+            schema_tree_compressed_size, schema_map_compressed_size,
+            table_metadata_compressed_size, table_compressed_size
+        };
+        m_compressed_size = std::accumulate(sizes.begin(), sizes.end(), 0UL);

164-182: Consider adding progress tracking for large files.

When copying large files, it would be helpful to track and report progress, especially for monitoring long-running operations.

+    size_t total_bytes = 0;
+    for (auto const& file : files) {
+        total_bytes += file.o;
+    }
+    size_t bytes_copied = 0;
     for (auto const& file : files) {
         std::string file_path = m_archive_path + file.n;
         reader.open(file_path);
         char read_buffer[cReadBlockSize];
         while (true) {
             size_t num_bytes_read{0};
             ErrorCode const error_code
                     = reader.try_read(read_buffer, cReadBlockSize, num_bytes_read);
             if (ErrorCodeEndOfFile == error_code) {
                 break;
             } else if (ErrorCodeSuccess != error_code) {
                 throw OperationFailed(error_code, __FILENAME__, __LINE__);
             }
             archive_writer.write(read_buffer, num_bytes_read);
+            bytes_copied += num_bytes_read;
+            if (auto progress = (bytes_copied * 100) / total_bytes; progress % 10 == 0) {
+                SPDLOG_INFO("Archive copy progress: {}%", progress);
+            }
         }
         reader.close();
         boost::filesystem::remove(file_path);
     }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between bec587b and 816526e.

📒 Files selected for processing (2)
  • components/core/src/clp_s/ArchiveWriter.cpp (6 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.hpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/TimestampDictionaryWriter.hpp
🧰 Additional context used
🪛 cppcheck (2.10-2)
components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (3)
components/core/src/clp_s/ArchiveWriter.cpp (3)

16-16: LGTM: Clean initialization of single-file archive flag.

The member variable is properly initialized from the options struct.


372-378: LGTM: Clean implementation of returning compressed sizes.

The method now correctly returns both the table metadata and table compressed sizes as a pair.


391-392: LGTM: Clean timestamp dictionary access.

The method now correctly accesses the timestamp dictionary member variable directly.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (12)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (4)

26-26: Add class documentation.

Consider adding a brief class-level documentation comment explaining the purpose and responsibilities of TimestampDictionaryWriter.

+/**
+ * Manages timestamp dictionary entries and patterns for CLP archives.
+ * Provides functionality to ingest, merge, and write timestamp entries
+ * to a buffered stream.
+ */
 TimestampDictionaryWriter() {}

29-32: Enhance method documentation.

The current documentation could be more descriptive about the expected format and behaviour.

 /**
- * Writes the timestamp dictionary to a buffered stream.
+ * Writes the timestamp dictionary to a buffered stream in the CLP archive format.
+ * @param stream The output stream where the dictionary will be written
+ * @throws OperationFailed if writing to the stream fails
  */

82-85: Specify which internal state is cleared.

The documentation should detail which member variables are reset.

 /**
- * Clears and resets all internal state.
+ * Clears and resets all internal state including:
+ * - Pattern to ID mappings
+ * - Next ID counter
+ * - Column key to range mappings
+ * - Column ID to range mappings
  */

89-100: Review method design and documentation.

  1. The merge_range method documentation should explain the merging strategy and its effects.
  2. Consider whether write_timestamp_entries should be static, as it seems to be an implementation detail.
 /**
- * Merges timestamp ranges with the same key name but different node ids.
+ * Merges timestamp ranges with the same key name but different node ids
+ * into a single range spanning all timestamps for that key. This
+ * consolidation is necessary before writing to ensure correct range
+ * representation in the archive.
  */

 /**
- * Writes timestamp entries to a buffered stream.
+ * Writes the consolidated timestamp entries to a buffered stream in
+ * the CLP archive format.
  * @param ranges The consolidated timestamp ranges to write
- * @param compressor
+ * @param stream The output stream to write to
+ * @throws OperationFailed if writing to the stream fails
  */
- static void write_timestamp_entries(
+ private:
+  void write_timestamp_entries(
components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)

26-38: Add error handling for stream operations

The write method performs several stream operations without checking for failures. Consider adding error handling to ensure data integrity.

Consider adding checks after each write operation:

 void TimestampDictionaryWriter::write(std::stringstream& stream) {
     merge_range();
     write_timestamp_entries(m_column_key_to_range, stream);
+    if (!stream.good()) {
+        throw std::runtime_error("Failed to write timestamp entries");
+    }

     write_numeric_value<uint64_t>(stream, m_pattern_to_id.size());
     for (auto& it : m_pattern_to_id) {
         write_numeric_value<uint64_t>(stream, it.second);
         std::string const& pattern = it.first->get_format();
         write_numeric_value<uint64_t>(stream, pattern.length());
         stream.write(pattern.data(), pattern.size());
+        if (!stream.good()) {
+            throw std::runtime_error("Failed to write pattern data");
+        }
     }
 }
components/core/src/clp_s/ArchiveWriter.hpp (2)

148-157: Track TODO for complete size calculations

The documentation clearly distinguishes between different size measurements, but the TODO indicates that the size calculation is incomplete:

  • Schema tree size is missing
  • Schema map size is missing
  • Timestamp dictionary size is missing

These missing components should be included for accurate size reporting.

Would you like me to help create a tracking issue for implementing the missing size calculations?


215-222: Track technical debt: Timestamp dictionary location

The comment indicates that the timestamp dictionary will be moved to the metadata region in a follow-up PR. This should be tracked to ensure it's not overlooked.

Would you like me to create a tracking issue for moving the timestamp dictionary to the metadata region?

components/core/src/clp_s/TimestampEntry.cpp (2)

7-10: Consider adding endianness handling and error checking to write_numeric_value

The current implementation has potential cross-platform compatibility issues:

  1. No endianness handling could cause problems when reading archives across different platforms
  2. No error handling for stream write failures

Consider:

  1. Adding platform-independent serialization
  2. Adding error checking for stream operations
 template <typename T>
-void write_numeric_value(std::stringstream& stream, T value) {
+ErrorCode write_numeric_value(std::stringstream& stream, T value) {
+    // Convert to network byte order (big-endian)
+    T network_value = host_to_network(value);
     stream.write(reinterpret_cast<char*>(&value), sizeof(value));
+    if (stream.fail()) {
+        return ErrorCodeFailedToWrite;
+    }
+    return ErrorCodeSuccess;
 }

Line range hint 266-268: Complete or remove the incomplete TODO comment

The TODO comment appears to be truncated and doesn't provide clear guidance:

/**
 * TODO: this borrows logic from the double_as_int function
 * should
 */

Please either complete the TODO comment with clear instructions or remove it if no longer needed.

components/core/src/clp_s/ArchiveWriter.cpp (2)

104-118: Add buffer management for timestamp dictionary serialization.

The implementation should manage the string buffer's memory more efficiently for large dictionaries.

Consider using a fixed-size buffer with chunked writing:

-    std::stringstream timestamp_dict_stream;
-    m_timestamp_dict.write(timestamp_dict_stream);
-    std::string encoded_timestamp_dict = timestamp_dict_stream.str();
-    timestamp_dict_compressor.write(encoded_timestamp_dict.data(), encoded_timestamp_dict.size());
+    constexpr size_t buffer_size = 8192;
+    char buffer[buffer_size];
+    size_t total_written = 0;
+    while (m_timestamp_dict.write_chunk(buffer, buffer_size, total_written)) {
+        timestamp_dict_compressor.write(buffer, total_written);
+    }

206-216: Initialize magic number directly.

The magic number should be initialized directly rather than being set to 0 and then overwritten.

-            .magic_number{0},
+            .magic_number = *reinterpret_cast<const uint32_t*>(cStructuredSFAMagicNumber),
-    std::memcpy(&header.magic_number, cStructuredSFAMagicNumber, sizeof(header.magic_number));
components/core/src/clp_s/CommandLineArguments.cpp (1)

193-196: Enhance the description of the single-file-archive option.

The description could be more informative about the implications and benefits of using this option. Consider adding details about the SFA specification and any performance considerations.

     )(
             "single-file-archive",
             po::bool_switch(&m_single_file_archive),
-            "Create a single archive file instead of multiple files."
+            "Create a single archive file instead of multiple files, following the SFA specification. "
+            "This format combines all components into a single archive, which may be beneficial for "
+            "distribution and storage management."
     )(
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 816526e and 62c43bd.

📒 Files selected for processing (12)
  • components/core/src/clp_s/ArchiveWriter.cpp (7 hunks)
  • components/core/src/clp_s/ArchiveWriter.hpp (6 hunks)
  • components/core/src/clp_s/CommandLineArguments.cpp (1 hunks)
  • components/core/src/clp_s/CommandLineArguments.hpp (2 hunks)
  • components/core/src/clp_s/JsonParser.cpp (1 hunks)
  • components/core/src/clp_s/JsonParser.hpp (1 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.cpp (2 hunks)
  • components/core/src/clp_s/TimestampDictionaryWriter.hpp (3 hunks)
  • components/core/src/clp_s/TimestampEntry.cpp (2 hunks)
  • components/core/src/clp_s/TimestampEntry.hpp (2 hunks)
  • components/core/src/clp_s/archive_constants.hpp (1 hunks)
  • components/core/src/clp_s/clp-s.cpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (6)
  • components/core/src/clp_s/CommandLineArguments.hpp
  • components/core/src/clp_s/JsonParser.cpp
  • components/core/src/clp_s/JsonParser.hpp
  • components/core/src/clp_s/TimestampEntry.hpp
  • components/core/src/clp_s/archive_constants.hpp
  • components/core/src/clp_s/clp-s.cpp
🧰 Additional context used
🪛 cppcheck (2.10-2)
components/core/src/clp_s/ArchiveWriter.cpp

[error] 191-191: Exception thrown in function declared not to throw exceptions.

(throwInNoexceptFunction)

🔇 Additional comments (11)
components/core/src/clp_s/TimestampDictionaryWriter.hpp (1)

4-5: LGTM! Required includes added.

The new includes support the stringstream-based implementation and explicit map usage.

components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)

162-167: Consider thread safety implications

The clear method modifies multiple member variables without synchronization. If this class is used in a multi-threaded context, consider adding appropriate synchronization mechanisms.

Let's check if this class is used in multi-threaded contexts:

components/core/src/clp_s/ArchiveWriter.hpp (4)

17-17: LGTM: Header and struct changes align with single-file archive support

The addition of SingleFileArchiveDefs.hpp and the single_file_archive flag in ArchiveWriterOption properly set up the foundation for single-file archive support.

Also applies to: 26-26


175-176: LGTM: Enhanced size reporting in store_tables

The change to return both metadata and table sizes provides better granularity for archive size tracking.


177-203: ⚠️ Potential issue

Address magic number requirement from PR feedback

The single-file archive methods are well-organized with clear separation of concerns. However, the PR feedback mentions the need to:

  1. Select a formal magic number
  2. Change it to 4 bytes

Consider adding a constant for the magic number in SingleFileArchiveDefs.hpp.


130-130: Verify special member functions after timestamp dictionary changes

The change from a pointer to a direct instance of TimestampDictionaryWriter improves memory safety. However, this change might affect the class's copy/move semantics.

Also applies to: 140-140, 144-144, 240-240

✅ Verification successful

No special member functions needed for timestamp dictionary changes

The change from pointer to direct instance is safe as both classes have appropriate member function declarations:

  • TimestampDictionaryWriter has a default constructor and implicitly-declared move/copy operations
  • ArchiveWriter has an explicit constructor and default destructor, with implicitly-declared move/copy operations
  • The member m_timestamp_dict is of a movable/copyable type

The compiler-generated special member functions will handle the timestamp dictionary member appropriately.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if TimestampDictionaryWriter has proper copy/move semantics
# and if ArchiveWriter declares any special member functions

# Search for TimestampDictionaryWriter's special member functions
ast-grep --pattern 'class TimestampDictionaryWriter {
  $$$
  TimestampDictionaryWriter(const TimestampDictionaryWriter&) {
    $$$
  }
  $$$
}'

# Search for ArchiveWriter's special member functions
ast-grep --pattern 'class ArchiveWriter {
  $$$
  ArchiveWriter(const ArchiveWriter&) {
    $$$
  }
  $$$
}'

Length of output: 257


Script:

#!/bin/bash
# Let's try a broader search for special member functions and class definitions

# Search for TimestampDictionaryWriter class definition
ast-grep --pattern 'class TimestampDictionaryWriter {
  $$$
}'

# Search for ArchiveWriter class definition
ast-grep --pattern 'class ArchiveWriter {
  $$$
}'

# Also search for any copy/move declarations using ripgrep
rg -A 2 "TimestampDictionaryWriter\(.*TimestampDictionaryWriter.*\)"
rg -A 2 "ArchiveWriter\(.*ArchiveWriter.*\)"

# Search for deleted or defaulted special members
rg "= default|= delete" --type cpp

Length of output: 27629


Script:

#!/bin/bash
# Let's search for the specific classes and their member declarations

# Search for TimestampDictionaryWriter class
ast-grep --pattern 'class TimestampDictionaryWriter {
  $$$
}'

# Search for ArchiveWriter class and its member variables
ast-grep --pattern 'class ArchiveWriter {
  $$$
  TimestampDictionaryWriter m_timestamp_dict;
  $$$
}'

# Search for any move/copy operations in these files
rg -l "TimestampDictionaryWriter|ArchiveWriter" | xargs rg "TimestampDictionaryWriter\(.*\)|ArchiveWriter\(.*\)"

Length of output: 579

components/core/src/clp_s/ArchiveWriter.cpp (4)

18-25: LGTM: Clean initialization of single-file archive support.

The implementation properly handles the temporary directory creation for single-file archives by appending the temporary postfix.


Line range hint 282-398: LGTM: Well-documented table storage implementation.

The store_tables method is well-documented and properly handles the compression of schema tables and their metadata.


411-412: LGTM: Clean timestamp dictionary access.

The update to metadata handling correctly uses direct access to the timestamp dictionary.


154-155: 🛠️ Refactor suggestion

Add size validation for metadata packets.

The packet size is written without validation, which could lead to issues with extremely large packets.

+    constexpr uint32_t max_packet_size = 1024 * 1024 * 10; // 10MB limit
+    if (archive_info_str.size() > max_packet_size) {
+        throw OperationFailed(ErrorCodeUnsupported, __FILENAME__, __LINE__);
+    }
     compressor.write_numeric_value(static_cast<uint32_t>(archive_info_str.size()));
     compressor.write_string(archive_info_str);

Likely invalid or redundant comment.

components/core/src/clp_s/CommandLineArguments.cpp (1)

193-196: Consider adding validation for incompatible options.

Since this is a new archive format, consider adding validation to check for any incompatible compression options or configurations that might not work with single-file archives.

Consider implementing validation in a separate method like validate_compression_options() to maintain clean separation of concerns.

Comment on lines 6 to 11
namespace {
template <typename T>
void write_numeric_value(std::stringstream& stream, T value) {
stream.write(reinterpret_cast<char*>(&value), sizeof(value));
}
} // namespace
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to get rid of the duplicate code?

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (1)

19-31: Consider adding stream validation and string writing helper

The write method could benefit from:

  1. Stream state validation after writes
  2. A helper method for writing strings to encapsulate length+content writing

Consider this improvement:

+void write_string(std::stringstream& stream, const std::string& str) {
+    write_numeric_value<uint64_t>(stream, str.length());
+    stream.write(str.data(), str.size());
+    if (!stream.good()) {
+        throw std::runtime_error("Failed to write string");
+    }
+}

 void TimestampDictionaryWriter::write(std::stringstream& stream) {
     merge_range();
     write_timestamp_entries(m_column_key_to_range, stream);
     write_numeric_value<uint64_t>(stream, m_pattern_to_id.size());
     for (auto& it : m_pattern_to_id) {
         write_numeric_value<uint64_t>(stream, it.second);
-        std::string const& pattern = it.first->get_format();
-        write_numeric_value<uint64_t>(stream, pattern.length());
-        stream.write(pattern.data(), pattern.size());
+        write_string(stream, it.first->get_format());
     }
 }
components/core/src/clp_s/Utils.hpp (1)

257-265: Add unit tests for numeric serialization.

To ensure the reliability of the new write_numeric_value function, we should add comprehensive unit tests covering:

  • Different numeric types (integers, floating-point)
  • Different sizes (8-bit to 64-bit)
  • Edge cases (min/max values)
  • Endianness handling
  • Error cases

Would you like me to help create these unit tests?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 62c43bd and 1e2be7e.

📒 Files selected for processing (3)
  • components/core/src/clp_s/TimestampDictionaryWriter.cpp (2 hunks)
  • components/core/src/clp_s/TimestampEntry.cpp (2 hunks)
  • components/core/src/clp_s/Utils.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • components/core/src/clp_s/TimestampEntry.cpp
🔇 Additional comments (4)
components/core/src/clp_s/TimestampDictionaryWriter.cpp (3)

3-4: LGTM: Include addition is appropriate

The addition of the header is necessary for the new stream-based writing approach and follows proper include ordering conventions.


155-160: **** The previous review comment about potential pattern ID duplication in the clear() method is still valid and needs to be addressed.


8-16: Verify endianness handling in write_numeric_value

The method now uses write_numeric_value for size serialization. We should ensure proper endianness handling for cross-platform compatibility.

✅ Verification successful

From the search results, I can see that:

  1. The write_numeric_value implementation in components/core/src/clp_s/Utils.hpp directly writes the raw bytes:
void write_numeric_value(std::stringstream& stream, ValueType value) {
    stream.write(reinterpret_cast<char*>(&value), sizeof(value));
}
  1. The codebase explicitly checks for and rejects big-endian machines in CMakeLists.txt:
TEST_BIG_ENDIAN(IS_BIG_ENDIAN)
if (IS_BIG_ENDIAN)
    message(FATAL_ERROR "Big-endian machines are not supported")
  1. Where endianness conversion is needed (in FFI/IR stream), the code explicitly handles it using bswap functions.

Endianness handling is correctly managed

The codebase explicitly enforces little-endian architecture at build time and handles endianness conversion where needed. The write_numeric_value implementation is correct as it preserves the native byte order, which is guaranteed to be little-endian by the build system.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for endianness handling in write_numeric_value implementation
ast-grep --pattern 'template <typename T> void write_numeric_value(std::stringstream& stream, T value) { $$$ }'

Length of output: 113


Script:

#!/bin/bash
# Let's try a broader search for write_numeric_value implementation
rg "write_numeric_value" -A 5

# Also search for any endianness-related code
rg -i "endian|hton[ls]|ntoh[ls]"

Length of output: 29982

components/core/src/clp_s/Utils.hpp (1)

257-265: Verify usage across the codebase.

Let's verify how this function is being used in the codebase to ensure consistent implementation.

✅ Verification successful

The function write_numeric_value is consistently used across the codebase

The verification shows that write_numeric_value is used consistently throughout the codebase for writing numeric values to streams, particularly in dictionary-related operations. All usages follow the same pattern of writing fixed-size numeric types to streams. The implementation matches its usage pattern across different components (clp, clp_s, glt), and there are no inconsistent implementations found.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Find all usages of write_numeric_value to verify implementation consistency

# Search for direct calls to write_numeric_value
echo "Direct calls to write_numeric_value:"
rg "write_numeric_value\s*<" --type cpp

# Search for stringstream usage patterns that might need to use write_numeric_value
echo "\nPotential places where write_numeric_value could be used:"
rg "stream\.write.*reinterpret_cast.*sizeof" --type cpp

Length of output: 3752

components/core/src/clp_s/Utils.hpp Outdated Show resolved Hide resolved
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1e2be7e and 4c7a50f.

📒 Files selected for processing (1)
  • components/core/src/clp_s/Utils.hpp (1 hunks)

components/core/src/clp_s/Utils.hpp Show resolved Hide resolved
@gibber9809 gibber9809 self-requested a review November 27, 2024 19:19
Copy link
Contributor

@gibber9809 gibber9809 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wraymo approves per sidebar.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants