-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(core-clp)!: Migrate archive metadata file format to MessagePack. #700
Conversation
WalkthroughThe pull request introduces significant changes to the Changes
Possibly related PRs
Suggested reviewers
Tip 🌐 Web search-backed reviews and chat
📜 Recent review detailsConfiguration used: CodeRabbit UI 📒 Files selected for processing (2)
🚧 Files skipped from review as they are similar to previous changes (2)
⏰ Context from checks skipped due to timeout of 90000ms (12)
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (5)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
5-5
: Remove unused header if not required.
<fmt/core.h>
does not appear to be used in this file. Consider removing it to keep dependencies minimal.-#include <fmt/core.h>
29-46
: Address potential large file memory usage.
Reading the entire metadata file into memory (std::vector<char> buf(file_stat.st_size)
) could pose memory concerns if the file is large. Consider chunk-based reading or additional checks (e.g., file size limit) to reduce risk.components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)
15-15
: Fixed compression type constant.
Using a static constexpr string_view for the "ZSTD" compression type is clear. If you anticipate multiple compression methods in the future, ensure expandability.
122-132
: MSGPACK_DEFINE_MAP ordering.
Keeping the newly added fields in a stable order can help maintain backward compatibility. Consider documenting any assumptions about field ordering in public references.components/core/src/clp/streaming_archive/reader/Archive.cpp (1)
41-41
: Explicit naming for metadata object.
Usingauto const metadata = ArchiveMetadata::create_from_file_reader(file_reader);
is readable. If multiple metadata objects are used in the future, a more specific variable name (e.g.,archive_metadata
) may be beneficial.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(5 hunks)components/core/src/clp/streaming_archive/reader/Archive.cpp
(1 hunks)
🧰 Additional context used
📓 Path-based instructions (3)
components/core/src/clp/streaming_archive/reader/Archive.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (12)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
3-3
: Validate necessity of the <sys/stat.h> header.
It seems this header is required for thestruct stat
usage at lines 30–31. Please confirm it remains necessary if file statistics are still being used.
59-62
: Confirm write operation’s success path.
You rely on theFileWriter
to throw on failure. While this is acceptable, providing explicit handling or logging around partial writes can offer more clarity for troubleshooting.components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (10)
7-7
: Include for encoding methods constants.
Ensuringffi::encoding_methods.hpp
is included allows proper default values for new member fields. Nicely done.
11-11
: Essential include for MessagePack.
Themsgpack.hpp
header is essential for the serialisation and deserialisation logic used in this class.
36-36
: Default constructor utility.
Providing a default constructor can simplify certain use cases or testing scenarios when creatingArchiveMetadata
objects.
49-49
: Concise method documentation.
The comment for the factory method clarifies its purpose and expected usage.
51-59
: Thorough docstring for MessagePack creation.
These lines effectively describe howcreate_from_file_reader
works and the possible exceptions. Excellent detail in the error-handling notes.
95-97
: Getter for variable encoding methods version.
Returning a const reference avoids unnecessary copies and ensures data consistency.
99-101
: Getter for variables schema version.
Likewise, returning a const reference aligns with your efficiency pattern and design style.
103-105
: Getter for compression type.
Maintains consistency with other getters. Straightforward and efficient.
114-119
: Comprehensive documentation for write operation.
Explicitly mentioning exceptions aids users of this method in planning robust error handling.
148-150
: Initial values for new fields.
Pre-setting defaults form_variable_encoding_methods_version
,m_variables_schema_version
, andm_compression_type
ensures consistent behaviour across all code paths.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
5-5
: Consider removing unused include.The
fmt/core.h
header appears to be unused in the current implementation.
59-62
: Improve error handling and memory efficiency in serialization.The current implementation:
- Doesn't handle potential msgpack::pack errors
- Uses intermediate string buffer, causing double memory allocation
Consider using msgpack's direct streaming to file writer or a more memory-efficient approach:
- std::ostringstream buf; - msgpack::pack(buf, *this); - auto const& string_buf = buf.str(); - file_writer.write(string_buf.data(), string_buf.size()); + // Option 1: Use msgpack's streaming interface + msgpack::packer<FileWriter> packer(file_writer); + packer.pack(*this); + + // Option 2: Pre-calculate buffer size + size_t buffer_size = msgpack::pack_calculate_size(*this); + std::vector<char> buffer(buffer_size); + msgpack::pack(buffer.data(), *this); + file_writer.write(buffer.data(), buffer_size);components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
95-105
: Add documentation for new getter methods.The new getter methods lack documentation comments explaining their purpose and return values.
Add documentation similar to other methods in the class:
+ /** + * @return The version of variable encoding methods used in this archive + */ [[nodiscard]] auto get_variable_encoding_methods_version() const -> std::string const& { return m_variable_encoding_methods_version; } + /** + * @return The version of variables schema used in this archive + */ [[nodiscard]] auto get_variables_schema_version() const -> std::string const& { return m_variables_schema_version; } + /** + * @return The compression algorithm type used in this archive + */ [[nodiscard]] auto get_compression_type() const -> std::string const& { return m_compression_type; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
⏰ Context from checks skipped due to timeout of 90000ms (11)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
🔇 Additional comments (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
7-15
: LGTM! Well-organized includes and constants.The new include and compression type constant are appropriately placed and follow the project's conventions.
MSGPACK_DEFINE_MAP( | ||
MSGPACK_NVP("archive_format_version", m_archive_format_version), | ||
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version), | ||
MSGPACK_NVP("variables_schema_version", m_variables_schema_version), | ||
MSGPACK_NVP("compression_type", m_compression_type), | ||
MSGPACK_NVP("creator_id", m_creator_id), | ||
MSGPACK_NVP("begin_timestamp", m_begin_timestamp), | ||
MSGPACK_NVP("end_timestamp", m_end_timestamp), | ||
MSGPACK_NVP("uncompressed_size", m_uncompressed_size), | ||
MSGPACK_NVP("compressed_size", m_compressed_size) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add missing creation_idx to MessagePack serialization.
The creation_idx field is initialized in the constructor but not included in the MessagePack serialization map.
Add the missing field:
MSGPACK_DEFINE_MAP(
MSGPACK_NVP("archive_format_version", m_archive_format_version),
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version),
MSGPACK_NVP("variables_schema_version", m_variables_schema_version),
MSGPACK_NVP("compression_type", m_compression_type),
MSGPACK_NVP("creator_id", m_creator_id),
+ MSGPACK_NVP("creation_idx", m_creation_idx),
MSGPACK_NVP("begin_timestamp", m_begin_timestamp),
MSGPACK_NVP("end_timestamp", m_end_timestamp),
MSGPACK_NVP("uncompressed_size", m_uncompressed_size),
MSGPACK_NVP("compressed_size", m_compressed_size)
);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
MSGPACK_DEFINE_MAP( | |
MSGPACK_NVP("archive_format_version", m_archive_format_version), | |
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version), | |
MSGPACK_NVP("variables_schema_version", m_variables_schema_version), | |
MSGPACK_NVP("compression_type", m_compression_type), | |
MSGPACK_NVP("creator_id", m_creator_id), | |
MSGPACK_NVP("begin_timestamp", m_begin_timestamp), | |
MSGPACK_NVP("end_timestamp", m_end_timestamp), | |
MSGPACK_NVP("uncompressed_size", m_uncompressed_size), | |
MSGPACK_NVP("compressed_size", m_compressed_size) | |
); | |
MSGPACK_DEFINE_MAP( | |
MSGPACK_NVP("archive_format_version", m_archive_format_version), | |
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version), | |
MSGPACK_NVP("variables_schema_version", m_variables_schema_version), | |
MSGPACK_NVP("compression_type", m_compression_type), | |
MSGPACK_NVP("creator_id", m_creator_id), | |
MSGPACK_NVP("creation_idx", m_creation_idx), | |
MSGPACK_NVP("begin_timestamp", m_begin_timestamp), | |
MSGPACK_NVP("end_timestamp", m_end_timestamp), | |
MSGPACK_NVP("uncompressed_size", m_uncompressed_size), | |
MSGPACK_NVP("compressed_size", m_compressed_size) | |
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davemarco is this an issue?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like the humans are having a chat. I'll hop back into my burrow for now. If you need me again, just tag @coderabbitai in a new comment, and I'll come hopping out!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps. Thanks @coderabbitai. I just realized that this field was omitted from sfa format, so left out here unintentionally. I believe this field is an archive order for split archives, so maybe it was omitted because the archives are not split. We can either have private branch add this back, since it is harmless. Or maybe I can just default initialize it as zero.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davemarco Thanks for the explanation! Since the archives aren't split, defaulting creation_idx
to zero in the MessagePack serialization seems like a reasonable approach. Would you like me to provide an updated suggestion with this change?
✏️ Learnings added
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.779Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think either works. I guess we can omit creation_idx for now, but add a comment to explain it's for format consistency. Then we can open another issue tracking the effort to handle this field properly in private branch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it back, since I assume this is used somewhere. I will add issue to doc. Note it is default initialized to zero, so even if missing in private branch, it will show up as zero when deserialized (if missing), so not much of a problem
@@ -38,7 +38,7 @@ void Archive::open(string const& path) { | |||
archive_format_version_t format_version{}; | |||
try { | |||
FileReader file_reader{metadata_file_path}; | |||
ArchiveMetadata const metadata{file_reader}; | |||
auto const metadata = ArchiveMetadata::create_from_file_reader(file_reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since the FileReader goes out of scope right after the metadata is created, can the method takes metadata_file_path as its parameter and create a FileReader internally?
Edit: Now i see that you are just trying to reuse the original signature, which makes sense.. but maybe we can also improve the method interface in this PR since it's a few lines of change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about taking a clp::array instead as input. It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa. Like for sfa, we can't pass in a filereader (since no way to get the size of metadata in larger file by itself, and the filepath is also for the entire sfa)?
We could also have 2 constructors. One for file path, and one for buffer. And they just share a method to deserialize from buffer? Let me know your thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa.
Is this planned in a future PR? I think it makes sense, but the change you propose might be easier to reason about when we are actually making the changes that include sfa.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay for now, I can take in a file path. I can add the buffer feature later when sfa reader is implemented
@@ -102,6 +145,9 @@ class ArchiveMetadata { | |||
// The size of the archive | |||
uint64_t m_compressed_size{0}; | |||
uint64_t m_dynamic_compressed_size{0}; | |||
std::string m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The new coding standard encourages using std::string_view where possible. is it possible to use string_view instead?
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} | ||
|
||
std::vector<char> buf(file_stat.st_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@LinZhihao-723 should we use clp::array instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved to clp::array
auto clp_rc = file_reader.try_fstat(file_stat); | ||
if (clp::ErrorCode::ErrorCode_Success != clp_rc) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
another way to write this code is:
Do we prefer one way to another? @LinZhihao-723
auto clp_rc = file_reader.try_fstat(file_stat); | |
if (clp::ErrorCode::ErrorCode_Success != clp_rc) { | |
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | |
} | |
if (auto const clp_rc = file_reader.try_fstat(file_stat); | |
clp::ErrorCode::ErrorCode_Success != clp_rc) { | |
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supposedly second is better since the variable is not needed in remaining scope
clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); | ||
if (clp::ErrorCode::ErrorCode_Success != clp_rc) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto, maybe can be rewritten as
clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); | |
if (clp::ErrorCode::ErrorCode_Success != clp_rc) { | |
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | |
} | |
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); | |
clp::ErrorCode::ErrorCode_Success != clp_rc) { | |
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | |
} |
* @throw `ArchiveMetadata::OperationFailed` if stat or read operation on metadata file fails. | ||
* @throw `msgpack::unpack_error` if data cannot be unpacked into MessagePack object. | ||
* @throw `msgpack::type_error` if MessagePack object can't be converted to | ||
* `ArchiveMetadata`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can it be at end of the preivious line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed comment, and provided feedback for couple questions
auto clp_rc = file_reader.try_fstat(file_stat); | ||
if (clp::ErrorCode::ErrorCode_Success != clp_rc) { | ||
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Supposedly second is better since the variable is not needed in remaining scope
MSGPACK_DEFINE_MAP( | ||
MSGPACK_NVP("archive_format_version", m_archive_format_version), | ||
MSGPACK_NVP("variable_encoding_methods_version", m_variable_encoding_methods_version), | ||
MSGPACK_NVP("variables_schema_version", m_variables_schema_version), | ||
MSGPACK_NVP("compression_type", m_compression_type), | ||
MSGPACK_NVP("creator_id", m_creator_id), | ||
MSGPACK_NVP("begin_timestamp", m_begin_timestamp), | ||
MSGPACK_NVP("end_timestamp", m_end_timestamp), | ||
MSGPACK_NVP("uncompressed_size", m_uncompressed_size), | ||
MSGPACK_NVP("compressed_size", m_compressed_size) | ||
); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added it back, since I assume this is used somewhere. I will add issue to doc. Note it is default initialized to zero, so even if missing in private branch, it will show up as zero when deserialized (if missing), so not much of a problem
@@ -38,7 +38,7 @@ void Archive::open(string const& path) { | |||
archive_format_version_t format_version{}; | |||
try { | |||
FileReader file_reader{metadata_file_path}; | |||
ArchiveMetadata const metadata{file_reader}; | |||
auto const metadata = ArchiveMetadata::create_from_file_reader(file_reader); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do you feel about taking a clp::array instead as input. It will move some of the code (the read and stat) into archive reader, but then this constructor can also be used for sfa. Like for sfa, we can't pass in a filereader (since no way to get the size of metadata in larger file by itself, and the filepath is also for the entire sfa)?
We could also have 2 constructors. One for file path, and one for buffer. And they just share a method to deserialize from buffer? Let me know your thoughts
throw OperationFailed(ErrorCode_Failure, __FILENAME__, __LINE__); | ||
} | ||
|
||
std::vector<char> buf(file_stat.st_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved to clp::array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (4)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)
37-39
: Enhance the default constructor documentation.While the comment explains the MessagePack requirement, consider making it more explicit:
- // The class must be constructible to convert from msgpack::object. - // https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor + /** + * Default constructor required for MessagePack deserialization. + * @see https://github.com/msgpack/msgpack-c/wiki/v2_0_cpp_adaptor + * @note This constructor should not be used directly. + */
117-122
: Enhance write_to_file documentation.Consider adding more details about the MessagePack serialization format and potential size implications.
/** - * Packs `ArchiveMetadata` to MessagePack and writes to the open file. + * Serializes the metadata to MessagePack format and writes to the open file. + * Note: MessagePack format may result in slightly larger file sizes compared + * to the previous binary format. * * @param file_writer Writer for archive metadata file. * @throw FileWriter::OperationFailed if failed to write. */components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
5-6
: Remove unused include.The
fmt/core.h
header appears to be unused in this file. Consider removing it to keep the includes clean and minimize compilation time.-#include <fmt/core.h>
63-66
: Improve write_to_file performance and error handling.The current implementation has two areas for improvement:
- Double buffering could be avoided by writing directly to the file
- Missing error handling for the write operation
void ArchiveMetadata::write_to_file(FileWriter& file_writer) const { - std::ostringstream buf; - msgpack::pack(buf, *this); - auto const& string_buf = buf.str(); - file_writer.write(string_buf.data(), string_buf.size()); + // Pack directly to a pre-sized buffer to avoid double buffering + auto const packed_size = msgpack::pack_calculate_size(*this); + clp::Array<char> buf(packed_size); + msgpack::pack_raw(buf.data(), packed_size, *this); + + if (auto const clp_rc = file_writer.try_write(buf.data(), buf.size()); + clp::ErrorCode::ErrorCode_Success != clp_rc) { + throw OperationFailed(clp_rc, __FILENAME__, __LINE__); + } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(5 hunks)
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build (macos-latest)
🔇 Additional comments (7)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (5)
5-16
: LGTM! The new includes and constants are well-organized.The use of
std::string_view
aligns with the coding standards, and the compression type constant follows proper naming conventions.
53-64
: LGTM! Well-documented factory method.The static factory method is well-designed with comprehensive error documentation.
98-109
: LGTM! Consistent getter implementations.The new getters follow a consistent pattern and properly use
std::string_view const&
as return types.
152-154
: LGTM! Well-structured member variables.The new member variables properly use
std::string_view
for constant strings and follow consistent initialization patterns.
125-136
: Verify MessagePack field order impact.The order of fields in MSGPACK_DEFINE_MAP might affect backward compatibility when reading older archives.
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
33-44
: Well-implemented error handling and buffer management.Good practices observed:
- Using the recommended if-with-initializer syntax
- Throwing original return codes for better error context
- Using
clp::Array
for safe buffer management
31-50
: 🛠️ Refactor suggestionAdd file size validation to prevent excessive memory allocation.
While the implementation follows good practices for error handling and buffer management, it lacks size validation before allocating memory for the file contents.
auto ArchiveMetadata::create_from_file_reader(FileReader& file_reader) -> ArchiveMetadata { struct stat file_stat{}; if (auto const clp_rc = file_reader.try_fstat(file_stat); clp::ErrorCode::ErrorCode_Success != clp_rc) { throw OperationFailed(clp_rc, __FILENAME__, __LINE__); } + // Validate file size to prevent excessive memory allocation + constexpr size_t max_metadata_size = 1024 * 1024; // 1MB should be more than enough for metadata + if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) { + throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__); + } + clp::Array<char> buf(file_stat.st_size);Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to pass in file path instead of fileReader @haiqi96
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
16-16
: Add documentation for the compression type constant.Add a comment to explain the purpose and usage of this constant.
Apply this diff to add documentation:
+/** + * Constant representing ZSTD compression type for archive metadata. + */ static constexpr std::string_view cCompressionTypeZstd = "ZSTD";components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
5-5
: Remove unused include.The
fmt/core.h
header appears to be unused in the current implementation.-#include <fmt/core.h>
65-68
: Optimize string buffer allocation.Consider reserving space in the string buffer to avoid reallocations during serialization.
std::ostringstream buf; + // Reserve approximate size based on typical metadata size + buf.str().reserve(1024); msgpack::pack(buf, *this); auto const& string_buf = buf.str();
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(5 hunks)components/core/src/clp/streaming_archive/reader/Archive.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/reader/Archive.cpp
🧰 Additional context used
📓 Path-based instructions (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: build (macos-latest)
🔇 Additional comments (12)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
32-52
: 🛠️ Refactor suggestionAdd file size validation to prevent excessive memory allocation.
The implementation should validate the file size before allocating memory for the buffer.
Apply this diff to add file size validation:
auto ArchiveMetadata::create_from_file(std::string_view path) -> ArchiveMetadata { auto file_reader = FileReader(std::string(path)); struct stat file_stat{}; if (auto const clp_rc = file_reader.try_fstat(file_stat); clp::ErrorCode::ErrorCode_Success != clp_rc) { throw OperationFailed(clp_rc, __FILENAME__, __LINE__); } + // Validate file size to prevent excessive memory allocation + constexpr size_t max_metadata_size = 1024 * 1024; // 1MB should be more than enough for metadata + if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) { + throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__); + } + clp::Array<char> buf(file_stat.st_size); if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); clp::ErrorCode::ErrorCode_Success != clp_rc) { throw OperationFailed(clp_rc, __FILENAME__, __LINE__); }Likely invalid or redundant comment.
32-46
: 🛠️ Refactor suggestionAdd file size validation to prevent excessive memory allocation.
While the implementation correctly handles file operations, it should validate the file size before allocating memory.
if (auto const clp_rc = file_reader.try_fstat(file_stat); clp::ErrorCode::ErrorCode_Success != clp_rc) { throw OperationFailed(clp_rc, __FILENAME__, __LINE__); } + // Validate file size to prevent excessive memory allocation + constexpr size_t max_metadata_size = 1024 * 1024; // 1MB should be sufficient for metadata + if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) { + throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__); + } + clp::Array<char> buf(file_stat.st_size);Likely invalid or redundant comment.
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (10)
37-40
: LGTM!The documentation clearly explains why the default constructor is necessary for MessagePack deserialization.
54-64
: LGTM!The documentation is comprehensive and accurately describes the method's behaviour, including all possible exceptions.
99-109
: LGTM!The getter methods are well-implemented and consistent with the existing codebase style.
118-123
: LGTM!The documentation clearly describes the method's purpose and potential exceptions.
126-137
: LGTM!The MessagePack map definition includes all necessary fields for serialization.
5-16
: LGTM! Good use of modern C++ features.The use of std::string_view and constexpr for the compression type constant is appropriate.
37-40
: LGTM! Clear documentation.The comment effectively explains why the default constructor is necessary for MessagePack deserialization.
126-137
: LGTM! Complete serialization map.All necessary fields are included in the MessagePack serialization map, including creation_idx for format consistency.
99-110
: LGTM! Consistent getter implementation.The new getter methods follow the same pattern as existing ones and correctly return const references.
153-155
: Consider using std::string instead of std::string_view for member variables.While std::string_view is efficient, storing it as a member variable can lead to dangling references if the referenced string's lifetime ends. Since these are configuration values that don't change frequently, the overhead of std::string would be negligible.
std::ostringstream buf; | ||
msgpack::pack(buf, *this); | ||
auto const& string_buf = buf.str(); | ||
file_writer.write(string_buf.data(), string_buf.size()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add error handling for the write operation.
The write operation should be checked for errors to ensure the data is written successfully.
Apply this diff to add error handling:
void ArchiveMetadata::write_to_file(FileWriter& file_writer) const {
std::ostringstream buf;
msgpack::pack(buf, *this);
auto const& string_buf = buf.str();
- file_writer.write(string_buf.data(), string_buf.size());
+ if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size());
+ clp::ErrorCode::ErrorCode_Success != clp_rc)
+ {
+ throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
+ }
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::ostringstream buf; | |
msgpack::pack(buf, *this); | |
auto const& string_buf = buf.str(); | |
file_writer.write(string_buf.data(), string_buf.size()); | |
std::ostringstream buf; | |
msgpack::pack(buf, *this); | |
auto const& string_buf = buf.str(); | |
if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size()); | |
clp::ErrorCode::ErrorCode_Success != clp_rc) | |
{ | |
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | |
} |
std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion}; | ||
std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion}; | ||
std::string_view m_compression_type{cCompressionTypeZstd}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Use std::string instead of std::string_view for member variables.
Using std::string_view for member variables that store string literals is risky as it doesn't own the string data. If the string literal's lifetime isn't guaranteed, this could lead to dangling references.
Apply this diff to use std::string:
- std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion};
- std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion};
- std::string_view m_compression_type{cCompressionTypeZstd};
+ std::string m_variable_encoding_methods_version{std::string(ffi::cVariableEncodingMethodsVersion)};
+ std::string m_variables_schema_version{std::string(ffi::cVariablesSchemaVersion)};
+ std::string m_compression_type{std::string(cCompressionTypeZstd)};
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::string_view m_variable_encoding_methods_version{ffi::cVariableEncodingMethodsVersion}; | |
std::string_view m_variables_schema_version{ffi::cVariablesSchemaVersion}; | |
std::string_view m_compression_type{cCompressionTypeZstd}; | |
std::string m_variable_encoding_methods_version{std::string(ffi::cVariableEncodingMethodsVersion)}; | |
std::string m_variables_schema_version{std::string(ffi::cVariablesSchemaVersion)}; | |
std::string m_compression_type{std::string(cCompressionTypeZstd)}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The stings are constexpression
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@davemarco Ah, you're absolutely right. Since these are constexpr strings with static storage duration, using std::string_view
is actually the correct and more efficient approach. I apologize for the confusion in my previous review.
✏️ Learnings added
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.726Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. as long as you ensure that the PR is tested.
for title, how about
feat(core-clp)!: Migrate archive metadata to MessagePack format and add new archive encoding fields.
I lightly tested at beginning, but will do more extensive tests before merging. Like test clg, and run unit tests. Also I think @kirkrodrigues, may want to lightly review before merge? If not let us know |
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | ||
} | ||
|
||
clp::Array<char> buf(file_stat.st_size); | ||
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); | ||
clp::ErrorCode::ErrorCode_Success != clp_rc) | ||
{ | ||
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally, exceptions should only be used for exceptional circumstances. I don't think corrupt data is exceptional since it can occur under normal circumstances (e.g., an interrupted download, a drive failure, etc.). Thus, I think this method should return these errors as error codes. To support a return value that can be an error code OR an object, you can use outcome. See clp::ffi::get_schema_subtree_bitmap
for an example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function is called in a try-catch block, where any exception reading metadata is critical and re thrown. I think it would look weird to return std::result here , then throw an exception if there is an error, inside the existing try-catch block. It may be possible to get rid of the try-catch block entirely, and just use std::result, but then any undocumented exceptions in msgpack/file_reader may be caught higher up in code, and not in metadata handler like they are now. Lmk ur thoughts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. So from a guideline perspective, we should:
- Catch the MessagePack exceptions and return an error
- I think the FileReader exceptions should be truly exception, so we can leave them as exceptions for the top-layer of code to catch; but if not, then we should ideally use calls that return an error.
- Remove the external try-catch and change it into an error check
- Change Archive::open to return an error
I think (1) & (2) is doable in this PR (but feel free to disagree if it seems like too much code to change). (3) will probably be larger than the scope of this PR. So if we just do (1) & (2), to be able to propagate the error to the caller, I'd throw an exception.
Let's just make sure we're on the same page with the above before we go ahead with the change (don't want you to have to change too much).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general I am still a little suspect about using std::result now that i have looked at a bit more. We need to use since this function returns ArchiveMetadata
. We don't have custom error codes defined in this namespace yet (see Zhihao's custom codes), and dosen't look like the standard clp error codes have been migrated? It may be out-of-scope of this PR to add these new error codes, unless you want to do a new PR first, to add support for new codes into this class, or migrate clp codes or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can use regular error code, and pass in a pointer to default constructed ArchivedMetadata, but that is a bit gross
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Uh, we usually use the generic error codes (I gave a bad example, lol), but I guess since it isn't clean, let's leave it for now. I'll review the rest of the PR when I get a chance today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
64-69
:⚠️ Potential issueAdd error handling for the write operation.
The write operation should check for errors to ensure data is written successfully.
std::ostringstream buf; msgpack::pack(buf, *this); auto const& string_buf = buf.str(); - file_writer.write(string_buf.data(), string_buf.size()); + if (auto const clp_rc = file_writer.try_write(string_buf.data(), string_buf.size()); + clp::ErrorCode::ErrorCode_Success != clp_rc) + { + throw OperationFailed(clp_rc, __FILENAME__, __LINE__); + }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(4 hunks)components/core/src/clp/streaming_archive/reader/Archive.cpp
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- components/core/src/clp/streaming_archive/reader/Archive.cpp
- components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (1)
Learnt from: haiqi96
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.cpp:29-46
Timestamp: 2025-01-29T22:16:52.665Z
Learning: In C++ code, when throwing exceptions, prefer throwing the original return code (RC) from lower-level operations instead of a generic error code to preserve error context and aid in debugging.
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, false)
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build (macos-latest)
- GitHub Check: build-macos (macos-13, true)
🔇 Additional comments (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp (2)
3-9
: LGTM! All new includes are necessary.The added includes support the new MessagePack serialization functionality and file operations.
41-46
: LGTM! Proper error handling with original return codes.The implementation correctly propagates the original error codes from lower-level operations, which aids in debugging.
auto ArchiveMetadata::create_from_file(std::string_view file_path) -> ArchiveMetadata { | ||
FileReader file_reader{std::string(file_path)}; | ||
struct stat file_stat{}; | ||
if (auto const clp_rc = file_reader.try_fstat(file_stat); | ||
clp::ErrorCode::ErrorCode_Success != clp_rc) | ||
{ | ||
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | ||
} | ||
|
||
clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)}; | ||
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); | ||
clp::ErrorCode::ErrorCode_Success != clp_rc) | ||
{ | ||
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | ||
} | ||
|
||
ArchiveMetadata metadata; | ||
msgpack::object_handle const obj_handle = msgpack::unpack(buf.data(), buf.size()); | ||
msgpack::object const obj = obj_handle.get(); | ||
obj.convert(metadata); | ||
return metadata; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Add file size validation to prevent excessive memory allocation.
Consider adding a size limit check before allocating memory for the buffer.
if (auto const clp_rc = file_reader.try_fstat(file_stat);
clp::ErrorCode::ErrorCode_Success != clp_rc)
{
throw OperationFailed(clp_rc, __FILENAME__, __LINE__);
}
+ constexpr size_t max_metadata_size = 1024 * 1024; // 1MB limit for metadata
+ if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) {
+ throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__);
+ }
+
clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)};
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto ArchiveMetadata::create_from_file(std::string_view file_path) -> ArchiveMetadata { | |
FileReader file_reader{std::string(file_path)}; | |
struct stat file_stat{}; | |
if (auto const clp_rc = file_reader.try_fstat(file_stat); | |
clp::ErrorCode::ErrorCode_Success != clp_rc) | |
{ | |
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | |
} | |
clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)}; | |
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); | |
clp::ErrorCode::ErrorCode_Success != clp_rc) | |
{ | |
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | |
} | |
ArchiveMetadata metadata; | |
msgpack::object_handle const obj_handle = msgpack::unpack(buf.data(), buf.size()); | |
msgpack::object const obj = obj_handle.get(); | |
obj.convert(metadata); | |
return metadata; | |
} | |
auto ArchiveMetadata::create_from_file(std::string_view file_path) -> ArchiveMetadata { | |
FileReader file_reader{std::string(file_path)}; | |
struct stat file_stat{}; | |
if (auto const clp_rc = file_reader.try_fstat(file_stat); | |
clp::ErrorCode::ErrorCode_Success != clp_rc) | |
{ | |
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | |
} | |
constexpr size_t max_metadata_size = 1024 * 1024; // 1MB limit for metadata | |
if (file_stat.st_size <= 0 || file_stat.st_size > max_metadata_size) { | |
throw OperationFailed(ErrorCode_BadParam, __FILENAME__, __LINE__); | |
} | |
clp::Array<char> buf{static_cast<size_t>(file_stat.st_size)}; | |
if (auto const clp_rc = file_reader.try_read_exact_length(buf.data(), buf.size()); | |
clp::ErrorCode::ErrorCode_Success != clp_rc) | |
{ | |
throw OperationFailed(clp_rc, __FILENAME__, __LINE__); | |
} | |
ArchiveMetadata metadata; | |
msgpack::object_handle const obj_handle = msgpack::unpack(buf.data(), buf.size()); | |
msgpack::object const obj = obj_handle.get(); | |
obj.convert(metadata); | |
return metadata; | |
} |
Co-authored-by: kirkrodrigues <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)
51-61
: Validate path parameters and file accessibility.While the documentation states that
create_from_file
throws exceptions if the file cannot be read, consider verifying thatfile_path
is not empty or invalid. This can provide more descriptive errors or early detection of misconfiguration.
102-106
: Ensure documentation consistency for error cases.If
write_to_file
can throw due to I/O failures, consider documenting such potential exceptions or clarifying thatFileWriter
handles them elsewhere.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
⏰ Context from checks skipped due to timeout of 90000ms (12)
- GitHub Check: ubuntu-focal-static-linked-bins
- GitHub Check: centos-stream-9-static-linked-bins
- GitHub Check: ubuntu-jammy-static-linked-bins
- GitHub Check: ubuntu-focal-dynamic-linked-bins
- GitHub Check: centos-stream-9-dynamic-linked-bins
- GitHub Check: ubuntu-jammy-dynamic-linked-bins
- GitHub Check: build-macos (macos-14, true)
- GitHub Check: build-macos (macos-13, false)
- GitHub Check: lint-check (ubuntu-latest)
- GitHub Check: build-macos (macos-13, true)
- GitHub Check: lint-check (macos-latest)
- GitHub Check: build (macos-latest)
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)
5-7
: No issues found with the newly added includes.Including
<string_view>
and<msgpack.hpp>
is consistent with the new usage of string views and MessagePack functionalities.
33-35
: Good addition of the default constructor.Having a default constructor is necessary for msgpack conversion. This ensures safe deserialisation into an
ArchiveMetadata
object without imposing extra burdens on the caller.
49-49
: No substantive changes.This is merely a comment line update with no functional impact.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the PR title, how about:
feat(core-clp)!: Migrate archive metadata file format to MessagePack.
I looked through this again prior to merging and noticed an issue. The compressed size for the metadata itself is calculated based on old metadata size. It is not the easiest thing to fix. I propose 3 options: Option A
Option B Option C |
How about let's only add the metadata file's size to |
write_to_file is actually called whenever a segment is closed, so they would accumulate. Even if called only once, or some behavior to handle that issue, there could also be issues with global_metadatadb archive update. Say we update global metadataDB with archive size before actually writting metadata to file. |
Can we do the following?
|
^ i think the above should work. ill let u know if any issues, will request ur review when updated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)
61-61
: Consider using a more descriptive parameter name.The parameter name
file_path
would be more descriptive than justpath
.-[[nodiscard]] static auto create_from_file(std::string_view file_path) -> ArchiveMetadata; +[[nodiscard]] static auto create_from_file(std::string_view archive_file_path) -> ArchiveMetadata;
102-107
: Add exception documentation.Document potential exceptions that could be thrown during MessagePack serialization or file writing.
/** * Packs this instance into a MessagePack object and writes it to the open file. * * @param file_writer + * @throw msgpack::type_error if serialization fails + * @throw FileWriter::OperationFailed if writing to file fails */
131-131
: Add documentation for m_metadata_size.The purpose and usage of this new member variable should be documented.
uint64_t m_compressed_size{0}; + // The size of the serialized metadata in bytes uint64_t m_metadata_size{0};
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
(3 hunks)components/core/src/clp/streaming_archive/ArchiveMetadata.hpp
(5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- components/core/src/clp/streaming_archive/ArchiveMetadata.cpp
🧰 Additional context used
📓 Path-based instructions (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (1)
Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}
: - Prefer false == <expression>
rather than !<expression>
.
📓 Learnings (1)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (2)
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:153-155
Timestamp: 2025-01-30T19:26:33.869Z
Learning: When working with constexpr strings (string literals with static storage duration), std::string_view is the preferred choice for member variables as it's more efficient and safe, avoiding unnecessary memory allocations.
Learnt from: davemarco
PR: y-scope/clp#700
File: components/core/src/clp/streaming_archive/ArchiveMetadata.hpp:122-132
Timestamp: 2025-01-29T21:09:06.878Z
Learning: The `creation_idx` field in `ArchiveMetadata` is used for tracking archive order in split archives. When archives aren't split, it can be safely initialized to zero.
🔇 Additional comments (3)
components/core/src/clp/streaming_archive/ArchiveMetadata.hpp (3)
33-35
: LGTM! Clear explanation for the default constructor.The comment effectively explains why a default constructor is needed for MessagePack deserialization.
109-117
: LGTM! MessagePack serialization map is complete.All essential metadata fields are included in the serialization map.
7-7
: 🛠️ Refactor suggestionUse angle brackets for library includes.
Library headers should use angle brackets instead of quotes.
-#include "msgpack.hpp" +#include <msgpack.hpp>Likely invalid or redundant comment.
@kirkrodrigues I did the fix. Note I made a slight change. I did not
Instead I set Therefore I did this because MsgPack will only take one variable |
Hey I thought about this more yesterday night, and I think your idea is clever , but still a bit suspicious. Like fundamentally there is a chicken & egg problem, where you don't know the size of the metadata until after you write it, making it difficult to include the metadata_size in the msgpack metadata itself. There are some weird effects:
I am still partial to Option C - just dropping the metadata from the compressed size. Let me know your thoughts. |
Discussed offline new plan to remove metadata size from compressed size. Made changes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved pending Haiqi's final approval.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quickly took a look. The final version looks reasonable.
Description
This PR is breaking
MIgrates the archive metadata format from a byte stream to MessagePack Map (same as sfa spec). Using MessagePack allows us to more easily store variable length metadata like strings. In previous format, a size specifier was required for any variable length field.
Note the metadata file will take up slightly more space, but likely small relative to archive.
Lastly the metadata size is now excluded from the compressed size. Now that the metadata is variable length, it is not trivial to incorporate the metadata size in the metadata itself (there is a circularity problem). The simplest solution is just to remove it from compressed size value.
This PR is sfa related like:
#698
Validation performed
Tested compression/decompression of archive.
Summary by CodeRabbit
Summary by CodeRabbit
Refactor
ArchiveMetadata
class to use MessagePack serialization for more efficient file reading and writing.create_from_file
method.ArchiveMetadata
in theArchive::open
method.New Features
ArchiveMetadata
.ArchiveMetadata
to a file.Improvements