-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for tracking matched and unmatched capture groups in RegexAST
nodes using integer-based tags; Add support for serializing RegexAST
nodes.
#38
Conversation
… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
…(); Add docstrign to RegexDFAStatePair
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (4)
src/log_surgeon/finite_automata/RegexAST.hpp (4)
4-6
: Good addition of the fmt library for string formatting.The inclusion of the fmt library is a great choice for modern C++ string formatting. It improves readability and maintainability of the code.
Consider adding a comment explaining why C++20's is not used, to provide context for future maintainers:
// Using fmt library instead of C++20 <format> for broader compiler support #include <fmt/core.h> #include <fmt/ranges.h>
80-94
: Well-implemented tag management methods.The new methods for managing subtree positive tags and negative tags provide a clear and efficient interface. The use of std::move in set methods is a good practice for performance.
Consider adding [[nodiscard]] to the getter method for consistency:
[[nodiscard]] auto get_subtree_positive_tags() const -> std::set<uint32_t> const& { return m_subtree_positive_tags; }
102-113
: Well-implemented negative tag serialization.The serialize_negative_tags method is well-implemented, using fmt::format consistently with the new approach. The logic for formatting negative tags is clear and efficient.
Consider pre-allocating the formatted_tags vector to improve performance:
std::vector<std::string> formatted_tags; formatted_tags.reserve(m_negative_tags.size()); for (auto const tag : m_negative_tags) { formatted_tags.push_back(fmt::format("<~{}>", tag)); }
Line range hint
1-1001
: Overall well-implemented serialization and tagging mechanism.The changes consistently implement serialization across all AST node types and introduce a valuable tagging mechanism for tracking capture groups. The use of the fmt library for string formatting improves code readability and maintainability.
However, there are a few areas that require attention:
- Unicode character handling in serialization methods of RegexASTLiteral, RegexASTInteger, and RegexASTGroup.
- Potential performance optimization in serialize_negative_tags method.
Please address these issues to ensure robust handling of Unicode characters and optimal performance.
Consider creating a utility function for Unicode character serialization to ensure consistent handling across all AST node types. This would improve code reusability and make it easier to maintain correct Unicode handling throughout the codebase.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- src/log_surgeon/SchemaParser.hpp (3 hunks)
- src/log_surgeon/finite_automata/RegexAST.hpp (22 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- src/log_surgeon/SchemaParser.hpp
🔇 Additional comments not posted (6)
src/log_surgeon/finite_automata/RegexAST.hpp (6)
26-38
: Well-documented tagging mechanism.The updated class documentation clearly explains the purpose and usage of the new tagging mechanism. It provides excellent context for understanding the class structure and behaviour, which will be valuable for maintainers and new developers working on this code.
74-78
: Good addition of the pure virtual serialize method.The introduction of the pure virtual serialize method in the base class ensures that all derived classes will implement serialization. The method is correctly marked as [[nodiscard]] and const, following best practices.
711-716
: Well-implemented changes in RegexASTOr.The constructor has been updated to correctly set negative tags for both left and right children, which is crucial for the new tagging mechanism. The serialize method is well-implemented, including proper null checks and using fmt::format consistently.
Also applies to: 725-736
744-747
: Well-implemented changes in RegexASTCat.The constructor has been updated to correctly set and add subtree positive tags for both left and right children, which aligns with the new tagging mechanism. The serialize method is well-implemented, including proper null checks and using fmt::format consistently.
Also applies to: 760-771
781-783
: Well-implemented changes in RegexASTMultiplication.The constructor has been updated to correctly set subtree positive tags for the operand, aligning with the new tagging mechanism. The serialize method is well-implemented, handling the infinite case correctly and using fmt::format consistently.
Also applies to: 821-833
575-587
: Well-implemented changes in RegexASTCapture.The constructor has been updated to include a tag parameter and correctly set subtree positive tags, including the new tag. This aligns well with the new tagging mechanism. The serialize method is well-implemented, including the tag in the output and using fmt::format consistently.
Also applies to: 841-851
…or serializing negative tags use reserve and emplace_back.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
.clang-format (1)
77-77
: LGTM! Consider adding a comment for clarity.The change to include 'fmt' in the regex pattern for library headers is appropriate, given the introduction of the 'fmt' library for formatting serialized ASTs. This modification ensures that 'fmt' headers are categorized correctly.
For improved clarity, consider adding a comment explaining the purpose of this regex pattern. For example:
- - Regex: "<(Catch2|fmt)" + # Catch2 (testing framework) and fmt (formatting library) headers + - Regex: "<(Catch2|fmt)"This comment would help future maintainers understand why these specific libraries are grouped together.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (4)
- .clang-format (1 hunks)
- CMakeLists.txt (4 hunks)
- src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
- tests/test-lexer.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
- CMakeLists.txt
- src/log_surgeon/finite_automata/RegexAST.hpp
- tests/test-lexer.cpp
…out-of-order comments); Add comment explaining why <format> isn't sued to cmake
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PR title suggestion:
"Add support for tracking match and unmatched capture groups in RegexAST nodes using integer-based tags; Add support for serializing RegexAST nodes."
We don't need to mention unit test in the PR title
RegexAST
nodes using integer-based tags; Add support for serializing RegexAST
nodes.
References
Description
CaptureAST
indicating the id for the capture group.fmt
library to format serialized ASTs.Validation performed
Summary by CodeRabbit
New Features
Bug Fixes
Tests