Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for tracking matched and unmatched capture groups in RegexAST nodes using integer-based tags; Add support for serializing RegexAST nodes. #38

Merged
merged 77 commits into from
Sep 30, 2024

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Sep 13, 2024

References

  • PR depends on PR#37.

Description

  • Add positive tag to AST node type CaptureAST indicating the id for the capture group.
  • Add negative tags to all AST node types indicating which capture groups are implicitly not matched when traversing the node.
  • Use fmt library to format serialized ASTs.

Validation performed

  • Add unit-test that serializes the tagged AST and compares against an expected string.

Summary by CodeRabbit

  • New Features

    • Enhanced package management for the log_surgeon library with conditional dependencies.
    • Introduced tagging and serialization features for regex AST structures.
    • Added functionality for unique ID generation in regex capture groups.
  • Bug Fixes

    • Improved handling of dependencies for Microsoft.GSL::GSL.
  • Tests

    • Expanded test coverage for the Schema class to validate regex AST serialization with named capture groups.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (4)
src/log_surgeon/finite_automata/RegexAST.hpp (4)

4-6: Good addition of the fmt library for string formatting.

The inclusion of the fmt library is a great choice for modern C++ string formatting. It improves readability and maintainability of the code.

Consider adding a comment explaining why C++20's is not used, to provide context for future maintainers:

// Using fmt library instead of C++20 <format> for broader compiler support
#include <fmt/core.h>
#include <fmt/ranges.h>

80-94: Well-implemented tag management methods.

The new methods for managing subtree positive tags and negative tags provide a clear and efficient interface. The use of std::move in set methods is a good practice for performance.

Consider adding [[nodiscard]] to the getter method for consistency:

[[nodiscard]] auto get_subtree_positive_tags() const -> std::set<uint32_t> const& {
    return m_subtree_positive_tags;
}

102-113: Well-implemented negative tag serialization.

The serialize_negative_tags method is well-implemented, using fmt::format consistently with the new approach. The logic for formatting negative tags is clear and efficient.

Consider pre-allocating the formatted_tags vector to improve performance:

std::vector<std::string> formatted_tags;
formatted_tags.reserve(m_negative_tags.size());
for (auto const tag : m_negative_tags) {
    formatted_tags.push_back(fmt::format("<~{}>", tag));
}

Line range hint 1-1001: Overall well-implemented serialization and tagging mechanism.

The changes consistently implement serialization across all AST node types and introduce a valuable tagging mechanism for tracking capture groups. The use of the fmt library for string formatting improves code readability and maintainability.

However, there are a few areas that require attention:

  1. Unicode character handling in serialization methods of RegexASTLiteral, RegexASTInteger, and RegexASTGroup.
  2. Potential performance optimization in serialize_negative_tags method.

Please address these issues to ensure robust handling of Unicode characters and optimal performance.

Consider creating a utility function for Unicode character serialization to ensure consistent handling across all AST node types. This would improve code reusability and make it easier to maintain correct Unicode handling throughout the codebase.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between ce2fc76 and 9643138.

📒 Files selected for processing (2)
  • src/log_surgeon/SchemaParser.hpp (3 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (22 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/SchemaParser.hpp
🔇 Additional comments not posted (6)
src/log_surgeon/finite_automata/RegexAST.hpp (6)

26-38: Well-documented tagging mechanism.

The updated class documentation clearly explains the purpose and usage of the new tagging mechanism. It provides excellent context for understanding the class structure and behaviour, which will be valuable for maintainers and new developers working on this code.


74-78: Good addition of the pure virtual serialize method.

The introduction of the pure virtual serialize method in the base class ensures that all derived classes will implement serialization. The method is correctly marked as [[nodiscard]] and const, following best practices.


711-716: Well-implemented changes in RegexASTOr.

The constructor has been updated to correctly set negative tags for both left and right children, which is crucial for the new tagging mechanism. The serialize method is well-implemented, including proper null checks and using fmt::format consistently.

Also applies to: 725-736


744-747: Well-implemented changes in RegexASTCat.

The constructor has been updated to correctly set and add subtree positive tags for both left and right children, which aligns with the new tagging mechanism. The serialize method is well-implemented, including proper null checks and using fmt::format consistently.

Also applies to: 760-771


781-783: Well-implemented changes in RegexASTMultiplication.

The constructor has been updated to correctly set subtree positive tags for the operand, aligning with the new tagging mechanism. The serialize method is well-implemented, handling the infinite case correctly and using fmt::format consistently.

Also applies to: 821-833


575-587: Well-implemented changes in RegexASTCapture.

The constructor has been updated to include a tag parameter and correctly set subtree positive tags, including the new tag. This aligns well with the new tagging mechanism. The serialize method is well-implemented, including the tag in the output and using fmt::format consistently.

Also applies to: 841-851

src/log_surgeon/finite_automata/RegexAST.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/RegexAST.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/RegexAST.hpp Show resolved Hide resolved
src/log_surgeon/SchemaParser.cpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
.clang-format (1)

77-77: LGTM! Consider adding a comment for clarity.

The change to include 'fmt' in the regex pattern for library headers is appropriate, given the introduction of the 'fmt' library for formatting serialized ASTs. This modification ensures that 'fmt' headers are categorized correctly.

For improved clarity, consider adding a comment explaining the purpose of this regex pattern. For example:

-  - Regex: "<(Catch2|fmt)"
+  # Catch2 (testing framework) and fmt (formatting library) headers
+  - Regex: "<(Catch2|fmt)"

This comment would help future maintainers understand why these specific libraries are grouped together.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between dac2122 and a5eae39.

📒 Files selected for processing (4)
  • .clang-format (1 hunks)
  • CMakeLists.txt (4 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
  • tests/test-lexer.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • CMakeLists.txt
  • src/log_surgeon/finite_automata/RegexAST.hpp
  • tests/test-lexer.cpp

…out-of-order comments); Add comment explaining why <format> isn't sued to cmake
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR title suggestion:
"Add support for tracking match and unmatched capture groups in RegexAST nodes using integer-based tags; Add support for serializing RegexAST nodes."
We don't need to mention unit test in the PR title

@SharafMohamed SharafMohamed changed the title Add tags to AST to track match and unmatched capture groups; Add serialization of AST; Add unit-test for tagged AST. Add support for tracking matched and unmatched capture groups in RegexAST nodes using integer-based tags; Add support for serializing RegexAST nodes. Sep 30, 2024
@SharafMohamed SharafMohamed merged commit c410af4 into y-scope:main Sep 30, 2024
7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants