Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0; Update README to include intersection-test. #41

Merged
merged 122 commits into from
Oct 7, 2024

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Sep 13, 2024

References

  • Depends on PR#40.

Description

Previously, RegexASTMultiplication was missing negative tags needed for generating a tagged-NFA. Namely, for regex repetition (e.g. R{0,N} or R*} containing a capture group, the 0 case indicates the capture group is not matched. In this case we need to add a negative tag. As a result we do the following:

  • Create an empty regex AST node, ∅.
  • Treat R{0,N} as R{1,N} | ∅
  • Treat R* as R+|∅

Validation performed

  • Create unit-tests for repetition regex.

Summary by CodeRabbit

  • New Features

    • Introduced a new UniqueIdGenerator class for generating unique IDs.
    • Added a new derived class RegexASTEmpty to enhance regex AST structure.
    • Enhanced regex parsing capabilities with improved handling of various input cases.
    • Updated example programs with clearer descriptions and new functionality.
    • Modified variable declarations in example programs to include naming conventions for clarity.
  • Bug Fixes

    • Improved regex parsing capabilities and handling of various input cases.
  • Tests

    • Expanded test coverage for regex AST serialization with a new helper function.
  • Chores

    • Updated dependency management to include the Boost library.

SharafMohamed and others added 30 commits September 11, 2024 20:07
… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
tests/test-lexer.cpp (2)

76-77: Ensure consistency in add_variable method calls.

While the changes to use string concatenation for var_schema are appropriate, there's an inconsistency in how add_variable is called compared to the previous section. Consider using string_view here as well for consistency:

-schema.add_variable(var_schema, -1);
+schema.add_variable(string_view(var_schema), -1);

146-161: LGTM: Comprehensive test cases for repetition regex.

The new test section for repetition regex is well-structured and covers a good range of scenarios. The use of the test_regex_ast helper function improves readability and consistency.

Consider adding a comment explaining the difference between tagged and untagged ASTs for clarity:

// Repetition without capture groups: untagged and tagged ASTs are the same
test_regex_ast("capture:a{0,10}", U"()|(a{1,10})");
// ...

// Repetition with capture groups: untagged and tagged ASTs are different
test_regex_ast("capture:(?<letter>a){0,10}", U"(<~0>)|((a)<0>{1,10})");
// ...
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 083b29d and f29cb07.

📒 Files selected for processing (1)
  • tests/test-lexer.cpp (6 hunks)
🧰 Additional context used
🔇 Additional comments (5)
tests/test-lexer.cpp (5)

1-1: LGTM: New includes and using declarations are appropriate.

The added include statements and using declarations are necessary for the new helper functions and improve code readability.

Also applies to: 3-4, 6-6, 16-16, 18-19


37-53: LGTM: Well-structured helper function declarations.

The anonymous namespace and function declarations with docstrings are well-structured and provide clear information about the helper functions' purpose and parameters.


59-60: LGTM: Improved variable schema construction and method call.

The changes to use string concatenation for var_schema and string_view for the add_variable method call are appropriate and align with previous suggestions.


165-177: LGTM: Well-implemented test_regex_ast function.

The implementation of the test_regex_ast function is correct and aligns with its declared purpose. The use of fmt::format for string formatting and the u32string_to_string helper function for conversion is appropriate.


179-182: LGTM: Efficient implementation of u32string_to_string.

The implementation of the u32string_to_string function is correct and efficient for converting UTF-32 to UTF-8. This addresses the previous suggestion to use a standard library function for UTF-8 encoding.

@SharafMohamed SharafMohamed changed the title Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0. Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0; Update README to include intersection-test. Oct 7, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (6)
examples/README.md (2)

8-9: Great addition of the new example program!

The introduction of the intersect-test program expands the range of functionality demonstrated in the examples. This is a valuable addition for users exploring the project's capabilities.

Consider adding a brief explanation or link to resources about DFAs (Deterministic Finite Automata) for users who might be unfamiliar with the concept. This could further improve the accessibility of the README.


30-30: Command added for the new example program.

The addition of the command to run the intersect-test program is consistent with the introduction of this new example.

For consistency with the other example commands, consider adding an explanation of any arguments required for the intersect-test program. If no arguments are needed, it might be helpful to explicitly state this to avoid confusion.

tests/test-lexer.cpp (3)

37-53: LGTM: Well-structured utility function declarations

The addition of an unnamed namespace for local helper functions is a good practice. The function declarations are well-formatted and use appropriate parameter types. The docstrings provide clear explanations of the functions' purposes.

Consider updating the docstring for u32string_to_string to be more precise:

- * Convert the characters in a 32-byte unicode string into 4 bytes to generate a 8-byte unciode
+ * Convert a UTF-32 string to a UTF-8 string.

This change more accurately describes the function's purpose without getting into implementation details.


145-170: LGTM: Comprehensive test cases for repetition regex

The new test cases for repetition regex are well-structured and cover a wide range of scenarios, including both tagged and untagged ASTs. The use of the test_regex_ast function makes the tests concise and easy to read. The complex repetition test case is particularly valuable for ensuring correct behavior in more intricate scenarios.

To improve readability, consider adding a brief comment before each group of related test cases to explain what aspect of repetition regex they are testing. For example:

// Basic repetition without capture groups
test_regex_ast("capture:a{0,10}", U"()|(a{1,10})");
test_regex_ast("capture:a{5,10}", U"a{5,10}");
// ... more tests ...

// Repetition with capture groups
test_regex_ast("capture:(?<letter>a){0,10}", U"(<~0>)|((a)<0>{1,10})");
// ... more tests ...

This would make it easier for other developers to understand the purpose of each group of tests at a glance.


173-186: LGTM: Well-implemented test_regex_ast function

The implementation of the test_regex_ast function is clear and aligns well with its intended purpose. The use of dynamic_cast for type checking is appropriate in this testing context.

To improve error reporting when a test fails, consider using Catch2's REQUIRE_THAT with a string matcher. This will provide more detailed output about string differences:

REQUIRE_THAT(actual_string, Catch::Matchers::Equals(expected_string));

This change would make it easier to identify discrepancies between the actual and expected serialized ASTs when a test fails.

src/log_surgeon/SchemaParser.cpp (1)

249-257: Improved handling of regex range matches with zero minimum

The changes to the regex_match_range_rule function correctly implement the handling of regex repetition constructs with a minimum value of zero, which is a key objective of this PR. The approach is consistent with the regex_match_zero_or_more_rule function, ensuring uniform behaviour across different regex constructs.

For consistency, consider using if (0 == min) instead of if (min == 0) to match the style used in other parts of the codebase.

-    if (min == 0) {
+    if (0 == min) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f29cb07 and 4379bc5.

📒 Files selected for processing (5)
  • examples/README.md (2 hunks)
  • examples/intersect-test.cpp (2 hunks)
  • src/log_surgeon/SchemaParser.cpp (5 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
  • tests/test-lexer.cpp (6 hunks)
🧰 Additional context used
🔇 Additional comments (12)
examples/README.md (1)

3-6: Excellent clarification of example programs!

The updates provide specific names for the example programs and clearly explain their purpose. This improvement enhances the README's usefulness for users exploring the project's examples.

examples/intersect-test.cpp (2)

32-32: Improved variable naming convention

The change to prefix the "search" variable with "search:" aligns with the new naming convention used throughout the file. This modification enhances clarity by explicitly indicating the purpose of the variable within the schema.


54-59: Enhanced schema variable declarations

The modifications to the schema variable declarations improve code readability and self-documentation. By prefixing each regex pattern with its corresponding variable name (e.g., "int:", "float:", etc.), the purpose of each pattern becomes immediately clear. This change maintains consistency with the new naming convention and facilitates easier understanding and maintenance of the schema.

tests/test-lexer.cpp (2)

1-6: LGTM: New includes added appropriately

The new includes for , , and <string_view> are necessary for the added utility functions. Good job on keeping the includes organized and relevant to the changes made.


16-21: LGTM: Appropriate using declarations added

The new using declarations for std::codecvt_utf8, std::string_view, and std::u32string are correctly added and align with the new includes. These are used effectively in the new utility functions.

src/log_surgeon/SchemaParser.cpp (4)

7-8: Improved string handling and new regex AST type added

The addition of the <string_view> header and the new RegexASTEmptyByte type alias are positive changes. Using string_view can lead to more efficient string handling, especially for read-only operations. The new RegexASTEmptyByte type likely supports the handling of empty matches in regex operations, which aligns with the PR objectives.

Also applies to: 35-36


85-85: Performance improvement in try_schema_string function signature

The change from string const& to string_view const in the function signature is a good optimization. Using string_view can lead to better performance, especially for read-only operations on strings, as it avoids unnecessary copying and allows for more efficient string handling.


204-208: Improved handling of zero-or-more regex matches

The changes to the regex_match_zero_or_more_rule function significantly improve the handling of regex patterns with zero-or-more repetitions. By treating R* as R+ | ∅, the function now correctly handles negative tags, which is a key objective of this PR. The use of RegexASTEmptyByte and RegexASTOrByte provides a more accurate representation of the regex AST for these cases.


Line range hint 1-1023: Overall improvements in regex handling and consistency

The changes in this file successfully address the PR objectives by improving the handling of regex repetition constructs, particularly those with a minimum value of zero. The modifications to regex_match_zero_or_more_rule and regex_match_range_rule are consistent and well-implemented. The addition of RegexASTEmptyByte and the use of string_view contribute to more efficient and accurate regex parsing.

These changes enhance the capability of the schema parser to handle complex regex patterns, especially those involving optional matches. The code maintains good readability and consistency throughout the file.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

127-162: Implementation of RegexASTEmpty class is appropriate.

The RegexASTEmpty class correctly represents an empty node in the regex AST, and its methods are properly implemented.


273-273: Addition of default constructor to RegexASTGroup.

The default constructor RegexASTGroup() = default; ensures that the class can be default-initialized, which is beneficial.


695-698: Correct implementation of serialize() method for RegexASTEmpty.

The serialize() method correctly returns the serialization of negative tags for an empty node.

examples/intersect-test.cpp Show resolved Hide resolved
tests/test-lexer.cpp Show resolved Hide resolved
tests/test-lexer.cpp Outdated Show resolved Hide resolved
tests/test-lexer.cpp Outdated Show resolved Hide resolved
tests/test-lexer.cpp Outdated Show resolved Hide resolved
tests/test-lexer.cpp Outdated Show resolved Hide resolved
tests/test-lexer.cpp Outdated Show resolved Hide resolved
tests/test-lexer.cpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)

127-162: LGTM! New RegexASTEmpty class is well-implemented.

The new RegexASTEmpty class is correctly implemented and aligns with the PR objectives. It provides a clean way to handle regex repetitions with a minimum of 0.

Consider adding an assertion in the add_to_nfa method to ensure that the nfa and end_state parameters are not null. This can help catch potential issues early:

 auto add_to_nfa(
         [[maybe_unused]] RegexNFA<NFAStateType>* nfa,
         [[maybe_unused]] NFAStateType* end_state
 ) const -> void override {
+    assert(nfa != nullptr && end_state != nullptr);
     // Do nothing as adding an empty node to the NFA is a null operation.
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4379bc5 and 0e99f67.

📒 Files selected for processing (2)
  • src/log_surgeon/finite_automata/RegexAST.hpp (4 hunks)
  • tests/test-lexer.cpp (5 hunks)
🧰 Additional context used
🔇 Additional comments (10)
tests/test-lexer.cpp (8)

1-6: LGTM: New include directives added

The new include directives for <codecvt>, <locale>, and <string_view> are appropriate for the added functionality in this file.


16-21: LGTM: New using declarations added

The new using declarations for std::codecvt_utf8, std::string_view, std::u32string, and std::wstring_convert are consistent with the added include directives and are used in the new functionality.


37-53: LGTM: Helper functions added in unnamed namespace

The addition of test_regex_ast and u32string_to_string functions in an unnamed namespace is appropriate. The function declarations are well-documented with clear comments explaining their purpose and parameters.


59-60: Consider using std::string::operator+= for efficient string concatenation

The current implementation uses string concatenation to create var_schema. While this works, it can be less efficient for multiple concatenations. Consider using std::string::operator+= for more efficient string concatenation:

string var_schema = var_name;
var_schema += ':';
var_schema += "123";  // or the regex pattern

This approach minimizes temporary string creations and can be more efficient, especially for longer strings or multiple concatenations.

Also applies to: 76-77


123-144: LGTM: Comprehensive test case added for AST with tags

This new test case thoroughly validates the serialization of a regex AST with named capture groups. It tests complex patterns and verifies the correct tagging of positive and negative matches in the serialized output. The use of test_regex_ast function enhances readability and maintainability.


146-189: LGTM: Comprehensive test cases added for repetition regex

These new test cases thoroughly cover various scenarios of repetition regex, including:

  • Repetition without capture groups
  • Repetition with capture groups
  • Capture groups with repetition
  • Complex repetition patterns

The tests validate both tagged and untagged AST outputs, ensuring correct handling of repetition constructs in different contexts. This comprehensive set of tests will help maintain the integrity of the regex parsing functionality.


193-205: LGTM: Implementation of test_regex_ast function

The implementation of test_regex_ast function is correct and efficient. It properly creates a Schema, adds a variable, and compares the serialized output with the expected result. The use of dynamic_cast for type checking is appropriate in this context.


207-212: LGTM: Implementation of u32string_to_string function

The implementation of u32string_to_string function correctly converts char32_t strings to UTF-8 strings. While it uses deprecated components (std::wstring_convert and std::codecvt_utf8), it's understood that this is the current recommendation until a new stable version is released.

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Line range hint 865-870: LGTM! Improved serialization in RegexASTMultiplication.

The changes to the serialize method in RegexASTMultiplication enhance the output format by using std::u32string consistently and correctly handling the infinite case. This improvement aligns well with the PR objectives.


Line range hint 1-1054: Overall changes are well-implemented and align with PR objectives.

The addition of the RegexASTEmpty class and the improvements to the RegexASTMultiplication class are the main changes in this file. These modifications effectively address the handling of regex repetition constructs with a minimum value of zero, as outlined in the PR objectives. The changes are well-integrated into the existing codebase and do not introduce any apparent issues.

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR title looks good to me.

@LinZhihao-723 LinZhihao-723 merged commit bb06e57 into y-scope:main Oct 7, 2024
7 checks passed
SharafMohamed added a commit to SharafMohamed/log-surgeon that referenced this pull request Oct 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants