Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`; Update README to include `intersection-test`. #41

SharafMohamed · 2024-09-13T18:03:46Z

References

Depends on PR#40.

Description

Previously, RegexASTMultiplication was missing negative tags needed for generating a tagged-NFA. Namely, for regex repetition (e.g. R{0,N} or R*} containing a capture group, the 0 case indicates the capture group is not matched. In this case we need to add a negative tag. As a result we do the following:

Create an empty regex AST node, ∅.
Treat R{0,N} as R{1,N} | ∅
Treat R* as R+|∅

Validation performed

Create unit-tests for repetition regex.

Summary by CodeRabbit

New Features
- Introduced a new UniqueIdGenerator class for generating unique IDs.
- Added a new derived class RegexASTEmpty to enhance regex AST structure.
- Enhanced regex parsing capabilities with improved handling of various input cases.
- Updated example programs with clearer descriptions and new functionality.
- Modified variable declarations in example programs to include naming conventions for clarity.
Bug Fixes
- Improved regex parsing capabilities and handling of various input cases.
Tests
- Expanded test coverage for regex AST serialization with a new helper function.
Chores
- Updated dependency management to include the Boost library.

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA

… added tags

…for clairty that nothing is shared b/w tests

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

…(); Add docstrign to RegexDFAStatePair

…NFA; Made add to nfa functions const

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)

tests/test-lexer.cpp (2)
76-77: Ensure consistency in add_variable method calls.

While the changes to use string concatenation for var_schema are appropriate, there's an inconsistency in how add_variable is called compared to the previous section. Consider using string_view here as well for consistency:
-schema.add_variable(var_schema, -1);
+schema.add_variable(string_view(var_schema), -1);
146-161: LGTM: Comprehensive test cases for repetition regex.

The new test section for repetition regex is well-structured and covers a good range of scenarios. The use of the test_regex_ast helper function improves readability and consistency.

Consider adding a comment explaining the difference between tagged and untagged ASTs for clarity:
// Repetition without capture groups: untagged and tagged ASTs are the same
test_regex_ast("capture:a{0,10}", U"()|(a{1,10})");
// ...

// Repetition with capture groups: untagged and tagged ASTs are different
test_regex_ast("capture:(?<letter>a){0,10}", U"(<~0>)|((a)<0>{1,10})");
// ...

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 083b29d and f29cb07.

📒 Files selected for processing (1)

tests/test-lexer.cpp (6 hunks)

🧰 Additional context used

🔇 Additional comments (5)

tests/test-lexer.cpp (5)

1-1: LGTM: New includes and using declarations are appropriate.

The added include statements and using declarations are necessary for the new helper functions and improve code readability.

Also applies to: 3-4, 6-6, 16-16, 18-19

37-53: LGTM: Well-structured helper function declarations.

The anonymous namespace and function declarations with docstrings are well-structured and provide clear information about the helper functions' purpose and parameters.

59-60: LGTM: Improved variable schema construction and method call.

The changes to use string concatenation for var_schema and string_view for the add_variable method call are appropriate and align with previous suggestions.

165-177: LGTM: Well-implemented test_regex_ast function.

The implementation of the test_regex_ast function is correct and aligns with its declared purpose. The use of fmt::format for string formatting and the u32string_to_string helper function for conversion is appropriate.

179-182: LGTM: Efficient implementation of u32string_to_string.

The implementation of the u32string_to_string function is correct and efficient for converting UTF-32 to UTF-8. This addresses the previous suggestion to use a standard library function for UTF-8 encoding.

coderabbitai

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (6)

examples/README.md (2)

8-9: Great addition of the new example program!

The introduction of the intersect-test program expands the range of functionality demonstrated in the examples. This is a valuable addition for users exploring the project's capabilities.

Consider adding a brief explanation or link to resources about DFAs (Deterministic Finite Automata) for users who might be unfamiliar with the concept. This could further improve the accessibility of the README.

30-30: Command added for the new example program.

The addition of the command to run the intersect-test program is consistent with the introduction of this new example.

For consistency with the other example commands, consider adding an explanation of any arguments required for the intersect-test program. If no arguments are needed, it might be helpful to explicitly state this to avoid confusion.
tests/test-lexer.cpp (3)
37-53: LGTM: Well-structured utility function declarations

The addition of an unnamed namespace for local helper functions is a good practice. The function declarations are well-formatted and use appropriate parameter types. The docstrings provide clear explanations of the functions' purposes.

Consider updating the docstring for u32string_to_string to be more precise:
- * Convert the characters in a 32-byte unicode string into 4 bytes to generate a 8-byte unciode
+ * Convert a UTF-32 string to a UTF-8 string.
This change more accurately describes the function's purpose without getting into implementation details.

145-170: LGTM: Comprehensive test cases for repetition regex

The new test cases for repetition regex are well-structured and cover a wide range of scenarios, including both tagged and untagged ASTs. The use of the test_regex_ast function makes the tests concise and easy to read. The complex repetition test case is particularly valuable for ensuring correct behavior in more intricate scenarios.

To improve readability, consider adding a brief comment before each group of related test cases to explain what aspect of repetition regex they are testing. For example:
// Basic repetition without capture groups
test_regex_ast("capture:a{0,10}", U"()|(a{1,10})");
test_regex_ast("capture:a{5,10}", U"a{5,10}");
// ... more tests ...

// Repetition with capture groups
test_regex_ast("capture:(?<letter>a){0,10}", U"(<~0>)|((a)<0>{1,10})");
// ... more tests ...
This would make it easier for other developers to understand the purpose of each group of tests at a glance.

173-186: LGTM: Well-implemented test_regex_ast function

The implementation of the test_regex_ast function is clear and aligns well with its intended purpose. The use of dynamic_cast for type checking is appropriate in this testing context.

To improve error reporting when a test fails, consider using Catch2's REQUIRE_THAT with a string matcher. This will provide more detailed output about string differences:
REQUIRE_THAT(actual_string, Catch::Matchers::Equals(expected_string));
This change would make it easier to identify discrepancies between the actual and expected serialized ASTs when a test fails.
src/log_surgeon/SchemaParser.cpp (1)
249-257: Improved handling of regex range matches with zero minimum

The changes to the regex_match_range_rule function correctly implement the handling of regex repetition constructs with a minimum value of zero, which is a key objective of this PR. The approach is consistent with the regex_match_zero_or_more_rule function, ensuring uniform behaviour across different regex constructs.

For consistency, consider using if (0 == min) instead of if (min == 0) to match the style used in other parts of the codebase.
-    if (min == 0) {
+    if (0 == min) {

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between f29cb07 and 4379bc5.

📒 Files selected for processing (5)

examples/README.md (2 hunks)
examples/intersect-test.cpp (2 hunks)
src/log_surgeon/SchemaParser.cpp (5 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
tests/test-lexer.cpp (6 hunks)

🧰 Additional context used

🔇 Additional comments (12)

examples/README.md (1)

3-6: Excellent clarification of example programs!

The updates provide specific names for the example programs and clearly explain their purpose. This improvement enhances the README's usefulness for users exploring the project's examples.

examples/intersect-test.cpp (2)

32-32: Improved variable naming convention

The change to prefix the "search" variable with "search:" aligns with the new naming convention used throughout the file. This modification enhances clarity by explicitly indicating the purpose of the variable within the schema.

54-59: Enhanced schema variable declarations

The modifications to the schema variable declarations improve code readability and self-documentation. By prefixing each regex pattern with its corresponding variable name (e.g., "int:", "float:", etc.), the purpose of each pattern becomes immediately clear. This change maintains consistency with the new naming convention and facilitates easier understanding and maintenance of the schema.

tests/test-lexer.cpp (2)

1-6: LGTM: New includes added appropriately

The new includes for , , and <string_view> are necessary for the added utility functions. Good job on keeping the includes organized and relevant to the changes made.

16-21: LGTM: Appropriate using declarations added

The new using declarations for std::codecvt_utf8, std::string_view, and std::u32string are correctly added and align with the new includes. These are used effectively in the new utility functions.

src/log_surgeon/SchemaParser.cpp (4)

7-8: Improved string handling and new regex AST type added

The addition of the <string_view> header and the new RegexASTEmptyByte type alias are positive changes. Using string_view can lead to more efficient string handling, especially for read-only operations. The new RegexASTEmptyByte type likely supports the handling of empty matches in regex operations, which aligns with the PR objectives.

Also applies to: 35-36

85-85: Performance improvement in try_schema_string function signature

The change from string const& to string_view const in the function signature is a good optimization. Using string_view can lead to better performance, especially for read-only operations on strings, as it avoids unnecessary copying and allows for more efficient string handling.

204-208: Improved handling of zero-or-more regex matches

The changes to the regex_match_zero_or_more_rule function significantly improve the handling of regex patterns with zero-or-more repetitions. By treating R* as R+ | ∅, the function now correctly handles negative tags, which is a key objective of this PR. The use of RegexASTEmptyByte and RegexASTOrByte provides a more accurate representation of the regex AST for these cases.

Line range hint 1-1023: Overall improvements in regex handling and consistency

The changes in this file successfully address the PR objectives by improving the handling of regex repetition constructs, particularly those with a minimum value of zero. The modifications to regex_match_zero_or_more_rule and regex_match_range_rule are consistent and well-implemented. The addition of RegexASTEmptyByte and the use of string_view contribute to more efficient and accurate regex parsing.

These changes enhance the capability of the schema parser to handle complex regex patterns, especially those involving optional matches. The code maintains good readability and consistency throughout the file.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

127-162: Implementation of RegexASTEmpty class is appropriate.

The RegexASTEmpty class correctly represents an empty node in the regex AST, and its methods are properly implemented.

273-273: Addition of default constructor to RegexASTGroup.

The default constructor RegexASTGroup() = default; ensures that the class can be default-initialized, which is beneficial.

695-698: Correct implementation of serialize() method for RegexASTEmpty.

The serialize() method correctly returns the serialization of negative tags for an empty node.

examples/intersect-test.cpp

tests/test-lexer.cpp

… Improve format for complex regex repetition test-case .

tests/test-lexer.cpp

…ring. Co-authored-by: Lin Zhihao <[email protected]>

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)

src/log_surgeon/finite_automata/RegexAST.hpp (1)
127-162: LGTM! New RegexASTEmpty class is well-implemented.

The new RegexASTEmpty class is correctly implemented and aligns with the PR objectives. It provides a clean way to handle regex repetitions with a minimum of 0.

Consider adding an assertion in the add_to_nfa method to ensure that the nfa and end_state parameters are not null. This can help catch potential issues early:
 auto add_to_nfa(
         [[maybe_unused]] RegexNFA<NFAStateType>* nfa,
         [[maybe_unused]] NFAStateType* end_state
 ) const -> void override {
+    assert(nfa != nullptr && end_state != nullptr);
     // Do nothing as adding an empty node to the NFA is a null operation.
 }

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 4379bc5 and 0e99f67.

📒 Files selected for processing (2)

src/log_surgeon/finite_automata/RegexAST.hpp (4 hunks)
tests/test-lexer.cpp (5 hunks)

🧰 Additional context used

🔇 Additional comments (10)

tests/test-lexer.cpp (8)
1-6: LGTM: New include directives added

The new include directives for <codecvt>, <locale>, and <string_view> are appropriate for the added functionality in this file.

16-21: LGTM: New using declarations added

The new using declarations for std::codecvt_utf8, std::string_view, std::u32string, and std::wstring_convert are consistent with the added include directives and are used in the new functionality.

37-53: LGTM: Helper functions added in unnamed namespace

The addition of test_regex_ast and u32string_to_string functions in an unnamed namespace is appropriate. The function declarations are well-documented with clear comments explaining their purpose and parameters.

59-60: Consider using std::string::operator+= for efficient string concatenation

The current implementation uses string concatenation to create var_schema. While this works, it can be less efficient for multiple concatenations. Consider using std::string::operator+= for more efficient string concatenation:
string var_schema = var_name;
var_schema += ':';
var_schema += "123";  // or the regex pattern
This approach minimizes temporary string creations and can be more efficient, especially for longer strings or multiple concatenations.

Also applies to: 76-77

123-144: LGTM: Comprehensive test case added for AST with tags

This new test case thoroughly validates the serialization of a regex AST with named capture groups. It tests complex patterns and verifies the correct tagging of positive and negative matches in the serialized output. The use of test_regex_ast function enhances readability and maintainability.

146-189: LGTM: Comprehensive test cases added for repetition regex

These new test cases thoroughly cover various scenarios of repetition regex, including:

Repetition without capture groups

Repetition with capture groups

Capture groups with repetition

Complex repetition patterns

The tests validate both tagged and untagged AST outputs, ensuring correct handling of repetition constructs in different contexts. This comprehensive set of tests will help maintain the integrity of the regex parsing functionality.

193-205: LGTM: Implementation of test_regex_ast function

The implementation of test_regex_ast function is correct and efficient. It properly creates a Schema, adds a variable, and compares the serialized output with the expected result. The use of dynamic_cast for type checking is appropriate in this context.

207-212: LGTM: Implementation of u32string_to_string function

The implementation of u32string_to_string function correctly converts char32_t strings to UTF-8 strings. While it uses deprecated components (std::wstring_convert and std::codecvt_utf8), it's understood that this is the current recommendation until a new stable version is released.
src/log_surgeon/finite_automata/RegexAST.hpp (2)

Line range hint 865-870: LGTM! Improved serialization in RegexASTMultiplication.

The changes to the serialize method in RegexASTMultiplication enhance the output format by using std::u32string consistently and correctly handling the infinite case. This improvement aligns well with the PR objectives.

Line range hint 1-1054: Overall changes are well-implemented and align with PR objectives.

The addition of the RegexASTEmpty class and the improvements to the RegexASTMultiplication class are the main changes in this file. These modifications effectively address the handling of regex repetition constructs with a minimum value of zero, as outlined in the PR objectives. The changes are well-integrated into the existing codebase and do not introduce any apparent issues.

LinZhihao-723

The PR title looks good to me.

SharafMohamed and others added 30 commits September 11, 2024 20:07

Bug-fix for unicode array sizes

a6274ec

Merge remote-tracking branch 'upstream/main' into nfa-cleanup-pr

186d239

Move LexicalRule to its own class; Change name to variable_id; Change…

4f122c6

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA

Additional fix for swapping meaning of tag

c24f6e1

Another additional fix for swapping meaning of tag

33582da

Fix up some comments

3338ec7

Fix comment grammar

3cd3c0f

Add tags to AST; Serialize AST for testing; Add unit-test for testing…

e05acbb

… added tags

Use using to condense code; Use a unique schema object for each test …

5e61e83

…for clairty that nothing is shared b/w tests

Add has_capture_groups(); Add unit-test for has_capture_groups()

082090d

Create and use RegexASTEmpty to split RegexASTgroup with min=0 into R…

2c6d94e

…egexASTgroup with min = 1 OR'd with RegexASTEmpty

Add unit-test for 0 repetition regex

4e02f24

Add more tests for repetition regex

bb3c543

Return by value in literal getters; Use const instead of const& for l…

54027ad

…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()

Refactor new_state()

e58274f

Rename get_first_matching_variable_ids() to get_matching_variable_ids…

1321871

…(); Add docstrign to RegexDFAStatePair

Remove redundant docstrings

c904755

Remove has_capture_groups()

ffe9a0f

Const and auto changes

913ed1a

Changed AST add functions to indicate the AST are being added to the …

7aa8a92

…NFA; Made add to nfa functions const

Merge branch 'nfa-cleanup-pr' into comment-cleanup

77e44a5

Merged with previous PR

d1d87e7

Merge branch 'tagged-ast' into pre-tagged-nfa-cleanup

f386a3b

Merge branch 'pre-tagged-nfa-cleanup' into regex-ast-empty

0c600d7

Change add in RegexASTEmpty to add_to_nfa

bedad75

Update src/log_surgeon/finite_automata/RegexAST.hpp

053d057

Co-authored-by: Lin Zhihao <[email protected]>

updated examples to use

a822307

Merge branch 'nfa-cleanup-pr' into comment-cleanup

0b9603a

TODO to clarify RegexAST class is actually nodes in the AST

2ef84d1

Merge branch 'main' into comment-cleanup

83bd518

Add a more complex repetition case.

a3bf94f

coderabbitai bot reviewed Oct 7, 2024

View reviewed changes

SharafMohamed added 2 commits October 7, 2024 11:10

Minor fixes.

9cff8ec

Fix test-interpretation and update READme

4379bc5

SharafMohamed changed the title ~~Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0.~~ Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0; Update README to include intersection-test. Oct 7, 2024

coderabbitai bot reviewed Oct 7, 2024

View reviewed changes

examples/intersect-test.cpp Show resolved Hide resolved

tests/test-lexer.cpp Show resolved Hide resolved

tests/test-lexer.cpp Outdated Show resolved Hide resolved

SharafMohamed added 2 commits October 7, 2024 12:47

Update RegexMultiplication serialize to wrap its operand in brackets;…

d12e91f

… Improve format for complex regex repetition test-case .

Add needed clang-format tags.

12871f5

LinZhihao-723 requested changes Oct 7, 2024

View reviewed changes

tests/test-lexer.cpp Outdated Show resolved Hide resolved

tests/test-lexer.cpp Outdated Show resolved Hide resolved

tests/test-lexer.cpp Outdated Show resolved Hide resolved

tests/test-lexer.cpp Outdated Show resolved Hide resolved

tests/test-lexer.cpp Outdated Show resolved Hide resolved

SharafMohamed and others added 4 commits October 7, 2024 16:38

Remove redundant use of fmr::format and directly call u32string_to_st…

7623202

…ring. Co-authored-by: Lin Zhihao <[email protected]>

Fix docstring grammar.

b2308ab

Co-authored-by: Lin Zhihao <[email protected]>

Fix docstring grammar and typo.

acbd3f5

Co-authored-by: Lin Zhihao <[email protected]>

Add [[nodiscard]].

0e99f67

Co-authored-by: Lin Zhihao <[email protected]>

coderabbitai bot reviewed Oct 7, 2024

View reviewed changes

Combine test-lexer.cpp helpers into one namespace.

5a19cee

LinZhihao-723 approved these changes Oct 7, 2024

View reviewed changes

LinZhihao-723 merged commit bb06e57 into y-scope:main Oct 7, 2024
7 checks passed

SharafMohamed added a commit to SharafMohamed/log-surgeon that referenced this pull request Oct 8, 2024

Merge with y-scope#41.

7401910

coderabbitai bot mentioned this pull request Oct 8, 2024

Add support for tagged NFA; Use uint32_t to replace int for IDs. #42

Merged

This was referenced Oct 31, 2024

Constrain NFA states to contain only one negative transition. #46

Merged

Replace the integer capture group tag ID with a dedicated 'Tag' class. #48

Merged

coderabbitai bot mentioned this pull request Nov 18, 2024

feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`; Update README to include `intersection-test`. #41

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`; Update README to include `intersection-test`. #41

SharafMohamed commented Sep 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

LinZhihao-723 left a comment

Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0; Update README to include intersection-test. #41

Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0; Update README to include intersection-test. #41

Conversation

SharafMohamed commented Sep 13, 2024 • edited by coderabbitai bot Loading

References

Description

Validation performed

Summary by CodeRabbit

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`; Update README to include `intersection-test`. #41

Bug-Fix: Add negative tags for `RegexMultiplicationAST` with `min=0`; Update README to include `intersection-test`. #41

SharafMohamed commented Sep 13, 2024 •

edited by coderabbitai bot

Loading