-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Bug-Fix: Add negative tags for RegexMultiplicationAST
with min=0
; Update README to include intersection-test
.
#41
Conversation
… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
…(); Add docstrign to RegexDFAStatePair
…NFA; Made add to nfa functions const
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (2)
tests/test-lexer.cpp (2)
76-77
: Ensure consistency inadd_variable
method calls.While the changes to use string concatenation for
var_schema
are appropriate, there's an inconsistency in howadd_variable
is called compared to the previous section. Consider usingstring_view
here as well for consistency:-schema.add_variable(var_schema, -1); +schema.add_variable(string_view(var_schema), -1);
146-161
: LGTM: Comprehensive test cases for repetition regex.The new test section for repetition regex is well-structured and covers a good range of scenarios. The use of the
test_regex_ast
helper function improves readability and consistency.Consider adding a comment explaining the difference between tagged and untagged ASTs for clarity:
// Repetition without capture groups: untagged and tagged ASTs are the same test_regex_ast("capture:a{0,10}", U"()|(a{1,10})"); // ... // Repetition with capture groups: untagged and tagged ASTs are different test_regex_ast("capture:(?<letter>a){0,10}", U"(<~0>)|((a)<0>{1,10})"); // ...
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- tests/test-lexer.cpp (6 hunks)
🧰 Additional context used
🔇 Additional comments (5)
tests/test-lexer.cpp (5)
1-1
: LGTM: New includes and using declarations are appropriate.The added include statements and using declarations are necessary for the new helper functions and improve code readability.
Also applies to: 3-4, 6-6, 16-16, 18-19
37-53
: LGTM: Well-structured helper function declarations.The anonymous namespace and function declarations with docstrings are well-structured and provide clear information about the helper functions' purpose and parameters.
59-60
: LGTM: Improved variable schema construction and method call.The changes to use string concatenation for
var_schema
andstring_view
for theadd_variable
method call are appropriate and align with previous suggestions.
165-177
: LGTM: Well-implementedtest_regex_ast
function.The implementation of the
test_regex_ast
function is correct and aligns with its declared purpose. The use offmt::format
for string formatting and theu32string_to_string
helper function for conversion is appropriate.
179-182
: LGTM: Efficient implementation ofu32string_to_string
.The implementation of the
u32string_to_string
function is correct and efficient for converting UTF-32 to UTF-8. This addresses the previous suggestion to use a standard library function for UTF-8 encoding.
RegexMultiplicationAST
with min=0
.RegexMultiplicationAST
with min=0
; Update README to include intersection-test
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (6)
examples/README.md (2)
8-9
: Great addition of the new example program!The introduction of the
intersect-test
program expands the range of functionality demonstrated in the examples. This is a valuable addition for users exploring the project's capabilities.Consider adding a brief explanation or link to resources about DFAs (Deterministic Finite Automata) for users who might be unfamiliar with the concept. This could further improve the accessibility of the README.
30-30
: Command added for the new example program.The addition of the command to run the
intersect-test
program is consistent with the introduction of this new example.For consistency with the other example commands, consider adding an explanation of any arguments required for the
intersect-test
program. If no arguments are needed, it might be helpful to explicitly state this to avoid confusion.tests/test-lexer.cpp (3)
37-53
: LGTM: Well-structured utility function declarationsThe addition of an unnamed namespace for local helper functions is a good practice. The function declarations are well-formatted and use appropriate parameter types. The docstrings provide clear explanations of the functions' purposes.
Consider updating the docstring for
u32string_to_string
to be more precise:- * Convert the characters in a 32-byte unicode string into 4 bytes to generate a 8-byte unciode + * Convert a UTF-32 string to a UTF-8 string.This change more accurately describes the function's purpose without getting into implementation details.
145-170
: LGTM: Comprehensive test cases for repetition regexThe new test cases for repetition regex are well-structured and cover a wide range of scenarios, including both tagged and untagged ASTs. The use of the
test_regex_ast
function makes the tests concise and easy to read. The complex repetition test case is particularly valuable for ensuring correct behavior in more intricate scenarios.To improve readability, consider adding a brief comment before each group of related test cases to explain what aspect of repetition regex they are testing. For example:
// Basic repetition without capture groups test_regex_ast("capture:a{0,10}", U"()|(a{1,10})"); test_regex_ast("capture:a{5,10}", U"a{5,10}"); // ... more tests ... // Repetition with capture groups test_regex_ast("capture:(?<letter>a){0,10}", U"(<~0>)|((a)<0>{1,10})"); // ... more tests ...This would make it easier for other developers to understand the purpose of each group of tests at a glance.
173-186
: LGTM: Well-implemented test_regex_ast functionThe implementation of the
test_regex_ast
function is clear and aligns well with its intended purpose. The use of dynamic_cast for type checking is appropriate in this testing context.To improve error reporting when a test fails, consider using Catch2's
REQUIRE_THAT
with a string matcher. This will provide more detailed output about string differences:REQUIRE_THAT(actual_string, Catch::Matchers::Equals(expected_string));
This change would make it easier to identify discrepancies between the actual and expected serialized ASTs when a test fails.
src/log_surgeon/SchemaParser.cpp (1)
249-257
: Improved handling of regex range matches with zero minimumThe changes to the
regex_match_range_rule
function correctly implement the handling of regex repetition constructs with a minimum value of zero, which is a key objective of this PR. The approach is consistent with theregex_match_zero_or_more_rule
function, ensuring uniform behaviour across different regex constructs.For consistency, consider using
if (0 == min)
instead ofif (min == 0)
to match the style used in other parts of the codebase.- if (min == 0) { + if (0 == min) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- examples/README.md (2 hunks)
- examples/intersect-test.cpp (2 hunks)
- src/log_surgeon/SchemaParser.cpp (5 hunks)
- src/log_surgeon/finite_automata/RegexAST.hpp (3 hunks)
- tests/test-lexer.cpp (6 hunks)
🧰 Additional context used
🔇 Additional comments (12)
examples/README.md (1)
3-6
: Excellent clarification of example programs!The updates provide specific names for the example programs and clearly explain their purpose. This improvement enhances the README's usefulness for users exploring the project's examples.
examples/intersect-test.cpp (2)
32-32
: Improved variable naming conventionThe change to prefix the "search" variable with "search:" aligns with the new naming convention used throughout the file. This modification enhances clarity by explicitly indicating the purpose of the variable within the schema.
54-59
: Enhanced schema variable declarationsThe modifications to the schema variable declarations improve code readability and self-documentation. By prefixing each regex pattern with its corresponding variable name (e.g., "int:", "float:", etc.), the purpose of each pattern becomes immediately clear. This change maintains consistency with the new naming convention and facilitates easier understanding and maintenance of the schema.
tests/test-lexer.cpp (2)
1-6
: LGTM: New includes added appropriatelyThe new includes for , , and <string_view> are necessary for the added utility functions. Good job on keeping the includes organized and relevant to the changes made.
16-21
: LGTM: Appropriate using declarations addedThe new using declarations for std::codecvt_utf8, std::string_view, and std::u32string are correctly added and align with the new includes. These are used effectively in the new utility functions.
src/log_surgeon/SchemaParser.cpp (4)
7-8
: Improved string handling and new regex AST type addedThe addition of the <string_view> header and the new RegexASTEmptyByte type alias are positive changes. Using string_view can lead to more efficient string handling, especially for read-only operations. The new RegexASTEmptyByte type likely supports the handling of empty matches in regex operations, which aligns with the PR objectives.
Also applies to: 35-36
85-85
: Performance improvement in try_schema_string function signatureThe change from
string const&
tostring_view const
in the function signature is a good optimization. Usingstring_view
can lead to better performance, especially for read-only operations on strings, as it avoids unnecessary copying and allows for more efficient string handling.
204-208
: Improved handling of zero-or-more regex matchesThe changes to the
regex_match_zero_or_more_rule
function significantly improve the handling of regex patterns with zero-or-more repetitions. By treatingR*
asR+ | ∅
, the function now correctly handles negative tags, which is a key objective of this PR. The use ofRegexASTEmptyByte
andRegexASTOrByte
provides a more accurate representation of the regex AST for these cases.
Line range hint
1-1023
: Overall improvements in regex handling and consistencyThe changes in this file successfully address the PR objectives by improving the handling of regex repetition constructs, particularly those with a minimum value of zero. The modifications to
regex_match_zero_or_more_rule
andregex_match_range_rule
are consistent and well-implemented. The addition ofRegexASTEmptyByte
and the use ofstring_view
contribute to more efficient and accurate regex parsing.These changes enhance the capability of the schema parser to handle complex regex patterns, especially those involving optional matches. The code maintains good readability and consistency throughout the file.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
127-162
: Implementation ofRegexASTEmpty
class is appropriate.The
RegexASTEmpty
class correctly represents an empty node in the regex AST, and its methods are properly implemented.
273-273
: Addition of default constructor toRegexASTGroup
.The default constructor
RegexASTGroup() = default;
ensures that the class can be default-initialized, which is beneficial.
695-698
: Correct implementation ofserialize()
method forRegexASTEmpty
.The
serialize()
method correctly returns the serialization of negative tags for an empty node.
… Improve format for complex regex repetition test-case .
…ring. Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
Co-authored-by: Lin Zhihao <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)
127-162
: LGTM! NewRegexASTEmpty
class is well-implemented.The new
RegexASTEmpty
class is correctly implemented and aligns with the PR objectives. It provides a clean way to handle regex repetitions with a minimum of 0.Consider adding an assertion in the
add_to_nfa
method to ensure that thenfa
andend_state
parameters are not null. This can help catch potential issues early:auto add_to_nfa( [[maybe_unused]] RegexNFA<NFAStateType>* nfa, [[maybe_unused]] NFAStateType* end_state ) const -> void override { + assert(nfa != nullptr && end_state != nullptr); // Do nothing as adding an empty node to the NFA is a null operation. }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- src/log_surgeon/finite_automata/RegexAST.hpp (4 hunks)
- tests/test-lexer.cpp (5 hunks)
🧰 Additional context used
🔇 Additional comments (10)
tests/test-lexer.cpp (8)
1-6
: LGTM: New include directives addedThe new include directives for
<codecvt>
,<locale>
, and<string_view>
are appropriate for the added functionality in this file.
16-21
: LGTM: New using declarations addedThe new using declarations for
std::codecvt_utf8
,std::string_view
,std::u32string
, andstd::wstring_convert
are consistent with the added include directives and are used in the new functionality.
37-53
: LGTM: Helper functions added in unnamed namespaceThe addition of
test_regex_ast
andu32string_to_string
functions in an unnamed namespace is appropriate. The function declarations are well-documented with clear comments explaining their purpose and parameters.
59-60
: Consider using std::string::operator+= for efficient string concatenationThe current implementation uses string concatenation to create
var_schema
. While this works, it can be less efficient for multiple concatenations. Consider usingstd::string::operator+=
for more efficient string concatenation:string var_schema = var_name; var_schema += ':'; var_schema += "123"; // or the regex patternThis approach minimizes temporary string creations and can be more efficient, especially for longer strings or multiple concatenations.
Also applies to: 76-77
123-144
: LGTM: Comprehensive test case added for AST with tagsThis new test case thoroughly validates the serialization of a regex AST with named capture groups. It tests complex patterns and verifies the correct tagging of positive and negative matches in the serialized output. The use of
test_regex_ast
function enhances readability and maintainability.
146-189
: LGTM: Comprehensive test cases added for repetition regexThese new test cases thoroughly cover various scenarios of repetition regex, including:
- Repetition without capture groups
- Repetition with capture groups
- Capture groups with repetition
- Complex repetition patterns
The tests validate both tagged and untagged AST outputs, ensuring correct handling of repetition constructs in different contexts. This comprehensive set of tests will help maintain the integrity of the regex parsing functionality.
193-205
: LGTM: Implementation of test_regex_ast functionThe implementation of
test_regex_ast
function is correct and efficient. It properly creates a Schema, adds a variable, and compares the serialized output with the expected result. The use ofdynamic_cast
for type checking is appropriate in this context.
207-212
: LGTM: Implementation of u32string_to_string functionThe implementation of
u32string_to_string
function correctly convertschar32_t
strings to UTF-8 strings. While it uses deprecated components (std::wstring_convert
andstd::codecvt_utf8
), it's understood that this is the current recommendation until a new stable version is released.src/log_surgeon/finite_automata/RegexAST.hpp (2)
Line range hint
865-870
: LGTM! Improved serialization inRegexASTMultiplication
.The changes to the
serialize
method inRegexASTMultiplication
enhance the output format by usingstd::u32string
consistently and correctly handling the infinite case. This improvement aligns well with the PR objectives.
Line range hint
1-1054
: Overall changes are well-implemented and align with PR objectives.The addition of the
RegexASTEmpty
class and the improvements to theRegexASTMultiplication
class are the main changes in this file. These modifications effectively address the handling of regex repetition constructs with a minimum value of zero, as outlined in the PR objectives. The changes are well-integrated into the existing codebase and do not introduce any apparent issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR title looks good to me.
References
Description
Previously,
RegexASTMultiplication
was missing negative tags needed for generating a tagged-NFA. Namely, for regex repetition (e.g. R{0,N} or R*} containing a capture group, the 0 case indicates the capture group is not matched. In this case we need to add a negative tag. As a result we do the following:R{0,N}
asR{1,N} | ∅
R*
asR+|∅
Validation performed
Summary by CodeRabbit
New Features
UniqueIdGenerator
class for generating unique IDs.RegexASTEmpty
to enhance regex AST structure.Bug Fixes
Tests
Chores