Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor: Remove redundant regex prefix from class names and fix capitalization. #59

Open
wants to merge 409 commits into
base: main
Choose a base branch
from

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Dec 5, 2024

References

  • Depends on PR#58.
  • To review in parallel with PR#58, diff against PR#58 locally. In the repo run:
git fetch upstream pull/58/head:pr-58
git fetch upstream pull/59/head:pr-59
git diff pr-58 pr-59

Description

Additions:

  • Add <algorithm> to TaggedTransition.hpp as reordering of includes caused by file name changes causes previous clang-tidy warning to become a compiler error.

Renames:

  • DFAByteState to ByteDfaState
  • DFAUTF8State to Utf8DfaState
  • NFAByteState to ByteNfaState
  • NFAUTF8State to Utf8NfaState
  • Template parameters named DfaState/DfaStateType to TypedDfaState
  • Template parameters named NfaStateType to TypedNfaState
  • RegexDFA* to Dfa*
  • RegexNFA* to Nfa*
  • NFA* to Nfa*
  • DFA* to Dfa*
  • LALR1Parser to Lalr1Parser

Does not rename:

  • *AST* as there are too many instances of this and it will create conflicts with future PRs that are already submitted. This will be fixed in separate clang-tidy PRs for each file in the future.

Validation performed

Previously existing tests still succeed.

Summary by CodeRabbit

Release Notes

  • New Features

    • Introduced a new Dfa class for representing Deterministic Finite Automata.
    • Added methods for state management and intersection operations in the Dfa class.
  • Renaming and Refactoring

    • Renamed RegexNFA to Nfa and RegexDFA to Dfa across the codebase for consistency.
    • Updated related types and method signatures to reflect the new naming conventions.
    • Changed casing of LALR1Parser to Lalr1Parser for uniformity.
  • Documentation

    • Updated documentation comments for consistency, including renaming references to LALR1Parser to Lalr1Parser.
  • Bug Fixes

    • Corrected type references in various classes and tests to ensure proper functionality with the new naming conventions.

These changes enhance clarity and maintainability while introducing new capabilities for finite automata handling.

SharafMohamed and others added 30 commits October 31, 2024 09:42
…Should fix this so the ordering is determinstic.
…ename tag members to make it clear they are positions.
…eterministic. This is fine as tags are unique across disjoint subtrees.
…ypedefs to the top of the file to fix compilation error.
@SharafMohamed SharafMohamed changed the title refactor: Remove regex prefix from class names if it is redundant. refactor: Remove redundant regex prefix from class names. Dec 8, 2024
@SharafMohamed SharafMohamed changed the title refactor: Remove redundant regex prefix from class names. refactor: Remove redundant regex prefix from class names and fix capitalization. Dec 8, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (9)
src/log_surgeon/finite_automata/Dfa.hpp (1)

58-58: Address the TODO: Implement UTF-8 handling

The get_intersect method contains a TODO comment indicating that UTF-8 (multi-byte transitions) handling needs to be implemented. Please consider adding support for UTF-8 to ensure correct functionality with multi-byte character transitions.

Would you like assistance in generating the implementation for UTF-8 handling?

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

71-71: Address the TODO: Implement UTF-8 handling

The get_reachable_pairs method contains a TODO comment about handling UTF-8 (multi-byte transitions). Please consider implementing support for UTF-8 to properly handle multi-byte character transitions in DFA state pairs.

Would you like assistance in implementing UTF-8 handling in this method?

src/log_surgeon/Lexer.hpp (1)

49-49: Pass 'uint32_t' by value instead of by const reference

Since uint32_t is a small fundamental type, passing it by value is more efficient and clearer. Change the parameter uint32_t const& id to uint32_t id.

Apply this change:

-        auto add_rule(uint32_t const& id, std::unique_ptr<finite_automata::RegexAST<NfaStateType>> rule)
+        auto add_rule(uint32_t id, std::unique_ptr<finite_automata::RegexAST<NfaStateType>> rule)
src/log_surgeon/finite_automata/Nfa.hpp (1)

Line range hint 164-164: TODO comment needs attention.

The comment indicates missing UTF-8 case handling. This could affect the functionality of the NFA for UTF-8 encoded input.

Would you like me to help implement the UTF-8 case handling or create a GitHub issue to track this task?

src/log_surgeon/finite_automata/NfaState.hpp (1)

130-130: Consider using enum class member directly.

Instead of NfaStateType::Utf8 == state_type, consider state_type == NfaStateType::Utf8 for consistency with the style used elsewhere in the file.

-    if constexpr (NfaStateType::Utf8 == state_type) {
+    if constexpr (state_type == NfaStateType::Utf8) {
src/log_surgeon/Lalr1Parser.hpp (1)

279-279: Remove commented-out code.

The commented-out member variable declaration should be removed as it adds noise to the codebase.

-    /* Lexer<NfaStateType, DfaStateType> m_lexer; */
src/log_surgeon/Lexer.tpp (1)

Line range hint 44-83: Consider extracting state transition logic.

The state machine implementation in the scan method is quite complex. Consider extracting the state transition logic into a separate method to improve maintainability.

+  auto transition_state(DfaStateType const* current_state, unsigned char next_char) -> DfaStateType* {
+    if (m_is_delimiter[next_char] || input_buffer.log_fully_consumed() || !m_has_delimiters) {
+      if (current_state->is_accepting()) {
+        m_match = true;
+        m_type_ids = &(current_state->get_matching_variable_ids());
+        m_match_pos = prev_byte_buf_pos;
+        m_match_line = m_line;
+      }
+    }
+    return current_state->next(next_char);
+  }
src/log_surgeon/Lalr1Parser.tpp (1)

Line range hint 580-628: Consider using string streams for error reporting.

The current string concatenation approach in error reporting could be inefficient. Consider using std::ostringstream for better performance and maintainability.

-    std::string error_string = "Schema:" + std::to_string(line_num + 1) + ":"
-                               + std::to_string(consumed_input.size() + 1)
-                               + ": error: " + error_type + "\n";
+    std::ostringstream error_stream;
+    error_stream << "Schema:" << (line_num + 1) << ":"
+                 << (consumed_input.size() + 1)
+                 << ": error: " << error_type << "\n";
src/log_surgeon/finite_automata/RegexAST.hpp (1)

716-717: LGTM! Implementation details are consistently updated.

The template parameter renaming has been correctly applied across all implementation methods. However, there's a minor opportunity for improvement in the error messages.

Consider updating the error messages in the RegexASTGroup constructors to use consistent formatting and avoid repetition of the explanation about escaping special characters.

Extract the common error message to a constant:

static constexpr char const* const INVALID_BRACKET_EXPRESSION_ERROR =
    "A bracket expression in the schema contains illegal characters, "
    "remember to escape special characters. "
    "Refer to README-Schema.md for more details.";

Then use it in the constructors:

-        throw std::runtime_error("RegexASTGroup1: right == nullptr: A bracket expression in the "
-                                 "schema contains illegal characters, remember to escape special "
-                                 "characters. Refer to README-Schema.md for more details.");
+        throw std::runtime_error(fmt::format("RegexASTGroup1: right == nullptr: {}", 
+                                           INVALID_BRACKET_EXPRESSION_ERROR));

Also applies to: 721-722, 739-740, 770-771, 799-800, 830-831, 892-893, 949-950, 989-990, 1006-1007, 1014-1015, 1020-1021, 1041-1042, 1058-1059, 1074-1075

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 99b5b08 and 6e65a3e.

📒 Files selected for processing (32)
  • CMakeLists.txt (2 hunks)
  • examples/intersect-test.cpp (4 hunks)
  • src/log_surgeon/BufferParser.hpp (2 hunks)
  • src/log_surgeon/Lalr1Parser.cpp (2 hunks)
  • src/log_surgeon/Lalr1Parser.hpp (5 hunks)
  • src/log_surgeon/Lalr1Parser.tpp (20 hunks)
  • src/log_surgeon/Lexer.hpp (6 hunks)
  • src/log_surgeon/Lexer.tpp (18 hunks)
  • src/log_surgeon/LexicalRule.hpp (2 hunks)
  • src/log_surgeon/LogParser.cpp (5 hunks)
  • src/log_surgeon/LogParser.hpp (4 hunks)
  • src/log_surgeon/Parser.hpp (1 hunks)
  • src/log_surgeon/Parser.tpp (3 hunks)
  • src/log_surgeon/ReaderParser.hpp (2 hunks)
  • src/log_surgeon/SchemaParser.cpp (1 hunks)
  • src/log_surgeon/SchemaParser.hpp (4 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (31 hunks)
  • src/log_surgeon/finite_automata/RegexDFA.hpp (0 hunks)
  • src/log_surgeon/finite_automata/RegexDFA.tpp (0 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexNFAStateType.hpp (0 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (4 hunks)
  • tests/CMakeLists.txt (1 hunks)
  • tests/test-NFA.cpp (1 hunks)
  • tests/test-lexer.cpp (2 hunks)
💤 Files with no reviewable changes (3)
  • src/log_surgeon/finite_automata/RegexNFAStateType.hpp
  • src/log_surgeon/finite_automata/RegexDFA.tpp
  • src/log_surgeon/finite_automata/RegexDFA.hpp
✅ Files skipped from review due to trivial changes (5)
  • src/log_surgeon/finite_automata/RegexDFAStateType.hpp
  • src/log_surgeon/finite_automata/DfaStateType.hpp
  • src/log_surgeon/finite_automata/NfaStateType.hpp
  • src/log_surgeon/ReaderParser.hpp
  • src/log_surgeon/BufferParser.hpp
🧰 Additional context used
📓 Path-based instructions (19)
src/log_surgeon/Lalr1Parser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Parser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LogParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lalr1Parser.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LogParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

examples/intersect-test.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-NFA.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (9)
src/log_surgeon/LexicalRule.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
CMakeLists.txt (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/TaggedTransition.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
examples/intersect-test.cpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
tests/test-NFA.cpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
tests/test-lexer.cpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#38
File: src/log_surgeon/finite_automata/RegexAST.hpp:698-705
Timestamp: 2024-11-10T16:46:53.300Z
Learning: Using `fmt::join` on a `vector<uint32_t>` containing digits correctly concatenates them into a string representation.
src/log_surgeon/finite_automata/Nfa.hpp (5)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:99-105
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/RegexNFA.hpp`, it's acceptable to have constructors without the `explicit` specifier. Do not suggest adding `explicit` to constructors in this file.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:442-456
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In the C++ file `src/log_surgeon/finite_automata/RegexNFA.hpp`, for the `RegexNFA::serialize()` function, prioritize code clarity over efficiency when handling string operations.
src/log_surgeon/finite_automata/NfaState.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/RegexAST.hpp (5)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#38
File: src/log_surgeon/finite_automata/RegexAST.hpp:663-669
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `RegexASTLiteral::serialize()`, to properly handle Unicode characters beyond the ASCII range, cast `m_character` to `char32_t` and use `U"{}{}"` in `fmt::format`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (43)
src/log_surgeon/finite_automata/DfaState.hpp (1)

40-40: Use of 'false == expression' aligns with coding guidelines

The use of false == m_matching_variable_ids.empty() complies with the coding guideline to prefer false == <expression> over !<expression>.

src/log_surgeon/LogParser.cpp (5)

20-21: Type Aliases Updated for Consistency

The using statements for finite_automata::DfaByteState and finite_automata::NfaByteState have been correctly updated to reflect the new naming conventions, promoting consistency across the codebase.


65-68: Constructor Template Parameters Updated Appropriately

The usage of RegexAST<NfaByteState> and RegexASTLiteral<NfaByteState> in initializing first_timestamp_regex_ast and r1 correctly reflects the updated type names, ensuring consistency with the new NfaByteState type.


143-145: Delimiter Group Construction Uses Updated Types

The creation of RegexASTGroup<NfaByteState> and the subsequent concatenation with rule->m_regex_ptr using RegexASTCat<NfaByteState> correctly apply the new naming conventions, maintaining consistency throughout the finite automata implementation.


196-196: Lexer Reference Updated with New Type Names

The assignment to token.m_type_ids_ptr now references Lexer<NfaByteState, DfaByteState>::cTokenUncaughtStringTypes, reflecting the updated type aliases and ensuring correct linkage to the lexer constants.


262-262: Consistent Update of Lexer Reference

Similarly, the assignment to curr_token.m_type_ids_ptr correctly references Lexer<NfaByteState, DfaByteState>::cTokenUncaughtStringTypes, maintaining consistency with the updated type names.

src/log_surgeon/Lalr1Parser.cpp (2)

1-1: Header Include Updated to Reflect Renamed File

The include directive has been updated to #include "Lalr1Parser.hpp", aligning with the new file name and ensuring that the correct header file is included.


11-14: Simplified Member Variable Initialization

The constructor for NonTerminal now initializes m_children_start using m_next_children_start without class qualification, enhancing code readability. The increment of m_next_children_start is also simplified, which is appropriate since both are static member variables of the same class.

src/log_surgeon/Parser.hpp (3)

8-8: Template Parameter Names Standardized

The template parameters have been renamed to NfaStateType and DfaStateType, promoting consistent and clear naming conventions across the codebase.


15-15: Method Signature Updated with New Template Parameters

The add_rule method now correctly uses std::unique_ptr<finite_automata::RegexAST<NfaStateType>> for the rule parameter, reflecting the updated template parameter names and ensuring type consistency.


20-20: Member Variable Type Updated for Consistency

The member variable m_lexer is now declared as Lexer<NfaStateType, DfaStateType>, aligning with the updated template parameter names and maintaining consistency throughout the parser implementation.

src/log_surgeon/LexicalRule.hpp (6)

9-9: Template Parameter Renamed for Clarity

The template parameter has been renamed from NFAStateType to NfaStateType, improving readability and aligning with the naming conventions used across the project.


15-15: Constructor Parameter Type Updated

The constructor for LexicalRule now accepts a std::unique_ptr<finite_automata::RegexAST<NfaStateType>> for the regex parameter, correctly reflecting the updated template parameter name.


24-24: Method Signature Adjusted to New NFA Type

The add_to_nfa method signature has been updated to accept a pointer to finite_automata::Nfa<NfaStateType>, ensuring compatibility with the renamed NFA class and maintaining consistency in the finite automata usage.


28-28: Return Type Updated in get_regex Method

The return type of the get_regex method is now finite_automata::RegexAST<NfaStateType>*, reflecting the updated template parameter and providing the correct type information.


35-35: Member Variable Type Correctly Updated

The member variable m_regex is now of type std::unique_ptr<finite_automata::RegexAST<NfaStateType>>, aligning with the updated template parameter and ensuring type consistency within the class.


38-39: Definition of add_to_nfa Method Updated

The definition of the add_to_nfa method now uses the updated template parameter NfaStateType and accepts a pointer to finite_automata::Nfa<NfaStateType>, maintaining consistency with the class declaration and the finite automata implementation.

tests/CMakeLists.txt (1)

8-10: LGTM! File renaming changes are consistent with PR objectives.

The updates to source file paths correctly reflect the removal of the "Regex" prefix and the capitalization changes.

Also applies to: 14-16

src/log_surgeon/Parser.tpp (1)

10-11: LGTM! Template parameter naming changes are consistent.

The capitalization changes in template parameters align with the PR objectives while maintaining the code's functionality.

Also applies to: 32-35, 44-46

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

18-20: LGTM! Template parameter naming changes are consistent.

The capitalization changes in template parameters align with the PR objectives while maintaining the code's functionality.

Also applies to: 28-28, 32-32, 39-39, 50-50, 56-59, 66-66, 75-75, 82-82, 96-96


18-20: Verify consistent usage of NfaStateType across the codebase.

Let's ensure the new naming convention is consistently applied across all related files.

✅ Verification successful

NfaStateType template parameter is consistently used across the codebase

The verification shows that NfaStateType is used consistently throughout the codebase, particularly in the finite automata implementation. The template parameter is properly used in:

  • Template class declarations
  • Function parameters
  • Type aliases
  • Documentation comments

All occurrences follow the same CamelCase naming convention, and there are no instances of inconsistent naming patterns (like NFAStateType or nfa_state_type) in the template parameter usage.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for any remaining instances of NFAStateType in the codebase
# that might have been missed during the renaming.

echo "Checking for remaining NFAStateType occurrences..."
rg "NFAStateType" --type cpp --type hpp

echo "Checking for any mixed usage of NfaStateType/NFAStateType..."
rg -i "nfastatetype" --type cpp --type hpp

Length of output: 380


Script:

#!/bin/bash
# Let's try with correct file extensions and more specific search
echo "Checking for any NFA state type usage..."
rg -i "nfastatetype|nfa_state_type" -g "*.h" -g "*.hpp" -g "*.cpp"

# Let's also check template parameters specifically
echo -e "\nChecking template parameters..."
ast-grep --pattern 'template <$$$> class $_ { $$$ }'

Length of output: 26668

examples/intersect-test.cpp (2)

8-11: LGTM! Using declarations updated consistently

The using declarations have been updated to use the new class names (Dfa, DfaByteState, Nfa, NfaByteState), aligning with the PR objectives.


20-20: LGTM! Type declarations updated consistently

All type declarations and variable instantiations have been updated to use the new class names:

  • ByteLexicalRule using NfaByteState
  • Function parameter using Dfa
  • Variable declarations using Nfa

The changes maintain consistency with the PR objectives while preserving the original functionality.

Also applies to: 24-24, 43-43, 45-45, 81-81

src/log_surgeon/SchemaParser.hpp (3)

8-8: LGTM! Include path updated

The include path has been updated to use the renamed Lalr1Parser.hpp, maintaining consistency with the capitalization fixes.


49-49: LGTM! Type declarations in SchemaVarAST updated

The RegexAST template parameter has been updated to use NfaByteState instead of RegexNFAByteState, aligning with the PR objectives.

Also applies to: 58-58


71-72: LGTM! Class inheritance updated

The SchemaParser class inheritance has been updated to use the new state type names (NfaByteState, DfaByteState) and properly formatted with consistent indentation.

src/log_surgeon/LogParser.hpp (3)

9-9: LGTM! Include path updated

The include path has been updated to use the renamed Lalr1Parser.hpp, maintaining consistency with the capitalization fixes.


18-18: LGTM! Class inheritance updated

The LogParser class inheritance has been updated to use the new state type names (NfaByteState, DfaByteState), aligning with the PR objectives.


29-29: LGTM! Documentation updated

The exception documentation has been updated to reference Lalr1Parser instead of LALR1Parser, maintaining consistency with the new naming convention.

Also applies to: 38-38

CMakeLists.txt (1)

71-73: LGTM! File renaming is consistent.

The renaming of files follows a consistent pattern:

  • Removes redundant "Regex" prefix
  • Fixes capitalization (LALR1 -> Lalr1, DFA -> Dfa, NFA -> Nfa)
  • Updates all related files together

Also applies to: 99-105

src/log_surgeon/finite_automata/Nfa.hpp (1)

1-2: LGTM! Class and header guard renaming is consistent.

The renaming follows the PR objectives:

  • Removes redundant "Regex" prefix
  • Updates header guard to match
  • Template parameter uses consistent capitalization (NfaStateType)

Also applies to: 20-21

tests/test-NFA.cpp (1)

10-10: LGTM! Type aliases updated consistently.

The changes maintain consistency with the class renaming:

  • Include path updated to new Nfa.hpp
  • All type aliases updated to use Nfa and NfaByteState
  • No functional changes to the test case required

Also applies to: 16-16, 23-31

tests/test-lexer.cpp (1)

21-32: LGTM! Type alias updates are consistent with PR objectives.

The changes correctly update the type aliases to use the new naming convention, removing the redundant "Regex" prefix while maintaining the same functionality.

src/log_surgeon/finite_automata/NfaState.hpp (2)

1-2: LGTM! Header guard follows the new naming convention.

The header guard has been correctly updated to match the new file name.


20-24: LGTM! Class and type alias updates are consistent.

The changes correctly:

  • Rename the class template from RegexNFAState to NfaState
  • Update type aliases to use the new naming convention
  • Fix capitalization (UTF8 → Utf8)
src/log_surgeon/Lalr1Parser.hpp (1)

205-206: LGTM! Template parameter names follow the new convention.

The template parameters have been correctly updated to use the new naming convention, maintaining consistency with other files.

src/log_surgeon/Lexer.tpp (3)

20-21: LGTM! Template parameter renaming is consistent.

The renaming of template parameters from NFAStateType/DFAStateType to NfaStateType/DfaStateType follows proper naming conventions and is consistently applied across all method signatures.

Also applies to: 41-42, 169-170, 302-303, 319-320, 333-334, 344-345, 357-358, 376-377


377-381: LGTM! Type usage is correctly updated.

The type usage in the generate() method has been properly updated to use the new type names while maintaining the same functionality.


Line range hint 255-276: Verify stack usage in wildcard handling.

The stack-based DFA state traversal could potentially consume excessive memory for complex patterns. Consider:

  1. Adding a maximum stack depth limit
  2. Implementing stack size checks
src/log_surgeon/SchemaParser.cpp (1)

19-37: LGTM! Type alias updates are consistent.

The type alias updates from RegexNFAByteState to NfaByteState are consistently applied across all declarations and maintain proper type relationships.

src/log_surgeon/Lalr1Parser.tpp (1)

56-57: LGTM! Template parameter renaming is consistent throughout the parser implementation.

The renaming of template parameters maintains consistency with the changes in other files and follows the project's new naming conventions.

Also applies to: 68-69, 77-78, 85-86, 112-113, 153-154, 163-164, 193-194, 213-214, 236-237, 269-270, 301-302, 385-386, 421-422, 427-428, 433-434

src/log_surgeon/finite_automata/RegexAST.hpp (2)

27-28: LGTM! The renaming changes are consistent.

The renaming of the template parameter from NFAStateType to NfaStateType and the corresponding updates to method signatures maintain consistency with the PR objectives while preserving the existing functionality.

Also applies to: 43-45, 73-78


160-161: LGTM! Template parameter renaming is consistent across derived classes.

The renaming changes have been consistently applied across all derived classes:

  • RegexASTEmpty
  • RegexASTLiteral
  • RegexASTInteger
  • RegexASTGroup
  • RegexASTOr
  • RegexASTCat
  • RegexASTMultiplication
  • RegexASTCapture

The changes maintain the existing class hierarchy and relationships.

Also applies to: 190-191, 239-240, 294-295, 427-428, 490-491, 553-554, 631-632

src/log_surgeon/finite_automata/DfaState.hpp Outdated Show resolved Hide resolved
src/log_surgeon/Lexer.hpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (10)
src/log_surgeon/finite_automata/NfaState.hpp (4)

51-53: Ensure consistent use of const qualifiers in function parameters

The functions add_positive_tagged_start_transition and add_epsilon_transition have different const qualifiers for their dest_state parameters (NfaState const* vs. NfaState*). For consistency and to prevent unintended modifications, consider using NfaState const* for all parameters that are not modified within the function.

Apply this diff to harmonize the const qualifiers:

 auto add_epsilon_transition(NfaState const* epsilon_transition) -> void {
     m_epsilon_transitions.push_back(epsilon_transition);
 }

70-72: Uniformly apply const qualifiers to function parameters

Similarly, in the add_byte_transition function, consider changing the dest_state parameter to NfaState const* to maintain consistency and ensure immutability of the pointed object within the function.

Apply this diff:

 auto add_byte_transition(uint8_t byte, NfaState const* dest_state) -> void {
     m_bytes_transitions[byte].push_back(dest_state);
 }

74-76: Use const pointers in return types for getter functions

In the get_epsilon_transitions function, you return a vector of NfaState*. To maintain const correctness and prevent unintended modifications to the states, consider returning std::vector<NfaState const*> const&.

Apply this diff:

 [[nodiscard]] auto get_epsilon_transitions() const -> std::vector<NfaState const*> const& {
     return m_epsilon_transitions;
 }

Also, update the member variable accordingly:

- std::vector<NfaState*> m_epsilon_transitions;
+ std::vector<NfaState const*> m_epsilon_transitions;

Line range hint 168-236: Refactor the serialize method for improved readability

The serialize method is lengthy and contains repetitive code for handling optional values and error checking. Refactoring parts of this method into smaller helper functions would enhance readability and maintainability.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

71-71: Offer assistance for TODO: Handle UTF-8 transitions

There's a TODO comment indicating the need to handle UTF-8 (multi-byte transitions) in the get_reachable_pairs method. Would you like assistance in implementing this functionality or opening a new GitHub issue to track this task?

src/log_surgeon/finite_automata/Dfa.hpp (1)

58-58: Offer assistance for TODO: Handle UTF-8 transitions

There's a TODO comment indicating the need to handle UTF-8 (multi-byte transitions) in the get_intersect method. Would you like assistance in implementing this functionality or opening a new GitHub issue to track this task?

src/log_surgeon/Lexer.hpp (1)

126-128: Avoid returning const references to primitive types

Methods such as get_has_delimiters(), is_delimiter(uint8_t byte), and is_first_char(uint8_t byte) return bool const&. Returning a const reference to a primitive type like bool is unnecessary and can lead to inefficiencies. Consider returning bool by value instead.

Also applies to: 131-135

src/log_surgeon/Lexer.tpp (3)

Line range hint 169-301: Consider extracting common logic between scan methods.

The scan_with_wildcard method largely duplicates code from the scan method. Consider extracting the common logic into a private helper method to improve maintainability and reduce code duplication.


459-466: Address or remove commented-out UTF-8 code.

The commented-out UTF-8 specific code in the nfa_to_dfa method should either be implemented or removed if no longer needed. If this is a planned feature, consider creating a tracking issue.

Would you like me to help create a GitHub issue to track the UTF-8 support implementation?

Also applies to: 484-491


Tagged transitions should not be treated as epsilon transitions

Based on the codebase analysis, treating tagged transitions as epsilon transitions is incorrect. The implementation shows that tagged transitions serve distinct purposes:

  1. Positive tagged start/end transitions are used for capture group boundaries
  2. Negative tagged transitions are used for negative lookahead assertions
  3. Epsilon transitions are separate and have their own dedicated implementation

The TODO comment should be removed and tagged transitions should be handled according to their specific semantics rather than being treated as epsilon transitions. This affects pattern matching behaviour, particularly for:

  • Capture group boundaries tracking
  • Negative lookahead assertion evaluation
  • State reachability computation
🔗 Analysis chain

Line range hint 407-407: Review epsilon transition handling for tagged transitions.

The current implementation treats tagged transitions as epsilon transitions. This might not be the intended behavior and could affect pattern matching accuracy.


Let me search for more context about the implementation of tagged transitions.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usage of tagged transitions in tests
rg -A 5 "tagged.*transition" "tests/"

Length of output: 7665


Script:

#!/bin/bash
# Search for implementation of tagged transitions in source files
rg -A 5 "tagged.*transition" "src/"

# Search for any epsilon transition handling
rg -A 5 "epsilon.*transition" "src/"

Length of output: 27502

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9c2ad81 and c4fc96b.

📒 Files selected for processing (14)
  • src/log_surgeon/Lalr1Parser.hpp (5 hunks)
  • src/log_surgeon/Lalr1Parser.tpp (20 hunks)
  • src/log_surgeon/Lexer.hpp (6 hunks)
  • src/log_surgeon/Lexer.tpp (18 hunks)
  • src/log_surgeon/LexicalRule.hpp (2 hunks)
  • src/log_surgeon/Parser.hpp (1 hunks)
  • src/log_surgeon/Parser.tpp (3 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (1 hunks)
  • src/log_surgeon/finite_automata/DfaStatePair.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (6 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (6 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (31 hunks)
  • src/log_surgeon/finite_automata/RegexDFAStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (8)
  • src/log_surgeon/Parser.hpp
  • src/log_surgeon/Parser.tpp
  • src/log_surgeon/LexicalRule.hpp
  • src/log_surgeon/Lalr1Parser.hpp
  • src/log_surgeon/finite_automata/TaggedTransition.hpp
  • src/log_surgeon/finite_automata/Nfa.hpp
  • src/log_surgeon/finite_automata/RegexAST.hpp
  • src/log_surgeon/Lalr1Parser.tpp
🧰 Additional context used
📓 Path-based instructions (5)
src/log_surgeon/finite_automata/RegexDFAStateType.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/DfaStatePair.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
src/log_surgeon/finite_automata/RegexDFAStateType.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
src/log_surgeon/finite_automata/NfaState.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
🔇 Additional comments (4)
src/log_surgeon/finite_automata/NfaState.hpp (1)

Line range hint 125-145: Verify condition in add_interval to avoid excluding zero

The condition if (interval.first != 0 && interval.first <= interval.second) might inadvertently exclude intervals starting at zero. This could lead to missed transitions for intervals that begin with zero. Please verify that this condition behaves as intended.

Please double-check the logic of this condition to ensure intervals starting at zero are correctly handled, or adjust the condition if necessary.

src/log_surgeon/Lexer.hpp (1)

127-129: Avoid returning a const reference to a std::unique_ptr

Returning a const& to a std::unique_ptr in get_dfa() can lead to confusion and potential misuse. Consider returning a raw pointer or a reference to the managed object instead.

src/log_surgeon/Lexer.tpp (2)

20-21: LGTM! Consistent template parameter renaming.

The template parameter renaming from NFAStateType/DFAStateType to TypedNfaState/TypedDfaState has been applied consistently throughout the file, aligning with the project's naming conventions.

Also applies to: 41-42, 169-170, 302-303, 319-320, 333-334, 344-345, 357-358, 365-366, 376-377, 391-392, 428-429


379-379: Verify the impact of DFA ignoring tags.

The TODO comment indicates that the DFA ignores tags in patterns like "capture:user=(?<user_id>\d+)". This could lead to incorrect pattern matching behavior.

src/log_surgeon/finite_automata/RegexDFAStateType.hpp Outdated Show resolved Hide resolved
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before we proceed, a quick question: do we really need to differentiate DFA state type and NFA state type? Can we have a generic type to mark whether the input is byte or unicode?

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any header guard issue within this file should be tracked in #65

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should ensure the namings in NFA unit tests are also sync with this PR:

  • Rename test-NFA.cpp to test-Nfa.cpp
  • Rename ByteNFA to ByteNfa in the unit test file

@LinZhihao-723
Copy link
Member

Before we proceed, a quick question: do we really need to differentiate DFA state type and NFA state type? Can we have a generic type to mark whether the input is byte or unicode?

Just to clarify: we don't necessarily do this in the current PR, but it's some design decision we should make

@SharafMohamed
Copy link
Contributor Author

SharafMohamed commented Dec 19, 2024

Before we proceed, a quick question: do we really need to differentiate DFA state type and NFA state type? Can we have a generic type to mark whether the input is byte or unicode?

I thought one of the other PRs already posted combine these into a single type, but it seems that's not the case. I'll add a PR to combine the types.

@LinZhihao-723
Copy link
Member

Before we proceed, a quick question: do we really need to differentiate DFA state type and NFA state type? Can we have a generic type to mark whether the input is byte or unicode?

I thought one of the other PRs already posted combine these into a single type, but it seems that's not the case. I'll add a PR to combine the types.

Do u think we can do it in this PR? 'cuz reviewing the current renaming changes and then reviewing another PR that has a lot of overlapped overwrite would duplicate the work

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants