Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract RegexNFAState and tagged transition classes into their own files. #47

Merged
merged 172 commits into from
Nov 7, 2024

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Nov 2, 2024

Summary by CodeRabbit

  • New Features

    • Introduced new classes for managing finite automata states and transitions, enhancing regular expression parsing capabilities.
    • Added support for tagged transitions in finite automata, improving state management.
  • Bug Fixes

    • Removed outdated classes and methods, streamlining the codebase for better performance.
  • Documentation

    • Updated test configurations to include new source files for comprehensive unit testing.

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
Copy link

coderabbitai bot commented Nov 2, 2024

Walkthrough

The pull request introduces changes aimed at enhancing the finite automata functionality within the log parsing library. It includes the addition of new header files for RegexNFAState, RegexNFAStateType, and TaggedTransition, while also modifying the existing RegexNFA.hpp to remove certain classes and methods related to tagged transitions. The changes streamline the state management and transition handling mechanisms, reflecting a significant refactor of the finite automata implementation.

Changes

File Path Change Summary
CMakeLists.txt Added new source files: RegexNFAState.hpp, RegexNFAStateType.hpp, TaggedTransition.hpp.
src/log_surgeon/finite_automata/RegexNFA.hpp Removed RegexNFAStateType enum, RegexNFAState, PositiveTaggedTransition, and NegativeTaggedTransition classes. Simplified RegexNFAState structure and removed transition management methods.
src/log_surgeon/finite_automata/RegexNFAState.hpp Introduced RegexNFAState template class with methods for managing transitions and state serialization.
src/log_surgeon/finite_automata/RegexNFAStateType.hpp Added RegexNFAStateType enum.
src/log_surgeon/finite_automata/TaggedTransition.hpp Introduced PositiveTaggedTransition and NegativeTaggedTransition classes with serialization methods.
tests/CMakeLists.txt Updated to include the new source files in the SOURCES_LOG_SURGEON variable for unit testing.

Possibly related PRs

Suggested reviewers

  • LinZhihao-723

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (5)
src/log_surgeon/finite_automata/RegexNFAStateType.hpp (2)

7-10: Consider adding documentation for the enum class values.

Please add documentation explaining the distinction between Byte and UTF8 state types and their specific use cases in the NFA implementation.

 namespace log_surgeon::finite_automata {
+/**
+ * Represents the type of state in a Regex NFA.
+ * 
+ * Byte: Processes input one byte at a time
+ * UTF8: Processes input as UTF-8 encoded characters
+ */
 enum class RegexNFAStateType : uint8_t {
     Byte,
     UTF8
 };

12-13: Add documentation for the template class.

Consider adding documentation explaining the purpose of RegexNFAState and how the state type parameter affects its behaviour.

+/**
+ * Represents a state in a Regex NFA with specific character handling behaviour.
+ * 
+ * @tparam state_type Determines how the state processes input characters
+ *                    (byte-by-byte or as UTF-8 sequences)
+ */
 template <RegexNFAStateType state_type>
 class RegexNFAState;
src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

18-20: Clarify ownership semantics of dest_state; consider using references or smart pointers

Currently, dest_state is a raw pointer to a const NFAStateType. If the transition classes do not own dest_state, consider using a const reference (NFAStateType const&) to indicate non-ownership, enhancing clarity and safety. If ownership needs to be managed, using smart pointers like std::shared_ptr or std::unique_ptr can prevent potential memory issues.

Also applies to: 42-44


71-71: Confirm that the serialization format and tag ordering meet system requirements

When serializing tags using fmt::join(m_tags, ","), the tags are ordered according to std::set, resulting in an ascending order. Please verify that this ordering aligns with the expected serialization format. If tag order is not significant or a different order is required, adjustments to the container type or serialization logic may be necessary.

Also applies to: 82-82

src/log_surgeon/finite_automata/RegexNFA.hpp (1)

Line range hint 78-78: Reminder: Address the TODO to handle the UTF-8 case.

Handling the UTF-8 case is important to ensure proper processing of Unicode inputs.

Would you like assistance in implementing support for the UTF-8 case or opening a GitHub issue to track this task?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a882dde and 9da470d.

📒 Files selected for processing (6)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/finite_automata/RegexNFA.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexNFAState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexNFAStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (1 hunks)
  • tests/CMakeLists.txt (1 hunks)
🔇 Additional comments (6)
src/log_surgeon/finite_automata/RegexNFAStateType.hpp (1)

1-19: Well-structured header with good separation of concerns.

The header file is well-organized with:

  • Proper include guards
  • Minimal includes
  • Clear type definitions
  • Good use of type safety with enum class
  • Appropriate use of templates and type aliases
tests/CMakeLists.txt (1)

7-9: LGTM! The new header files are properly integrated.

The addition of these header files aligns with the PR objective of separating classes from RegexNFA.hpp. The files are correctly placed in the finite_automata directory and included in the build configuration.

Let's verify that these files exist and were properly separated from RegexNFA.hpp:

✅ Verification successful

The new header files are properly separated and integrated

The verification confirms that:

  • All three header files exist in the correct location
  • The classes have been properly separated:
    • RegexNFAState is now in its own file
    • RegexNFAStateType contains the enum class definition
    • TaggedTransition contains both PositiveTaggedTransition and NegativeTaggedTransition classes
  • RegexNFA.hpp correctly includes the new RegexNFAState header
  • The classes are no longer present in RegexNFA.hpp
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of new header files and their separation from RegexNFA.hpp

# Test 1: Check if the new files exist
echo "Checking if new files exist..."
fd -t f "RegexNFAState.hpp|RegexNFAStateType.hpp|TaggedTransition.hpp" src/log_surgeon/finite_automata/

# Test 2: Verify that these classes are no longer in RegexNFA.hpp
echo "Verifying separation from RegexNFA.hpp..."
rg -i "class\s+(RegexNFAState|TaggedTransition)" src/log_surgeon/finite_automata/RegexNFA.hpp

Length of output: 522


Script:

#!/bin/bash
# Let's verify the content of these files and their relationships

# Check if RegexNFA.hpp includes these new headers
echo "Checking includes in RegexNFA.hpp..."
rg -A 1 "#include.*?(RegexNFAState|RegexNFAStateType|TaggedTransition)" src/log_surgeon/finite_automata/RegexNFA.hpp

# Verify that the classes are properly defined in their respective files
echo "Checking class definitions in new files..."
for file in RegexNFAState.hpp RegexNFAStateType.hpp TaggedTransition.hpp; do
    echo "=== Content of $file ==="
    rg "class\s+\w+" "src/log_surgeon/finite_automata/$file" -A 2
done

# Check if these files are properly guarded
echo "Checking header guards..."
for file in RegexNFAState.hpp RegexNFAStateType.hpp TaggedTransition.hpp; do
    echo "=== Header guards in $file ==="
    rg "#ifndef.*?_HPP|#define.*?_HPP" "src/log_surgeon/finite_automata/$file"
done

Length of output: 2449

CMakeLists.txt (1)

100-102: LGTM! The new header files are properly integrated.

The addition of these header files aligns with the PR objective of separating classes from RegexNFA.hpp. The files are correctly placed in the finite_automata directory and will be included in the installation.

Let's verify that all the new header files exist:

✅ Verification successful

All new header files are present in the correct location

All three header files have been successfully verified to exist in the src/log_surgeon/finite_automata/ directory:

  • RegexNFAState.hpp
  • RegexNFAStateType.hpp
  • TaggedTransition.hpp
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence of the new header files
# Test: Check if all new header files exist. Expect: All files should be found.

for file in "RegexNFAState.hpp" "RegexNFAStateType.hpp" "TaggedTransition.hpp"; do
  fd --type f "^${file}$" || echo "Missing file: ${file}"
done

Length of output: 514

src/log_surgeon/finite_automata/RegexNFA.hpp (2)

19-19: LGTM!

Including <log_surgeon/finite_automata/RegexNFAState.hpp> ensures that the necessary definitions are available after moving RegexNFAState into its own file.


Line range hint 109-112: Verify the safety of unchecked optional access in the serialize method.

The use of .value() on state->serialize(state_ids) assumes it always returns a valid value. Please ensure that serialize(state_ids) cannot return an empty std::optional, to avoid potential exceptions.

src/log_surgeon/finite_automata/RegexNFAState.hpp (1)

100-100: ⚠️ Potential issue

Undefined identifier cSizeOfByte

The identifier cSizeOfByte is used but not defined within this file or included headers. Ensure that cSizeOfByte is defined or properly included.

To verify where cSizeOfByte is defined, you can run:

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small changes before we merge

src/log_surgeon/finite_automata/RegexNFAStateType.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/RegexNFAState.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/TaggedTransition.hpp Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
src/log_surgeon/finite_automata/RegexNFAState.hpp (1)

114-115: Consider using auto for the return type.

Since the return type is void, you can simplify the function declaration by using auto for the return type:

-auto RegexNFAState<state_type>::add_interval(Interval interval, RegexNFAState* dest_state) -> void {
+auto RegexNFAState<state_type>::add_interval(Interval interval, RegexNFAState* dest_state) {

This is a minor nitpick and a matter of personal preference. Using auto for void return types can make the code more concise without sacrificing readability.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9da470d and 4576d7d.

📒 Files selected for processing (3)
  • src/log_surgeon/finite_automata/RegexNFAState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexNFAStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/finite_automata/RegexNFAStateType.hpp
🧰 Additional context used
📓 Learnings (2)
src/log_surgeon/finite_automata/RegexNFAState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-02T09:13:56.755Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
src/log_surgeon/finite_automata/TaggedTransition.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-02T09:18:31.046Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
🔇 Additional comments (3)
src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

16-43: Well-structured implementation of PositiveTaggedTransition.

The PositiveTaggedTransition class is cleanly implemented with clear constructors and accessor methods. The use of std::optional in the serialize method appropriately handles cases where the destination state is not found.


46-73: Well-structured implementation of NegativeTaggedTransition.

The NegativeTaggedTransition class is well-designed, and the handling of multiple tags using std::set<uint32_t> is appropriate for ensuring uniqueness and order. The serialize method effectively generates a string representation of the transition.

src/log_surgeon/finite_automata/RegexNFAState.hpp (1)

1-215: LGTM!

The RegexNFAState class is well-structured and provides a comprehensive set of methods for managing NFA states and transitions. The code follows good practices such as using const and [[nodiscard]] where appropriate, and the error handling in the serialize method is robust. The class is properly templated to handle different state types, and the type aliases RegexNFAByteState and RegexNFAUTF8State provide convenient shortcuts. Overall, the code is clean, modular, and easy to understand.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexNFAState.hpp (2)

37-38: Consider passing tags by value and moving.

The constructor takes tags by value and then moves it into m_negative_tagged_transition. To avoid an unnecessary copy, consider taking tags by rvalue reference (std::set<uint32_t>&&) and moving it directly.

Apply this change:

- RegexNFAState(std::set<uint32_t> tags, RegexNFAState const* dest_state)
+ RegexNFAState(std::set<uint32_t>&& tags, RegexNFAState const* dest_state)

79-79: Consider marking the method as const.

The get_tree_transitions method returns a const reference to m_tree_transitions but is not marked as const itself. Consider adding the const qualifier to align with the return type and allow calling it on const instances.

Apply this change:

- auto get_tree_transitions() -> Tree const& { return m_tree_transitions; }
+ [[nodiscard]] auto get_tree_transitions() const -> Tree const& { return m_tree_transitions; }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9da470d and 4576d7d.

📒 Files selected for processing (3)
  • src/log_surgeon/finite_automata/RegexNFAState.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexNFAStateType.hpp (1 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/finite_automata/RegexNFAStateType.hpp
🧰 Additional context used
📓 Learnings (2)
src/log_surgeon/finite_automata/RegexNFAState.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-02T09:13:56.755Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
src/log_surgeon/finite_automata/TaggedTransition.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-02T09:18:31.046Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
🔇 Additional comments (17)
src/log_surgeon/finite_automata/RegexNFAState.hpp (13)

21-22: LGTM!

The template class declaration looks good. The class is parameterized by the RegexNFAStateType enum, allowing for different state types.


24-25: LGTM!

The type aliases RegexNFAByteState and RegexNFAUTF8State are defined correctly using the RegexNFAState template class with the appropriate RegexNFAStateType values.


32-32: LGTM!

The default constructor is defined correctly.


34-35: LGTM!

The constructor taking a single tag and destination state is implemented correctly, initializing the m_positive_tagged_transitions member.


40-42: LGTM!

The set_accepting and is_accepting methods are implemented correctly, allowing the acceptance status to be set and retrieved.


44-50: LGTM!

The set_matching_variable_id and get_matching_variable_id methods are implemented correctly, allowing the matching variable ID to be set and retrieved.


52-60: LGTM!

The get_positive_tagged_transitions and get_negative_tagged_transition methods are implemented correctly, returning the respective tagged transitions.


62-68: LGTM!

The add_epsilon_transition and get_epsilon_transitions methods are implemented correctly, allowing epsilon transitions to be added and retrieved.


70-77: LGTM!

The add_byte_transition and get_byte_transitions methods are implemented correctly, allowing byte transitions to be added and retrieved.


87-87: LGTM!

The add_interval method declaration looks good. It takes an Interval and a destination state pointer as parameters.


97-98: LGTM!

The serialize method declaration looks good. It takes a const reference to a map of states to their unique identifiers and returns an optional string.


115-159: LGTM!

The add_interval method implementation looks correct. It handles adding transitions based on the provided interval, either as byte transitions or tree transitions depending on the state type.


161-212: LGTM!

The serialize method implementation looks correct. It serializes the NFA state into a string representation, handling byte transitions, epsilon transitions, positive and negative tagged transitions, and the accepting tag. It also performs error handling by returning std::nullopt if the serialization of any transition fails.

src/log_surgeon/finite_automata/TaggedTransition.hpp (4)

1-76: LGTM!

The new TaggedTransition.hpp file is well-structured and follows best practices. The classes PositiveTaggedTransition and NegativeTaggedTransition are properly encapsulated and provide the necessary functionality for representing tagged transitions in a finite automaton.


16-43: ****

As discussed in a previous review, the classes PositiveTaggedTransition and NegativeTaggedTransition currently do not share enough functionality to justify refactoring into a common base class.

Also applies to: 46-73


71-72: ****

As mentioned in a previous review, using std::set<uint32_t> for m_tags is appropriate for now, as this is non-critical code. The change to std::vector<uint32_t> will be tracked in a future PR.


26-38: Verify error handling for missing destination states.

The serialize methods in both PositiveTaggedTransition and NegativeTaggedTransition return std::nullopt if the destination state is not found in the provided state_ids map. Ensure that the calling code appropriately handles this case to avoid unexpected behaviour.

Run the following script to verify the error handling:

Also applies to: 56-68

✅ Verification successful

Let me analyze the verification results. The code shows that:

  1. The serialize method in RegexNFAState properly handles std::nullopt returns from both positive and negative tagged transitions by propagating the std::nullopt up the call chain.

  2. The main caller of serialize is in RegexNFA::serialize() which generates state IDs from a BFS traversal order, ensuring all states are included in the state_ids map. This is evident from the comment: "state_ids is well-formed as its generated from get_bfs_traversal_order".

  3. The error handling for missing destination states is properly implemented at both the transition level (in TaggedTransition.hpp) and state level (in RegexNFAState.hpp).

Error handling for missing destination states is properly implemented.

The codebase shows a robust error handling chain:

  • Tagged transitions return std::nullopt if destination state is not found
  • States propagate the std::nullopt from transitions
  • The main serialization code ensures state IDs are well-formed through BFS traversal
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify error handling for missing destination states in `serialize` methods.

# Test 1: Search for `serialize` method calls. Expect: Calls are wrapped in error handling logic.
rg -A 10 $'auto.+serialize'

# Test 2: Search for `std::nullopt` usage. Expect: Appropriate handling of `std::nullopt` return values.
rg -A 10 $'std::nullopt'

Length of output: 37979

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the PR title, how about:

Extract `RegexNFAState` and tagged transition classes into their own files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants