Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

Open
wants to merge 510 commits into
base: main
Choose a base branch
from

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Jan 13, 2025

Description

  • Previously tags were being used to refer to a single capture group, as well as the start and end markers for a capture group's position in the NFA.
    • The former has been changed to be referred to as a capture.
    • The latter is now stored as a unique unsigned integer.
  • To simplify information tracking and ownership transfer, the lexer is now responsible for keeping track of all the relational information it will need after parsing. This includes:
    • A map from each variable id to the capture id's for the groups the variable contains.
    • A map from each capture to its start and end tag.
    • A map from each tag to its final register.
  • Fix filer order in cmake.
  • Use *_id_t aliases to make it more clear what the intended purpose of the maps are.

Validation performed

  • Added new unit-test for the lexer's base functionality.
  • Added new unit-test for the lexer's capture group functionality. This includes testing the maps that can currently be assigned.

Summary by CodeRabbit

Based on the comprehensive changes across multiple files, here are the updated release notes:

  • New Features

    • Introduced a new Capture class to replace the previous Tag system.
    • Enhanced lexer functionality with improved capture group handling.
    • Added methods for retrieving capture and tag information.
  • Improvements

    • Refined type management with new type aliases like symbol_id_t and register_id_t.
    • Updated NFA and regex AST to use a capture-based tagging mechanism.
    • Improved error handling and unique ID generation for captures.
  • Refactoring

    • Replaced tag-based system with a more robust capture-based approach.
    • Updated method signatures and internal data structures across multiple components.
    • Streamlined serialization and transition handling.
  • Testing

    • Added new unit tests for Capture functionality.
    • Updated existing tests to reflect new capture and tagging mechanisms.

These changes represent a significant architectural improvement in the log parsing and regex handling system.

Copy link

coderabbitai bot commented Jan 13, 2025

Walkthrough

This pull request introduces a comprehensive refactoring of the log surgeon's finite automata and lexer components, primarily focusing on replacing the concept of "tags" with "captures". The changes span multiple files across the project, including CMakeLists.txt, header files, and test cases. The refactoring involves renaming classes, updating method signatures, and modifying how captures are managed throughout the codebase, with a particular emphasis on improving the representation and handling of regex capture groups.

Changes

File Change Summary
CMakeLists.txt Removed Tag.hpp, added Capture.hpp, restored PrefixTree.cpp, PrefixTree.hpp, and RegexAST.hpp
src/log_surgeon/Lexer.hpp Added type aliases: register_id_t, symbol_id_t, tag_id_t; added methods for capture and tag ID retrieval
src/log_surgeon/Lexer.tpp Updated method signatures to use symbol_id_t; added capture name uniqueness validation
src/log_surgeon/LexicalRule.hpp Added get_captures() method to return capture information
src/log_surgeon/SchemaParser.cpp Replaced Tag.hpp include with Capture.hpp
src/log_surgeon/finite_automata/* Comprehensive renaming from Tag to Capture; updated method signatures to use tag IDs
tests/ Added test-capture.cpp; removed test-tag.cpp; updated existing test cases

Sequence Diagram

sequenceDiagram
    participant Lexer
    participant Nfa
    participant RegexAST
    participant Capture

    Lexer->>Nfa: Create NFA with rules
    Nfa->>RegexAST: Process regex rules
    RegexAST->>Capture: Generate capture groups
    Nfa-->>Lexer: Return NFA with capture information
Loading

Possibly related PRs

Suggested reviewers

  • LinZhihao-723
✨ Finishing Touches
  • 📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (9)
src/log_surgeon/finite_automata/Dfa.hpp (1)

Line range hint 121-134: Consider adding error handling for invalid input in get_intersect

While the method correctly implements the intersection operation, it might benefit from validation of the input DFA pointer and appropriate error handling.

Consider adding input validation:

 template <typename TypedDfaState>
 auto Dfa<TypedDfaState>::get_intersect(Dfa const* dfa_in) const -> std::set<uint32_t> {
+    if (nullptr == dfa_in) {
+        throw std::invalid_argument("Input DFA pointer cannot be null");
+    }
     std::set<uint32_t> schema_types;
     // ... rest of the implementation
src/log_surgeon/finite_automata/Nfa.hpp (1)

122-122: Code formatting issue detected

The code does not adhere to the project's code formatting guidelines. Please run clang-format on the affected file to resolve formatting issues.

🧰 Tools
🪛 GitHub Actions: lint

[error] 122-122: Code should be clang-formatted

src/log_surgeon/finite_automata/RegexAST.hpp (1)

113-113: Code formatting issue detected

The code does not adhere to the project's code formatting guidelines at line 113. Please run clang-format on the affected file to resolve formatting issues.

🧰 Tools
🪛 GitHub Actions: lint

[error] 113-113: Code should be clang-formatted

tests/test-capture.cpp (1)

23-34: Consider adding more test cases for constructors.

While the current tests cover basic copy and move operations, consider adding tests for:

  • Copy/move with empty strings
  • Self-assignment test
  • Chain assignment test (a = b = c)
src/log_surgeon/Lexer.hpp (1)

131-169: Consider adding validation for capture_id parameter.

While the error handling is good, consider adding a debug assertion or validation for the capture_id parameter in get_registers_for_capture() to catch potential programming errors early.

 [[nodiscard]] auto get_registers_for_capture(symbol_id_t capture_id
 ) const -> std::optional<std::pair<register_id_t, register_id_t>> {
+    assert(capture_id < m_id_symbol.size() && "Invalid capture_id");
     auto const tag_ids = get_tag_ids_for_capture_id(capture_id);
tests/test-nfa.cpp (1)

47-48: Consider using std::move for better performance.

The initialization of ByteNfa could be more efficient by moving rules instead of copying it.

-    ByteNfa const nfa{rules};
+    ByteNfa const nfa{std::move(rules)};
🧰 Tools
🪛 GitHub Actions: lint

[error] 47-65: Code should be clang-formatted

src/log_surgeon/Lexer.tpp (1)

381-394: Enhance the error message for duplicate captures.

The error message could be more descriptive to help users identify which capture names are duplicated.

-                throw std::invalid_argument("`m_rules` contains capture names that are not unique.");
+                throw std::invalid_argument(fmt::format("Duplicate capture name '{}' found in rules.", capture_name));
🧰 Tools
🪛 GitHub Actions: lint

[error] 390-390: Code should be clang-formatted

tests/test-lexer.cpp (2)

215-297: Consider extracting helper functions for better test readability.

The test case could be more maintainable by extracting common setup and verification logic into helper functions.

// Helper function for lexer setup
auto setup_lexer(string_view schema) -> log_surgeon::lexers::ByteLexer {
    log_surgeon::lexers::ByteLexer lexer;
    lexer.add_delimiters({' ', '\n'});
    // ... rest of setup code ...
    return lexer;
}

// Helper function for token verification
auto verify_token(const log_surgeon::Token& token, const log_surgeon::lexers::ByteLexer& lexer,
                 string_view expected_type, string_view expected_value) -> void {
    REQUIRE(nullptr != token.m_type_ids_ptr);
    REQUIRE(1 == token.m_type_ids_ptr->size());
    REQUIRE(expected_type == lexer.m_id_symbol[token.m_type_ids_ptr->at(0)]);
    REQUIRE(expected_value == token.to_string_view());
}

359-400: Track missing test coverage for register functionality.

The TODO comments indicate that tests for register-related functionality are pending implementation. This should be tracked to ensure complete test coverage.

Would you like me to create a GitHub issue to track the implementation of these test cases?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e08f728 and bc78c49.

📒 Files selected for processing (18)
  • CMakeLists.txt (1 hunks)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/Lexer.tpp (3 hunks)
  • src/log_surgeon/LexicalRule.hpp (1 hunks)
  • src/log_surgeon/SchemaParser.cpp (4 hunks)
  • src/log_surgeon/finite_automata/Capture.hpp (2 hunks)
  • src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/NfaState.hpp (3 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (20 hunks)
  • src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
  • tests/CMakeLists.txt (2 hunks)
  • tests/test-capture.cpp (1 hunks)
  • tests/test-lexer.cpp (2 hunks)
  • tests/test-nfa.cpp (4 hunks)
  • tests/test-tag.cpp (0 hunks)
💤 Files with no reviewable changes (1)
  • tests/test-tag.cpp
✅ Files skipped from review due to trivial changes (2)
  • src/log_surgeon/finite_automata/PrefixTree.hpp
  • src/log_surgeon/finite_automata/Capture.hpp
🧰 Additional context used
📓 Path-based instructions (12)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-nfa.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-capture.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (6)
src/log_surgeon/finite_automata/RegisterHandler.hpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
src/log_surgeon/SchemaParser.cpp (1)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
src/log_surgeon/finite_automata/NfaState.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/finite_automata/TaggedTransition.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
src/log_surgeon/finite_automata/RegexAST.hpp (3)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.
src/log_surgeon/Lexer.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.
🪛 GitHub Actions: lint
tests/test-nfa.cpp

[error] 47-65: Code should be clang-formatted

src/log_surgeon/LexicalRule.hpp

[error] 28-28: Code should be clang-formatted

src/log_surgeon/SchemaParser.cpp

[error] 1-1: Code should be clang-formatted

src/log_surgeon/Lexer.tpp

[error] 4-4: Code should be clang-formatted


[error] 390-390: Code should be clang-formatted

src/log_surgeon/finite_automata/Nfa.hpp

[error] 122-122: Code should be clang-formatted

src/log_surgeon/finite_automata/RegexAST.hpp

[error] 113-113: Code should be clang-formatted

⏰ Context from checks skipped due to timeout of 90000ms (2)
  • GitHub Check: build (ubuntu-latest, Release)
  • GitHub Check: build (ubuntu-latest, Debug)
🔇 Additional comments (29)
src/log_surgeon/finite_automata/Dfa.hpp (2)

5-5: Well-structured include additions!

Making the dependencies on <map>, <stack>, and RegisterHandler.hpp explicit improves maintainability by clearly showing the required headers.

Also applies to: 8-8, 14-14


82-82: Good defensive programming practice!

Initializing the state pointer to nullptr is a good practice that prevents undefined behaviour and makes debugging easier.

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

14-15: Introduced tag_id_t typedef for clarity

Defining tag_id_t as an alias for std::uint32_t improves code readability and maintainability.

src/log_surgeon/finite_automata/Nfa.hpp (1)

43-43: Optimized constructor by passing vector by const reference

Changing the constructor parameter to accept rules by const reference avoids unnecessary copying, improving performance.

tests/test-capture.cpp (3)

8-11: LGTM! Basic test case is well-implemented.

The test verifies the fundamental functionality of the get_name() method.


13-16: LGTM! Empty string handling is properly tested.

The test ensures that empty capture names are handled correctly.


18-21: LGTM! Special characters test is comprehensive.

The test verifies that special characters in capture names are preserved correctly.

src/log_surgeon/LexicalRule.hpp (1)

26-28: LGTM! Method implementation is correct.

The get_captures() method is well-designed:

  • Properly marked with [[nodiscard]]
  • Returns const pointers to prevent modification
  • Correctly delegates to the underlying regex implementation
🧰 Tools
🪛 GitHub Actions: lint

[error] 28-28: Code should be clang-formatted

src/log_surgeon/finite_automata/RegisterHandler.hpp (2)

5-5: LGTM! Include is properly placed.

The <cstdint> include is correctly added to support the new type alias.


21-21: LGTM! Type alias enhances code clarity.

The register_id_t type alias improves code readability and maintainability by providing a meaningful name for the register identifier type.

src/log_surgeon/Lexer.hpp (3)

27-29: LGTM! Type aliases enhance code clarity.

The type aliases improve code readability and maintainability by providing meaningful names for different identifier types.


59-61: LGTM! Exception documentation is clear.

The comment properly documents the exception that can be thrown when duplicate capture names are encountered.


197-199: LGTM! Member variables support the new mapping functionality.

The new member variables properly support the mapping between variables, captures, tags, and registers.

tests/test-nfa.cpp (3)

58-66: LGTM! Test expectations updated to match the new capture-to-tag mapping.

The changes correctly reflect the new implementation where captures are mapped to numerical tag indices.


Line range hint 69-101: LGTM! Consistent update of test expectations.

The changes maintain consistency with the new capture-to-tag mapping implementation.


Line range hint 106-132: LGTM! Test expectations properly updated.

The changes complete the consistent update of test expectations for the new capture-to-tag mapping implementation.

src/log_surgeon/finite_automata/NfaState.hpp (2)

35-40: LGTM! Good use of value types over pointers.

The change from Tag const* to tag_id_t reduces memory management complexity and aligns with modern C++ practices.


54-56: LGTM! Consistent use of value types.

The method signature update maintains consistency with the constructor changes, using tag_id_t throughout the class.

src/log_surgeon/Lexer.tpp (2)

396-401: LGTM! Proper implementation of capture-to-tag mapping.

The code correctly processes and stores the relationships between captures and their corresponding tag IDs.


403-403: Verify the impact of DFA ignoring captures.

The TODO comment indicates that DFA treats capture groups as regular expressions. Please verify if this limitation affects the correctness of capture group functionality.

✅ Verification successful

DFA capture group limitation is a known design choice

The TODO comment accurately reflects the current implementation. The DFA treats capture expressions like "capture:user=(?<user_id>\d+)" as simple patterns "capture:user=\d+". This is a known limitation in the design, not a bug:

  • Tests in test-lexer.cpp show that capture groups are supported in the API
  • The DFA implementation intentionally simplifies these patterns
  • Capture group processing likely happens at a different stage of processing
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Search for tests that verify capture group functionality with DFA.
rg -A 5 "TEST_CASE.*capture.*DFA" tests/

Length of output: 42


Script:

#!/bin/bash
# Search for capture-related code and tests
echo "=== Capture-related tests ==="
rg -l "TEST_CASE.*capture" tests/

echo -e "\n=== Capture implementation ==="
ast-grep --pattern 'capture'

echo -e "\n=== DFA implementation ==="
rg -l "DFA|deterministic.*automaton" src/

Length of output: 2361


Script:

#!/bin/bash
# Examine capture tests and DFA implementation
echo "=== Capture Tests ==="
rg -A 10 "TEST_CASE.*capture" tests/test-lexer.cpp tests/test-capture.cpp

echo -e "\n=== DFA Capture Handling ==="
rg -A 5 "capture" src/log_surgeon/finite_automata/Dfa.hpp

echo -e "\n=== NFA to DFA Conversion ==="
rg -A 10 "convert.*to.*dfa|to_dfa" src/log_surgeon/finite_automata/

Length of output: 1133

tests/test-lexer.cpp (1)

299-358: LGTM! Comprehensive test coverage for capture group functionality.

The test case thoroughly verifies the new capture-to-tag mapping features, including variable IDs, capture IDs, and tag IDs relationships.

src/log_surgeon/SchemaParser.cpp (4)

13-13: LGTM!

The include statement for Capture.hpp is correctly placed and aligns with the transition from tags to captures.


205-205: LGTM!

The comment has been correctly updated to use "captures" terminology, maintaining consistency with the codebase changes.


251-251: LGTM!

The comment has been correctly updated to use "captures" terminology, maintaining consistency with the codebase changes.


170-170: Verify non-null requirement for Capture.

The change from Tag to Capture is consistent. However, based on previous learnings, ensure that the Capture instance is always non-null when passed to RegexASTCapture.

Run this script to verify the non-null requirement:

✅ Verification successful

Non-null requirement is enforced by design

The Capture class design and RegexASTCapture's constructor guarantees that Capture instances cannot be null. The make_unique<Capture>(r4->m_name) call is safe as it satisfies the required string parameter constructor.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify that Capture instances are always non-null when passed to RegexASTCapture.

# Search for any null checks or assignments of Capture instances
ast-grep --pattern 'RegexASTCapture($$$, nullptr)' || ast-grep --pattern 'RegexASTCapture($$$, NULL)'

# Search for any potential null assignments to Capture variables
rg -A 2 'Capture.*=.*null|Capture.*=.*nullptr|Capture.*=.*NULL'

Length of output: 118


Script:

#!/bin/bash
# Find Capture class definition and usage
ast-grep --pattern 'class Capture {
  $$$
}'

# Find RegexASTCapture constructor definition
ast-grep --pattern 'RegexASTCapture::RegexASTCapture($$$)'

# Find all Capture construction patterns
ast-grep --pattern 'new Capture($$$)' || ast-grep --pattern 'make_unique<Capture>($$$)'

# Find all RegexASTCapture construction patterns
ast-grep --pattern 'new RegexASTCapture($$$)' || ast-grep --pattern 'make_unique<RegexASTCapture>($$$)'

Length of output: 287


Script:

#!/bin/bash
# Search for Capture and RegexASTCapture in header files
fd -e hpp -e h --exec cat {} \; | rg -A 10 'class (Capture|RegexASTCapture)'

# Find all make_unique constructions
rg -A 2 'make_unique<(Capture|RegexASTCapture)>'

# Find constructor implementations
rg -A 5 '(Capture|RegexASTCapture)::\1'

Length of output: 1320

tests/CMakeLists.txt (2)

27-34: LGTM!

The addition of test-capture.cpp and the improved formatting enhance the maintainability of the test suite.


5-7: Verify source file list consistency.

The addition of Capture.hpp aligns with the transition from tags to captures. However, Nfa.hpp and NfaState.hpp appear to be duplicated in the source list.

Run this script to verify the source file list:

CMakeLists.txt (2)

96-96: LGTM!

The addition of Capture.hpp is correctly placed within the finite_automata section.


102-104: Verify the necessity of restored files.

The restoration of PrefixTree and RegexAST files should be verified against the capture functionality requirements.

Run this script to verify file dependencies:

✅ Verification successful

Restoration of files is necessary and correct

The PrefixTree and RegexAST files are essential components of the system:

  • PrefixTree is required by RegisterHandler for position tracking
  • RegexAST provides critical capture group functionality
  • Together they form a cohesive pattern matching and position tracking system
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check dependencies between restored files and Capture functionality.

# Search for Capture-related usage in restored files
echo "Checking PrefixTree.cpp for Capture usage:"
rg -A 3 "Capture" "src/log_surgeon/finite_automata/PrefixTree.cpp" || true

echo "Checking PrefixTree.hpp for Capture usage:"
rg -A 3 "Capture" "src/log_surgeon/finite_automata/PrefixTree.hpp" || true

echo "Checking RegexAST.hpp for Capture usage:"
rg -A 3 "Capture" "src/log_surgeon/finite_automata/RegexAST.hpp" || true

Length of output: 4173


Script:

#!/bin/bash
# Check relationships between PrefixTree and RegexAST/Capture components

echo "Checking RegexAST.hpp for PrefixTree usage:"
rg -A 3 "PrefixTree" "src/log_surgeon/finite_automata/RegexAST.hpp" || true

echo "Checking for files using PrefixTree:"
rg --files-with-matches "PrefixTree" || true

echo "Checking implementation files for relationships:"
rg -l "class.*(?:RegexAST|PrefixTree)" || true

Length of output: 770


Script:

#!/bin/bash
# Examine RegisterHandler implementation

echo "Checking RegisterHandler for component relationships:"
rg -A 5 "class RegisterHandler" "src/log_surgeon/finite_automata/RegisterHandler.hpp" || true

echo "Checking RegisterHandler usage of components:"
rg -A 3 "(?:PrefixTree|RegexAST|Capture)" "src/log_surgeon/finite_automata/RegisterHandler.hpp" || true

echo "Checking test implementation for component integration:"
rg -A 3 "(?:PrefixTree|RegexAST|Capture)" "tests/test-register-handler.cpp" || true

Length of output: 2057

src/log_surgeon/finite_automata/Dfa.hpp Show resolved Hide resolved
src/log_surgeon/finite_automata/TaggedTransition.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/RegexAST.hpp Show resolved Hide resolved
src/log_surgeon/LexicalRule.hpp Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/log_surgeon/finite_automata/RegexAST.hpp (1)

650-660: ⚠️ Potential issue

Avoid throwing exceptions in the initializer list

Throwing exceptions in the member initializer list can lead to issues with object construction and exception safety. It's safer to perform null checks in the constructor body and throw exceptions from there.

Apply this diff to move the null checks into the constructor body:

 RegexASTCapture(
         std::unique_ptr<RegexAST<TypedNfaState>> group_regex_ast,
         std::unique_ptr<Capture> capture
 )
-        : m_group_regex_ast{(
-                  nullptr == group_regex_ast
-                          ? throw std::invalid_argument("Group regex AST cannot be null")
-                          : std::move(group_regex_ast)
-          )},
-          m_capture{
-                  nullptr == capture ? throw std::invalid_argument("Capture cannot be null")
-                                     : std::move(capture)
-          } {
+        {
+            if (nullptr == group_regex_ast) {
+                throw std::invalid_argument("Group regex AST cannot be null");
+            }
+            if (nullptr == capture) {
+                throw std::invalid_argument("Capture cannot be null");
+            }
+            m_group_regex_ast = std::move(group_regex_ast);
+            m_capture = std::move(capture);
             RegexAST<TypedNfaState>::set_subtree_positive_captures(
                     m_group_regex_ast->get_subtree_positive_captures()
             );
             RegexAST<TypedNfaState>::add_subtree_positive_captures({m_capture.get()});
         }
🧹 Nitpick comments (4)
tests/test-lexer.cpp (3)

215-297: Consider adding test cases for edge cases.

The test case thoroughly validates the basic functionality but could benefit from additional test cases for:

  • Empty input
  • Multiple consecutive delimiters
  • Input containing only delimiters

359-360: Track TODOs in the issue tracker.

The TODOs indicate pending implementation of determinization and simulation features. These should be tracked in the issue tracker for better visibility and follow-up.

Would you like me to create GitHub issues to track these TODOs?

Also applies to: 399-400


342-357: Consider adding negative test cases for mapping validation.

The mapping validation tests only cover the happy path. Consider adding test cases for:

  • Non-existent variable IDs
  • Non-existent capture IDs
  • Invalid tag IDs
src/log_surgeon/finite_automata/Nfa.hpp (1)

95-99: Consider using stable identifiers instead of raw pointers as unordered_map keys

Using Capture const* as keys in an unordered_map relies on the stability of pointer values, which can be error-prone if objects are moved or deleted. To enhance safety and maintainability, consider using unique IDs or smart pointers with custom hash and equality functions as keys.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc78c49 and 2c89216.

📒 Files selected for processing (4)
  • src/log_surgeon/Lexer.hpp (5 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
  • tests/test-lexer.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • src/log_surgeon/Lexer.hpp
🧰 Additional context used
📓 Path-based instructions (3)
tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: build (ubuntu-latest, Release)
  • GitHub Check: build (ubuntu-latest, Debug)
  • GitHub Check: build (macos-latest, Release)
  • GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (4)
tests/test-lexer.cpp (2)

9-9: LGTM!

The new includes and using declarations are appropriate for the added test cases.

Also applies to: 15-15


362-397: LGTM!

The token recognition tests with captures are thorough and well-structured.

src/log_surgeon/finite_automata/Nfa.hpp (2)

43-43: Efficient parameter passing by const reference

Good use of a constant reference in the Nfa constructor to avoid unnecessary copies of rules.


20-28: Verify thread safety of UniqueIdGenerator

If UniqueIdGenerator is used in multithreaded contexts, the generate_id() method is not thread-safe due to the use of the post-increment operator. Consider using std::atomic<uint32_t> for current_id or synchronizing access with a mutex to ensure thread safety.

Run the following script to check for usage of UniqueIdGenerator in multithreaded contexts:

@SharafMohamed SharafMohamed changed the title feat: lexer now tracks mapping from variables to capture to tags to registers. feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. Jan 13, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

55-56: 🛠️ Refactor suggestion

Pass vector by const reference to improve efficiency.

Consider passing tag_ids by const reference to avoid unnecessary copying of the vector.

-NegativeTaggedTransition(std::vector<tag_id_t> tag_ids, TypedNfaState const* dest_state)
+NegativeTaggedTransition(std::vector<tag_id_t> const& tag_ids, TypedNfaState const* dest_state)
🧹 Nitpick comments (1)
src/log_surgeon/Lexer.tpp (1)

381-394: Enhance the error message for duplicate capture names.

The error message could be more descriptive by including the duplicate capture name.

Apply this diff to improve the error message:

-                throw std::invalid_argument("`m_rules` contains capture names that are not unique."
+                throw std::invalid_argument(fmt::format(
+                    "`m_rules` contains duplicate capture name: '{}'", capture_name)
                );
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c89216 and f88a780.

📒 Files selected for processing (7)
  • src/log_surgeon/Lexer.tpp (3 hunks)
  • src/log_surgeon/LexicalRule.hpp (1 hunks)
  • src/log_surgeon/SchemaParser.cpp (4 hunks)
  • src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
  • src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
  • src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
  • tests/test-nfa.cpp (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (4)
  • src/log_surgeon/finite_automata/PrefixTree.hpp
  • src/log_surgeon/LexicalRule.hpp
  • tests/test-nfa.cpp
  • src/log_surgeon/SchemaParser.cpp
🧰 Additional context used
📓 Path-based instructions (2)
src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)
src/log_surgeon/finite_automata/TaggedTransition.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
src/log_surgeon/finite_automata/RegexAST.hpp (2)
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.
Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.
⏰ Context from checks skipped due to timeout of 90000ms (3)
  • GitHub Check: build (ubuntu-latest, Release)
  • GitHub Check: build (ubuntu-latest, Debug)
  • GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (9)
src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

14-14: LGTM! Good use of type alias.

The introduction of tag_id_t improves code readability and provides better type safety.


23-24: LGTM! Good transition from pointers to IDs.

The changes effectively replace Tag pointers with tag_id_t, which:

  • Reduces memory usage
  • Simplifies memory management
  • Maintains the same functionality

Also applies to: 40-40, 44-44

src/log_surgeon/Lexer.tpp (3)

7-7: LGTM!

The addition of <stdexcept> header is appropriate for using std::invalid_argument exception.


362-374: LGTM!

The parameter type change from uint32_t to symbol_id_t improves type safety and semantic clarity.


397-402: LGTM!

The mapping between captures and tag IDs is well-structured and maintains the relationship necessary for tracking ownership transfer after parsing.

src/log_surgeon/finite_automata/RegexAST.hpp (4)

33-38: LGTM!

The documentation updates accurately reflect the transition from tags to captures and clearly explain the role of captures in the AST.


155-157: LGTM!

The member variable changes from tags to captures are consistent and well-named, reflecting the new design.


650-664: Move null checks to constructor body.

As previously discussed, throwing exceptions in the initializer list can lead to issues with object construction and exception safety.

Apply this diff to move the null checks into the constructor body:

 RegexASTCapture(
         std::unique_ptr<RegexAST<TypedNfaState>> group_regex_ast,
         std::unique_ptr<Capture> capture
-)
-        : m_group_regex_ast{(
-              nullptr == group_regex_ast
-                      ? throw std::invalid_argument("Group regex AST cannot be null")
-                      : std::move(group_regex_ast)
-          )},
-          m_capture{
-              nullptr == capture ? throw std::invalid_argument("Capture cannot be null")
-                                 : std::move(capture)
-          } {
+) {
+    if (nullptr == group_regex_ast) {
+        throw std::invalid_argument("Group regex AST cannot be null");
+    }
+    if (nullptr == capture) {
+        throw std::invalid_argument("Capture cannot be null");
+    }
+    m_group_regex_ast = std::move(group_regex_ast);
+    m_capture = std::move(capture);
     RegexAST<TypedNfaState>::set_subtree_positive_captures(
             m_group_regex_ast->get_subtree_positive_captures()
     );
     RegexAST<TypedNfaState>::add_subtree_positive_captures({m_capture.get()});
 }

954-961: LGTM!

The serialization changes properly handle captures and maintain Unicode support through std::u32string.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
tests/test-lexer.cpp (2)

65-72: Track the TODO comment about register value checks.

The test_scanning_input function is well-implemented, but there's a TODO comment about checking register values that should be tracked.

Would you like me to create a GitHub issue to track the implementation of register value checks when simulation is implemented?

Also applies to: 128-153


312-346: Track the TODO comment about register-related checks.

The test case provides good coverage for capture groups, but there's a TODO comment about checking get_register_for_tag_id and get_registers_for_capture that should be tracked.

Would you like me to create a GitHub issue to track the implementation of these register-related checks when determinization is implemented?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f88a780 and 0cc6c24.

📒 Files selected for processing (1)
  • tests/test-lexer.cpp (5 hunks)
🧰 Additional context used
📓 Path-based instructions (1)
tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (4)
  • GitHub Check: build (ubuntu-latest, Release)
  • GitHub Check: build (ubuntu-latest, Debug)
  • GitHub Check: build (macos-latest, Release)
  • GitHub Check: build (macos-latest, Debug)
🔇 Additional comments (4)
tests/test-lexer.cpp (4)

9-9: LGTM! Well-organized header inclusions and using declarations.

The header inclusions follow a good organization pattern (standard library headers first, followed by project headers), and the using declarations are properly scoped.

Also applies to: 15-25


58-63: LGTM! Well-structured utility function for lexer initialization.

The initialize_lexer function is well-documented and follows good practices:

  • Clear parameter documentation
  • Proper handling of delimiters
  • Good error handling for duplicate symbols

Also applies to: 93-126


296-310: LGTM! Comprehensive test coverage for basic lexer functionality.

The test case provides good coverage for:

  • Token recognition for valid patterns
  • Handling of uncaught strings

347-349: LGTM! Good test coverage for token matching scenarios.

The test provides comprehensive coverage for:

  • Exact pattern matching
  • Handling of non-matching patterns
  • Handling of partial matches

Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First round of review. It is highly recommended to configure clang-tidy in CLion or the editor you're using so that many errors can be caught before code review.

Comment on lines +133 to +137
auto const capture_ids{m_var_id_to_capture_ids.find(var_id)};
if (m_var_id_to_capture_ids.end() == capture_ids) {
return std::nullopt;
}
return capture_ids->second;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto const capture_ids{m_var_id_to_capture_ids.find(var_id)};
if (m_var_id_to_capture_ids.end() == capture_ids) {
return std::nullopt;
}
return capture_ids->second;
if (m_var_id_to_capture_ids.contains(var_id)) {
return m_var_id_to_capture_ids.at(var_id);
}
return std::nullopt;

How about rewriting in this way to avoid using iterators (shouldn't have any performance difference).
Same for the rest

}

[[nodiscard]] auto get_tag_ids_for_capture_id(symbol_id_t const capture_id
) const -> std::optional<std::pair<tag_id_t, tag_id_t>> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Headers for std::optional and std::pair are not directly included

@@ -122,8 +128,48 @@ class Lexer {
return m_dfa;
}

std::unordered_map<std::string, uint32_t> m_symbol_id;
std::unordered_map<uint32_t, std::string> m_id_symbol;
[[nodiscard]] auto get_capture_ids_for_var_id(symbol_id_t const var_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming to get_capture_ids_from_var_id

return tag_ids->second;
}

[[nodiscard]] auto get_register_for_tag_id(tag_id_t const tag_id
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about get_reg_id_xxx instead of get_register since the return value is an ID.

Comment on lines +160 to +168
auto const tag_ids{get_tag_ids_for_capture_id(capture_id)};
if (tag_ids.has_value()) {
auto const start_reg{get_register_for_tag_id(tag_ids.value().first())};
auto const end_reg{get_register_for_tag_id(tag_ids.value().second())};
if (start_reg.has_value() && end_reg.has_value()) {
return std::make_pair(start_reg.value(), end_reg.value());
}
}
return std::nullopt;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
auto const tag_ids{get_tag_ids_for_capture_id(capture_id)};
if (tag_ids.has_value()) {
auto const start_reg{get_register_for_tag_id(tag_ids.value().first())};
auto const end_reg{get_register_for_tag_id(tag_ids.value().second())};
if (start_reg.has_value() && end_reg.has_value()) {
return std::make_pair(start_reg.value(), end_reg.value());
}
}
return std::nullopt;
auto const optional_tag_ids{get_tag_ids_for_capture_id(capture_id)};
if (false == optional_tag_ids.has_value()) {
return std::nullopt;
}
auto const [start_tag_id, end_tag_id]{optional_tag_ids.value()};
auto const optional_start_reg_id{get_register_for_tag_id(start_tag_id)};
if (false == optional_tag_ids.has_value()) {
return std::nullopt;
}
auto const optional_end_reg_id{get_register_for_tag_id(end_tag_id)};
if (false == optional_tag_ids.has_value()) {
return std::nullopt;
}
return {optional_start_reg_id.value(), optional_end_reg_id.value()};

How about rewriting in this way to:

  • Follow the convention of using optional_ prefix for optional variables.
  • Reduce the level of nested if statements.

REQUIRE(lexer.m_symbol_id.end() != var_id);

string captureName{cCaptureName};
auto const capture_id{lexer.m_symbol_id.find(captureName)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, capture_id_it, or use contains and at instead

auto const capture_id{lexer.m_symbol_id.find(captureName)};
REQUIRE(lexer.m_symbol_id.end() != capture_id);

auto capture_ids{lexer.get_capture_ids_for_var_id(var_id->second)};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • optional_capture_ids
  • auto const

REQUIRE(1 == capture_ids.value().size());
REQUIRE(capture_id->second == capture_ids.value()[0]);

auto tag_ids{lexer.get_tag_ids_for_capture_id(capture_ids.value()[0])};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • optional_tag_ids
  • auto const

auto initialize_lexer(Schema schema, ByteLexer& lexer) -> void;

/**
* Scans the given input to ensure the correct behavior.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We might need a more detailed doc string to explain what this helper checks.

Comment on lines +344 to +345
// TODO: add check for get_register_for_tag_id and get_registers_for_capture when
// determinization is implemented.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// TODO: add check for get_register_for_tag_id and get_registers_for_capture when
// determinization is implemented.
// TODO: Add check for `get_register_for_tag_id` and `get_registers_for_capture` when
// TDFA's determination is implemented.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (5)
src/log_surgeon/UniqueIdGenerator.hpp (1)

5-13: Improve class implementation with member initialization and nodiscard attribute.

The class can be made more concise and safer by using member initialization and adding the nodiscard attribute to prevent accidental discarding of generated IDs.

 namespace log_surgeon {
 class UniqueIdGenerator {
 public:
-    UniqueIdGenerator() : current_id{0} {}
-
     [[nodiscard]] auto generate_id() -> uint32_t { return current_id++; }
 
 private:
-    uint32_t current_id;
+    uint32_t current_id{0};
 };
 }  // namespace log_surgeon
CMakeLists.txt (2)

Line range hint 1-1: Address CMake deprecation warning.

Update the CMake minimum version to use the range syntax for better compatibility.

-cmake_minimum_required(VERSION 3.5.1)
+cmake_minimum_required(VERSION 3.5.1...3.27)

71-84: Improve file ordering in SOURCE_FILES.

The finite automata files should be alphabetically ordered for better maintainability.

     src/log_surgeon/FileReader.hpp
     src/log_surgeon/finite_automata/Capture.hpp
-    src/log_surgeon/finite_automata/Dfa.hpp
-    src/log_surgeon/finite_automata/DfaState.hpp
-    src/log_surgeon/finite_automata/DfaStatePair.hpp
-    src/log_surgeon/finite_automata/Nfa.hpp
-    src/log_surgeon/finite_automata/NfaState.hpp
-    src/log_surgeon/finite_automata/PrefixTree.cpp
-    src/log_surgeon/finite_automata/PrefixTree.hpp
-    src/log_surgeon/finite_automata/RegexAST.hpp
-    src/log_surgeon/finite_automata/RegisterHandler.hpp
-    src/log_surgeon/finite_automata/StateType.hpp
-    src/log_surgeon/finite_automata/TaggedTransition.hpp
-    src/log_surgeon/finite_automata/UnicodeIntervalTree.hpp
-    src/log_surgeon/finite_automata/UnicodeIntervalTree.tpp
+    src/log_surgeon/finite_automata/Dfa.hpp
+    src/log_surgeon/finite_automata/DfaState.hpp
+    src/log_surgeon/finite_automata/DfaStatePair.hpp
+    src/log_surgeon/finite_automata/Nfa.hpp
+    src/log_surgeon/finite_automata/NfaState.hpp
+    src/log_surgeon/finite_automata/PrefixTree.cpp
+    src/log_surgeon/finite_automata/PrefixTree.hpp
+    src/log_surgeon/finite_automata/RegexAST.hpp
+    src/log_surgeon/finite_automata/RegisterHandler.hpp
+    src/log_surgeon/finite_automata/StateType.hpp
+    src/log_surgeon/finite_automata/TaggedTransition.hpp
+    src/log_surgeon/finite_automata/UnicodeIntervalTree.hpp
+    src/log_surgeon/finite_automata/UnicodeIntervalTree.tpp
src/log_surgeon/finite_automata/Nfa.hpp (2)

120-124: Initialize all member variables in constructor.

Consider explicitly initializing all member variables in the initialization list for better clarity.

 Nfa<TypedNfaState>::Nfa(std::vector<LexicalRule<TypedNfaState>> const& rules)
-        : m_root{new_state()} {
+        : m_states{}
+        , m_capture_to_tag_ids{}
+        , m_root{new_state()}
+        , m_unique_id_generator{} {
     for (auto const& rule : rules) {
         rule.add_to_nfa(this);
     }

112-113: Track TODO for lexer limitation.

The TODO comment indicates a design limitation in the lexer that should be addressed.

Would you like me to create an issue to track this TODO for improving the lexer's handling of capture group names?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cc6c24 and d38393d.

📒 Files selected for processing (4)
  • CMakeLists.txt (2 hunks)
  • src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
  • src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
  • tests/CMakeLists.txt (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • tests/CMakeLists.txt
🧰 Additional context used
📓 Path-based instructions (2)
src/log_surgeon/UniqueIdGenerator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🪛 GitHub Actions: Build
CMakeLists.txt

[warning] 1-1: CMake Deprecation Warning: Compatibility with CMake < 3.10 will be removed from a future version of CMake. Update the VERSION argument or use ... syntax.

🔇 Additional comments (2)
src/log_surgeon/finite_automata/Nfa.hpp (2)

21-28: Improve class documentation.

The documentation needs to be more precise and follow proper formatting.

 /**
- * Represents a NFA(non-deterministic finite automata) for recognizing a language based on the set
- * of rules used during initialization. Currently use as an intermediate model for generating the
- * DFA.
+ * Represents a Non-Deterministic Finite Automaton (NFA) designed to recognize a language based on
+ * a set of rules provided during initialization. This class serves as an intermediate
+ * representation used for generating the corresponding Deterministic Finite Automaton (DFA).
  *
- * Currently we assume all capture groups have unique names.
+ * NOTE: It is assumed that all capture groups have unique names, even across different rules.
  * @tparam TypedNfaState
  */

128-139: Improve implementation of get_or_create_capture_tags.

The method can be made more concise and safer with const variables.

 auto Nfa<TypedNfaState>::get_or_create_capture_tags(Capture const* capture
 ) -> std::pair<tag_id_t, tag_id_t> {
-    auto const existing_tags{m_capture_to_tag_ids.find(capture)};
-    if (m_capture_to_tag_ids.end() == existing_tags) {
-        auto start_tag{m_unique_id_generator.generate_id()};
-        auto end_tag{m_unique_id_generator.generate_id()};
-        auto new_tags{std::make_pair(start_tag, end_tag)};
-        m_capture_to_tag_ids.emplace(capture, new_tags);
-        return new_tags;
+    if (false == m_capture_to_tag_ids.contains(capture)) {
+        m_capture_to_tag_ids.emplace(
+                capture,
+                m_unique_id_generator.generate_id(),
+                m_unique_id_generator.generate_id()
+        );
     }
-    return existing_tags->second;
+    return m_capture_to_tag_ids.at(capture);
 }

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants