feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

SharafMohamed · 2025-01-13T10:54:20Z

Description

Previously tags were being used to refer to a single capture group, as well as the start and end markers for a capture group's position in the NFA.
- The former has been changed to be referred to as a capture.
- The latter is now stored as a unique unsigned integer.
To simplify information tracking and ownership transfer, the lexer is now responsible for keeping track of all the relational information it will need after parsing. This includes:
- A map from each variable id to the capture id's for the groups the variable contains.
- A map from each capture to its start and end tag.
- A map from each tag to its final register.
Fix filer order in cmake.
Use *_id_t aliases to make it more clear what the intended purpose of the maps are.

Validation performed

Added new unit-test for the lexer's base functionality.
Added new unit-test for the lexer's capture group functionality. This includes testing the maps that can currently be assigned.

Summary by CodeRabbit

Based on the comprehensive changes across multiple files, here are the updated release notes:

New Features
- Introduced a new Capture class to replace the previous Tag system.
- Enhanced lexer functionality with improved capture group handling.
- Added methods for retrieving capture and tag information.
Improvements
- Refined type management with new type aliases like symbol_id_t and register_id_t.
- Updated NFA and regex AST to use a capture-based tagging mechanism.
- Improved error handling and unique ID generation for captures.
Refactoring
- Replaced tag-based system with a more robust capture-based approach.
- Updated method signatures and internal data structures across multiple components.
- Streamlined serialization and transition handling.
Testing
- Added new unit tests for Capture functionality.
- Updated existing tests to reflect new capture and tagging mechanisms.

These changes represent a significant architectural improvement in the log parsing and regex handling system.

Co-authored-by: Lin Zhihao <[email protected]>

…reflect functionality change.

coderabbitai · 2025-01-13T10:54:29Z

Walkthrough

This pull request introduces a comprehensive refactoring of the log surgeon's finite automata and lexer components, primarily focusing on replacing the concept of "tags" with "captures". The changes span multiple files across the project, including CMakeLists.txt, header files, and test cases. The refactoring involves renaming classes, updating method signatures, and modifying how captures are managed throughout the codebase, with a particular emphasis on improving the representation and handling of regex capture groups.

Changes

File	Change Summary
`CMakeLists.txt`	Removed `Tag.hpp`, added `Capture.hpp`, restored `PrefixTree.cpp`, `PrefixTree.hpp`, and `RegexAST.hpp`
`src/log_surgeon/Lexer.hpp`	Added type aliases: `register_id_t`, `symbol_id_t`, `tag_id_t`; added methods for capture and tag ID retrieval
`src/log_surgeon/Lexer.tpp`	Updated method signatures to use `symbol_id_t`; added capture name uniqueness validation
`src/log_surgeon/LexicalRule.hpp`	Added `get_captures()` method to return capture information
`src/log_surgeon/SchemaParser.cpp`	Replaced `Tag.hpp` include with `Capture.hpp`
`src/log_surgeon/finite_automata/*`	Comprehensive renaming from `Tag` to `Capture`; updated method signatures to use tag IDs
`tests/`	Added `test-capture.cpp`; removed `test-tag.cpp`; updated existing test cases

Sequence Diagram

sequenceDiagram
    participant Lexer
    participant Nfa
    participant RegexAST
    participant Capture

    Lexer->>Nfa: Create NFA with rules
    Nfa->>RegexAST: Process regex rules
    RegexAST->>Capture: Generate capture groups
    Nfa-->>Lexer: Return NFA with capture information

Possibly related PRs

Add support for tracking matched and unmatched capture groups in RegexAST nodes using integer-based tags; Add support for serializing RegexAST nodes. #38: The changes in this PR involve the addition of tags to the AST to track matched and unmatched capture groups, which relates to the modifications in the CMakeLists.txt file in the main PR that also involves files related to finite automata.
Bug-Fix: Add negative tags for RegexMultiplicationAST with min=0; Update README to include intersection-test. #41: This PR introduces negative tags for RegexMultiplicationAST, which aligns with the changes in the main PR that involve the management of capture groups and their representation in the code.
feat: Split NFA positive tags into start and end transitions to encapsulate a capture group. #50: This PR focuses on splitting NFA positive tags into start and end transitions, which is relevant to the changes in the main PR that involve the management of capture groups and their transitions.

Suggested reviewers

LinZhihao-723

✨ Finishing Touches

📝 Generate Docstrings (Beta)

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Nitpick comments (9)

src/log_surgeon/finite_automata/Dfa.hpp (1)
Line range hint 121-134: Consider adding error handling for invalid input in get_intersect

While the method correctly implements the intersection operation, it might benefit from validation of the input DFA pointer and appropriate error handling.

Consider adding input validation:
 template <typename TypedDfaState>
 auto Dfa<TypedDfaState>::get_intersect(Dfa const* dfa_in) const -> std::set<uint32_t> {
+    if (nullptr == dfa_in) {
+        throw std::invalid_argument("Input DFA pointer cannot be null");
+    }
     std::set<uint32_t> schema_types;
     // ... rest of the implementation
src/log_surgeon/finite_automata/Nfa.hpp (1)

122-122: Code formatting issue detected

The code does not adhere to the project's code formatting guidelines. Please run clang-format on the affected file to resolve formatting issues.

🧰 Tools

🪛 GitHub Actions: lint

[error] 122-122: Code should be clang-formatted

src/log_surgeon/finite_automata/RegexAST.hpp (1)

113-113: Code formatting issue detected

The code does not adhere to the project's code formatting guidelines at line 113. Please run clang-format on the affected file to resolve formatting issues.

🧰 Tools

🪛 GitHub Actions: lint

[error] 113-113: Code should be clang-formatted

tests/test-capture.cpp (1)

23-34: Consider adding more test cases for constructors.

While the current tests cover basic copy and move operations, consider adding tests for:

Copy/move with empty strings

Self-assignment test

Chain assignment test (a = b = c)
src/log_surgeon/Lexer.hpp (1)
131-169: Consider adding validation for capture_id parameter.

While the error handling is good, consider adding a debug assertion or validation for the capture_id parameter in get_registers_for_capture() to catch potential programming errors early.
 [[nodiscard]] auto get_registers_for_capture(symbol_id_t capture_id
 ) const -> std::optional<std::pair<register_id_t, register_id_t>> {
+    assert(capture_id < m_id_symbol.size() && "Invalid capture_id");
     auto const tag_ids = get_tag_ids_for_capture_id(capture_id);
tests/test-nfa.cpp (1)
47-48: Consider using std::move for better performance.

The initialization of ByteNfa could be more efficient by moving rules instead of copying it.
-    ByteNfa const nfa{rules};
+    ByteNfa const nfa{std::move(rules)};
🧰 Tools

🪛 GitHub Actions: lint

[error] 47-65: Code should be clang-formatted
src/log_surgeon/Lexer.tpp (1)
381-394: Enhance the error message for duplicate captures.

The error message could be more descriptive to help users identify which capture names are duplicated.
-                throw std::invalid_argument("`m_rules` contains capture names that are not unique.");
+                throw std::invalid_argument(fmt::format("Duplicate capture name '{}' found in rules.", capture_name));
🧰 Tools

🪛 GitHub Actions: lint

[error] 390-390: Code should be clang-formatted
tests/test-lexer.cpp (2)
215-297: Consider extracting helper functions for better test readability.

The test case could be more maintainable by extracting common setup and verification logic into helper functions.
// Helper function for lexer setup
auto setup_lexer(string_view schema) -> log_surgeon::lexers::ByteLexer {
    log_surgeon::lexers::ByteLexer lexer;
    lexer.add_delimiters({' ', '\n'});
    // ... rest of setup code ...
    return lexer;
}

// Helper function for token verification
auto verify_token(const log_surgeon::Token& token, const log_surgeon::lexers::ByteLexer& lexer,
                 string_view expected_type, string_view expected_value) -> void {
    REQUIRE(nullptr != token.m_type_ids_ptr);
    REQUIRE(1 == token.m_type_ids_ptr->size());
    REQUIRE(expected_type == lexer.m_id_symbol[token.m_type_ids_ptr->at(0)]);
    REQUIRE(expected_value == token.to_string_view());
}
359-400: Track missing test coverage for register functionality.

The TODO comments indicate that tests for register-related functionality are pending implementation. This should be tracked to ensure complete test coverage.

Would you like me to create a GitHub issue to track the implementation of these test cases?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between e08f728 and bc78c49.

📒 Files selected for processing (18)

CMakeLists.txt (1 hunks)
src/log_surgeon/Lexer.hpp (5 hunks)
src/log_surgeon/Lexer.tpp (3 hunks)
src/log_surgeon/LexicalRule.hpp (1 hunks)
src/log_surgeon/SchemaParser.cpp (4 hunks)
src/log_surgeon/finite_automata/Capture.hpp (2 hunks)
src/log_surgeon/finite_automata/Dfa.hpp (3 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
src/log_surgeon/finite_automata/NfaState.hpp (3 hunks)
src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (20 hunks)
src/log_surgeon/finite_automata/RegisterHandler.hpp (2 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
tests/CMakeLists.txt (2 hunks)
tests/test-capture.cpp (1 hunks)
tests/test-lexer.cpp (2 hunks)
tests/test-nfa.cpp (4 hunks)
tests/test-tag.cpp (0 hunks)

💤 Files with no reviewable changes (1)

tests/test-tag.cpp

✅ Files skipped from review due to trivial changes (2)

src/log_surgeon/finite_automata/PrefixTree.hpp
src/log_surgeon/finite_automata/Capture.hpp

🧰 Additional context used

📓 Path-based instructions (12)

src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Dfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-nfa.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-capture.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/LexicalRule.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/SchemaParser.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/NfaState.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/Lexer.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (6)

src/log_surgeon/finite_automata/RegisterHandler.hpp (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.

src/log_surgeon/SchemaParser.cpp (1)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

src/log_surgeon/finite_automata/NfaState.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:127-128
Timestamp: 2024-11-10T16:46:58.543Z
Learning: `RegexNFAUTF8State` is defined as a type alias for `RegexNFAState<RegexNFAStateType::UTF8>`.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

src/log_surgeon/finite_automata/RegexAST.hpp (3)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexNFAState.hpp:0-0
Timestamp: 2024-11-13T20:02:13.737Z
Learning: In `src/log_surgeon/finite_automata/RegexNFAState.hpp`, the constructor `RegexNFAState(std::set<Tag const*> tags, RegexNFAState const* dest_state)` has been updated to use `std::vector<Tag const*> tags` instead of `std::set`.

src/log_surgeon/Lexer.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#42
File: src/log_surgeon/finite_automata/RegexNFA.hpp:37-90
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In this codebase, prefer code clarity over efficiency optimizations unless efficiency is a critical concern.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#56
File: src/log_surgeon/finite_automata/RegisterHandler.hpp:0-0
Timestamp: 2024-11-27T22:25:35.608Z
Learning: In the `RegisterHandler` class in `src/log_surgeon/finite_automata/RegisterHandler.hpp`, the methods `add_register` and `append_position` rely on `emplace_back` and `m_prefix_tree.insert` to handle exceptions correctly and maintain consistent state without requiring additional exception handling.

🪛 GitHub Actions: lint

tests/test-nfa.cpp

[error] 47-65: Code should be clang-formatted

src/log_surgeon/LexicalRule.hpp

[error] 28-28: Code should be clang-formatted

src/log_surgeon/SchemaParser.cpp

[error] 1-1: Code should be clang-formatted

src/log_surgeon/Lexer.tpp

[error] 4-4: Code should be clang-formatted

[error] 390-390: Code should be clang-formatted

src/log_surgeon/finite_automata/Nfa.hpp

[error] 122-122: Code should be clang-formatted

src/log_surgeon/finite_automata/RegexAST.hpp

[error] 113-113: Code should be clang-formatted

⏰ Context from checks skipped due to timeout of 90000ms (2)

GitHub Check: build (ubuntu-latest, Release)
GitHub Check: build (ubuntu-latest, Debug)

🔇 Additional comments (29)

src/log_surgeon/finite_automata/Dfa.hpp (2)

5-5: Well-structured include additions!

Making the dependencies on <map>, <stack>, and RegisterHandler.hpp explicit improves maintainability by clearly showing the required headers.

Also applies to: 8-8, 14-14

82-82: Good defensive programming practice!

Initializing the state pointer to nullptr is a good practice that prevents undefined behaviour and makes debugging easier.

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

14-15: Introduced tag_id_t typedef for clarity

Defining tag_id_t as an alias for std::uint32_t improves code readability and maintainability.

src/log_surgeon/finite_automata/Nfa.hpp (1)

43-43: Optimized constructor by passing vector by const reference

Changing the constructor parameter to accept rules by const reference avoids unnecessary copying, improving performance.

tests/test-capture.cpp (3)

8-11: LGTM! Basic test case is well-implemented.

The test verifies the fundamental functionality of the get_name() method.

13-16: LGTM! Empty string handling is properly tested.

The test ensures that empty capture names are handled correctly.

18-21: LGTM! Special characters test is comprehensive.

The test verifies that special characters in capture names are preserved correctly.

src/log_surgeon/LexicalRule.hpp (1)

26-28: LGTM! Method implementation is correct.

The get_captures() method is well-designed:

Properly marked with [[nodiscard]]

Returns const pointers to prevent modification

Correctly delegates to the underlying regex implementation

🧰 Tools

🪛 GitHub Actions: lint

[error] 28-28: Code should be clang-formatted

src/log_surgeon/finite_automata/RegisterHandler.hpp (2)

5-5: LGTM! Include is properly placed.

The <cstdint> include is correctly added to support the new type alias.

21-21: LGTM! Type alias enhances code clarity.

The register_id_t type alias improves code readability and maintainability by providing a meaningful name for the register identifier type.

src/log_surgeon/Lexer.hpp (3)

27-29: LGTM! Type aliases enhance code clarity.

The type aliases improve code readability and maintainability by providing meaningful names for different identifier types.

59-61: LGTM! Exception documentation is clear.

The comment properly documents the exception that can be thrown when duplicate capture names are encountered.

197-199: LGTM! Member variables support the new mapping functionality.

The new member variables properly support the mapping between variables, captures, tags, and registers.

tests/test-nfa.cpp (3)

58-66: LGTM! Test expectations updated to match the new capture-to-tag mapping.

The changes correctly reflect the new implementation where captures are mapped to numerical tag indices.

Line range hint 69-101: LGTM! Consistent update of test expectations.

The changes maintain consistency with the new capture-to-tag mapping implementation.

Line range hint 106-132: LGTM! Test expectations properly updated.

The changes complete the consistent update of test expectations for the new capture-to-tag mapping implementation.

src/log_surgeon/finite_automata/NfaState.hpp (2)

35-40: LGTM! Good use of value types over pointers.

The change from Tag const* to tag_id_t reduces memory management complexity and aligns with modern C++ practices.

54-56: LGTM! Consistent use of value types.

The method signature update maintains consistency with the constructor changes, using tag_id_t throughout the class.
src/log_surgeon/Lexer.tpp (2)

396-401: LGTM! Proper implementation of capture-to-tag mapping.

The code correctly processes and stores the relationships between captures and their corresponding tag IDs.

403-403: Verify the impact of DFA ignoring captures.

The TODO comment indicates that DFA treats capture groups as regular expressions. Please verify if this limitation affects the correctness of capture group functionality.

✅ Verification successful

DFA capture group limitation is a known design choice

The TODO comment accurately reflects the current implementation. The DFA treats capture expressions like "capture:user=(?<user_id>\d+)" as simple patterns "capture:user=\d+". This is a known limitation in the design, not a bug:

Tests in test-lexer.cpp show that capture groups are supported in the API

The DFA implementation intentionally simplifies these patterns

Capture group processing likely happens at a different stage of processing
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Search for tests that verify capture group functionality with DFA.
rg -A 5 "TEST_CASE.*capture.*DFA" tests/
Length of output: 42

Script:
#!/bin/bash
# Search for capture-related code and tests
echo "=== Capture-related tests ==="
rg -l "TEST_CASE.*capture" tests/

echo -e "\n=== Capture implementation ==="
ast-grep --pattern 'capture'

echo -e "\n=== DFA implementation ==="
rg -l "DFA|deterministic.*automaton" src/
Length of output: 2361

Script:
#!/bin/bash
# Examine capture tests and DFA implementation
echo "=== Capture Tests ==="
rg -A 10 "TEST_CASE.*capture" tests/test-lexer.cpp tests/test-capture.cpp

echo -e "\n=== DFA Capture Handling ==="
rg -A 5 "capture" src/log_surgeon/finite_automata/Dfa.hpp

echo -e "\n=== NFA to DFA Conversion ==="
rg -A 10 "convert.*to.*dfa|to_dfa" src/log_surgeon/finite_automata/
Length of output: 1133
tests/test-lexer.cpp (1)

299-358: LGTM! Comprehensive test coverage for capture group functionality.

The test case thoroughly verifies the new capture-to-tag mapping features, including variable IDs, capture IDs, and tag IDs relationships.
src/log_surgeon/SchemaParser.cpp (4)

13-13: LGTM!

The include statement for Capture.hpp is correctly placed and aligns with the transition from tags to captures.

205-205: LGTM!

The comment has been correctly updated to use "captures" terminology, maintaining consistency with the codebase changes.

251-251: LGTM!

The comment has been correctly updated to use "captures" terminology, maintaining consistency with the codebase changes.

170-170: Verify non-null requirement for Capture.

The change from Tag to Capture is consistent. However, based on previous learnings, ensure that the Capture instance is always non-null when passed to RegexASTCapture.

Run this script to verify the non-null requirement:

✅ Verification successful

Non-null requirement is enforced by design

The Capture class design and RegexASTCapture's constructor guarantees that Capture instances cannot be null. The make_unique<Capture>(r4->m_name) call is safe as it satisfies the required string parameter constructor.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify that Capture instances are always non-null when passed to RegexASTCapture.

# Search for any null checks or assignments of Capture instances
ast-grep --pattern 'RegexASTCapture($$$, nullptr)' || ast-grep --pattern 'RegexASTCapture($$$, NULL)'

# Search for any potential null assignments to Capture variables
rg -A 2 'Capture.*=.*null|Capture.*=.*nullptr|Capture.*=.*NULL'
Length of output: 118

Script:
#!/bin/bash
# Find Capture class definition and usage
ast-grep --pattern 'class Capture {
  $$$
}'

# Find RegexASTCapture constructor definition
ast-grep --pattern 'RegexASTCapture::RegexASTCapture($$$)'

# Find all Capture construction patterns
ast-grep --pattern 'new Capture($$$)' || ast-grep --pattern 'make_unique<Capture>($$$)'

# Find all RegexASTCapture construction patterns
ast-grep --pattern 'new RegexASTCapture($$$)' || ast-grep --pattern 'make_unique<RegexASTCapture>($$$)'
Length of output: 287

Script:
#!/bin/bash
# Search for Capture and RegexASTCapture in header files
fd -e hpp -e h --exec cat {} \; | rg -A 10 'class (Capture|RegexASTCapture)'

# Find all make_unique constructions
rg -A 2 'make_unique<(Capture|RegexASTCapture)>'

# Find constructor implementations
rg -A 5 '(Capture|RegexASTCapture)::\1'
Length of output: 1320
tests/CMakeLists.txt (2)

27-34: LGTM!

The addition of test-capture.cpp and the improved formatting enhance the maintainability of the test suite.

5-7: Verify source file list consistency.

The addition of Capture.hpp aligns with the transition from tags to captures. However, Nfa.hpp and NfaState.hpp appear to be duplicated in the source list.

Run this script to verify the source file list:
CMakeLists.txt (2)

96-96: LGTM!

The addition of Capture.hpp is correctly placed within the finite_automata section.

102-104: Verify the necessity of restored files.

The restoration of PrefixTree and RegexAST files should be verified against the capture functionality requirements.

Run this script to verify file dependencies:

✅ Verification successful

Restoration of files is necessary and correct

The PrefixTree and RegexAST files are essential components of the system:

PrefixTree is required by RegisterHandler for position tracking

RegexAST provides critical capture group functionality

Together they form a cohesive pattern matching and position tracking system
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Check dependencies between restored files and Capture functionality.

# Search for Capture-related usage in restored files
echo "Checking PrefixTree.cpp for Capture usage:"
rg -A 3 "Capture" "src/log_surgeon/finite_automata/PrefixTree.cpp" || true

echo "Checking PrefixTree.hpp for Capture usage:"
rg -A 3 "Capture" "src/log_surgeon/finite_automata/PrefixTree.hpp" || true

echo "Checking RegexAST.hpp for Capture usage:"
rg -A 3 "Capture" "src/log_surgeon/finite_automata/RegexAST.hpp" || true
Length of output: 4173

Script:
#!/bin/bash
# Check relationships between PrefixTree and RegexAST/Capture components

echo "Checking RegexAST.hpp for PrefixTree usage:"
rg -A 3 "PrefixTree" "src/log_surgeon/finite_automata/RegexAST.hpp" || true

echo "Checking for files using PrefixTree:"
rg --files-with-matches "PrefixTree" || true

echo "Checking implementation files for relationships:"
rg -l "class.*(?:RegexAST|PrefixTree)" || true
Length of output: 770

Script:
#!/bin/bash
# Examine RegisterHandler implementation

echo "Checking RegisterHandler for component relationships:"
rg -A 5 "class RegisterHandler" "src/log_surgeon/finite_automata/RegisterHandler.hpp" || true

echo "Checking RegisterHandler usage of components:"
rg -A 3 "(?:PrefixTree|RegexAST|Capture)" "src/log_surgeon/finite_automata/RegisterHandler.hpp" || true

echo "Checking test implementation for component integration:"
rg -A 3 "(?:PrefixTree|RegexAST|Capture)" "tests/test-register-handler.cpp" || true
Length of output: 2057

src/log_surgeon/finite_automata/Dfa.hpp

src/log_surgeon/finite_automata/TaggedTransition.hpp

src/log_surgeon/finite_automata/RegexAST.hpp

src/log_surgeon/LexicalRule.hpp

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/log_surgeon/finite_automata/RegexAST.hpp (1)

650-660: ⚠️ Potential issue

Avoid throwing exceptions in the initializer list

Throwing exceptions in the member initializer list can lead to issues with object construction and exception safety. It's safer to perform null checks in the constructor body and throw exceptions from there.

Apply this diff to move the null checks into the constructor body:

 RegexASTCapture(
         std::unique_ptr<RegexAST<TypedNfaState>> group_regex_ast,
         std::unique_ptr<Capture> capture
 )
-        : m_group_regex_ast{(
-                  nullptr == group_regex_ast
-                          ? throw std::invalid_argument("Group regex AST cannot be null")
-                          : std::move(group_regex_ast)
-          )},
-          m_capture{
-                  nullptr == capture ? throw std::invalid_argument("Capture cannot be null")
-                                     : std::move(capture)
-          } {
+        {
+            if (nullptr == group_regex_ast) {
+                throw std::invalid_argument("Group regex AST cannot be null");
+            }
+            if (nullptr == capture) {
+                throw std::invalid_argument("Capture cannot be null");
+            }
+            m_group_regex_ast = std::move(group_regex_ast);
+            m_capture = std::move(capture);
             RegexAST<TypedNfaState>::set_subtree_positive_captures(
                     m_group_regex_ast->get_subtree_positive_captures()
             );
             RegexAST<TypedNfaState>::add_subtree_positive_captures({m_capture.get()});
         }

🧹 Nitpick comments (4)

tests/test-lexer.cpp (3)

215-297: Consider adding test cases for edge cases.

The test case thoroughly validates the basic functionality but could benefit from additional test cases for:

Empty input

Multiple consecutive delimiters

Input containing only delimiters

359-360: Track TODOs in the issue tracker.

The TODOs indicate pending implementation of determinization and simulation features. These should be tracked in the issue tracker for better visibility and follow-up.

Would you like me to create GitHub issues to track these TODOs?

Also applies to: 399-400

342-357: Consider adding negative test cases for mapping validation.

The mapping validation tests only cover the happy path. Consider adding test cases for:

Non-existent variable IDs

Non-existent capture IDs

Invalid tag IDs

src/log_surgeon/finite_automata/Nfa.hpp (1)

95-99: Consider using stable identifiers instead of raw pointers as unordered_map keys

Using Capture const* as keys in an unordered_map relies on the stability of pointer values, which can be error-prone if objects are moved or deleted. To enhance safety and maintainability, consider using unique IDs or smart pointers with custom hash and equality functions as keys.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between bc78c49 and 2c89216.

📒 Files selected for processing (4)

src/log_surgeon/Lexer.hpp (5 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
tests/test-lexer.cpp (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

src/log_surgeon/Lexer.hpp

🧰 Additional context used

📓 Path-based instructions (3)

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (1)

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: build (ubuntu-latest, Release)
GitHub Check: build (ubuntu-latest, Debug)
GitHub Check: build (macos-latest, Release)
GitHub Check: build (macos-latest, Debug)

🔇 Additional comments (4)

tests/test-lexer.cpp (2)

9-9: LGTM!

The new includes and using declarations are appropriate for the added test cases.

Also applies to: 15-15

362-397: LGTM!

The token recognition tests with captures are thorough and well-structured.

src/log_surgeon/finite_automata/Nfa.hpp (2)

43-43: Efficient parameter passing by const reference

Good use of a constant reference in the Nfa constructor to avoid unnecessary copies of rules.

20-28: Verify thread safety of UniqueIdGenerator

If UniqueIdGenerator is used in multithreaded contexts, the generate_id() method is not thread-safe due to the use of the post-increment operator. Consider using std::atomic<uint32_t> for current_id or synchronizing access with a mutex to ensure thread safety.

Run the following script to check for usage of UniqueIdGenerator in multithreaded contexts:

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)
55-56: 🛠️ Refactor suggestion

Pass vector by const reference to improve efficiency.

Consider passing tag_ids by const reference to avoid unnecessary copying of the vector.
-NegativeTaggedTransition(std::vector<tag_id_t> tag_ids, TypedNfaState const* dest_state)
+NegativeTaggedTransition(std::vector<tag_id_t> const& tag_ids, TypedNfaState const* dest_state)

🧹 Nitpick comments (1)

src/log_surgeon/Lexer.tpp (1)
381-394: Enhance the error message for duplicate capture names.

The error message could be more descriptive by including the duplicate capture name.

Apply this diff to improve the error message:
-                throw std::invalid_argument("`m_rules` contains capture names that are not unique."
+                throw std::invalid_argument(fmt::format(
+                    "`m_rules` contains duplicate capture name: '{}'", capture_name)
                );

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2c89216 and f88a780.

📒 Files selected for processing (7)

src/log_surgeon/Lexer.tpp (3 hunks)
src/log_surgeon/LexicalRule.hpp (1 hunks)
src/log_surgeon/SchemaParser.cpp (4 hunks)
src/log_surgeon/finite_automata/PrefixTree.hpp (1 hunks)
src/log_surgeon/finite_automata/RegexAST.hpp (21 hunks)
src/log_surgeon/finite_automata/TaggedTransition.hpp (3 hunks)
tests/test-nfa.cpp (4 hunks)

🚧 Files skipped from review as they are similar to previous changes (4)

src/log_surgeon/finite_automata/PrefixTree.hpp
src/log_surgeon/LexicalRule.hpp
tests/test-nfa.cpp
src/log_surgeon/SchemaParser.cpp

🧰 Additional context used

📓 Path-based instructions (2)

src/log_surgeon/finite_automata/TaggedTransition.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/RegexAST.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

📓 Learnings (2)

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#47
File: src/log_surgeon/finite_automata/TaggedTransition.hpp:16-37
Timestamp: 2024-11-10T16:46:58.543Z
Learning: In `src/log_surgeon/finite_automata/TaggedTransition.hpp`, the classes `PositiveTaggedTransition` and `NegativeTaggedTransition` currently do not share enough functionality to justify refactoring into a common base class.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

src/log_surgeon/finite_automata/RegexAST.hpp (2)

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#48
File: src/log_surgeon/finite_automata/RegexAST.hpp:700-700
Timestamp: 2024-11-13T22:38:19.472Z
Learning: In `RegexASTCapture`, `m_tag` must always be non-null.

Learnt from: SharafMohamed
PR: y-scope/log-surgeon#50
File: src/log_surgeon/finite_automata/Tag.hpp:0-0
Timestamp: 2024-11-18T16:45:46.074Z
Learning: The class `TagPositions` was removed from `src/log_surgeon/finite_automata/Tag.hpp` as it is no longer needed.

⏰ Context from checks skipped due to timeout of 90000ms (3)

GitHub Check: build (ubuntu-latest, Release)
GitHub Check: build (ubuntu-latest, Debug)
GitHub Check: build (macos-latest, Debug)

🔇 Additional comments (9)

src/log_surgeon/finite_automata/TaggedTransition.hpp (2)

14-14: LGTM! Good use of type alias.

The introduction of tag_id_t improves code readability and provides better type safety.

23-24: LGTM! Good transition from pointers to IDs.

The changes effectively replace Tag pointers with tag_id_t, which:

Reduces memory usage

Simplifies memory management

Maintains the same functionality

Also applies to: 40-40, 44-44

src/log_surgeon/Lexer.tpp (3)

7-7: LGTM!

The addition of <stdexcept> header is appropriate for using std::invalid_argument exception.

362-374: LGTM!

The parameter type change from uint32_t to symbol_id_t improves type safety and semantic clarity.

397-402: LGTM!

The mapping between captures and tag IDs is well-structured and maintains the relationship necessary for tracking ownership transfer after parsing.
src/log_surgeon/finite_automata/RegexAST.hpp (4)

33-38: LGTM!

The documentation updates accurately reflect the transition from tags to captures and clearly explain the role of captures in the AST.

155-157: LGTM!

The member variable changes from tags to captures are consistent and well-named, reflecting the new design.

650-664: Move null checks to constructor body.

As previously discussed, throwing exceptions in the initializer list can lead to issues with object construction and exception safety.

Apply this diff to move the null checks into the constructor body:
 RegexASTCapture(
         std::unique_ptr<RegexAST<TypedNfaState>> group_regex_ast,
         std::unique_ptr<Capture> capture
-)
-        : m_group_regex_ast{(
-              nullptr == group_regex_ast
-                      ? throw std::invalid_argument("Group regex AST cannot be null")
-                      : std::move(group_regex_ast)
-          )},
-          m_capture{
-              nullptr == capture ? throw std::invalid_argument("Capture cannot be null")
-                                 : std::move(capture)
-          } {
+) {
+    if (nullptr == group_regex_ast) {
+        throw std::invalid_argument("Group regex AST cannot be null");
+    }
+    if (nullptr == capture) {
+        throw std::invalid_argument("Capture cannot be null");
+    }
+    m_group_regex_ast = std::move(group_regex_ast);
+    m_capture = std::move(capture);
     RegexAST<TypedNfaState>::set_subtree_positive_captures(
             m_group_regex_ast->get_subtree_positive_captures()
     );
     RegexAST<TypedNfaState>::add_subtree_positive_captures({m_capture.get()});
 }
954-961: LGTM!

The serialization changes properly handle captures and maintain Unicode support through std::u32string.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (2)

tests/test-lexer.cpp (2)

65-72: Track the TODO comment about register value checks.

The test_scanning_input function is well-implemented, but there's a TODO comment about checking register values that should be tracked.

Would you like me to create a GitHub issue to track the implementation of register value checks when simulation is implemented?

Also applies to: 128-153

312-346: Track the TODO comment about register-related checks.

The test case provides good coverage for capture groups, but there's a TODO comment about checking get_register_for_tag_id and get_registers_for_capture that should be tracked.

Would you like me to create a GitHub issue to track the implementation of these register-related checks when determinization is implemented?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f88a780 and 0cc6c24.

📒 Files selected for processing (1)

tests/test-lexer.cpp (5 hunks)

🧰 Additional context used

📓 Path-based instructions (1)

tests/test-lexer.cpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

⏰ Context from checks skipped due to timeout of 90000ms (4)

GitHub Check: build (ubuntu-latest, Release)
GitHub Check: build (ubuntu-latest, Debug)
GitHub Check: build (macos-latest, Release)
GitHub Check: build (macos-latest, Debug)

🔇 Additional comments (4)

tests/test-lexer.cpp (4)

9-9: LGTM! Well-organized header inclusions and using declarations.

The header inclusions follow a good organization pattern (standard library headers first, followed by project headers), and the using declarations are properly scoped.

Also applies to: 15-25

58-63: LGTM! Well-structured utility function for lexer initialization.

The initialize_lexer function is well-documented and follows good practices:

Clear parameter documentation

Proper handling of delimiters

Good error handling for duplicate symbols

Also applies to: 93-126

296-310: LGTM! Comprehensive test coverage for basic lexer functionality.

The test case provides good coverage for:

Token recognition for valid patterns

Handling of uncaught strings

347-349: LGTM! Good test coverage for token matching scenarios.

The test provides comprehensive coverage for:

Exact pattern matching

Handling of non-matching patterns

Handling of partial matches

LinZhihao-723

First round of review. It is highly recommended to configure clang-tidy in CLion or the editor you're using so that many errors can be caught before code review.

LinZhihao-723 · 2025-01-13T16:14:28Z

src/log_surgeon/Lexer.hpp

+        auto const capture_ids{m_var_id_to_capture_ids.find(var_id)};
+        if (m_var_id_to_capture_ids.end() == capture_ids) {
+            return std::nullopt;
+        }
+        return capture_ids->second;


Suggested change

auto const capture_ids{m_var_id_to_capture_ids.find(var_id)};

if (m_var_id_to_capture_ids.end() == capture_ids) {

return std::nullopt;

}

return capture_ids->second;

if (m_var_id_to_capture_ids.contains(var_id)) {

return m_var_id_to_capture_ids.at(var_id);

}

return std::nullopt;

How about rewriting in this way to avoid using iterators (shouldn't have any performance difference).
Same for the rest

LinZhihao-723 · 2025-01-13T16:17:35Z

src/log_surgeon/Lexer.hpp

+    }
+
+    [[nodiscard]] auto get_tag_ids_for_capture_id(symbol_id_t const capture_id
+    ) const -> std::optional<std::pair<tag_id_t, tag_id_t>> {


Headers for std::optional and std::pair are not directly included

LinZhihao-723 · 2025-01-13T16:18:48Z

src/log_surgeon/Lexer.hpp

@@ -122,8 +128,48 @@ class Lexer {
        return m_dfa;
    }

-    std::unordered_map<std::string, uint32_t> m_symbol_id;
-    std::unordered_map<uint32_t, std::string> m_id_symbol;
+    [[nodiscard]] auto get_capture_ids_for_var_id(symbol_id_t const var_id


How about renaming to get_capture_ids_from_var_id

LinZhihao-723 · 2025-01-13T17:31:07Z

src/log_surgeon/Lexer.hpp

+        return tag_ids->second;
+    }
+
+    [[nodiscard]] auto get_register_for_tag_id(tag_id_t const tag_id


How about get_reg_id_xxx instead of get_register since the return value is an ID.

LinZhihao-723 · 2025-01-13T17:35:52Z

src/log_surgeon/Lexer.hpp

+        auto const tag_ids{get_tag_ids_for_capture_id(capture_id)};
+        if (tag_ids.has_value()) {
+            auto const start_reg{get_register_for_tag_id(tag_ids.value().first())};
+            auto const end_reg{get_register_for_tag_id(tag_ids.value().second())};
+            if (start_reg.has_value() && end_reg.has_value()) {
+                return std::make_pair(start_reg.value(), end_reg.value());
+            }
+        }
+        return std::nullopt;


Suggested change

auto const tag_ids{get_tag_ids_for_capture_id(capture_id)};

if (tag_ids.has_value()) {

auto const start_reg{get_register_for_tag_id(tag_ids.value().first())};

auto const end_reg{get_register_for_tag_id(tag_ids.value().second())};

if (start_reg.has_value() && end_reg.has_value()) {

return std::make_pair(start_reg.value(), end_reg.value());

}

}

return std::nullopt;

auto const optional_tag_ids{get_tag_ids_for_capture_id(capture_id)};

if (false == optional_tag_ids.has_value()) {

return std::nullopt;

}

auto const [start_tag_id, end_tag_id]{optional_tag_ids.value()};

auto const optional_start_reg_id{get_register_for_tag_id(start_tag_id)};

if (false == optional_tag_ids.has_value()) {

return std::nullopt;

}

auto const optional_end_reg_id{get_register_for_tag_id(end_tag_id)};

if (false == optional_tag_ids.has_value()) {

return std::nullopt;

}

return {optional_start_reg_id.value(), optional_end_reg_id.value()};

How about rewriting in this way to:

Follow the convention of using optional_ prefix for optional variables.

Reduce the level of nested if statements.

LinZhihao-723 · 2025-01-13T23:02:27Z

tests/test-lexer.cpp

+    REQUIRE(lexer.m_symbol_id.end() != var_id);
+
+    string captureName{cCaptureName};
+    auto const capture_id{lexer.m_symbol_id.find(captureName)};


Same, capture_id_it, or use contains and at instead

LinZhihao-723 · 2025-01-13T23:03:04Z

tests/test-lexer.cpp

+    auto const capture_id{lexer.m_symbol_id.find(captureName)};
+    REQUIRE(lexer.m_symbol_id.end() != capture_id);
+
+    auto capture_ids{lexer.get_capture_ids_for_var_id(var_id->second)};


optional_capture_ids

auto const

LinZhihao-723 · 2025-01-13T23:04:10Z

tests/test-lexer.cpp

+    REQUIRE(1 == capture_ids.value().size());
+    REQUIRE(capture_id->second == capture_ids.value()[0]);
+
+    auto tag_ids{lexer.get_tag_ids_for_capture_id(capture_ids.value()[0])};


optional_tag_ids

auto const

LinZhihao-723 · 2025-01-13T23:05:50Z

tests/test-lexer.cpp

+auto initialize_lexer(Schema schema, ByteLexer& lexer) -> void;
+
+/**
+ * Scans the given input to ensure the correct behavior.


We might need a more detailed doc string to explain what this helper checks.

LinZhihao-723 · 2025-01-13T23:06:37Z

tests/test-lexer.cpp

+    // TODO: add check for get_register_for_tag_id and get_registers_for_capture when
+    // determinization is implemented.


Suggested change

// TODO: add check for get_register_for_tag_id and get_registers_for_capture when

// determinization is implemented.

// TODO: Add check for `get_register_for_tag_id` and `get_registers_for_capture` when

// TDFA's determination is implemented.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (5)

src/log_surgeon/UniqueIdGenerator.hpp (1)

5-13: Improve class implementation with member initialization and nodiscard attribute.

The class can be made more concise and safer by using member initialization and adding the nodiscard attribute to prevent accidental discarding of generated IDs.
 namespace log_surgeon {
 class UniqueIdGenerator {
 public:
-    UniqueIdGenerator() : current_id{0} {}
-
     [[nodiscard]] auto generate_id() -> uint32_t { return current_id++; }
 
 private:
-    uint32_t current_id;
+    uint32_t current_id{0};
 };
 }  // namespace log_surgeon

CMakeLists.txt (2)

Line range hint 1-1: Address CMake deprecation warning.

Update the CMake minimum version to use the range syntax for better compatibility.

-cmake_minimum_required(VERSION 3.5.1)
+cmake_minimum_required(VERSION 3.5.1...3.27)

71-84: Improve file ordering in SOURCE_FILES.

The finite automata files should be alphabetically ordered for better maintainability.

     src/log_surgeon/FileReader.hpp
     src/log_surgeon/finite_automata/Capture.hpp
-    src/log_surgeon/finite_automata/Dfa.hpp
-    src/log_surgeon/finite_automata/DfaState.hpp
-    src/log_surgeon/finite_automata/DfaStatePair.hpp
-    src/log_surgeon/finite_automata/Nfa.hpp
-    src/log_surgeon/finite_automata/NfaState.hpp
-    src/log_surgeon/finite_automata/PrefixTree.cpp
-    src/log_surgeon/finite_automata/PrefixTree.hpp
-    src/log_surgeon/finite_automata/RegexAST.hpp
-    src/log_surgeon/finite_automata/RegisterHandler.hpp
-    src/log_surgeon/finite_automata/StateType.hpp
-    src/log_surgeon/finite_automata/TaggedTransition.hpp
-    src/log_surgeon/finite_automata/UnicodeIntervalTree.hpp
-    src/log_surgeon/finite_automata/UnicodeIntervalTree.tpp
+    src/log_surgeon/finite_automata/Dfa.hpp
+    src/log_surgeon/finite_automata/DfaState.hpp
+    src/log_surgeon/finite_automata/DfaStatePair.hpp
+    src/log_surgeon/finite_automata/Nfa.hpp
+    src/log_surgeon/finite_automata/NfaState.hpp
+    src/log_surgeon/finite_automata/PrefixTree.cpp
+    src/log_surgeon/finite_automata/PrefixTree.hpp
+    src/log_surgeon/finite_automata/RegexAST.hpp
+    src/log_surgeon/finite_automata/RegisterHandler.hpp
+    src/log_surgeon/finite_automata/StateType.hpp
+    src/log_surgeon/finite_automata/TaggedTransition.hpp
+    src/log_surgeon/finite_automata/UnicodeIntervalTree.hpp
+    src/log_surgeon/finite_automata/UnicodeIntervalTree.tpp

src/log_surgeon/finite_automata/Nfa.hpp (2)

120-124: Initialize all member variables in constructor.

Consider explicitly initializing all member variables in the initialization list for better clarity.
 Nfa<TypedNfaState>::Nfa(std::vector<LexicalRule<TypedNfaState>> const& rules)
-        : m_root{new_state()} {
+        : m_states{}
+        , m_capture_to_tag_ids{}
+        , m_root{new_state()}
+        , m_unique_id_generator{} {
     for (auto const& rule : rules) {
         rule.add_to_nfa(this);
     }
112-113: Track TODO for lexer limitation.

The TODO comment indicates a design limitation in the lexer that should be addressed.

Would you like me to create an issue to track this TODO for improving the lexer's handling of capture group names?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0cc6c24 and d38393d.

📒 Files selected for processing (4)

CMakeLists.txt (2 hunks)
src/log_surgeon/UniqueIdGenerator.hpp (1 hunks)
src/log_surgeon/finite_automata/Nfa.hpp (3 hunks)
tests/CMakeLists.txt (2 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

tests/CMakeLists.txt

🧰 Additional context used

📓 Path-based instructions (2)

src/log_surgeon/UniqueIdGenerator.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

src/log_surgeon/finite_automata/Nfa.hpp (1)

Pattern **/*.{cpp,hpp,java,js,jsx,ts,tsx}: - Prefer false == <expression> rather than !<expression>.

🪛 GitHub Actions: Build

CMakeLists.txt

[warning] 1-1: CMake Deprecation Warning: Compatibility with CMake < 3.10 will be removed from a future version of CMake. Update the VERSION argument or use ... syntax.

🔇 Additional comments (2)

src/log_surgeon/finite_automata/Nfa.hpp (2)

21-28: Improve class documentation.

The documentation needs to be more precise and follow proper formatting.

 /**
- * Represents a NFA(non-deterministic finite automata) for recognizing a language based on the set
- * of rules used during initialization. Currently use as an intermediate model for generating the
- * DFA.
+ * Represents a Non-Deterministic Finite Automaton (NFA) designed to recognize a language based on
+ * a set of rules provided during initialization. This class serves as an intermediate
+ * representation used for generating the corresponding Deterministic Finite Automaton (DFA).
  *
- * Currently we assume all capture groups have unique names.
+ * NOTE: It is assumed that all capture groups have unique names, even across different rules.
  * @tparam TypedNfaState
  */

128-139: Improve implementation of get_or_create_capture_tags.

The method can be made more concise and safer with const variables.

 auto Nfa<TypedNfaState>::get_or_create_capture_tags(Capture const* capture
 ) -> std::pair<tag_id_t, tag_id_t> {
-    auto const existing_tags{m_capture_to_tag_ids.find(capture)};
-    if (m_capture_to_tag_ids.end() == existing_tags) {
-        auto start_tag{m_unique_id_generator.generate_id()};
-        auto end_tag{m_unique_id_generator.generate_id()};
-        auto new_tags{std::make_pair(start_tag, end_tag)};
-        m_capture_to_tag_ids.emplace(capture, new_tags);
-        return new_tags;
+    if (false == m_capture_to_tag_ids.contains(capture)) {
+        m_capture_to_tag_ids.emplace(
+                capture,
+                m_unique_id_generator.generate_id(),
+                m_unique_id_generator.generate_id()
+        );
     }
-    return existing_tags->second;
+    return m_capture_to_tag_ids.at(capture);
 }

SharafMohamed and others added 30 commits November 19, 2024 16:27

Add const for consitency with constructor.

08b7548

Update positive end transition to be optional instead of a vector.

449133e

Rename new_state function correctly.

7b837bf

Update capture group AST state creation.

f0eb56b

Encapsulate new state for capture group.

a945915

Fix compiler error.

c757ded

Use singular for end transition getter function.

2eb7477

Void to auto -> void.

08060ed

Co-authored-by: Lin Zhihao <[email protected]>

Update new_capture_group_start_states to new_capture_group_states to …

0c2c1d1

…reflect functionality change.

Linter.

b0b951a

Update docstring for .

3c2a2ab

Rename to new_start_and_end_states_with_positively_tagged_transitions.

98c5b95

Rename to capture_X_state.

f59cf41

Update docstring.

85a2d69

Updated diagram to match vars used in code.

4c602d4

Rename vars to serialized_X.

2b01433

Run Linter.

e37b29a

Fix typo.

c5beca3

Update diagram for capture group NFA.

fe4a7b3

Merge branch 'fixed-tagged-nfa' into register

8993088

Merge branch 'main' into register

aaf720a

Add register unit-tests, add PrefixTree with unit-tests.

0017512

Finished with initial register implementation.

336f2ae

Linter.

3449df2

Linter.

ef62df1

Docstring fixes.

a085650

Add boundry test case.

2be06c0

Improve test cases for setting positions in prefix tree.

9ec01dd

Improve test cases for setting invalid positions in prefix tree.

019e675

Remove confusing description; Remove unused include.

83a411a

SharafMohamed added 10 commits January 10, 2025 20:05

Merge branch 'main' into restructure-tagging

af2c8d4

Fixed lexer test...the current way of using the lexer is insanely bad.

3504ffb

Add capture group test case.

8b43e83

Add getters and setters for lexer maps; Update unit-tests.

719e722

Use capture instead of tag.

082145e

Fix ownership of m_rules.

e06cf44

Add tags.

2e99a91

Lexer now enforced unique capture names.

94db0d6

Update test-nfa to match added tags.

9731fa3

Update the last lexer map that can be initialized currently.

bc78c49

coderabbitai bot reviewed Jan 13, 2025

View reviewed changes

src/log_surgeon/finite_automata/Dfa.hpp Show resolved Hide resolved

src/log_surgeon/finite_automata/TaggedTransition.hpp Outdated Show resolved Hide resolved

src/log_surgeon/finite_automata/RegexAST.hpp Show resolved Hide resolved

src/log_surgeon/LexicalRule.hpp Show resolved Hide resolved

Uniform initialization.

2c89216

coderabbitai bot reviewed Jan 13, 2025

View reviewed changes

SharafMohamed added 6 commits January 13, 2025 06:12

Indent comment.

f06031d

Indent comment.

90ce695

Linter.

40a90bd

Linter.

d9c20fe

Linter.

391607d

Linter.

e6273b2

SharafMohamed changed the title ~~feat: lexer now tracks mapping from variables to capture to tags to registers.~~ feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. Jan 13, 2025

Use moe semantics properly for negative transition.

f88a780

coderabbitai bot reviewed Jan 13, 2025

View reviewed changes

Clean up testing code.

0cc6c24

coderabbitai bot reviewed Jan 13, 2025

View reviewed changes

SharafMohamed requested a review from LinZhihao-723 January 13, 2025 12:46

LinZhihao-723 requested changes Jan 13, 2025

View reviewed changes

SharafMohamed added 2 commits January 29, 2025 08:43

Move UniqueIdGenerator to its own file.

54d9181

Fix accidental deletion.

d38393d

coderabbitai bot reviewed Jan 29, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

SharafMohamed commented Jan 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2025 •

edited

Loading

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot left a comment

LinZhihao-723 left a comment

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

LinZhihao-723 Jan 13, 2025

coderabbitai bot left a comment

		// TODO: add check for get_register_for_tag_id and get_registers_for_capture when
		// determinization is implemented.

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

Are you sure you want to change the base?

feat: separate the concept of captures and tags; lexer now tracks mapping from variables to capture to tags to registers. #72

Conversation

SharafMohamed commented Jan 13, 2025 • edited by coderabbitai bot Loading

Description

Validation performed

Summary by CodeRabbit

coderabbitai bot commented Jan 13, 2025 • edited Loading

Walkthrough

Changes

Sequence Diagram

Possibly related PRs

Suggested reviewers

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

LinZhihao-723 left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

SharafMohamed commented Jan 13, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Jan 13, 2025 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)