Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for tagged NFA; Use uint32_t to replace int for IDs. #42

Merged
merged 146 commits into from
Oct 30, 2024

Conversation

SharafMohamed
Copy link
Contributor

@SharafMohamed SharafMohamed commented Sep 16, 2024

References

  • Depends on PR#41

Description

The way tagged-NFAs work are as follows:

  • Set of states that either match nothing or match a variable id (same as untagged-NFA).
  • A start state (same as untagged-NFA).
  • A out-going transition from each state for every possible input character (same as untagged-NFA).
  • Out-going epsilon-transitions that are always taken when in the state (same as untagged-NFA).
  • Positive-transitions that corresponds to a single capture group. It is treated as an epsilon transitions, except taking it indicates that the corresponding capture group has been matched(unique to tagged-NFAs).
  • Negative-transitions that corresponds to several capture groups. It is treated as an epsilon transitions, except taking it indicates that the corresponding capture groups are guaranteed to be unmatched (unique to tagged-NFAs).

Changes to implement tagged-NFA:

  • Lexer no longer ignores capture group rules in the AST when building the NFA.
  • add_ast() updated to add tags to each rule's regex and to build the NFA with tags.
  • add_with_negative_tags() implemented to add a rule to the NFA while considering negative tags. This requires two passes of the AST, first positive tags are added, then negative tags are added as they depend on knowing the positive tags of alternate paths in the the AST.
  • All AST node add() functions call add_with_negative_tags() such that the NFA recursively adds negative tags when traversing the AST.
  • add_negative_tagged_transition() adds negative tags. Called at whichever AST node has negative tags.
  • add_positive_tagged_transitions() adds a positive tag. Called for every capture group AST node.
  • Add structs to represent negative and positive tagged transitions.
  • Add negative and positive tagged transitions to the NFA, with setters and getters.

Changes as a result of tagged-NFA:

  • DFA currently treats tagged transitions as epsilon transitions (ignores the capture aspect of capture groups).
  • generate_reverse() commented out as it is currently unused (at least internally and in CLP) until it is fixed to work with tags.

Validation

  • Add NFA unit-tests.

Summary by CodeRabbit

  • New Features

    • Introduced a new method for handling negative transitions during NFA construction.
    • Updated token ID handling to use consistent data types across various components.
    • Enhanced state management capabilities with new tagged transition classes.
    • Added a LexicalRule class to manage lexical rules in finite automata.
    • Simplified NFA construction by collecting lexical rules in a vector.
    • Introduced a comprehensive test suite for NFA implementation, validating various transition types and state properties.
  • Bug Fixes

    • Improved error reporting for regex patterns, providing detailed messages for invalid patterns.

… tag to matching_variable_id; Use full names for vars (r->rule); Clarify if states are NFA or DFA
…for clairty that nothing is shared b/w tests
…egexASTgroup with min = 1 OR'd with RegexASTEmpty
…iteral arguments; Use const& for non-literals; Use auto where possible; Use uint32_t over int for ids; replace begin() and end() with cbegin() and cend()
@SharafMohamed SharafMohamed marked this pull request as draft September 26, 2024 17:12
@SharafMohamed SharafMohamed marked this pull request as ready for review October 8, 2024 07:56
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between c3fb16d and 24a0efd.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/RegexNFA.hpp (6 hunks)

src/log_surgeon/finite_automata/RegexNFA.hpp Outdated Show resolved Hide resolved
src/log_surgeon/finite_automata/RegexNFA.hpp Outdated Show resolved Hide resolved
LinZhihao-723
LinZhihao-723 previously approved these changes Oct 30, 2024
Copy link
Member

@LinZhihao-723 LinZhihao-723 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For PR title, how about:

Add support for tagged NFA; Use `uint32_t` to replace `int` for IDs.

@LinZhihao-723
Copy link
Member

Sorry mb, let's fix the latest comments from coderabbit

…dcoding to use bytes; Take state type instaed of transition type as template for transition classes.
@SharafMohamed SharafMohamed changed the title Add tagged-NFA; Replace usage of int with uint32_t. Add support for tagged NFA; Use uint32_t to replace int for IDs. Oct 30, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
src/log_surgeon/finite_automata/RegexNFA.hpp (2)

47-54: Consider adding @throws documentation for serialize methods.

The serialize methods' documentation should indicate whether they can throw exceptions (e.g., from fmt::format). This helps users handle potential exceptions appropriately.

Also applies to: 71-78


Line range hint 177-361: Document thread safety guarantees.

The class handles complex state management but lacks documentation about thread safety. Consider adding thread safety documentation to help users properly synchronize access in multi-threaded contexts.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 24a0efd and edbcdb4.

📒 Files selected for processing (1)
  • src/log_surgeon/finite_automata/RegexNFA.hpp (6 hunks)

Comment on lines +160 to +161
std::vector<PositiveTaggedTransition<RegexNFAState>> m_positive_tagged_transitions;
std::vector<NegativeTaggedTransition<RegexNFAState>> m_negative_tagged_transitions;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider reserving capacity for transition vectors.

The vectors m_positive_tagged_transitions and m_negative_tagged_transitions might benefit from capacity reservation if you have an estimate of the typical number of transitions.

Comment on lines +343 to +352
return fmt::format(
"{}:{}byte_transitions={{{}}},epsilon_transitions={{{}}},positive_tagged_transitions={{"
"{}}},negative_tagged_transitions={{{}}}",
state_ids.at(this),
accepting_tag_string,
fmt::join(byte_transitions, ","),
fmt::join(epsilon_transitions, ","),
fmt::join(positive_tagged_transitions, ","),
fmt::join(negative_tagged_transitions, ",")
);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Consider using fmt::join with a custom formatter.

Instead of building multiple intermediate strings with fmt::join and then combining them, consider creating a custom formatter to handle the entire state serialization in one pass. This could improve performance by reducing memory allocations.

Example approach:

struct StateFormatter {
    template<typename ParseContext>
    constexpr auto parse(ParseContext& ctx) { return ctx.begin(); }
    
    template<typename FormatContext>
    auto format(const State& state, FormatContext& ctx) {
        // Format all transitions in one pass
        return fmt::format_to(ctx.out(), "{}:{}", ...);
    }
};

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants