perf: lexer + recognizer micro-optimizations on the C# hot path by tinovyatkin · Pull Request #10 · ophidiarium/antlr-rust-runtime

tinovyatkin · 2026-05-25T08:38:45Z

Summary

Targeted micro-optimizations to the lexer and recursive recognizer that
came out of profiling the C# parse benchmark. Each commit is a
self-contained, measurable wins; together they produce a meaningful
speedup on C# while leaving Kotlin (already fast) untouched or slightly
faster.

This is not a full Adaptive-LL implementation. I scaffolded one
in earlier iterations, couldn't make it fast enough on C#'s grammar
shape (closures explode the merge graph past the per-call deadline)
and stripped it out before sending this PR. Closing the remaining gap
to Go on C# requires that architectural change (per-decision DFA
caching across parses, or generated parser methods that dispatch via
AdaptivePredict) — out of scope here.

Benchmark deltas (release, 5 iters + 2 warmups)

C#, vs upstream Go ANTLR runtime:

fixture	before	after	go	rust vs go
dotnet-wpf-datagrid-column	73.5 ms	41 ms	16 ms	0.40×
mono-anonymous	115.3 ms	66 ms	34 ms	0.52×
mono-codegen	61.9 ms	37 ms	32 ms	0.88×
mono-statement	377 ms	228 ms	120 ms	0.53×

Kotlin: still 9-17× faster than Go; no regression.

What's in the commits

Lexer (cumulative ~30% on C# parse time):

Rc-wrap cached lexer DFA state so lookups are an Rc bump, not a deep
Vec clone
Switch the closure-state seen set from BTreeSet<LexerConfig> to
FxHashSet
Switch LexerDfaTrace BTreeMap fields to FxHashMap
New PredictionFxHasher providing the hash function

Parser (cumulative ~25-30% on C# parse time):

Inline single-atom-match and single-epsilon transitions in the fast
loop, avoiding a recursive recognize_state_fast call per visit
Skip the diagnostics clone in the rule-call follow when the child
produced no diagnostics (always true on pass 1)
Intern an empty NodeList tail to avoid Rc::new(NodeList::Empty) per
prepend onto an empty list
Per-instance decision_lookahead_cache, eliminating thread-local +
RefCell + entry overhead on every multi-trans visit
LL(1) early-commit + per-instance cache when the FIRST sets are
disjoint and not nullable; collapses the per-transition filter loop
to a single hash probe
Inline-scan dedupe instead of BTreeSet for outcome lists ≤8 entries
(with a heap Vec fallback for the overflow case — fixed a bug where
duplicates leaked past 8 distinct entries)
New NodeList::One inline variant to avoid the Rc tail
allocation when the list has a single node

Tooling:

Opt-in perf-counters feature exposing outcome-distribution counters
via ANTLR_PERF_DUMP=1 for future investigation

Test plan

cargo test --release --lib — 37 passing
cargo clippy --locked --all-targets --all-features -- -D warnings — clean
cargo build --release --features perf-counters — clean
parse-bench C# fixtures — improvements above
parse-bench Kotlin fixtures — no regression (9-17× vs Go)

Summary by CodeRabbit

New Features
- Added an optional perf-counters feature with runtime dump/reset controls for performance diagnostics.
Performance
- Faster lexer and parser via hash-based caching and shared cross-parse caches.
- LL(1) fast path for deterministic decisions.
- Custom hashing and faster set membership lookups to reduce hot-path overhead.
Documentation
- Expanded developer guide, benchmarking instructions, and perf-counters usage notes.

Targeted micro-optimizations to the existing recursive recognizer: - IntervalSet::contains uses binary search (was linear) - Loop-collapse single-epsilon transition chains in recognize_state_fast - Strip recovery/rule-context fields from memo key in pass 1 - Skip memoize for single-transition states - u64-packed visit_id for cycle guard - Conditional cycle guard (multi-alt only) - Single hash op for FIRST set lookup at rule transitions - Cross-parse FIRST-set + decision-lookahead caches keyed by static ATN pointer - Rc<[FastRecognizeOutcome]> in memo so writes reuse Vec allocation Results vs baseline: - Kotlin: 32-53% faster, now 5-11x faster than Go (was 3-7x) - C#: 47-65% faster, now 1.8-4.0x slower than Go (was 3-7x slower) Still doesn't beat Go on C#; the next commit replaces the recursive backtracker with an adaptive-LL ATN simulator + DFA cache.

Lightweight FxHash-style hasher used by lexer DFA-trace maps and the epsilon_closure seen set to avoid SipHash overhead on the hot lexer path. Subsequent lexer commits switch BTreeMap/BTreeSet fields to FxHashMap/FxHashSet built on this hasher.

Lookup hot in C# parsing showed cached_lexer_dfa_state cloning the entire Vec of LexerDfaConfigKey on every lexer step. Wrapping the cached state in Rc makes lookup an Rc bump instead of a deep clone.

Profile showed BTreeSet<LexerConfig> insertions were ~10% of inclusive time on C# parsing. Lexer closure walks dedupe many configs per token; an O(1) hash lookup beats the O(log n) tree comparisons against Vec<usize> stack fields. mono-codegen.cs: 62ms -> 42ms (32% faster) mono-statement.cs: 332ms -> 250ms (25% faster)

The cached_lexer_dfa_transition lookup is the inner-loop call inside the lexer DFA walker; lookup-heavy hash maps win over O(log n) tree maps.

Adds OUTCOMES_RETURN_0/1/N counters to recognize_state_fast and an opt-in dump on ANTLR_PERF_DUMP=1 from fast_recognize_top, so future investigations can see the 0/1/many split without rebuilding.

The recognize_state_fast loop already collapses single-Epsilon chains. Extending it to single-Atom/Range/Set/NotSet/Wildcard transitions saves ~17K recursive calls per C# parse (atom transitions are common in flat grammars). The inline path tracks consumed token indices so token nodes still appear in the resulting parse tree. Limited to pass 1 (no recovery); the recovery path needs the existing no-match/expected-symbols machinery.

The follow-state outcome loop unconditionally cloned and appended the child's diagnostics; on pass 1 (no recovery) child_diagnostics is always empty, so the clone allocated a Vec we then immediately replaced. Skip the dance when the source vec is empty.

Each NodeList::Cons created from an Empty list paid for an Rc::new (NodeList::Empty) allocation. Sharing a thread-local Rc<NodeList::Empty> means Empty -> Cons transitions are now a cheap Rc::clone instead.

The decision_lookahead_cache field on BaseParser was unused; the cache was always taken from the shared thread-local SharedAtnCache, which costs a borrow/entry per multi-trans visit. Use the per-instance cache as a fast path so hot decisions skip the RefCell+hashmap dance.

When a multi-trans decision's FIRST sets are disjoint and no transition is nullable, the lookahead deterministically picks one alternative. Commit to that alt directly and skip the per-transition filter probe loop. Same behavior, fewer probes per visit on dense decisions.

…e lists dedupe_clean_fast_outcomes is called per multi-trans visit on pass 1; with typical N=2-4 outcomes the BTreeSet allocation + balancing was the dominant cost. An inline 8-entry scan avoids the heap allocation entirely for the common case while preserving outcome order for downstream selection logic.

The ll1_unique_alt scan was repeated per visit; now its result is memoized per parser instance keyed by (state_number, lookahead_token). The shared SharedAtnCache field is reserved for cross-parse warming (future commit) — the per-instance cache wins on hot decisions within a single parse already.

…> alloc Most outcomes carry a single node; using NodeList::One(node) directly avoids the Rc::new(NodeList::Empty) allocation that the Cons-from-Empty case used to make. Saves one heap alloc per single-node prepend on the hot path.

The inline-scan optimization only checked the first 8 distinct keys. Beyond that, new outcomes silently passed through without being deduplicated against each other — masking duplicates on grammars with many ambiguous alts (e.g. ktor-openapi Kotlin parse). The duplicates explode speculative work one step up the recursion. Promote to a heap Vec for the overflow case so all kept entries continue to participate in dedup. Also folds in two clippy fixes that surfaced during rebase: re-using a doc comment and collapsing a nested if into a match guard.

coderabbitai · 2026-05-25T08:38:56Z

Caution

Review failed

The pull request is closed.

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 23d21bae-5c91-46dc-b40d-3361ee5d0f78

📥 Commits

Reviewing files that changed from the base of the PR and between c704581 and d5b2fe1.

📒 Files selected for processing (1)

CLAUDE.md

📝 Walkthrough

Walkthrough

Adds PredictionFxHasher and FxHash aliases, introduces an opt-in perf-counters Cargo feature and thread-local counters, migrates ordered collections to hash-based caches (lexer ATN, DFA caches), changes IntervalSet::contains to binary search, and heavily refactors parser fast-paths (LL(1) cache, shared FIRST/lookahead caches, recognize_state_fast memoization/epsilon collapse).

Changes

Performance optimization and instrumentation

Layer / File(s)	Summary
Lightweight hasher `src/prediction.rs`	Adds `PredictionFxHasher` implementing `std::hash::Hasher` for deterministic, low-overhead hashing used by FxHashMap/FxHashSet aliases.
Feature flag and perf-counters `Cargo.toml`, `src/parser.rs`	Adds `perf-counters` Cargo feature and a `perf_counters` module (thread-local `Cell<u64>`) with `snapshot`/`reset`/`dump` and crate-level re-exports behind the feature.
Lexer ATN epsilon-closure & IntervalSet `src/atn/lexer.rs`, `src/atn/mod.rs`	Switches lexer epsilon-closure `seen` tracking from `BTreeSet` to `FxHashSet` (derives `Hash` where needed) and changes `IntervalSet::contains` to binary-search on sorted/coalesced ranges.
Lexer DFA trace caching `src/lexer.rs`	Refactors `LexerDfaTrace` to use `FxHashMap` aliases and stores cached DFA states as `Rc<LexerDfaCachedState>`; adds `Hash` to DFA key types and converts const constructors to runtime `pub fn`.
Parser LL(1) cache & NodeList `src/parser.rs`	Adds `ll1_decision_cache` to `BaseParser`, implements `ll1_unique_alt` determinism check, adds `NodeList::One` to avoid wrapper allocations, and initializes LL(1) cache in `BaseParser::new`.
Shared FIRST/lookahead caches `src/parser.rs`	Introduces thread-local `SHARED_ATN_CACHES` keyed by stable ATN identity with helpers for shared FIRST-set and decision-lookahead caches; `cached_decision_lookahead` uses shared caches and per-parse reset clears `ll1_decision_cache`.
recognize_state_fast refactor `src/parser.rs`	Major rewrite to iteratively collapse epsilon chains, inline common single-transition consuming matches, gate LL(1) fast-path by transition count, memoize outcomes as `Rc<[FastRecognizeOutcome]>`, and use a hashed visiting guard for cycle detection.
Perf increments & micro-optimizations `src/parser.rs`	Adds perf-counter increments at several transition types, optimizes diagnostic propagation to avoid cloning when empty, short-circuits failed atom handling in some paths, and tweaks memo capacity estimation.
Outcome dedupe & FIRST pruning `src/parser.rs`	Replaces BTreeSet-based dedupe with an order-preserving inline key buffer (heap fallback) and updates rule-call pruning to consult the shared FIRST cache.
Docs and benchmarks `CLAUDE.md`	Updates inner-loop developer commands, documents source layout, parse-benchmark instructions, and documents the `perf-counters` opt-in build/run; adds CI/clippy guidance.

Sequence Diagram(s)

sequenceDiagram
  participant Client as Parser::parse_atn_rule
  participant Recognize as recognize_state_fast
  participant Shared as SHARED_ATN_CACHES
  participant LL1 as ll1_decision_cache
  participant Memo as memo_cache

  Client->>Recognize: invoke fast-recognition
  Recognize->>Shared: with_shared_atn_caches -> check FIRST/lookahead
  alt LL(1) deterministic
    Recognize->>LL1: consult ll1_decision_cache -> select unique alt
  else fallback
    Recognize->>Memo: check memo (Rc<[FastRecognizeOutcome]>)
    Memo-->>Recognize: return cached outcomes
  end
  Recognize->>Memo: memoize outcomes as Rc slice
  Recognize-->>Client: return recognition outcome(s)

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Possibly related PRs

ophidiarium/antlr-rust-runtime#5: Overlaps on parser fast-recognizer refactors, decision/lookahead caching, and related optimizations.
ophidiarium/antlr-rust-runtime#8: Related changes to lexer DFA caching and trace representation affecting BaseLexer and DFA replay paths.

Poem

"A rabbit cheered with tiny paws,
I hashed the bytes without a pause.
Counters tick, caches share,
Epsilons collapse with flair.
Faster parsing — hop, hooray! 🐇"

🚥 Pre-merge checks | ✅ 5

✅ Passed checks (5 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title accurately summarizes the main change: performance micro-optimizations targeting lexer and recognizer hot paths in a C# context.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch perf/csharp-microopts-clean

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-05-25T08:39:03Z

Copy/Paste Detection

Found 10 duplication(s) across 5 changed Rust file(s) (threshold: 100 tokens).

Show duplications

Found a 20 line (125 tokens) duplication in the following files:

Starting at line 198 of src/atn/lexer.rs
Starting at line 222 of src/atn/lexer.rs

pub fn next_token_with_hooks<I, F, A, P, E>(
    lexer: &mut BaseLexer<I, F>,
    atn: &Atn,
    mut custom_action: A,
    mut semantic_predicate: P,
    mut accept_adjuster: E,
) -> CommonToken
where
    I: CharStream,
    F: TokenFactory,
    A: FnMut(&mut BaseLexer<I, F>, LexerCustomAction),
    P: FnMut(&BaseLexer<I, F>, LexerPredicate) -> bool,
    E: FnMut(&mut BaseLexer<I, F>, i32, usize),
{
    next_token_with_hooks_impl(
        lexer,
        atn,
        &mut custom_action,
        &mut semantic_predicate,
        &mut accept_adjuster,

Found a 20 line (120 tokens) duplication in the following files:

Starting at line 425 of src/atn/lexer.rs
Starting at line 540 of src/atn/lexer.rs

            let Some(state) = atn.state(config.state) else {
                continue;
            };
            for transition in &state.transitions {
                if !transition.matches(symbol, MIN_CHAR_VALUE, MAX_CHAR_VALUE) {
                    continue;
                }
                let mut advanced = config.clone();
                set_config_state(atn, &mut advanced, transition.target());
                if symbol == EOF {
                    advanced.consumed_eof = true;
                } else {
                    advanced.position += 1;
                }
                next.push(advanced);
            }
        }

        let closure = epsilon_closure(lexer, atn, next, semantic_predicate);
        let target_has_semantic_context = closure.has_semantic_context;

Found a 30 line (120 tokens) duplication in the following files:

Starting at line 3239 of src/parser.rs
Starting at line 3274 of src/parser.rs

                    let boundary = left_recursive_boundary(atn, state, *target);
                    outcomes.extend(
                        self.recognize_state_fast(
                            atn,
                            FastRecognizeRequest {
                                state_number: *target,
                                stop_state,
                                index,
                                rule_start_index,
                                decision_start_index: next_decision_start_index,
                                precedence,
                                depth: depth + 1,
                                recovery_symbols: Rc::clone(&epsilon_recovery_symbols),
                                recovery_state: epsilon_recovery_state,
                            },
                            visiting,
                            memo,
                            expected,
                        )
                        .into_iter()
                        .map(|mut outcome| {
                            if let Some(rule_index) = boundary {
                                outcome.nodes.prepend(Rc::new(
                                    FastRecognizedNode::LeftRecursiveBoundary { rule_index },
                                ));
                            }
                            outcome
                        }),
                    );
                }

Found a 15 line (114 tokens) duplication in the following files:

Starting at line 5999 of src/parser.rs
Starting at line 6163 of src/parser.rs

    fn parser_matches_token_and_reports_mismatch() {
        let source = Source {
            tokens: vec![
                CommonToken::new(1).with_text("x"),
                CommonToken::eof("parser-test", 1, 1, 1),
            ],
            index: 0,
        };
        let data = RecognizerData::new(
            "Mini.g4",
            Vocabulary::new([None, Some("'x'")], [None, Some("X")], [None::<&str>, None]),
        );
        let mut parser = BaseParser::new(CommonTokenStream::new(source), data);
        assert_eq!(
            parser.match_token(1).expect("token 1 should match").text(),

Found a 32 line (113 tokens) duplication in the following files:

Starting at line 4094 of src/parser.rs
Starting at line 4155 of src/parser.rs

                        outcomes.extend(
                            self.recognize_state(
                                atn,
                                RecognizeRequest {
                                    state_number: *target,
                                    stop_state,
                                    index,
                                    rule_start_index,
                                    decision_start_index: next_decision_start_index,
                                    init_action_rules,
                                    predicates,
                                    rule_args,
                                    member_actions,
                                    return_actions,
                                    local_int_arg,
                                    member_values: member_values.clone(),
                                    return_values: return_values.clone(),
                                    rule_alt_number: next_alt_number,
                                    track_alt_numbers,
                                    consumed_eof,
                                    precedence,
                                    depth: depth + 1,
                                    recovery_symbols: epsilon_recovery_symbols.clone(),
                                    recovery_state: epsilon_recovery_state,
                                },
                                visiting,
                                memo,
                                expected,
                            )
                            .into_iter()
                            .map(|mut outcome| {
                                prepend_decision(&mut outcome, decision);

Found a 14 line (110 tokens) duplication in the following files:

Starting at line 6207 of src/parser.rs
Starting at line 6230 of src/parser.rs

    fn outcome_ties_keep_later_non_recursive_alternative() {
        let first = RecognizeOutcome {
            index: 1,
            consumed_eof: false,
            alt_number: 0,
            member_values: BTreeMap::new(),
            return_values: BTreeMap::new(),
            diagnostics: Vec::new(),
            decisions: Vec::new(),
            actions: vec![ParserAction::new(1, 0, 0, None)],
            nodes: vec![RecognizedNode::Token { index: 0 }],
        };
        let second = RecognizeOutcome {
            actions: vec![ParserAction::new(2, 0, 0, None)],

Found a 13 line (109 tokens) duplication in the following files:

Starting at line 5999 of src/parser.rs
Starting at line 6138 of src/parser.rs
Starting at line 6163 of src/parser.rs

    fn parser_matches_token_and_reports_mismatch() {
        let source = Source {
            tokens: vec![
                CommonToken::new(1).with_text("x"),
                CommonToken::eof("parser-test", 1, 1, 1),
            ],
            index: 0,
        };
        let data = RecognizerData::new(
            "Mini.g4",
            Vocabulary::new([None, Some("'x'")], [None, Some("X")], [None::<&str>, None]),
        );
        let mut parser = BaseParser::new(CommonTokenStream::new(source), data);

Found a 19 line (106 tokens) duplication in the following files:

Starting at line 1920 of src/parser.rs
Starting at line 2360 of src/parser.rs

        let start_state = atn
            .rule_to_start_state()
            .get(rule_index)
            .copied()
            .ok_or_else(|| {
                AntlrError::Unsupported(format!("rule {rule_index} has no start state"))
            })?;
        let stop_state = atn
            .rule_to_stop_state()
            .get(rule_index)
            .copied()
            .filter(|state| *state != usize::MAX)
            .ok_or_else(|| {
                AntlrError::Unsupported(format!("rule {rule_index} has no stop state"))
            })?;

        let start_index = self.current_visible_index();
        self.clear_prediction_diagnostics();
        self.reset_per_parse_caches();

Found a 16 line (104 tokens) duplication in the following files:

Starting at line 2120 of src/parser.rs
Starting at line 2159 of src/parser.rs

            FastRecognizedNode::Rule {
                rule_index,
                invoking_state,
                start_index,
                stop_index,
                children,
            } => {
                let mut context = ParserRuleContext::new(*rule_index, *invoking_state);
                if let Some(token) = self.token_at(*start_index) {
                    context.set_start(token);
                }
                if let Some(token) = stop_index.and_then(|index| self.token_at(index)) {
                    context.set_stop(token);
                }
                if children.has_left_recursive_boundary() {
                    let folded = fold_fast_left_recursive_boundaries(children.to_vec());

Found a 15 line (103 tokens) duplication in the following files:

Starting at line 2105 of src/parser.rs
Starting at line 5099 of src/parser.rs

            FastRecognizedNode::MissingToken {
                token_type,
                at_index,
                text,
            } => {
                let current = self.token_at(*at_index);
                let token = CommonToken::new(*token_type)
                    .with_text(text)
                    .with_span(usize::MAX, usize::MAX)
                    .with_position(
                        current.as_ref().map(Token::line).unwrap_or_default(),
                        current.as_ref().map(Token::column).unwrap_or_default(),
                    );
                Ok(ParseTree::Error(ErrorNode::new(token)))
            }

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: cfd5cd9692

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

greptile-apps · 2026-05-25T08:45:01Z

Confidence Score: 5/5

Safe to merge; all changes are confined to the hot-path recognizer and lexer DFA cache, and the 37-test suite plus C#/Kotlin parse-bench results confirm both correctness and the expected speedups.

The two previously raised concerns — hash collision in cycle detection and stale ATN-cache entries after pointer reuse — are both addressed: cycle detection now uses exact (state_number, index) tuples and the shared cache key bundles secondary identity fields. The inline epsilon/atom loop, the pass-1 memo-key relaxation, the LL(1) early-commit path, and the NodeList::One variant were each traced through their edge cases and are logically sound.

No files require special attention; the most complex changes are in src/parser.rs and are well-commented and tested.

_{Reviews (3): Last reviewed commit: "chore: improve CLAUDE.md" | Re-trigger Greptile}

gemini-code-assist

Code Review

This pull request introduces significant performance optimizations to the lexer and parser. Key changes include replacing standard collections with custom-hashed versions (FxHashMap/FxHashSet), optimizing IntervalSet lookups with binary search, and implementing a thread-local cache for ATN-related computations to share data across parses. The recognize_state_fast function was refactored to reduce recursion through loop-unrolling of epsilon chains and an LL(1) fast path. Feedback highlights a potential bug where the new ll1_decision_cache is not reset between parses, safety concerns regarding the use of raw pointers as global cache keys, and redundant logic in FIRST set retrieval.

gemini-code-assist · 2026-05-25T08:45:22Z

+fn with_shared_first_set_cache<R>(
+    atn: &Atn,
+    f: impl FnOnce(&mut FirstSetCache) -> R,
+) -> R {
+    SHARED_ATN_CACHES.with(|cell| {
+        let key = std::ptr::from_ref::<Atn>(atn) as usize;
+        let mut map = cell.borrow_mut();
+        let cache = map.entry(key).or_default();
+        f(&mut cache.first_set)
+    })
+}
+
+fn with_shared_atn_caches<R>(
+    atn: &Atn,
+    f: impl FnOnce(&mut SharedAtnCache) -> R,
+) -> R {
+    SHARED_ATN_CACHES.with(|cell| {
+        let key = std::ptr::from_ref::<Atn>(atn) as usize;
+        let mut map = cell.borrow_mut();
+        let cache = map.entry(key).or_default();
+        f(cache)
+    })
+}


Using raw pointers (std::ptr::from_ref::<Atn>(atn) as usize) as keys in the global SHARED_ATN_CACHES map poses a risk of memory leaks and stale data. Since Atn objects are not guaranteed to be 'static by the type system, a dropped Atn could leave an entry in the map indefinitely. Furthermore, if a new Atn is allocated at the same memory address, the parser might incorrectly use cached data from a different grammar. Consider using a unique identifier within the Atn struct or a safer caching strategy that accounts for object lifetimes.

Addressed in c704581 — see reply on the same line. The compound key includes a stable secondary identity (states pointer/length, max_token_type) so a reused ATN allocation cannot pick up entries from the previous grammar.

coderabbitai

Actionable comments posted: 4

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/lexer.rs`:
- Around line 125-127: The cache uses Rc which prevents BaseLexer from being
Send/Sync; change FxHashMap<usize, Rc<LexerDfaCachedState>> in the LexerDfaTrace
type to FxHashMap<usize, Arc<LexerDfaCachedState>> (add use std::sync::Arc) and
update any Rc::new / Rc::clone sites to Arc::new / Arc::clone; then update the
helper functions cached_lexer_dfa_state and cache_lexer_dfa_state to
return/store Arc<LexerDfaCachedState> instead of Rc, and adjust any callsites to
accept Arc (or clone the Arc where sharing is needed) so the cache remains
equivalent but uses thread-safe ownership.

In `@src/parser.rs`:
- Around line 958-985: The cache key uses the raw Atn pointer
(std::ptr::from_ref) which can be reused after an Atn is dropped; instead use a
stable, per-Atn unique identifier or require a 'static/owned handle. Modify
SHARED_ATN_CACHES lookups in with_shared_first_set_cache and
with_shared_atn_caches to use a stable key produced by the Atn itself (e.g., add
or use an existing Atn::unique_id() / id() usize field set at Atn construction)
rather than the pointer, or change the APIs to accept an Arc<Atn>/'static
reference and use a guaranteed-unique generation id from that handle; update all
callsites to provide the new stable id/handle and keep using SharedAtnCache,
FirstSetCache, and the same map semantics.
- Around line 442-447: The ll1_decision_cache field (FxHashMap<(usize, i32),
Option<usize>>) is grammar-specific and must be cleared when a parser is reset;
update BaseParser::reset_per_parse_caches() to clear or reinitialize
ll1_decision_cache (e.g., call self.ll1_decision_cache.clear() or assign
Default::default()) so that cached LL(1) alt selections from a previous
Atn/grammar are not reused across parse boundaries.
- Around line 3312-3319: The code assumes rule_first_set populated the shared
cache and unconditionally calls expect on cache.get(&first_key), causing a panic
when rule_first_set skips insertion on cycles; update rule_first_set to return
the computed FIRST set (e.g., Option<Rc<...>> or Rc<...> when produced) and
change the with_shared_first_set_cache closure (the block using first_key,
rule_first_set, and cache.get) to use the returned FIRST value as a fallback
when cache.get(&first_key) is absent instead of calling expect — i.e., call let
maybe_first = rule_first_set(...); then clone cache entry if present else clone
maybe_first to avoid panics on recursive grammars.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 3a8c598c-830a-4c4a-b336-22fad6a3e64a

📥 Commits

Reviewing files that changed from the base of the PR and between f269543 and cfd5cd9.

📒 Files selected for processing (6)

Cargo.toml
src/atn/lexer.rs
src/atn/mod.rs
src/lexer.rs
src/parser.rs
src/prediction.rs

Match the call style used in src/parser.rs's local FxHasher (`i32::cast_unsigned(value)` rather than method-call form). Same behavior, just keeps the two parallel hasher impls visually aligned. Addresses gemini-code-assist comment on PR #10.

The LL(1) cache is keyed by ATN state numbers, which are grammar-specific. A BaseParser reused with a different Atn would otherwise commit to alts chosen for the previous grammar. Same fix as for first_set_cache and decision_lookahead_cache, which already clear on reset. Addresses comments from chatgpt-codex (P1), gemini-code-assist (high), greptile (P1), and coderabbit (Major) on PR #10.

rule_first_set intentionally skips inserting into the cache when its FIRST-set walk hits a cycle. The previous code probed the cache after the call, which would .expect()-panic on recursive grammars. Use the returned Rc<FirstSet> directly — it's correct in both cycle and non-cycle cases. Addresses gemini-code-assist (high) and coderabbit (Critical) comments on PR #10.

The previous code hashed (state_number, index) into a single u64 with a multiplicative mix. While the collision probability is ~10^-19, a collision would silently treat a fresh visit as a cycle and drop all outcomes — i.e., produce a wrong parse with no diagnostic. Storing the exact (usize, usize) pair preserves O(1) FxHashSet lookup without giving up exactness; the bookkeeping is still tiny because the visiting set holds at most a few hundred entries at any time. Addresses greptile (P2) comment on PR #10.

Use a compound (atn ptr, states ptr, states len, max_token_type) key instead of just the Atn pointer. Real-world generated parsers use 'static OnceLock<Atn>', so collisions are impossible for them. But for non-'static Atns (dropped between parses), the previous key would let a new Atn allocated at the same address pick up the previous grammar's FIRST/lookahead entries. Adding a stable secondary identity catches that pointer collision. Addresses chatgpt-codex (P2), gemini-code-assist (medium), greptile (P1), and coderabbit (Major) comments on PR #10.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c704581b24

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-25T10:36:46Z

+        let visit_id = (state_number, index);
+        if needs_cycle_guard && !visiting.insert(visit_id) {


Include parse context in cycle-guard key

The new cycle guard keys visiting only by (state_number, index), but viability at a state also depends on request context (notably precedence, and in recovery mode the recovery context). In this function, precedence is passed through recursive rule calls and directly gates Transition::Precedence handling, so revisiting the same state/index with a different precedence can be a valid path; this guard now returns Vec::new() instead, pruning that path and potentially producing a wrong parse or spurious no-viable error on left-recursive precedence grammars.

Useful? React with 👍 / 👎.

tinovyatkin added 15 commits May 25, 2026 10:21

perf(lexer): Rc-wrap cached DFA state to avoid Vec<ConfigKey> clone

d1cc835

Lookup hot in C# parsing showed cached_lexer_dfa_state cloning the entire Vec of LexerDfaConfigKey on every lexer step. Wrapping the cached state in Rc makes lookup an Rc bump instead of a deep clone.

perf(lexer): swap LexerDfaTrace BTreeMap fields for FxHashMap

a16fab4

The cached_lexer_dfa_transition lookup is the inner-loop call inside the lexer DFA walker; lookup-heavy hash maps win over O(log n) tree maps.

perf-counters: track outcome list length distribution

6014aad

Adds OUTCOMES_RETURN_0/1/N counters to recognize_state_fast and an opt-in dump on ANTLR_PERF_DUMP=1 from fast_recognize_top, so future investigations can see the 0/1/many split without rebuilding.

perf(parser): intern empty NodeList tail to avoid Rc allocation

74936d1

Each NodeList::Cons created from an Empty list paid for an Rc::new (NodeList::Empty) allocation. Sharing a thread-local Rc<NodeList::Empty> means Empty -> Cons transitions are now a cheap Rc::clone instead.

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

Comment thread src/parser.rs

Comment thread src/parser.rs Outdated

greptile-apps Bot reviewed May 25, 2026

View reviewed changes

Comment thread src/parser.rs Outdated

Comment thread src/parser.rs Outdated

gemini-code-assist Bot reviewed May 25, 2026

View reviewed changes

coderabbitai Bot reviewed May 25, 2026

View reviewed changes

Comment thread src/lexer.rs

Comment thread src/parser.rs

Comment thread src/parser.rs

Comment thread src/parser.rs

tinovyatkin added 5 commits May 25, 2026 12:26

chatgpt-codex-connector Bot reviewed May 25, 2026

View reviewed changes

chore: improve CLAUDE.md

d5b2fe1

tinovyatkin merged commit cde9ad0 into main May 25, 2026
7 of 8 checks passed

		let visit_id = (state_number, index);
		if needs_cycle_guard && !visiting.insert(visit_id) {

Conversation

tinovyatkin commented May 25, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Benchmark deltas (release, 5 iters + 2 warmups)

What's in the commits

Test plan

Summary by CodeRabbit

Uh oh!

coderabbitai Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review failed

Walkthrough

Changes

Uh oh!

github-actions Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Copy/Paste Detection

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

greptile-apps Bot commented May 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Confidence Score: 5/5

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

tinovyatkin May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 25, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tinovyatkin commented May 25, 2026 •

edited by coderabbitai Bot

Loading

coderabbitai Bot commented May 25, 2026 •

edited

Loading

github-actions Bot commented May 25, 2026 •

edited

Loading

greptile-apps Bot commented May 25, 2026 •

edited

Loading