Skip to content

yaml: reject tab in s-indent position before structural tokens (DK95/06, Y79Y/005, Y79Y/008 — 402/402)#31527

Merged
dylan-conway merged 18 commits into
mainfrom
claude/yaml-tab-indent
May 29, 2026
Merged

yaml: reject tab in s-indent position before structural tokens (DK95/06, Y79Y/005, Y79Y/008 — 402/402)#31527
dylan-conway merged 18 commits into
mainfrom
claude/yaml-tab-indent

Conversation

@dylan-conway

Copy link
Copy Markdown
Member

What

Per [62]/[63] s-indent(n) is spaces only. A tab in indent position is s-separate-in-line — valid before [197] flow-in-block content ([69] s-flow-line-prefix(n) ::= s-indent(n) s-separate-in-line?), but never before a [184]/[192]/[195] structural sibling (-/?/:/implicit-key), which the grammar places immediately at s-indent(n) with no separation production in between.

Bun was accepting tab-tainted structural tokens because Token.indent records the count of leading spaces (correctly), but the parser had no way to ask "was there additional whitespace between those spaces and the token?"

How

Scanner: Parser.tab_after_indent: bool records "a tab was seen between this line's s-indent (or post-indicator additional_parent_indent column) and the current token." Set in scan()'s tab arm (and fold_lines()' tab arm, which is what consumes whitespace for the DK95/06 shape). Reset on newline() and at scan() entry when re-entering indent position.

Parser: checked at every structural-sibling recognition site (the complete set per [184]/[192]/[195]):

Site Production
parse_block_sequence loop [184] each - at s-indent(n)
parse_block_mapping loop [195] each entry at s-indent(n)
explicit : indent check (first + subsequent) [191] s-indent(n) ":"
parse_node MappingKey / MappingValue arms first entry of new block mapping
parse_node Scalar arm [192] implicit key
parse_block_indented same-line compact [185]

Content paths (flow context, s-separate after indicator, plain-scalar fold continuation) correctly do not check it — tab is valid there.

This is the general spec rule, not three test cases: it covers every grammar position where s-indent(n) precedes a structural token, not just the three official-suite inputs.

yaml-test-suite

DK95/06, Y79Y/005, Y79Y/008 activated. 402/402, 0 todos.

PR Suite todos
#31112 27 → 10
#31203 10 → 7
#31308 7 → 3
this 3 → 0

Validation

  • 1887 pass / 0 fail / 11 todo
  • 1015/1015 4-parser consensus harness
  • 11 new unit tests (7 fail on system Bun, 4 lock in valid-tab content cases)
  • Also flips the #31203 test.todo("tab before block construct") (3/3 asserts)

…tructural token

[62]/[63] s-indent(n) is spaces only. A tab between s-indent and the next
token is s-separate-in-line — valid before [197] flow-in-block content
([80]/[69] s-flow-line-prefix(n) ::= s-indent(n) s-separate-in-line?),
never before a [184]/[192]/[195] structural sibling (`-`/`?`/`:`/key),
which sit immediately at s-indent(n).

Scanner: `Parser.tab_after_indent: bool` records "tab seen between
line-start (or post-indicator additional_parent_indent) and the current
token". Set in scan() tab arm + fold_lines() tab arm; reset on
newline().

Parser: checked at every structural-sibling recognition site:
- parse_block_sequence loop (each `-` at s-indent(n))
- parse_block_mapping loop (each entry at s-indent(n))
- explicit `:` indent check (first + subsequent)
- parse_node MappingKey/MappingValue arms (first block entry)
- parse_node Scalar arm (scalar → implicit key)
- parse_block_indented same-line compact construct

Content paths (flow context, s-separate after indicator, plain-scalar
fold continuation) correctly do not check it.

Activates DK95/06, Y79Y/005, Y79Y/008 — yaml-test-suite is now 402/402.
@robobun

robobun commented May 28, 2026

Copy link
Copy Markdown
Collaborator
Updated 11:05 PM PT - May 28th, 2026

@dylan-conway, your commit 8e4f8ee has some failures in Build #58832 (All Failures)


🧪   To try this PR locally:

bunx bun-pr 31527

That installs a local version of the PR into your bun-31527 executable, so you can run:

bun-31527 --bun

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

This PR adds a parser-level tab-after-indent taint, detects tabs in indentation-sensitive scanner positions, centralizes token initialization, propagates the taint through block sequence/mapping/implicit-key flows and folded-scalar scans, and updates/activates tests to assert the new TabIndentation failures.

Changes

YAML tab-position indentation validation

Layer / File(s) Summary
Tab-position field initialization and parser setup
src/parsers/yaml.rs
Parser.tab_after_indent and ParseError::TabIndentation are added; ParseResult mapping and newline reset behavior are updated; token_init() centralizes token metadata.
scan() indent-position refactor and token_init
src/parsers/yaml.rs
scan() gains in_indent_position, token construction sites refactored to use token_init(), and directive/reserved-token paths updated.
Whitespace/newline and tab detection
src/parsers/yaml.rs
Newline handling sets in_indent_position and clears additional_parent_indent; tabs consumed in indent positions set tab_after_indent, clear additional_parent_indent, and trigger TabIndentation where applicable.
Scalar folding indentation scanning
src/parsers/yaml.rs
fold_lines and folded-scalar indentation scans taint tab_after_indent when an indentation tab is consumed and update indentation bookkeeping.
Block sequence tab validation
src/parsers/yaml.rs
Block sequence entry parsing (-) rejects when tab_after_indent is set.
Block mapping : handling and entry-loop validation
src/parsers/yaml.rs
Explicit : handling and mapping entry-loop indentation checks now treat tainted indent gaps as TabIndentation in block (non-flow) contexts.
Block-indented rescan and compact-indicator paths
src/parsers/yaml.rs
Preserves tab-after-indent across block-indented rewinds and tightens compact-indicator acceptance to reject tainted cases.
Implicit-key resolution and ? handling
src/parsers/yaml.rs
Snapshots and checks tab_after_indent for implicit-key candidates (alias, sequence, mapping, scalar) and rejects block-context implicit keys or explicit ? when tainted.
Tab-handling test coverage
test/js/bun/yaml/yaml-test-suite.test.ts, test/js/bun/yaml/yaml.test.ts
Activates several YAML test-suite cases (DK95/06, Y79Y/005, Y79Y/008), converts multiple test.todos into active tests, updates many expectations to the TabIndentation message, and adds a comprehensive tab in s-indent position suite with matrix checks.

Possibly related PRs

  • oven-sh/bun#31112: Both PRs modify src/parsers/yaml.rs to change how indentation/additional_parent_indent affects block explicit mapping key (?) / : parsing and token termination.
  • oven-sh/bun#31308: Both PRs modify the YAML block-indented parsing paths for the same constructs in src/parsers/yaml.rs and overlap on parser entry points for block constructs.

Suggested reviewers

  • RiskyMH
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: rejecting tabs in s-indent position before structural tokens in YAML parsing, with reference to the test cases and completion metric (402/402).
Description check ✅ Passed The description fully addresses both template requirements: it thoroughly explains what the PR does (tab-indentation taint tracking) and how verification was performed (test suite activation, validation metrics, unit tests).
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

Warning

Review ran into problems

🔥 Problems

Stopped waiting for pipeline failures after 30000ms. One of your pipelines takes longer than our 30000ms fetch window to run, so review may not consider pipeline-failure results for inline comments if any failures occurred after the fetch window. Increase the timeout if you want to wait longer or run a @coderabbit review after the pipeline has finished.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/parsers/yaml.rs`:
- Around line 3608-3612: The code clears self.tab_after_indent before calling
self.scan(), which loses the leading-tab taint for the token and allows invalid
tab-indented sibling tokens to bypass s-indent checks; fix by preserving the
taint across the re-scan — remove or defer clearing self.tab_after_indent until
after self.scan(ScanOptions::default()) completes (i.e., do not reset
self.tab_after_indent before calling scan()), so that scan() and subsequent
s-indent validation see the original tab state of self.token.

In `@test/js/bun/yaml/yaml.test.ts`:
- Around line 1342-1412: Several nearby tests (the rejection cases covering
"rejects tab before sibling block-seq `-`", "rejects tab before sibling
block-map entry", and "rejects tab before explicit `:` continuation" etc.)
repeat the same pattern; refactor them into a single parameterized test using
test.each that iterates over an array of invalid YAML strings and asserts
expect(() => YAML.parse(input)).toThrow("Unexpected token") for each; keep the
other acceptance tests as-is and ensure the test name for the test.each clearly
indicates these are tab-in-s-indent rejection cases so they remain discoverable.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: dd03a9c0-1f5d-4d1a-a4d3-bdcd6d3ce288

📥 Commits

Reviewing files that changed from the base of the PR and between 79a4029 and 0ff238d.

📒 Files selected for processing (3)
  • src/parsers/yaml.rs
  • test/js/bun/yaml/yaml-test-suite.test.ts
  • test/js/bun/yaml/yaml.test.ts

Comment thread src/parsers/yaml.rs
Comment thread test/js/bun/yaml/yaml.test.ts
dylan-conway and others added 2 commits May 28, 2026 11:29
…x FH7J/3GZX/C4HZ tests

parse_node MappingValue arm built `first_key = E::Null` directly,
dropping any tag/anchor recorded in node_props. `!!str : x` gave
`{null:x}` instead of `{"":x}`. The end-of-function resolve_null only
applies to the returned node (the mapping), not the key.

Now `first_key = node_props.tag().resolve_null(loc)` and the anchor is
registered on the e-node key before parse_block_mapping. Also flips the
`anchor as implicit-key e-node` test.todo.

Test-suite audit fixes (vs upstream yaml-test-suite@data 6ad3d2c):
- FH7J: item[2] expected was `{null:null}`; events say key is
  `!!str` empty → `""`, so `{"":null}`. The test was locking in the
  wrong parser output.
- 3GZX/C4HZ: shared-reference `.toBe()` assertions used non-existent
  keys (asserted `undefined === undefined`, vacuous). Fixed to use the
  actual key paths.
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

dylan-conway and others added 2 commits May 28, 2026 11:40
…teraction matrix

The rewind+rescan re-resolves the abandoned scalar's value (tag-neutral),
not its leading whitespace; the original scan's tab_after_indent is the
source of truth and the re-scan (in_indent_position=false, positioned
past the tab) cannot re-detect it.

Adds a 24-case matrix (3 indicators × 4 property prefixes × 2 tab
positions) covering every property-loop × tab-taint interaction.
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 src/parsers/yaml.rs:4239-4252 — The block-scalar body scanner (scan_auto_indented_literal_scalar, shared by | and >) is the third leading-whitespace consumer and is not instrumented, so a tab before a sibling that immediately follows block-scalar content is still accepted — e.g. YAML.parse("- |\n x\n\t- y\n") returns ["x\n","y"] instead of throwing, and likewise for a: |\n x\n\tb: y and ? |\n x\n\t: y. The phase-2 loop's other => arm (and/or the post-space arm) needs to set self.tab_after_indent = true when it sees a tab on the terminating line, mirroring what was done for fold_lines().

    Extended reasoning...

    What the bug is

    This PR instruments two leading-whitespace consumers to set tab_after_indent: the scan() tab arm (gated on in_indent_position) and fold_lines() (lines 4242-4244 and 4251). But there is a third consumer that owns the line after a block scalar's content: scan_auto_indented_literal_scalar (yaml.rs:4951-5061), which is the body scanner shared by scan_literal_scalar (|) and scan_folded_scalar (>). It has its own newline / leading-space handling and never calls fold_lines(), so the taint is never recorded for the line that terminates the block scalar.

    Step-by-step trace of "- |\n x\n\t- y\n"

    1. parse_block_sequenceparse_block_indentedscan(additional_parent_indent=Some(1)) hits 0x7Cscan_literal_scalarscan_auto_indented_literal_scalar.
    2. Phase 2 appends x, then hits the outer 0x0A arm at line 4964: self.newline() (line 4966) resets tab_after_indent = false and line_indent = NONE.
    3. The inner "newlines" loop reads nc = 0x09. The tab checks at lines 4980/4989 only fire after inner newlines (i.e. for the 2nd+ consecutive blank line), so a single \n before \t falls through to the other => arm at lines 5008-5012, which just sets __c = 0x09 and breaks — no taint recorded.
    4. The outer catch-all arm at line 5047 sees line_indent (0/NONE) < min_indent (2) and returns ctx.done(). Scanner state on return: pos is at the tab, line_indent = 0, tab_after_indent = false.
    5. Back in parse_node's Scalar arm, line 4071 captures scalar_tab_after_indent = false, then line 4089 calls self.scan(ScanOptions { tag, outside_context: true, ..Default::default() }). Defaults give first_scan = false, additional_parent_indent = None, so at lines 5531-5536 count_indentation = false and in_indent_position = false.
    6. scan() reaches the 0x09 arm at line 5826: count_indentation is false (BlockIn error skipped) and in_indent_position is false, so the new taint assignment at line 5842 is skipped. The tab is consumed; the next char is -Token::sequence_entry with indent = self.line_indent = 0.
    7. parse_block_sequence loop iteration 2 (line 2864): token is SequenceEntry, token.indent (0) == sequence_indent (0), and the new check at line 2868 reads self.tab_after_indentfalse. The sibling is wrongly accepted; result is ["x\n", "y"].

    The exact same gap applies to:

    • "a: |\n x\n\tb: y\n" — accepted as {a:"x\n", b:"y"} because parse_block_mapping's check at line 3062 sees tab_after_indent = false.
    • "? |\n x\n\t: y\n" — accepted as {"x\n":"y"} because the explicit-: check at line 2998 sees tab_after_indent = false.
    • All of the above with > in place of | (same body scanner).

    Why existing code doesn't prevent it

    • fold_lines() is instrumented (lines 4242-4252) but is only used by plain/quoted scalars. Block scalars bypass it entirely.
    • scan()'s tab arm only taints when in_indent_position is true, which requires first_scan or additional_parent_indent or having crossed a newline inside that scan() call. The post-scalar scan at line 4089 starts mid-line (block-scalar scanner already consumed the newline + spaces), so none of those hold.
    • The block-scalar scanner's own tab checks at lines 4980/4989 only cover consecutive newlines (empty lines inside the scalar), not the first newline after content.

    Compare with the plain-scalar case the PR does cover: YAML.parse("- a\n\t- b\n") correctly throws because scan_plain_scalar calls fold_lines() for its terminating line, whose tab arm (line 4251) sets the flag, which then survives to line 2868.

    Impact

    The PR description states this change "covers every grammar position where s-indent(n) precedes a structural token," but post-block-scalar siblings are exactly such a [184]/[191]/[195] position and are missed. Reference parsers (PyYAML, ruamel, libyaml) reject all of these inputs.

    Fix

    Instrument scan_auto_indented_literal_scalar's phase-2 line handling the same way fold_lines() was: in the inner loop, when the post-space character (line 5003) or the immediate post-newline character (other => arm, line 5008) is 0x09, set self.tab_after_indent = true before breaking. That flag will then survive the return (it's only reset by newline() or by entering scan() in indent position) and be read by the structural-sibling checks at lines 2868 / 2998 / 3062 / 3090.

Comment thread src/parsers/yaml.rs
Comment thread src/parsers/yaml.rs Outdated
dylan-conway and others added 3 commits May 28, 2026 11:59
"Tab characters cannot be used as indentation" — emitted from every
tab_after_indent check site (parser + scanner). Matches what eemeli/yaml
("Tabs are not allowed as indentation"), js-yaml ("tab characters must
not be used in indentation"), and PyYAML/ruamel ("found character \t
that cannot start any token") do.

The compact-construct check in parse_block_indented is split: only the
tab_after_indent path emits TabIndentation; the indent<=n path (e.g.
`a: ? x` — no tab, no api) keeps UnexpectedToken.
… into claude/yaml-tab-indent

# Conflicts:
#	test/js/bun/yaml/yaml.test.ts

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/parsers/yaml.rs (1)

5840-5845: ⚠️ Potential issue | 🟡 Minor | ⚡ Quick win

Preserve the tab error location before returning from scan().

ParseResult::fail() resolves ParseError::TabIndentation from parser.token.start, but this branch returns before a new token is installed. On this path the diagnostic points at the previous token instead of the offending tab.

🔧 Minimal fix
                     if count_indentation
                         && additional_parent_indent.is_none()
                         && self.context.get() == Context::BlockIn
                     {
+                        self.token.start = self.pos;
                         return Err(ParseError::TabIndentation);
                     }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/parsers/yaml.rs` around lines 5840 - 5845, The branch in scan() returns
Err(ParseError::TabIndentation) before a new token is installed, so
ParseResult::fail() uses the previous token.location; fix by capturing the
offending location (e.g., set parser.token.start or create a temporary token
with the current scanner position) before returning and use that when returning
the TabIndentation error so diagnostics point at the tab; update the scan()
branch that checks count_indentation && additional_parent_indent.is_none() &&
self.context.get() == Context::BlockIn to record the current position and
propagate the ParseError::TabIndentation with that position.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Outside diff comments:
In `@src/parsers/yaml.rs`:
- Around line 5840-5845: The branch in scan() returns
Err(ParseError::TabIndentation) before a new token is installed, so
ParseResult::fail() uses the previous token.location; fix by capturing the
offending location (e.g., set parser.token.start or create a temporary token
with the current scanner position) before returning and use that when returning
the TabIndentation error so diagnostics point at the tab; update the scan()
branch that checks count_indentation && additional_parent_indent.is_none() &&
self.context.get() == Context::BlockIn to record the current position and
propagate the ParseError::TabIndentation with that position.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: b47c99a5-a167-4337-a710-42ddbb1d5f5b

📥 Commits

Reviewing files that changed from the base of the PR and between 06bb7ca and 6c9b769.

📒 Files selected for processing (2)
  • src/parsers/yaml.rs
  • test/js/bun/yaml/yaml.test.ts

…ng-line lookahead

The block-scalar (`|`/`>`) body scanner is the third leading-whitespace
consumer (after scan() and fold_lines()). Its phase-2 nested-newlines
loop consumes the leading whitespace of each body line, including the
line whose dedent terminates the scalar — but did not set
tab_after_indent, so `- |\n  x\n\t- y` was accepted as `["x\n","y"]`.

Now the `0x20`-then-tab and `other == 0x09` cases taint, mirroring
fold_lines().

Also: drop leftover `let r = ...; break r?;` binding from removed debug
prints; collapse stray blank lines.
@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🔴 src/parsers/yaml.rs:4043-4048 — This unconditionally applies node_props.anchor()/.tag() to the e-node key and then resets node_props, but unlike the sibling Scalar/SequenceStart/MappingStart arms it does not use implicit_key_anchors(self.token.line) to do the line-based [200]/[193] split. For &a\n: x the anchor is now applied to the null key instead of the mapping (regression — *a later resolves to null instead of {null:"x"}), and for &outer\n&inner\n: x the node_props = default() wipes has_mapping_anchor so &outer is silently dropped instead of erroring MultipleAnchors at line 4171 (same for has_mapping_tag). The fix for the same-line case (&a : x, FH7J) is correct; it just needs to gate on the property's line matching self.token.line like the other arms do.

    Extended reasoning...

    What changed

    Commit c407311 (the FH7J/3GZX/C4HZ fix in this PR) rewrote the TokenData::MappingValue arm of parse_node so that when a bare : opens a block mapping with an e-node first key, any pending node_props are applied to that key:

    let first_key = node_props.tag().resolve_null(self.token.start.loc());
    if let Some(anchor) = node_props.anchor() {
        self.anchors.put(Enc::key_bytes(anchor.slice(self.input)), first_key)?;
    }
    node_props = NodeProperties::default();
    break 'node self.parse_block_mapping(first_key, ...)?;

    This is correct for the same-line case &a : x / !!str : x (the e-node key carries the property, per [193] c-ns-properties in BLOCK-KEY context), and that is what FH7J and the un-todo'd "anchor as implicit-key e-node" test exercise.

    But node_props.anchor() (line 3358-3363) returns has_anchor with no line check, and node_props = NodeProperties::default() then wipes has_mapping_anchor / has_mapping_tag. The three sibling implicit-key arms (Scalar at 4140, SequenceStart at 3863, MappingStart at 3950) all route through node_props.implicit_key_anchors(implicit_key_line) instead, which encodes the spec's line-based split: a property on the same line as the key belongs to the key; a property on a prior line is the [200] s-l+block-collection property and belongs to the mapping. The new MappingValue arm has no such split.

    Step-by-step proof — &a\n: x\nb: *a

    1. parse_node property loop consumes Anchor(&a) on line 1 → set_anchornode_props.has_anchor = Some(&a@line1), has_mapping_anchor = None.
    2. Next scan → MappingValue on line 2.
    3. New arm: node_props.anchor() returns &a (no line check). self.anchors.put("a", first_key=null). Then node_props = default().
    4. parse_block_mapping builds {null:"x", b:*a}. *a resolves to null (the e-node key).
    5. break 'node returns the mapping; post-'node code at 4186 sees node_props.anchor() == None → mapping is not anchored.

    Old behavior: node_props flowed through untouched; post-'node line 4186-4188 anchored &a → {null:"x"} (the mapping). *a resolved to the mapping.

    Codebase's own rule: implicit_key_anchors(implicit_key_line=2) at 3407-3422 with has_anchor@line1, has_mapping_anchor=Nonemystery_anchor.line(1) != 2 → returns {key_anchor: None, mapping_anchor: &a}. So per the parser's own splitting logic, &a belongs to the mapping, not the key. The spec agrees: [193]/[154] place an implicit key's c-ns-properties in BLOCK-KEY context (s-separate-in-line, same line only); a prior-line property is the [200] collection property.

    Step-by-step proof — &outer\n&inner\n: x

    1. Property loop consumes &outer@line1has_anchor=&outer. Then &inner@line2set_anchor (line 3347-3355): previous.line(1) != 2 so has_mapping_anchor = &outer, has_anchor = &inner.
    2. Scan → MappingValue on line 3.
    3. New arm: node_props.anchor() returns &inner only (line 3358 reads has_anchor, never has_mapping_anchor). Anchors &inner → null. Then node_props = NodeProperties::default()wipes has_mapping_anchor=&outer.
    4. break 'node. Post-'node check at 4171: node_props.has_mapping_anchor is now None → no error. &outer is silently dropped.

    Old behavior: node_props preserved → line 4171 fired Err(MultipleAnchors).

    Codebase's own rule: implicit_key_anchors(3) with has_mapping_anchor=&outer, has_anchor=&inner@line2inner.line(2) != 3Err(MultipleAnchors) (line 3392-3394). So both old behavior and the canonical helper reject this; the new code silently accepts it.

    The identical shape applies to tags via has_mapping_tag: !!map\n!!str\n: x previously errored MultipleTags at 4176; now !!map vanishes.

    Why nothing prevents it

    The post-'node checks at 4171-4189 are still there but see a blank node_props. The only test coverage added is the same-line case (&a : x), which is the one shape where .anchor() and implicit_key_anchors() agree.

    Impact

    Regression on valid (if esoteric) YAML, introduced by this PR. (1) &a\n: x — anchor moves from mapping to null key, observable via any later *a. (2) &o\n&i\n: x / !!map\n!!str\n: x — error → silent drop. The PR's stated goal is full spec conformance (402/402); silently changing anchor semantics on valid input and turning a rejected double-property into a silent over-accept both run counter to that.

    Fix

    Replicate the line-based split via node_props.implicit_key_anchors(self.token.line)? (and the tag analogue) exactly as the Scalar/SequenceStart/MappingStart arms do at 3863/3950/4140: apply key_anchor/key_tag to first_key, hold mapping_anchor/mapping_tag to apply to the parse_block_mapping result, and do not blanket-reset node_props (or reset only after extracting has_mapping_*). That preserves the FH7J same-line fix while restoring the prior-line semantics.

Comment thread test/js/bun/yaml/yaml.test.ts Outdated
Phase-1 (auto-indent detection) consumes the FIRST line after the
`|`/`>` header. When that line's spaces+tab terminates the scalar
(spaces ≤ parent block indent), phase-1's `c =>` arm broke without
tainting, and phase-2 immediately returned done() — leaving the next
scan() to read the tab untainted. Only phase-2 (body-loop, lines after
the first) was instrumented before.

Now `if first == 0x09 { tab_after_indent = true }` between phase-1 and
phase-2, mirroring the phase-2 nested-loop additions.
Comment thread src/parsers/yaml.rs Outdated
dylan-conway and others added 2 commits May 28, 2026 13:19
…93] line split

The FH7J fix unconditionally applied node_props to the e-node key,
ignoring whether the property was on the `:`'s line ([193] key
property, BLOCK-KEY = same-line) or a prior line ([200] block-collection
property). Regression: `- &a\n  : x\n- *a` gave `[{null:x}, null]`
instead of `[{null:x}, {null:x}]`; `&outer\n&inner\n: x` accepted
instead of erroring.

Now mirrors the Scalar/SequenceStart/MappingStart arms: the tag is only
the key's when tag_line == colon_line; the anchor split goes through
implicit_key_anchors(colon_line) (key_anchor on first_key, mapping_anchor
on the parse_block_mapping result).

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional findings (outside current diff — PR may have been updated during review):

  • 🟡 src/parsers/yaml.rs:5016-5028 — The phase-2 inner-loop 0x0D/0x0A arms just above (yaml.rs:4994, 5003) — and the phase-1 pair at 4898/4908 — still hard-return ParseError::UnexpectedCharacter when a column-0 tab follows an empty line, so "- |\n x\n\t- y\n" throws the new "Tab characters cannot be used as indentation" but adding a blank line ("- |\n x\n\n\t- y\n") throws "Unexpected character" for the identical [184] violation. Diagnostic-only (both inputs are correctly rejected); change those four early-returns to ParseError::TabIndentation, or drop the phase-2 ones since the other arm now handles the tab via taint anyway.

    Extended reasoning...

    What

    Commit 5f2fd629 ("taint tab_after_indent in block-scalar body scanner's terminating-line lookahead") added tab_after_indent tainting to two arms of the phase-2 inner loop in scan_block_scalar — the 0x20 arm at yaml.rs:5019 and the other arm at yaml.rs:5028. The two adjacent 0x0D/0x0A arms in the same inner match (yaml.rs:4994-4996 and 5003-5005) were left unchanged: they still peek at the next byte after an empty line and, if it is a tab, hard-return ParseError::UnexpectedCharacter. The phase-1 leading-empty-line loop has the same pattern at yaml.rs:4898 and 4908.

    How it manifests

    Two inputs that are spec-invalid for the identical reason — a [184] sibling - after a tab in s-indent position — produce different diagnostics depending only on whether a blank line sits between the block-scalar body and the tabbed sibling:

    Input Diagnostic
    YAML.parse("- |\n x\n\t- y\n") Tab characters cannot be used as indentation
    YAML.parse("- |\n x\n\n\t- y\n") Unexpected character

    The PR's own new test at yaml.test.ts:1380-1394 ("rejects tab before sibling that immediately follows block-scalar body") covers the first shape but not the empty-line variant, so the gap is untested.

    Step-by-step trace

    Input A — "- |\n x\n\t- y\n" (no blank line):

    1. Block-scalar header |, phase 1 detects content_indent = 2 from x, phase 2 starts at __c = 'x'.
    2. Outer c => arm consumes x; outer 0x0A arm (4978) enters the inner loop.
    3. Inner loop: nc = self.next() is \t → falls to other arm (5025). PR-added code sets self.tab_after_indent = true, __c = 0x09, breaks inner.
    4. Outer c => arm: line_indent is NONE (set by newline()) < min_indent (2) → body terminates gracefully.
    5. Back in parse_block_sequence, the loop at 2879 sees SequenceEntry with tab_after_indent = trueParseError::TabIndentation.

    Input B — "- |\n x\n\n\t- y\n" (blank line before tab):

    1-2. Same as above through the outer 0x0A arm.
    3. Inner loop: nc = self.next() is \n (the blank line) → inner 0x0A arm (4999). Increments leading_newlines, calls newline(), inc(1).
    4. Line 5003: if Enc::wide(self.next()) == 0x09 { return Err(ParseError::UnexpectedCharacter) }. self.next() is \tfires before the body can terminate. The other arm at 5025 is never reached for this tab.

    Why the existing code doesn't prevent it

    The 0x0D/0x0A arms' tab lookahead is pre-existing code that predates this PR. Before the PR it was necessary — there was no tab_after_indent mechanism, so the only way to reject a column-0 tab after an empty line in block-scalar context was to error inline. After this PR the check is redundant for phase 2: removing it lets the next inner-loop iteration route the tab through the other arm, which now taints and lets the body terminate so the parser-side structural check at 2879/3076/etc. fires with the dedicated diagnostic. The pre-existing code was internally consistent (everything used UnexpectedCharacter); this PR created the inconsistency by updating two of the four arms in this loop and leaving the other two.

    Impact

    Diagnostic-message quality only. Both inputs are correctly rejected; there is no accept/reject behavioral change. The PR's stated goal is "the general spec rule, not three test cases" with a dedicated TabIndentation diagnostic at every s-indent position — leaving four sites in the block-scalar scanner on the old generic error works against that.

    Fix

    Lowest-risk: change ParseError::UnexpectedCharacterParseError::TabIndentation at the four sites (yaml.rs:4898, 4908, 4994, 5004).

    Cleaner (phase 2 only): delete the if Enc::wide(self.next()) == 0x09 { … } lookahead at 4994-4996 and 5003-5005 entirely. The next inner-loop iteration sees the tab, the other arm taints, and the parser-side check produces TabIndentation — exactly matching the no-blank-line path. (Phase 1's pair at 4898/4908 should keep the early return with TabIndentation, since the c => arm there feeds first into the phase-1-exit taint at 4953 only when content_indent has been determined, and the autodetect interaction is subtler.)

    Worth adding expect(() => YAML.parse("- |\n x\n\n\t- y\n")).toThrow(TAB_ERR) (and the \r\n\r\n variant) alongside the existing block-scalar-body cases at yaml.test.ts:1380.

… for post-loop guards

The blanket node_props reset cleared has_mapping_tag/has_mapping_anchor,
masking the post-loop guards at ~4223/4228. `!!str\n!!map\n: x` and
`&a\n&b\n&c : x` were accepted; main + 4/4 refs reject.

Now only `has_anchor` is cleared (the inner anchor, consumed by
implicit_key_anchors as either key_anchor or sole mapping_anchor). The
mapping_anchor and tag fields stay, so the existing overflow guards
fire. This matches main's behavior (which also over-rejects the
spec-valid `&a\n&b : x` 1-prior-+-1-same case via the same guard;
pre-existing).
Comment thread src/parsers/yaml.rs Outdated
Comment thread src/parsers/yaml.rs
Comment thread src/parsers/yaml.rs
…xisting over-reject

The post-loop has_mapping_anchor guard catches both overflow (`&a\n&b\n&c
: x`) and the spec-valid `&outer\n&inner : x` (the Scalar-key analogue is
accepted because that arm `return Ok` bypasses the guard). Pre-existing
on main; comment now says so instead of implying it's overflow-only.
@dylan-conway dylan-conway merged commit d4b3a3e into main May 29, 2026
5 of 7 checks passed
@dylan-conway dylan-conway deleted the claude/yaml-tab-indent branch May 29, 2026 00:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants