Skip to content

yaml: clear remaining 3 non-cyclic test.todo entries (BOM strip, 2-anchor e-node key, flow cmi gap)#31569

Merged
dylan-conway merged 14 commits into
mainfrom
claude/yaml-remaining-todos
May 29, 2026
Merged

yaml: clear remaining 3 non-cyclic test.todo entries (BOM strip, 2-anchor e-node key, flow cmi gap)#31569
dylan-conway merged 14 commits into
mainfrom
claude/yaml-remaining-todos

Conversation

@dylan-conway

Copy link
Copy Markdown
Member

What

Clears the last three test.todo entries that are not cyclic-anchor related (those four remain deferred).

Input Before After Spec
---\na: 1 error / spurious BOM-only first doc {a:1} [206] l-document-prefix ::= c-byte-order-mark? l-comment*
&outer\n&inner : x MultipleAnchors {null:"x"} [200] outer anchors collection, [193] inner anchors key (same line as :)
{a: &x\n b: c} {a:{b:"c"}} (over-accept) error [147] flow-map value is ns-flow-node, not a pair

How

  • New Encoding::bom_len() (3 for UTF-8 EF BB BF, 1 for UTF-16 FEFF, 0 for Latin-1). Parser::init starts pos/line_start_pos past it.
  • MappingValue arm clears has_mapping_anchor after implicit_key_anchors consumed it (was reaching the post-loop guard). set_anchor errors instead of silently overwriting on a 3rd anchor.
  • parse_node Scalar arm returns the bare scalar in FlowIn && !flow_pair_allowed instead of falling through to parse_block_mapping; closes the multiline-property gap left by the cmi mechanism.

Tests

7 → 4 todos (cyclic anchors only). 2032 pass / 0 fail / 402/402 yaml-test-suite / 1015/1015 4-parser consensus.

(1) [206] BOM strip in l-document-prefix: new `Encoding::bom_len()`
returns 3 for UTF-8 EF BB BF, 1 for UTF-16 FEFF, 0 for Latin-1.
`Parser::init` starts pos and line_start_pos past the BOM.

(2) `&outer\n&inner : x` (e-node key, valid [200]/[193] split): clear
`has_mapping_anchor` alongside `has_anchor` after implicit_key_anchors
consumed both. `set_anchor` now errors instead of silently overwriting
on a third anchor.

(3) `{a: &x\n b: c}` (multiline-property cmi gap): the Scalar arm
returns the bare scalar in `FlowIn && !flow_pair_allowed` regardless of
the cmi indent comparison; the leftover `:` reaches the [140] check.

7 → 4 todos; the remaining 4 are cyclic anchors (deferred).
@robobun

robobun commented May 29, 2026

Copy link
Copy Markdown
Collaborator
Updated 5:09 PM PT - May 29th, 2026

@dylan-conway, your commit efcfffe has 1 failures in Build #59095 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 31569

That installs a local version of the PR into your bun-31569 executable, so you can run:

bun-31569 --bun

@coderabbitai

coderabbitai Bot commented May 29, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review

Walkthrough

The YAML parser now detects leading BOM/encoding and dispatches accordingly, initializes scanning past the BOM, tightens anchor duplicate/consumption rules, returns flow-map values early when flow pairs are disallowed, validates plain-scalar numbers against the core schema before parsing, and refines # comment termination; tests were enabled/extended.

Changes

YAML Parser Enhancements

Layer / File(s) Summary
BOM detection and parser initialization
src/parsers/yaml.rs
Adds DetectedEncoding and detect_encoding; YAML::parse dispatches to parse_with::<Enc> and transcodes UTF-16 into arena u16 units; rejects UTF-32. Encoding gains bom_len(input: &[Self::Unit]) -> usize; Latin1, Utf8, and Utf16 implement BOM-skipping lengths. Parser::init sets pos/line_start_pos from Enc::bom_len(input). Tests for BOM-prefixed ---, BOM-only input, preservation of mid-stream BOM, byte-level encoding detection, UTF-32 rejection, and odd-length UTF-16 errors were added.
Anchor duplicate detection and consumption cleanup
src/parsers/yaml.rs, test/js/bun/yaml/yaml.test.ts
NodeProperties::set_anchor now errors when a mapping-anchor already exists; after consuming implicit_key_anchors the parser clears has_anchor and has_mapping_anchor. Tests enable two-anchor-before-: acceptance, assert third-anchor "Multiple anchors" error, and add a todo for two-tag analogue.
Flow-map value scalar early-return
src/parsers/yaml.rs, test/js/bun/yaml/yaml.test.ts
When in Context::FlowIn and !opts.flow_pair_allowed, alias/sequence/mapping/scalar parsing for a mapping value returns the bare node immediately so trailing : handling is performed by the caller. Tests asserting Unexpected token for multiline c-ns-properties before nested pairs and expanded flow-seq invalid-context assertions were enabled.
Plain-scalar number lexing and core-schema validation
src/parsers/yaml.rs
Number-resolution loop pre-advance adjusted; consumed lexeme captured and validated with is_core_schema_number (core YAML 1.2 int/float structure) before calling numeric parsers; unsigned/radix parsing uses the validated slice.
# comment termination refinement
src/parsers/yaml.rs
scan_plain_scalar and the general token scanner now treat # as starting a comment only at true line starts or when preceded by whitespace/newline; prior Pos::ZERO/raw-prev special-casing removed.

Possibly Related PRs

  • oven-sh/bun#31203: Related changes to flow_pair_allowed semantics for ?/: parsing in src/parsers/yaml.rs.
  • oven-sh/bun#31561: Overlapping fixes to flow-context parsing and flow-map/key handling.
  • oven-sh/bun#31308: Overlapping anchor/tag parsing changes in src/parsers/yaml.rs.

Suggested reviewers

  • RiskyMH
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the three main changes in the PR: BOM stripping, 2-anchor e-node key handling, and flow mapping multiline-property gap fixes.
Description check ✅ Passed The description follows the template with a detailed 'What' section explaining the three fixes and a 'How' section with implementation details, plus test results.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/parsers/yaml.rs`:
- Around line 2385-2391: The parser initializes pos and line_start_pos at a
BOM-aware start (Pos::from(Enc::bom_len(input))) but subsequent scanner logic
still treats only Pos::ZERO as "start of input", so comments immediately after a
BOM aren't recognized; update any checks that compare positions to Pos::ZERO to
instead compare against the initialized start (e.g., self.line_start_pos or the
same Pos value created by Pos::from(Enc::bom_len(input))) so functions/methods
that detect leading comments (the scanner that looks for '#' at start) treat the
BOM-offset start as the start-of-input; ensure all places using Pos::ZERO for
start detection (in scanner/comment parsing routines) are changed to use
self.line_start_pos or the computed start symbol.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 68cda9e9-35af-4bbf-b459-e676a9e3b9ef

📥 Commits

Reviewing files that changed from the base of the PR and between 9d00056 and 6f088e2.

📒 Files selected for processing (2)
  • src/parsers/yaml.rs
  • test/js/bun/yaml/yaml.test.ts

Comment thread src/parsers/yaml.rs
Comment thread src/parsers/yaml.rs
Comment thread src/parsers/yaml.rs
Comment thread src/parsers/yaml.rs
YAML::parse now applies the spec §5.2 byte-pattern matrix on byte input
(Buffer/Uint8Array): detects UTF-8/UTF-16LE/UTF-16BE (with or without
BOM) and routes to the matching Parser<Enc>; UTF-32 errors with a clear
message. The transcoded UTF-16 buffer is arena-allocated so it shares
the Expr tree's lifetime.

Also: the BOM-aware `Parser::init` made the two `#` arms' `Pos::ZERO`
start-of-input checks stale (`<BOM># comment\na: 1` errored). Both now
use `is_at_line_start()` which is BOM-aware.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/parsers/yaml.rs`:
- Around line 78-86: The UTF-16 parser currently treats Pos (and token.start) as
indexes into the &[u16] units but feeds them directly into
Pos::loc()/bun_ast::Loc (used by Log::add_error), which expects byte offsets;
convert UTF-16 unit indexes back to byte offsets (unit_index * 2, accounting for
endianness if needed) before constructing bun_ast::Loc or passing offsets to
Log::add_error so reported locations match the original byte positions; update
the Pos::loc() path and the places around the Utf16 parsing code (the loop
producing `units` and the subsequent call to Self::parse_with::<Utf16>(...), and
the similar code at the 110-112 region) to perform this conversion for both
self.pos and token.start when creating Locs.
- Around line 38-50: detect_encoding uses clippy-triggering patterns for UTF-16
BOM and odd-length checks; update the match arms in detect_encoding (and any
helper closure b) to use direct tuple patterns for BOMs (e.g. match (0xFE, 0xFF,
..) and (0xFF, 0xFE, ..) or explicit 2-byte header tuples) instead of `%`-based
patterns, and replace any modulus-based odd/even length guards with
bytes.len().is_multiple_of(2) (or !is_multiple_of(2) as appropriate) to
determine UTF-16 vs UTF-8 paths, keeping the DetectedEncoding variants (Utf16Be,
Utf16Le, Utf32Be, Utf32Le, Utf8) unchanged.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: c4791bc9-c39f-4146-8dc3-11c0b9cb01b3

📥 Commits

Reviewing files that changed from the base of the PR and between 8687f17 and 544b779.

📒 Files selected for processing (2)
  • src/parsers/yaml.rs
  • test/js/bun/yaml/yaml.test.ts

Comment thread src/parsers/yaml.rs Outdated
Comment thread src/parsers/yaml.rs Outdated
dylan-conway and others added 4 commits May 29, 2026 11:00
…rt/MapStart; tests + tag-analogue todo

The cmi-gap fix in the Scalar arm (`FlowIn && !flow_pair_allowed → return
bare`) is now mirrored to the Alias/SequenceStart/MappingStart arms so
`{a: &x\n [b]: c}` etc. error instead of forming nested mappings.

Adds test for the 3rd-anchor overflow guard, and a test.todo for the tag
analogue (set_tag lacks the same overflow guard; has_mapping_tag isn't
cleared in MappingValue arm).
The number-scanning loop accepted `+`/`-`/`e`/`.` at any position, then
`wtf::parse_double` prefix-parsed — so `1+1`, `++1`, `1e`, `1e1.5` all
resolved as numbers (1, 1, 1, 10). Now `is_core_schema_number` validates
the consumed slice against the [10.2.1.4] regex before calling
parse_double; non-matching scalars stay as strings.

Also fixes pre-existing `-.5`/`+.5` → string (the loop's `decimal`
pre-check + no-inc-for-sign meant the `.` was seen as a second decimal).

41-case test matrix.
…hunt

NUL truncation, C0/C1/DEL accepted, CRLF quoted-fold, verbatim/named tag
resolution, tag-on-quoted coercion, !!int/!!float validation, surrogate-
pair escapes, s-separate after tag, directive validation, 1024-char
implicit-key limit, error position discarded, merge-key order, alias
with prior-line property, block-scalar header whitespace.

Each todo asserts the spec-correct result with a [N] citation.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

♻️ Duplicate comments (2)
src/parsers/yaml.rs (2)

78-86: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Preserve byte offsets for UTF-16 parser positions.

This path transcodes the input into u16 units, but the parser still later feeds Pos directly into bun_ast::Loc. For Utf16, Pos counts code units while Loc is indexed into the original byte source, so syntax errors and Expr locations drift after the first code unit. Please convert UTF-16 positions back to byte offsets before constructing Locs, or carry a byte-offset map alongside the transcoded buffer.

Also applies to: 99-112

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/parsers/yaml.rs` around lines 78 - 86, The UTF-16 branch transcodes bytes
into u16 units (units, bumped buffer) and calls
Self::parse_with::<Utf16>(source, units, log, bump) but leaves Pos (which counts
UTF-16 code units) to be used directly for bun_ast::Loc (byte offsets), causing
drifting locations; fix by creating and passing a mapping from each UTF-16
code-unit index back to the original byte offset (or by converting Pos values to
byte offsets) before constructing Loc instances in the Utf16 parser path—update
the code paths around parse_with::<Utf16>, any Pos-to-Loc conversion sites, and
the similar block at the other occurrence (the block around lines 99-112) so Loc
is always built from original byte offsets rather than raw code-unit indices.

46-49: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix the redundant wildcard in detect_encoding.

Line 48 still trips Clippy's unneeded_pattern, so CI stays red. .. already covers the remaining tuple slots here; use (0x00, ..) instead.

♻️ Minimal fix
-        (0x00, _, ..) => Utf16Be,
+        (0x00, ..) => Utf16Be,
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/parsers/yaml.rs` around lines 46 - 49, In detect_encoding, remove the
redundant wildcard in the tuple patterns that tripped Clippy: replace (0x00, _,
..) => Utf16Be with (0x00, ..) => Utf16Be, and similarly replace (_, 0x00, ..)
=> Utf16Le with (.., 0x00) => Utf16Le so the patterns don't include an
unnecessary `_` and Clippy's unneeded_pattern warning is resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/bun/yaml/yaml.test.ts`:
- Around line 1831-1833: The test is using Math.abs(Bun.YAML.parse(input)) which
masks sign regressions; change the assertion to check the exact numeric value
returned by Bun.YAML.parse for each input. Update the test.each from a
single-array of inputs to either an array of [input, expected] tuples or compute
expected = input.startsWith("-") ? -Infinity : Infinity, then assert
expect(Bun.YAML.parse(input)).toBe(expected) (referencing the existing test.each
and Bun.YAML.parse).

---

Duplicate comments:
In `@src/parsers/yaml.rs`:
- Around line 78-86: The UTF-16 branch transcodes bytes into u16 units (units,
bumped buffer) and calls Self::parse_with::<Utf16>(source, units, log, bump) but
leaves Pos (which counts UTF-16 code units) to be used directly for bun_ast::Loc
(byte offsets), causing drifting locations; fix by creating and passing a
mapping from each UTF-16 code-unit index back to the original byte offset (or by
converting Pos values to byte offsets) before constructing Loc instances in the
Utf16 parser path—update the code paths around parse_with::<Utf16>, any
Pos-to-Loc conversion sites, and the similar block at the other occurrence (the
block around lines 99-112) so Loc is always built from original byte offsets
rather than raw code-unit indices.
- Around line 46-49: In detect_encoding, remove the redundant wildcard in the
tuple patterns that tripped Clippy: replace (0x00, _, ..) => Utf16Be with (0x00,
..) => Utf16Be, and similarly replace (_, 0x00, ..) => Utf16Le with (.., 0x00)
=> Utf16Le so the patterns don't include an unnecessary `_` and Clippy's
unneeded_pattern warning is resolved.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 00bc4b38-3e40-4f58-b547-a53a8a52554a

📥 Commits

Reviewing files that changed from the base of the PR and between 6187d11 and 86ec4aa.

📒 Files selected for processing (2)
  • src/parsers/yaml.rs
  • test/js/bun/yaml/yaml.test.ts

Comment thread test/js/bun/yaml/yaml.test.ts Outdated
Comment thread test/js/bun/yaml/yaml.test.ts Outdated
Comment thread test/js/bun/yaml/yaml.test.ts Outdated
Comment thread src/parsers/yaml.rs Outdated

/// Spec §5.2 byte-order-mark / null-byte encoding detection.
#[derive(Clone, Copy, PartialEq, Eq)]
enum DetectedEncoding {

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should go in a bun_core::bom module

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed — though detect_encoding was just removed in 12c16e5 (Dylan flagged that Parser<Utf16> has never been exercised, so shipping it or erroring on detected UTF-16 byte input both trade one untested behavior for another). Only the UTF-8 BOM strip in Parser::init remains.

When §5.2 detection ships properly (after a full Parser<Utf16> test sweep), bun_core::bom is the right home so JSON5/TOML/JSONC can share it.

`Parser<Utf16>` has never been exercised; introducing it (or erroring on
detected UTF-16/32 byte input) trades one untested behavior for another.
Restore main's `YAML::parse` exactly: `Parser<Utf8>` only. The
`Encoding` trait/`bom_len()` infrastructure stays so a UTF-16 path can
be added later behind a full test sweep, but no detection runs today.
Comment thread test/js/bun/yaml/yaml.test.ts Outdated
Comment thread test/js/bun/yaml/yaml.test.ts Outdated
Comment thread test/js/bun/yaml/yaml.test.ts Outdated
Comment thread src/parsers/yaml.rs
Comment thread test/js/bun/yaml/yaml.test.ts
Comment thread src/parsers/yaml.rs
Comment on lines +2447 to +2453
// [206] l-document-prefix ::= c-byte-order-mark? l-comment*
let start = Pos::from(Enc::bom_len(input));
Self {
input,
bump,
pos: Pos::from(0),
line_start_pos: Pos::from(0),
pos: start,
line_start_pos: start,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 🟣 Pre-existing, non-blocking: this strips the BOM at stream-start only, but [211] l-yaml-stream admits l-document-prefix (with c-byte-order-mark?) before each document — and lists c-byte-order-mark directly in its trailing alternation — so per §5.2 "byte order marks may appear at the start of any document". a: 1\n...\n---\nb: 2 (concatenated UTF-8-with-BOM files) won't have the second-document BOM stripped; the U+FEFF reaches scan_plain_scalar and --- parses as content instead of c-byte-order-mark + c-directives-end. Also makes the new test comment at yaml.test.ts:1753 ("only [206] document-prefix") slightly imprecise — only the first [206] instance is handled. Strictly improved vs main; worth a test.todo alongside the §5.2 deferral already discussed in the thread.

Extended reasoning...

What

Parser::init (yaml.rs:2447-2453) strips a leading BOM by setting pos/line_start_pos to Pos::from(Enc::bom_len(input)), citing [206] l-document-prefix ::= c-byte-order-mark? l-comment*. This handles the BOM at byte 0 only. But [206] is invoked from [211] l-yaml-stream:

l-yaml-stream ::= l-document-prefix* l-any-document?
                  ( ( l-document-suffix+ l-document-prefix* l-any-document? )
                    | c-byte-order-mark | l-comment | l-explicit-document )*

so l-document-prefix (with its optional c-byte-order-mark) recurs after each l-document-suffix+ (...), and c-byte-order-mark is also a direct alternative in the trailing repetition. §5.2 says this explicitly: "byte order marks may appear at the start of any document". The PR implements only the first [206] instance at stream-start.

Code path

Enc::bom_len() is called exactly once, in Parser::init. The document loop (parse()parse_stream / next-document re-entry) does not re-check for a BOM after DocumentEnd / .... There is no 0xFEFF / 0xEF arm in scan() that would consume a mid-stream BOM at a document boundary; after ...\n, the next scan() reads 0xEF (UTF-8) or 0xFEFF (UTF-16), which falls to the catch-all _ arm → scan_plain_scalar, so --- lexes as a plain scalar rather than c-byte-order-mark followed by c-directives-end.

Step-by-step proof

Input (UTF-8): a: 1\n...\n---\nb: 2 i.e. 61 3A 20 31 0A 2E 2E 2E 0A EF BB BF 2D 2D 2D 0A 62 3A 20 32.

  1. Parser::init: Enc::bom_len([0x61, …]) = 0 (input does not start with EF BB BF), so pos = line_start_pos = 0. No BOM stripped.
  2. First document parses as {a: 1}; scanner reaches ...DocumentEnd; pos advances past the trailing \n to byte 9.
  3. Next-document scan: self.next() at pos=9 returns 0xEF. The scan() dispatch has no arm for 0xEF/BOM bytes, so it falls to the _ catch-all → scan_plain_scalar.
  4. scan_plain_scalar consumes EF BB BF 2D 2D 2D as scalar content (BOM is nb-char, - is ns-plain-safe). is_at_line_start() is not consulted for --- recognition here because the scanner already committed to the plain-scalar path on the first byte; even if it were, pos(9) != line_start_pos(9) only at the moment - is reached after the BOM advanced pos.
  5. Result: the second document's --- is treated as content, not as c-byte-order-mark? + c-directives-end. The BOM is not stripped per [211]/[206].

This is the realistic shape of cat doc1.yaml doc2.yaml where each was saved by an editor that emits a UTF-8 BOM, with ... terminating the first.

Why nothing else catches it

bom_len() is only invoked from init — there is no per-document hook. The new test at yaml.test.ts:1753 documents the opposite (BOM mid-content is preserved as scalar text), which is correct for U+FEFF inside a node, but the document-suffix→document-prefix boundary is a different grammar position where [211] explicitly admits a BOM. The test comment "BOM mid-stream is not stripped (only [206] document-prefix)" is itself slightly imprecise: [206] l-document-prefix can recur per [211]; what's actually implemented is "only the stream-start instance of [206]".

Impact

Obscure (multi-document streams with per-document BOM, e.g. concatenated UTF-8-with-BOM files), and strictly improved vs main — before this PR no BOM was stripped at all. Not a regression. The §5.2 encoding-detection deferral already discussed in the thread (Jarred/Dylan re bun_core::bom, commit 12c16e5) is about UTF-16/UTF-32 detection, not per-document BOM at [211] boundaries, so this isn't already covered by that follow-up. This is the immediate [211] sibling of the [206] rule the PR cites at yaml.rs:2447 — same pattern as the other pre-existing siblings (Alias/SequenceStart/MappingStart [147] mirrors, set_tag analogue) already filed and addressed on this PR.

Fix

When the document loop sees DocumentEnd (or before scanning the next document's first token), check Enc::bom_len(&self.input[self.pos.cast()..]) and advance pos/line_start_pos past it — or add a 0xFEFF / EF BB BF arm in scan() that, at is_at_line_start() between documents, consumes the BOM and continues. Either way, also tighten the test comment at yaml.test.ts:1753 to "only the stream-start [206] instance", and add a test.todo for YAML.parse("a: 1\n...\n---\nb: 2", { all: true }) (or equivalent multi-doc API) alongside the existing §5.2 deferral.

… stale comments

- `-0x1f`/`-0o17` (pre-existing on int path, not the float gate) → test.todo
  asserting the spec value, not codifying current -31
- `Bun.YAML.parse` → imported `YAML.parse` for consistency
- Drop "Pre-existing on main" from un-todo'd test comment
- Update string_is_number doc to reflect is_core_schema_number gate
Comment on lines +903 to +910
/// This mirrors the rules in `src/parsers/yaml.rs`'s `try_resolve_number` /
/// `is_core_schema_number`:
/// - Optional leading sign, optionally followed by `.inf`/`.Inf`/`.INF`.
/// - Otherwise either an integer (`[0-9]+` / `0x…` / `0o…`) or a float
/// matching §10.2.1.4 `[-+]? ( . [0-9]+ | [0-9]+ ( . [0-9]* )? ) ([eE][-+]?[0-9]+)?`.
/// The parser-side gate now rejects non-conforming float-like tokens
/// (e.g. `"1+5"`, `"1e"`, `"."`) so this mirror should err on the side of
/// *quoting* whenever a token *might* parse as a number.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Nit: the doc-comment update (per #3326791810) dropped the stale embedded-+/- clauses, but the two inline // Parser rule: comments in the body at :998 and :1004 still cite the pre-PR rules ("'+' accepted unless we're in hex base", "at most one '-' after the leading sign") — the parser's is_core_schema_number gate now rejects any embedded +/-, so these contradict the freshly-updated header. No functional impact (over-quoting is safe); just drop the Parser rule: prefix or reword to e.g. // Over-conservative: parser now rejects embedded +/-; quoting is still safe.

Extended reasoning...

What

This PR updated string_is_number's doc comment (lines 903-910) in response to inline comment #3326791810 — replacing the stale "the parser accepts any number of + after the leading sign … and at most one additional - … e.g. \"1+5\" round-trips to 1" with a correct reference to the new is_core_schema_number gate. However, the two inline comments inside the function body were not touched and still carry the exact claims that were just removed from the header:

  • Line 998 (0x2b arm): // Parser rule: '+' accepted unless we're in hex base.
  • Line 1004 (0x2d arm): // Parser rule: at most one '-' after the leading sign.

After this PR, try_resolve_number is gated by is_core_schema_number (yaml.rs:1670), which rejects any +/- outside the single optional-sign-after-e position. There is no longer a "+ accepted unless hex" rule or an "at most one additional -" rule on the parser side — those were the pre-PR lexer-loop rules at yaml.rs:1624-1643, which still run but whose acceptances are now overridden by the validator. The new doc comment three lines above says exactly this: "The parser-side gate now rejects non-conforming float-like tokens (e.g. \"1+5\" …)" — which the body comments directly contradict.

Step-by-step proof

  1. Take input "1+5". Per the body comment at :998, the "Parser rule" is that + is accepted (not in hex base) — implying YAML.parse("1+5") resolves as a number.
  2. Per the doc comment at :908-909 (updated in this PR), the parser-side gate "now rejects … \"1+5\"" — implying it resolves as a string.
  3. Per the new test added in this PR (["1+1", "1+1"] in the "resolves as string" table), the doc comment is right and the body comment is wrong.
  4. Same for "1-1" vs the "at most one -" comment at :1004.

So a reader of the body sees a Parser rule: attribution that the function's own header, and the test suite, now say is false.

Addressing the refutation

One verifier argued these comments describe what the body code does, not net parser behavior, and that the updated doc comment already frames the body as intentionally over-conservative. The second half is fair (and is why this is a nit, not a bug). But the first half doesn't hold: the comments are not phrased as "accept + unless hex" — they are phrased as "Parser rule: + accepted unless hex". The Parser rule: prefix is an explicit attribution to what YAML.parse does; the function's stated job (line 901) is "returns true when str would be parsed back as a number by YAML.parse", and these comments explain which parser rule each arm mirrors. That attribution is now stale. The doc-comment fix in this PR was made for precisely this reason; these two lines are the residue of that same fix.

The refutation also notes the body comments are pre-existing and untouched by the PR. True — but the PR did touch this function (the doc comment at 903-910 is in the diff) specifically to remove these stale claims, and missed the two places in the body where the same claims are repeated. That's an incomplete application of an accepted fix, not an out-of-scope pre-existing issue.

Impact

Zero functional impact. string_is_number still returns true for "1+5" / "1-1", so YAML.stringify over-quotes them — which is round-trip-safe, exactly as the new doc comment notes. The only cost is that the next reader of these arms is told the parser does something it no longer does.

Fix

Drop or reword the Parser rule: prefix at :998 and :1004, e.g.:

0x2b /* '+' */ => {
    // Over-conservative: parser now rejects embedded '+'; quoting is safe.
    if base == Base::Hex { return false; }
}
0x2d /* '-' */ => {
    // Over-conservative: parser now rejects embedded '-'; quoting is safe.
    if saw_minus_after_sign { return false; }
    saw_minus_after_sign = true;
}

(or simply delete the two comment lines — the doc comment already covers the over-conservative framing).

@dylan-conway dylan-conway merged commit 7069b30 into main May 29, 2026
74 of 76 checks passed
@dylan-conway dylan-conway deleted the claude/yaml-remaining-todos branch May 29, 2026 22:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants