yaml: clear remaining 3 non-cyclic test.todo entries (BOM strip, 2-anchor e-node key, flow cmi gap)#31569
Conversation
(1) [206] BOM strip in l-document-prefix: new `Encoding::bom_len()`
returns 3 for UTF-8 EF BB BF, 1 for UTF-16 FEFF, 0 for Latin-1.
`Parser::init` starts pos and line_start_pos past the BOM.
(2) `&outer\n&inner : x` (e-node key, valid [200]/[193] split): clear
`has_mapping_anchor` alongside `has_anchor` after implicit_key_anchors
consumed both. `set_anchor` now errors instead of silently overwriting
on a third anchor.
(3) `{a: &x\n b: c}` (multiline-property cmi gap): the Scalar arm
returns the bare scalar in `FlowIn && !flow_pair_allowed` regardless of
the cmi indent comparison; the leftover `:` reaches the [140] check.
7 → 4 todos; the remaining 4 are cyclic anchors (deferred).
|
Updated 5:09 PM PT - May 29th, 2026
❌ @dylan-conway, your commit efcfffe has 1 failures in
🧪 To try this PR locally: bunx bun-pr 31569That installs a local version of the PR into your bun-31569 --bun |
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
WalkthroughThe YAML parser now detects leading BOM/encoding and dispatches accordingly, initializes scanning past the BOM, tightens anchor duplicate/consumption rules, returns flow-map values early when flow pairs are disallowed, validates plain-scalar numbers against the core schema before parsing, and refines ChangesYAML Parser Enhancements
Possibly Related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/parsers/yaml.rs`:
- Around line 2385-2391: The parser initializes pos and line_start_pos at a
BOM-aware start (Pos::from(Enc::bom_len(input))) but subsequent scanner logic
still treats only Pos::ZERO as "start of input", so comments immediately after a
BOM aren't recognized; update any checks that compare positions to Pos::ZERO to
instead compare against the initialized start (e.g., self.line_start_pos or the
same Pos value created by Pos::from(Enc::bom_len(input))) so functions/methods
that detect leading comments (the scanner that looks for '#' at start) treat the
BOM-offset start as the start-of-input; ensure all places using Pos::ZERO for
start detection (in scanner/comment parsing routines) are changed to use
self.line_start_pos or the computed start symbol.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 68cda9e9-35af-4bbf-b459-e676a9e3b9ef
📒 Files selected for processing (2)
src/parsers/yaml.rstest/js/bun/yaml/yaml.test.ts
YAML::parse now applies the spec §5.2 byte-pattern matrix on byte input (Buffer/Uint8Array): detects UTF-8/UTF-16LE/UTF-16BE (with or without BOM) and routes to the matching Parser<Enc>; UTF-32 errors with a clear message. The transcoded UTF-16 buffer is arena-allocated so it shares the Expr tree's lifetime. Also: the BOM-aware `Parser::init` made the two `#` arms' `Pos::ZERO` start-of-input checks stale (`<BOM># comment\na: 1` errored). Both now use `is_at_line_start()` which is BOM-aware.
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/parsers/yaml.rs`:
- Around line 78-86: The UTF-16 parser currently treats Pos (and token.start) as
indexes into the &[u16] units but feeds them directly into
Pos::loc()/bun_ast::Loc (used by Log::add_error), which expects byte offsets;
convert UTF-16 unit indexes back to byte offsets (unit_index * 2, accounting for
endianness if needed) before constructing bun_ast::Loc or passing offsets to
Log::add_error so reported locations match the original byte positions; update
the Pos::loc() path and the places around the Utf16 parsing code (the loop
producing `units` and the subsequent call to Self::parse_with::<Utf16>(...), and
the similar code at the 110-112 region) to perform this conversion for both
self.pos and token.start when creating Locs.
- Around line 38-50: detect_encoding uses clippy-triggering patterns for UTF-16
BOM and odd-length checks; update the match arms in detect_encoding (and any
helper closure b) to use direct tuple patterns for BOMs (e.g. match (0xFE, 0xFF,
..) and (0xFF, 0xFE, ..) or explicit 2-byte header tuples) instead of `%`-based
patterns, and replace any modulus-based odd/even length guards with
bytes.len().is_multiple_of(2) (or !is_multiple_of(2) as appropriate) to
determine UTF-16 vs UTF-8 paths, keeping the DetectedEncoding variants (Utf16Be,
Utf16Le, Utf32Be, Utf32Le, Utf8) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: c4791bc9-c39f-4146-8dc3-11c0b9cb01b3
📒 Files selected for processing (2)
src/parsers/yaml.rstest/js/bun/yaml/yaml.test.ts
…rt/MapStart; tests + tag-analogue todo
The cmi-gap fix in the Scalar arm (`FlowIn && !flow_pair_allowed → return
bare`) is now mirrored to the Alias/SequenceStart/MappingStart arms so
`{a: &x\n [b]: c}` etc. error instead of forming nested mappings.
Adds test for the 3rd-anchor overflow guard, and a test.todo for the tag
analogue (set_tag lacks the same overflow guard; has_mapping_tag isn't
cleared in MappingValue arm).
The number-scanning loop accepted `+`/`-`/`e`/`.` at any position, then `wtf::parse_double` prefix-parsed — so `1+1`, `++1`, `1e`, `1e1.5` all resolved as numbers (1, 1, 1, 10). Now `is_core_schema_number` validates the consumed slice against the [10.2.1.4] regex before calling parse_double; non-matching scalars stay as strings. Also fixes pre-existing `-.5`/`+.5` → string (the loop's `decimal` pre-check + no-inc-for-sign meant the `.` was seen as a second decimal). 41-case test matrix.
…hunt NUL truncation, C0/C1/DEL accepted, CRLF quoted-fold, verbatim/named tag resolution, tag-on-quoted coercion, !!int/!!float validation, surrogate- pair escapes, s-separate after tag, directive validation, 1024-char implicit-key limit, error position discarded, merge-key order, alias with prior-line property, block-scalar header whitespace. Each todo asserts the spec-correct result with a [N] citation.
There was a problem hiding this comment.
Actionable comments posted: 1
♻️ Duplicate comments (2)
src/parsers/yaml.rs (2)
78-86:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPreserve byte offsets for UTF-16 parser positions.
This path transcodes the input into
u16units, but the parser still later feedsPosdirectly intobun_ast::Loc. ForUtf16,Poscounts code units whileLocis indexed into the original bytesource, so syntax errors andExprlocations drift after the first code unit. Please convert UTF-16 positions back to byte offsets before constructingLocs, or carry a byte-offset map alongside the transcoded buffer.Also applies to: 99-112
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/parsers/yaml.rs` around lines 78 - 86, The UTF-16 branch transcodes bytes into u16 units (units, bumped buffer) and calls Self::parse_with::<Utf16>(source, units, log, bump) but leaves Pos (which counts UTF-16 code units) to be used directly for bun_ast::Loc (byte offsets), causing drifting locations; fix by creating and passing a mapping from each UTF-16 code-unit index back to the original byte offset (or by converting Pos values to byte offsets) before constructing Loc instances in the Utf16 parser path—update the code paths around parse_with::<Utf16>, any Pos-to-Loc conversion sites, and the similar block at the other occurrence (the block around lines 99-112) so Loc is always built from original byte offsets rather than raw code-unit indices.
46-49:⚠️ Potential issue | 🟠 Major | ⚡ Quick winFix the redundant wildcard in
detect_encoding.Line 48 still trips Clippy's
unneeded_pattern, so CI stays red...already covers the remaining tuple slots here; use(0x00, ..)instead.♻️ Minimal fix
- (0x00, _, ..) => Utf16Be, + (0x00, ..) => Utf16Be,🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/parsers/yaml.rs` around lines 46 - 49, In detect_encoding, remove the redundant wildcard in the tuple patterns that tripped Clippy: replace (0x00, _, ..) => Utf16Be with (0x00, ..) => Utf16Be, and similarly replace (_, 0x00, ..) => Utf16Le with (.., 0x00) => Utf16Le so the patterns don't include an unnecessary `_` and Clippy's unneeded_pattern warning is resolved.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/bun/yaml/yaml.test.ts`:
- Around line 1831-1833: The test is using Math.abs(Bun.YAML.parse(input)) which
masks sign regressions; change the assertion to check the exact numeric value
returned by Bun.YAML.parse for each input. Update the test.each from a
single-array of inputs to either an array of [input, expected] tuples or compute
expected = input.startsWith("-") ? -Infinity : Infinity, then assert
expect(Bun.YAML.parse(input)).toBe(expected) (referencing the existing test.each
and Bun.YAML.parse).
---
Duplicate comments:
In `@src/parsers/yaml.rs`:
- Around line 78-86: The UTF-16 branch transcodes bytes into u16 units (units,
bumped buffer) and calls Self::parse_with::<Utf16>(source, units, log, bump) but
leaves Pos (which counts UTF-16 code units) to be used directly for bun_ast::Loc
(byte offsets), causing drifting locations; fix by creating and passing a
mapping from each UTF-16 code-unit index back to the original byte offset (or by
converting Pos values to byte offsets) before constructing Loc instances in the
Utf16 parser path—update the code paths around parse_with::<Utf16>, any
Pos-to-Loc conversion sites, and the similar block at the other occurrence (the
block around lines 99-112) so Loc is always built from original byte offsets
rather than raw code-unit indices.
- Around line 46-49: In detect_encoding, remove the redundant wildcard in the
tuple patterns that tripped Clippy: replace (0x00, _, ..) => Utf16Be with (0x00,
..) => Utf16Be, and similarly replace (_, 0x00, ..) => Utf16Le with (.., 0x00)
=> Utf16Le so the patterns don't include an unnecessary `_` and Clippy's
unneeded_pattern warning is resolved.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 00bc4b38-3e40-4f58-b547-a53a8a52554a
📒 Files selected for processing (2)
src/parsers/yaml.rstest/js/bun/yaml/yaml.test.ts
|
|
||
| /// Spec §5.2 byte-order-mark / null-byte encoding detection. | ||
| #[derive(Clone, Copy, PartialEq, Eq)] | ||
| enum DetectedEncoding { |
There was a problem hiding this comment.
This should go in a bun_core::bom module
There was a problem hiding this comment.
Agreed — though detect_encoding was just removed in 12c16e5 (Dylan flagged that Parser<Utf16> has never been exercised, so shipping it or erroring on detected UTF-16 byte input both trade one untested behavior for another). Only the UTF-8 BOM strip in Parser::init remains.
When §5.2 detection ships properly (after a full Parser<Utf16> test sweep), bun_core::bom is the right home so JSON5/TOML/JSONC can share it.
`Parser<Utf16>` has never been exercised; introducing it (or erroring on detected UTF-16/32 byte input) trades one untested behavior for another. Restore main's `YAML::parse` exactly: `Parser<Utf8>` only. The `Encoding` trait/`bom_len()` infrastructure stays so a UTF-16 path can be added later behind a full test sweep, but no detection runs today.
| // [206] l-document-prefix ::= c-byte-order-mark? l-comment* | ||
| let start = Pos::from(Enc::bom_len(input)); | ||
| Self { | ||
| input, | ||
| bump, | ||
| pos: Pos::from(0), | ||
| line_start_pos: Pos::from(0), | ||
| pos: start, | ||
| line_start_pos: start, |
There was a problem hiding this comment.
🟣 🟣 Pre-existing, non-blocking: this strips the BOM at stream-start only, but [211] l-yaml-stream admits l-document-prefix (with c-byte-order-mark?) before each document — and lists c-byte-order-mark directly in its trailing alternation — so per §5.2 "byte order marks may appear at the start of any document". a: 1\n...\n---\nb: 2 (concatenated UTF-8-with-BOM files) won't have the second-document BOM stripped; the U+FEFF reaches scan_plain_scalar and --- parses as content instead of c-byte-order-mark + c-directives-end. Also makes the new test comment at yaml.test.ts:1753 ("only [206] document-prefix") slightly imprecise — only the first [206] instance is handled. Strictly improved vs main; worth a test.todo alongside the §5.2 deferral already discussed in the thread.
Extended reasoning...
What
Parser::init (yaml.rs:2447-2453) strips a leading BOM by setting pos/line_start_pos to Pos::from(Enc::bom_len(input)), citing [206] l-document-prefix ::= c-byte-order-mark? l-comment*. This handles the BOM at byte 0 only. But [206] is invoked from [211] l-yaml-stream:
l-yaml-stream ::= l-document-prefix* l-any-document?
( ( l-document-suffix+ l-document-prefix* l-any-document? )
| c-byte-order-mark | l-comment | l-explicit-document )*
so l-document-prefix (with its optional c-byte-order-mark) recurs after each l-document-suffix+ (...), and c-byte-order-mark is also a direct alternative in the trailing repetition. §5.2 says this explicitly: "byte order marks may appear at the start of any document". The PR implements only the first [206] instance at stream-start.
Code path
Enc::bom_len() is called exactly once, in Parser::init. The document loop (parse() → parse_stream / next-document re-entry) does not re-check for a BOM after DocumentEnd / .... There is no 0xFEFF / 0xEF arm in scan() that would consume a mid-stream BOM at a document boundary; after ...\n, the next scan() reads 0xEF (UTF-8) or 0xFEFF (UTF-16), which falls to the catch-all _ arm → scan_plain_scalar, so --- lexes as a plain scalar rather than c-byte-order-mark followed by c-directives-end.
Step-by-step proof
Input (UTF-8): a: 1\n...\n---\nb: 2 i.e. 61 3A 20 31 0A 2E 2E 2E 0A EF BB BF 2D 2D 2D 0A 62 3A 20 32.
Parser::init:Enc::bom_len([0x61, …]) = 0(input does not start withEF BB BF), sopos = line_start_pos = 0. No BOM stripped.- First document parses as
{a: 1}; scanner reaches...→DocumentEnd;posadvances past the trailing\nto byte 9. - Next-document scan:
self.next()atpos=9returns0xEF. Thescan()dispatch has no arm for0xEF/BOM bytes, so it falls to the_catch-all →scan_plain_scalar. scan_plain_scalarconsumesEF BB BF 2D 2D 2Das scalar content (BOM isnb-char,-isns-plain-safe).is_at_line_start()is not consulted for---recognition here because the scanner already committed to the plain-scalar path on the first byte; even if it were,pos(9) != line_start_pos(9)only at the moment-is reached after the BOM advancedpos.- Result: the second document's
---is treated as content, not asc-byte-order-mark?+c-directives-end. The BOM is not stripped per [211]/[206].
This is the realistic shape of cat doc1.yaml doc2.yaml where each was saved by an editor that emits a UTF-8 BOM, with ... terminating the first.
Why nothing else catches it
bom_len() is only invoked from init — there is no per-document hook. The new test at yaml.test.ts:1753 documents the opposite (BOM mid-content is preserved as scalar text), which is correct for U+FEFF inside a node, but the document-suffix→document-prefix boundary is a different grammar position where [211] explicitly admits a BOM. The test comment "BOM mid-stream is not stripped (only [206] document-prefix)" is itself slightly imprecise: [206] l-document-prefix can recur per [211]; what's actually implemented is "only the stream-start instance of [206]".
Impact
Obscure (multi-document streams with per-document BOM, e.g. concatenated UTF-8-with-BOM files), and strictly improved vs main — before this PR no BOM was stripped at all. Not a regression. The §5.2 encoding-detection deferral already discussed in the thread (Jarred/Dylan re bun_core::bom, commit 12c16e5) is about UTF-16/UTF-32 detection, not per-document BOM at [211] boundaries, so this isn't already covered by that follow-up. This is the immediate [211] sibling of the [206] rule the PR cites at yaml.rs:2447 — same pattern as the other pre-existing siblings (Alias/SequenceStart/MappingStart [147] mirrors, set_tag analogue) already filed and addressed on this PR.
Fix
When the document loop sees DocumentEnd (or before scanning the next document's first token), check Enc::bom_len(&self.input[self.pos.cast()..]) and advance pos/line_start_pos past it — or add a 0xFEFF / EF BB BF arm in scan() that, at is_at_line_start() between documents, consumes the BOM and continues. Either way, also tighten the test comment at yaml.test.ts:1753 to "only the stream-start [206] instance", and add a test.todo for YAML.parse("a: 1\n...\n---\nb: 2", { all: true }) (or equivalent multi-doc API) alongside the existing §5.2 deferral.
… stale comments - `-0x1f`/`-0o17` (pre-existing on int path, not the float gate) → test.todo asserting the spec value, not codifying current -31 - `Bun.YAML.parse` → imported `YAML.parse` for consistency - Drop "Pre-existing on main" from un-todo'd test comment - Update string_is_number doc to reflect is_core_schema_number gate
| /// This mirrors the rules in `src/parsers/yaml.rs`'s `try_resolve_number` / | ||
| /// `is_core_schema_number`: | ||
| /// - Optional leading sign, optionally followed by `.inf`/`.Inf`/`.INF`. | ||
| /// - Otherwise either an integer (`[0-9]+` / `0x…` / `0o…`) or a float | ||
| /// matching §10.2.1.4 `[-+]? ( . [0-9]+ | [0-9]+ ( . [0-9]* )? ) ([eE][-+]?[0-9]+)?`. | ||
| /// The parser-side gate now rejects non-conforming float-like tokens | ||
| /// (e.g. `"1+5"`, `"1e"`, `"."`) so this mirror should err on the side of | ||
| /// *quoting* whenever a token *might* parse as a number. |
There was a problem hiding this comment.
🟡 Nit: the doc-comment update (per #3326791810) dropped the stale embedded-+/- clauses, but the two inline // Parser rule: comments in the body at :998 and :1004 still cite the pre-PR rules ("'+' accepted unless we're in hex base", "at most one '-' after the leading sign") — the parser's is_core_schema_number gate now rejects any embedded +/-, so these contradict the freshly-updated header. No functional impact (over-quoting is safe); just drop the Parser rule: prefix or reword to e.g. // Over-conservative: parser now rejects embedded +/-; quoting is still safe.
Extended reasoning...
What
This PR updated string_is_number's doc comment (lines 903-910) in response to inline comment #3326791810 — replacing the stale "the parser accepts any number of + after the leading sign … and at most one additional - … e.g. \"1+5\" round-trips to 1" with a correct reference to the new is_core_schema_number gate. However, the two inline comments inside the function body were not touched and still carry the exact claims that were just removed from the header:
- Line 998 (
0x2barm):// Parser rule: '+' accepted unless we're in hex base. - Line 1004 (
0x2darm):// Parser rule: at most one '-' after the leading sign.
After this PR, try_resolve_number is gated by is_core_schema_number (yaml.rs:1670), which rejects any +/- outside the single optional-sign-after-e position. There is no longer a "+ accepted unless hex" rule or an "at most one additional -" rule on the parser side — those were the pre-PR lexer-loop rules at yaml.rs:1624-1643, which still run but whose acceptances are now overridden by the validator. The new doc comment three lines above says exactly this: "The parser-side gate now rejects non-conforming float-like tokens (e.g. \"1+5\" …)" — which the body comments directly contradict.
Step-by-step proof
- Take input
"1+5". Per the body comment at :998, the "Parser rule" is that+is accepted (not in hex base) — implyingYAML.parse("1+5")resolves as a number. - Per the doc comment at :908-909 (updated in this PR), the parser-side gate "now rejects …
\"1+5\"" — implying it resolves as a string. - Per the new test added in this PR (
["1+1", "1+1"]in the "resolves as string" table), the doc comment is right and the body comment is wrong. - Same for
"1-1"vs the "at most one-" comment at :1004.
So a reader of the body sees a Parser rule: attribution that the function's own header, and the test suite, now say is false.
Addressing the refutation
One verifier argued these comments describe what the body code does, not net parser behavior, and that the updated doc comment already frames the body as intentionally over-conservative. The second half is fair (and is why this is a nit, not a bug). But the first half doesn't hold: the comments are not phrased as "accept + unless hex" — they are phrased as "Parser rule: + accepted unless hex". The Parser rule: prefix is an explicit attribution to what YAML.parse does; the function's stated job (line 901) is "returns true when str would be parsed back as a number by YAML.parse", and these comments explain which parser rule each arm mirrors. That attribution is now stale. The doc-comment fix in this PR was made for precisely this reason; these two lines are the residue of that same fix.
The refutation also notes the body comments are pre-existing and untouched by the PR. True — but the PR did touch this function (the doc comment at 903-910 is in the diff) specifically to remove these stale claims, and missed the two places in the body where the same claims are repeated. That's an incomplete application of an accepted fix, not an out-of-scope pre-existing issue.
Impact
Zero functional impact. string_is_number still returns true for "1+5" / "1-1", so YAML.stringify over-quotes them — which is round-trip-safe, exactly as the new doc comment notes. The only cost is that the next reader of these arms is told the parser does something it no longer does.
Fix
Drop or reword the Parser rule: prefix at :998 and :1004, e.g.:
0x2b /* '+' */ => {
// Over-conservative: parser now rejects embedded '+'; quoting is safe.
if base == Base::Hex { return false; }
}
0x2d /* '-' */ => {
// Over-conservative: parser now rejects embedded '-'; quoting is safe.
if saw_minus_after_sign { return false; }
saw_minus_after_sign = true;
}(or simply delete the two comment lines — the doc comment already covers the over-conservative framing).
What
Clears the last three
test.todoentries that are not cyclic-anchor related (those four remain deferred).---\na: 1{a:1}l-document-prefix ::= c-byte-order-mark? l-comment*&outer\n&inner : xMultipleAnchors{null:"x"}:){a: &x\n b: c}{a:{b:"c"}}(over-accept)ns-flow-node, not a pairHow
Encoding::bom_len()(3 for UTF-8EF BB BF, 1 for UTF-16FEFF, 0 for Latin-1).Parser::initstartspos/line_start_pospast it.MappingValuearm clearshas_mapping_anchorafterimplicit_key_anchorsconsumed it (was reaching the post-loop guard).set_anchorerrors instead of silently overwriting on a 3rd anchor.parse_nodeScalar arm returns the bare scalar inFlowIn && !flow_pair_allowedinstead of falling through toparse_block_mapping; closes the multiline-property gap left by the cmi mechanism.Tests
7 → 4 todos (cyclic anchors only). 2032 pass / 0 fail / 402/402 yaml-test-suite / 1015/1015 4-parser consensus.