bun:sqlite: decode short non-UTF-8 TEXT leniently instead of dropping the field by alii · Pull Request #31514 · oven-sh/bun

alii · 2026-05-28T14:31:41Z

bun:sqlite returned an empty string (the whole field) for a non-UTF-8 TEXT value shorter than 64 bytes — e.g. SELECT CAST(x'4A6F73E9' AS TEXT) (Latin-1 "José") came back as "" instead of "Jos�". The SQLITE3_TEXT fast path for short strings used WTF::String::fromUTF8, which returns a null string on any invalid UTF-8 byte, and jsString(null) becomes "". The >= 64-byte path already decodes leniently (invalid → U+FFFD), so the two paths disagreed on identical data, with a hard discontinuity at exactly 64 bytes.

Use WTF::String::fromUTF8ReplacingInvalidSequences in the short-string branch so invalid bytes become U+FFFD, matching the long path (verified byte-identical across the maximal-subpart replacement classes) and node:sqlite. The fast WTF path is kept for the common valid-ASCII/UTF-8 case.

Adds a regression test sweeping a non-UTF-8 TEXT value across the 63↔64-byte boundary (no discontinuity; node:sqlite parity).

… the field

alii · 2026-05-28T14:31:44Z

@robobun adopt

robobun · 2026-05-28T14:31:54Z

^{Updated 12:08 PM PT - May 28th, 2026}

❌ @robobun, your commit be798b0 has 2 failures in Build #58676 (All Failures):

test/js/bun/net/socket-retention.test.ts - timeout on 🍎 14 aarch64
test/js/bun/http/serve.test.ts - timeout on 🍎 14 aarch64

🧪 To try this PR locally:

bunx bun-pr 31514

That installs a local version of the PR into your bun-31514 executable, so you can run:

bun-31514 --bun

robobun · 2026-05-28T14:35:08Z

✅ Ready to merge — the diff is green; CI red is unrelated flake on one macOS lane.

The fix (bun:sqlite decoded SQLite-sourced strings with strict WTF::String::fromUTF8, which returns a null string on any invalid byte → data dropped to "" or, for identifiers/declared types, a crash). This PR decodes the reachable, user-facing paths leniently (invalid UTF-8 → U+FFFD), matching node:sqlite:

TEXT values (<64-byte fast path) — used to drop the whole field to "".
Column names (initializeColumnNames) — a non-UTF-8 identifier in an externally-created DB collided on the empty key and silently dropped a column (SIGABRT under ASAN).
Statement#toString() (sqlite3_expanded_sql) — a bound lone surrogate blanked the whole expanded SQL.
Statement#declaredTypes (sqlite3_column_decltype) — a non-UTF-8 declared type null-deref crashed; a 1-char type like "X" also tripped a length>1 assert (now uses jsString).
SQLiteError messages (sqlite3_errmsg) — a non-UTF-8 identifier echoed in a message blanked it.
sqlite-sql.test.ts — corrected the stale assertions that baked in the old drop-to-empty behavior.

Each case has a regression test that fails before the change (the column-name and declared-type cases SIGABRT under ASAN) and passes after. Local: bun:sqlite 79/79, column-types 9/9, sqlite-sql Unicode 11/11.

CI on be798b0b: 287 jobs passed, including every Linux test-bun lane (where the sqlite tests run). The only failure is one darwin-14-aarch64 lane that timed out — stuck on unrelated networking tests (net/tcp-server "abrupt stop the server", net/socket-retention) hanging 90s each until the lane hit SIGTERM. No sqlite test runs in that shard; nothing red touches this diff. Needs a maintainer to merge (or re-run the macOS lane).

coderabbitai · 2026-05-28T14:35:11Z

Actionable comments posted: 0

coderabbitai · 2026-05-28T14:35:14Z

Walkthrough

This PR changes multiple JSC SQLite binding paths to decode SQLite-returned bytes leniently using fromUTF8ReplacingInvalidSequences, and adds/updates tests verifying replacement-character decoding for text, column names, bound values, expanded SQL, and declared types (including the 64-byte fast-path boundary).

Changes

Lenient UTF-8 decoding for SQLite-hosted strings

Layer / File(s)	Summary
JSC SQLite decoding updates `src/jsc/bindings/sqlite/JSSQLStatement.cpp`	Error messages, SQLITE3_TEXT fast-path (<64 bytes), column-name initialization (fast & slow paths), sqlite3_expanded_sql -> `SQLStatement.prototype.toString`, and declared-type decoding now use `WTF::String::fromUTF8ReplacingInvalidSequences` and return JS strings via `JSC::jsString` where applicable.
Regression tests and DB patch helpers `test/js/bun/sqlite/sqlite.test.js`	Adds tests that assert lenient decoding to U+FFFD for `CAST(... AS TEXT)` across the 64-byte boundary, non-UTF-8 column names from patched on-disk DBs, `Statement#toString` expansion of bound non-UTF-8 values, and declared-type decoding from patched schemas. Also updates Node `fs` imports for read/write.
Unicode fuzz test expectation update `test/js/sql/sqlite-sql.test.ts`	Adjusts the fuzzing test expectations: lone-surrogate cases now expect repeated U+FFFD replacement characters; only the BOM-inverse case remains expected to be dropped to an empty string.

🚥 Pre-merge checks | ✅ 4

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title directly and clearly describes the main bug fix: lenient UTF-8 decoding for short non-UTF-8 TEXT values instead of dropping the field.
Description check	✅ Passed	The description comprehensively explains the bug, root cause, the fix, affected areas, and test coverage. It matches the required template with detailed 'What' and 'How verified' sections.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude

LGTM — straightforward one-line fix swapping to fromUTF8ReplacingInvalidSequences so the <64-byte path matches the ≥64-byte path, with a solid boundary-sweep regression test.

Extended reasoning...

Overview

One-line change in src/jsc/bindings/sqlite/JSSQLStatement.cpp replacing WTF::String::fromUTF8 with WTF::String::fromUTF8ReplacingInvalidSequences in the SQLITE3_TEXT short-string (<64 byte) branch, plus a new regression test in test/js/bun/sqlite/sqlite.test.js.

Security risks

None. This only changes how invalid UTF-8 bytes from SQLite TEXT columns are surfaced to JS — instead of silently collapsing the entire value to "", invalid sequences now become U+FFFD. No new untrusted input paths, no allocation/length math changes, and the API used is a stock WTF helper already used in ~8 other places in Bun's bindings.

Level of scrutiny

Low. The fix is mechanical and the PR description precisely identifies the root cause: fromUTF8 returns a null WTF::String on any invalid byte, and jsString(null) yields "". The ≥64-byte path (Bun__encoding__toStringUTF8) already decoded leniently, so this just removes a length-dependent discontinuity and aligns with node:sqlite. There's no design ambiguity here — lenient decoding with U+FFFD is the established behavior for the other path and for Node.

Other factors

The regression test is well-constructed: it checks get()/all()/values(), sweeps lengths 1/4/32/63/64/65/100 across the previous implementation boundary to assert no discontinuity, and confirms valid UTF-8 (multi-byte and 4-byte emoji) is unaffected. No bugs flagged by the bug hunter, no prior reviewer comments to address.

alii · 2026-05-28T15:04:23Z

failures in ci sqlite-sql.test.ts are wrong

alii · 2026-05-28T15:07:38Z

@robobun the failing test/js/sql/sqlite-sql.test.ts test ("handles extensive Unicode scripts and languages") is asserting the old buggy behavior this PR fixes — please do not revert the JSSQLStatement.cpp change. It's correct: invalid UTF-8 in a short (<64 byte) TEXT value must decode leniently to U+FFFD (matching node:sqlite and the existing >=64-byte path), not drop the whole field to "". The test needs updating to the new (correct) behavior.

The test's droppedCharacters block (~line 5145) asserts text_data === "" for three cases. All three are wrong — verified against node:sqlite (node --experimental-sqlite):

case	old (buggy) assert	node:sqlite (correct)
`"\uD800"` High surrogate	`""`	`"�"`
`"\uDFFF"` Low surrogate	`""`	`"�"`
`"�"` BOM inverse	`""`	`"�"` — preserved, it's a valid code point and was never actually dropped by SQLite

What to change:

Remove "Byte order mark inverse" from the droppedCharacters list — it round-trips to "�", so it belongs in the normal expect(result[0].text_data).toBe(text) branch.
The two lone-surrogate cases no longer return ""; they decode leniently to U+FFFD. Run those two cases against this build and assert the actual value — Bun currently yields "��" (three U+FFFD) because the TEXT bind stores a lone surrogate as WTF-8 (3 bytes) and the lenient read replaces each. Assert that actual value and replace the misleading // SQLite drops these invalid UTF-8 sequences comment with one explaining the lenient-replacement behavior.

Scope note: the surrogate granularity difference (Bun's 3 × U+FFFD vs node's single U+FFFD) is a separate bind-side encoder issue — do not touch the encoder in this PR; just assert Bun's current correct-on-the-read-side value so CI goes green.

@robobun adopt

The Unicode fuzzing test asserted lone surrogates and BOM-inverse TEXT read back as "", which encoded the old drop-to-empty behavior. With short non-UTF-8 TEXT now decoded leniently, lone surrogates (stored by SQLite as invalid WTF-8) decode to U+FFFD like the long path and node:sqlite. BOM-inverse is still dropped by SQLite itself, so it stays empty.

claude

LGTM — one-token fix aligning the <64-byte TEXT decode path with the existing lenient ≥64-byte path, with solid boundary regression coverage.

Extended reasoning...

Overview

This PR changes a single token in src/jsc/bindings/sqlite/JSSQLStatement.cpp:577, swapping WTF::String::fromUTF8 for WTF::String::fromUTF8ReplacingInvalidSequences in the SQLITE3_TEXT short-string (<64 byte) branch. The >=64-byte branch right below it already decodes leniently via Bun__encoding__toStringUTF8, so this removes a length-dependent discontinuity where the same invalid byte produced "" at 63 bytes but "…�" at 64. It also adds a regression test sweeping the 63↔64 boundary and updates a fuzz test in sqlite-sql.test.ts that was asserting the old drop-to-empty bug.

Security risks

None. This is purely a string-decoding behavior change on the read path: invalid UTF-8 bytes now become U+FFFD instead of nulling out the entire field. No auth, injection, permission, or memory-safety surface is touched — fromUTF8ReplacingInvalidSequences is a standard WTF API already used elsewhere in this same file (bind-parameter names at lines ~399/939/969).

Level of scrutiny

Low. The C++ change is mechanical and self-evidently correct given the surrounding context — it makes two adjacent code paths agree. The new regression test in sqlite.test.js is well-constructed: it exercises .get()/.all()/.values(), sweeps lengths {1,4,32,63,64,65,100} across the boundary, and confirms valid UTF-8 (héllo, 👋) is unaffected. The sqlite-sql.test.ts update replaces stale assertions of the buggy behavior with empirically-verified values (lone surrogates → 3×U+FFFD via maximal-subpart replacement of the stored WTF-8 bytes; \uFFFE → "" because SQLite's bind_text16 strips it as a BOM on the bind side). No CODEOWNERS apply to these paths.

Other factors

The one inline finding is explicitly pre-existing and non-blocking: column-name and expanded_sql decoding at lines ~724/777/2562 still use strict fromUTF8 and could benefit from the same treatment as a follow-up — it doesn't affect the correctness of this PR. The reviewer's detailed instructions on the thread were followed (the BOM-inverse case differs slightly from the suggested table but matches Bun's actual bind-side behavior, which is the right thing to assert here and is out of scope for this read-side fix).

The column-name decoder (initializeColumnNames) and Statement#toString (sqlite3_expanded_sql) still used strict WTF::String::fromUTF8, which returns a null string on any invalid UTF-8 byte. For a database created by another tool, a non-UTF-8 column identifier produced a null AtomString: two such names collided on the empty key and silently dropped a column from every row (and tripped a null-deref assertion in debug builds). A bound lone surrogate (stored by SQLite as invalid WTF-8) made toString() return an empty string. Decode both leniently with fromUTF8ReplacingInvalidSequences, the same fix already applied to TEXT values and bind-parameter names, so invalid bytes become U+FFFD. Adds regression tests for a patched-on-disk DB with two distinct non-UTF-8 column names and for toString() over a bound lone surrogate.

Statement#declaredTypes (sqlite3_column_decltype) is schema-sourced like column names but still used strict WTF::String::fromUTF8 feeding jsNontrivialString. A non-UTF-8 declared type from an externally-created DB decoded to a null string and then null-dereferenced in jsNontrivialString; separately, a single-character declared type (valid SQLite, e.g. CREATE TABLE t (a "X")) tripped its length>1 assert. Use fromUTF8ReplacingInvalidSequences and jsString so both decode leniently and short types work. Also decode SQLiteError messages (sqlite3_errmsg, which echoes identifiers and values from the query) leniently so a non-UTF-8 byte no longer blanks the whole message. Adds a regression test covering a single-character declared type and a patched-on-disk DB with a non-UTF-8 declared type (both crashed before).

coderabbitai

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/sql/sqlite-sql.test.ts`:
- Around line 5155-5163: The test wrongly treats "Byte order mark inverse" as
dropped; update the droppedBySqlite array (used in the test block that checks
result[0].text_data) by removing "Byte order mark inverse" so the branch that
asserts an empty string no longer runs, and ensure the test instead falls
through to the branch that expects the preserved \uFFFE value (i.e., remove that
entry from droppedBySqlite so the assertion uses lenientlyReplaced or the
preserved-value check for "Byte order mark inverse").

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6fbcefaa-f6e9-4429-9718-e682444e8c49

📥 Commits

Reviewing files that changed from the base of the PR and between 8ba179d and be798b0.

📒 Files selected for processing (3)

src/jsc/bindings/sqlite/JSSQLStatement.cpp
test/js/bun/sqlite/sqlite.test.js
test/js/sql/sqlite-sql.test.ts

claude

LGTM — all the lenient-decode sites with test coverage are in, and the remaining cosmetic loadExtension/rebindValue sites are reasonably deferred.

Extended reasoning...

Overview

Six call sites in src/jsc/bindings/sqlite/JSSQLStatement.cpp that consume SQLite-sourced C strings (TEXT values <64 bytes, column names ×2, sqlite3_expanded_sql, sqlite3_column_decltype, sqlite3_errmsg) are switched from WTF::String::fromUTF8 (returns null on any invalid byte → empty JS string or crash) to WTF::String::fromUTF8ReplacingInvalidSequences (invalid → U+FFFD). The declaredTypes site additionally swaps jsNontrivialString → jsString to fix a pre-existing length>1 assert on 1-char declared types. Four new regression tests in test/js/bun/sqlite/sqlite.test.js cover each path, and test/js/sql/sqlite-sql.test.ts is updated to assert the new (correct, node:sqlite-matching) behavior instead of the old drop-to-empty bug.

Security risks

None. This widens decoding from "null on invalid byte" to "U+FFFD on invalid byte" for strings already being surfaced to JS — strictly more data preserved, no new untrusted-input surface, no auth/crypto/permissions involvement.

Level of scrutiny

Low-to-medium. The transformation is a one-token mechanical swap to a well-known WTF API that is already used elsewhere in this same file (bind-parameter names at :399/:939/:969) and matches the existing ≥64-byte path. For valid UTF-8 (the overwhelming common case) behavior is byte-identical; only the previously-broken invalid-UTF-8 case changes. Each changed site has a dedicated regression test that fails before / passes after.

Other factors

I reviewed this PR through three iterations; each round of same-bug-class sites I flagged (column names, expanded SQL, declared types, error messages) was incorporated with tests. The remaining strict sites (sqliteString at :145, loadExtension fallback at :1376, rebindValue at :823) were explicitly and reasonably deferred as untestable/cosmetic-only follow-ups. All inline threads are resolved. The single CI failure (streams-leak.test.ts on one Linux shard) is unrelated to sqlite and the head commit is an empty ci: retrigger. No CODEOWNERS cover these paths.

bun:sqlite: decode short non-UTF-8 TEXT leniently instead of dropping…

8ba179d

… the field