bun:sqlite: decode short non-UTF-8 TEXT leniently instead of dropping the field#31514
Conversation
|
@robobun adopt |
|
Updated 12:08 PM PT - May 28th, 2026
❌ @robobun, your commit be798b0 has 2 failures in
🧪 To try this PR locally: bunx bun-pr 31514That installs a local version of the PR into your bun-31514 --bun |
|
✅ Ready to merge — the diff is green; CI red is unrelated flake on one macOS lane. The fix (bun:sqlite decoded SQLite-sourced strings with strict
Each case has a regression test that fails before the change (the column-name and declared-type cases SIGABRT under ASAN) and passes after. Local: CI on |
|
Actionable comments posted: 0 |
WalkthroughThis PR changes multiple JSC SQLite binding paths to decode SQLite-returned bytes leniently using fromUTF8ReplacingInvalidSequences, and adds/updates tests verifying replacement-character decoding for text, column names, bound values, expanded SQL, and declared types (including the 64-byte fast-path boundary). ChangesLenient UTF-8 decoding for SQLite-hosted strings
🚥 Pre-merge checks | ✅ 4✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Comment |
There was a problem hiding this comment.
LGTM — straightforward one-line fix swapping to fromUTF8ReplacingInvalidSequences so the <64-byte path matches the ≥64-byte path, with a solid boundary-sweep regression test.
Extended reasoning...
Overview
One-line change in src/jsc/bindings/sqlite/JSSQLStatement.cpp replacing WTF::String::fromUTF8 with WTF::String::fromUTF8ReplacingInvalidSequences in the SQLITE3_TEXT short-string (<64 byte) branch, plus a new regression test in test/js/bun/sqlite/sqlite.test.js.
Security risks
None. This only changes how invalid UTF-8 bytes from SQLite TEXT columns are surfaced to JS — instead of silently collapsing the entire value to "", invalid sequences now become U+FFFD. No new untrusted input paths, no allocation/length math changes, and the API used is a stock WTF helper already used in ~8 other places in Bun's bindings.
Level of scrutiny
Low. The fix is mechanical and the PR description precisely identifies the root cause: fromUTF8 returns a null WTF::String on any invalid byte, and jsString(null) yields "". The ≥64-byte path (Bun__encoding__toStringUTF8) already decoded leniently, so this just removes a length-dependent discontinuity and aligns with node:sqlite. There's no design ambiguity here — lenient decoding with U+FFFD is the established behavior for the other path and for Node.
Other factors
The regression test is well-constructed: it checks get()/all()/values(), sweeps lengths 1/4/32/63/64/65/100 across the previous implementation boundary to assert no discontinuity, and confirms valid UTF-8 (multi-byte and 4-byte emoji) is unaffected. No bugs flagged by the bug hunter, no prior reviewer comments to address.
|
failures in ci |
|
@robobun the failing The test's
What to change:
Scope note: the surrogate granularity difference (Bun's 3 × U+FFFD vs node's single U+FFFD) is a separate bind-side encoder issue — do not touch the encoder in this PR; just assert Bun's current correct-on-the-read-side value so CI goes green. @robobun adopt |
The Unicode fuzzing test asserted lone surrogates and BOM-inverse TEXT read back as "", which encoded the old drop-to-empty behavior. With short non-UTF-8 TEXT now decoded leniently, lone surrogates (stored by SQLite as invalid WTF-8) decode to U+FFFD like the long path and node:sqlite. BOM-inverse is still dropped by SQLite itself, so it stays empty.
There was a problem hiding this comment.
LGTM — one-token fix aligning the <64-byte TEXT decode path with the existing lenient ≥64-byte path, with solid boundary regression coverage.
Extended reasoning...
Overview
This PR changes a single token in src/jsc/bindings/sqlite/JSSQLStatement.cpp:577, swapping WTF::String::fromUTF8 for WTF::String::fromUTF8ReplacingInvalidSequences in the SQLITE3_TEXT short-string (<64 byte) branch. The >=64-byte branch right below it already decodes leniently via Bun__encoding__toStringUTF8, so this removes a length-dependent discontinuity where the same invalid byte produced "" at 63 bytes but "…�" at 64. It also adds a regression test sweeping the 63↔64 boundary and updates a fuzz test in sqlite-sql.test.ts that was asserting the old drop-to-empty bug.
Security risks
None. This is purely a string-decoding behavior change on the read path: invalid UTF-8 bytes now become U+FFFD instead of nulling out the entire field. No auth, injection, permission, or memory-safety surface is touched — fromUTF8ReplacingInvalidSequences is a standard WTF API already used elsewhere in this same file (bind-parameter names at lines ~399/939/969).
Level of scrutiny
Low. The C++ change is mechanical and self-evidently correct given the surrounding context — it makes two adjacent code paths agree. The new regression test in sqlite.test.js is well-constructed: it exercises .get()/.all()/.values(), sweeps lengths {1,4,32,63,64,65,100} across the boundary, and confirms valid UTF-8 (héllo, 👋) is unaffected. The sqlite-sql.test.ts update replaces stale assertions of the buggy behavior with empirically-verified values (lone surrogates → 3×U+FFFD via maximal-subpart replacement of the stored WTF-8 bytes; \uFFFE → "" because SQLite's bind_text16 strips it as a BOM on the bind side). No CODEOWNERS apply to these paths.
Other factors
The one inline finding is explicitly pre-existing and non-blocking: column-name and expanded_sql decoding at lines ~724/777/2562 still use strict fromUTF8 and could benefit from the same treatment as a follow-up — it doesn't affect the correctness of this PR. The reviewer's detailed instructions on the thread were followed (the BOM-inverse case differs slightly from the suggested table but matches Bun's actual bind-side behavior, which is the right thing to assert here and is out of scope for this read-side fix).
The column-name decoder (initializeColumnNames) and Statement#toString (sqlite3_expanded_sql) still used strict WTF::String::fromUTF8, which returns a null string on any invalid UTF-8 byte. For a database created by another tool, a non-UTF-8 column identifier produced a null AtomString: two such names collided on the empty key and silently dropped a column from every row (and tripped a null-deref assertion in debug builds). A bound lone surrogate (stored by SQLite as invalid WTF-8) made toString() return an empty string. Decode both leniently with fromUTF8ReplacingInvalidSequences, the same fix already applied to TEXT values and bind-parameter names, so invalid bytes become U+FFFD. Adds regression tests for a patched-on-disk DB with two distinct non-UTF-8 column names and for toString() over a bound lone surrogate.
Statement#declaredTypes (sqlite3_column_decltype) is schema-sourced like column names but still used strict WTF::String::fromUTF8 feeding jsNontrivialString. A non-UTF-8 declared type from an externally-created DB decoded to a null string and then null-dereferenced in jsNontrivialString; separately, a single-character declared type (valid SQLite, e.g. CREATE TABLE t (a "X")) tripped its length>1 assert. Use fromUTF8ReplacingInvalidSequences and jsString so both decode leniently and short types work. Also decode SQLiteError messages (sqlite3_errmsg, which echoes identifiers and values from the query) leniently so a non-UTF-8 byte no longer blanks the whole message. Adds a regression test covering a single-character declared type and a patched-on-disk DB with a non-UTF-8 declared type (both crashed before).
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@test/js/sql/sqlite-sql.test.ts`:
- Around line 5155-5163: The test wrongly treats "Byte order mark inverse" as
dropped; update the droppedBySqlite array (used in the test block that checks
result[0].text_data) by removing "Byte order mark inverse" so the branch that
asserts an empty string no longer runs, and ensure the test instead falls
through to the branch that expects the preserved \uFFFE value (i.e., remove that
entry from droppedBySqlite so the assertion uses lenientlyReplaced or the
preserved-value check for "Byte order mark inverse").
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: ASSERTIVE
Plan: Pro
Run ID: 6fbcefaa-f6e9-4429-9718-e682444e8c49
📒 Files selected for processing (3)
src/jsc/bindings/sqlite/JSSQLStatement.cpptest/js/bun/sqlite/sqlite.test.jstest/js/sql/sqlite-sql.test.ts
There was a problem hiding this comment.
LGTM — all the lenient-decode sites with test coverage are in, and the remaining cosmetic loadExtension/rebindValue sites are reasonably deferred.
Extended reasoning...
Overview
Six call sites in src/jsc/bindings/sqlite/JSSQLStatement.cpp that consume SQLite-sourced C strings (TEXT values <64 bytes, column names ×2, sqlite3_expanded_sql, sqlite3_column_decltype, sqlite3_errmsg) are switched from WTF::String::fromUTF8 (returns null on any invalid byte → empty JS string or crash) to WTF::String::fromUTF8ReplacingInvalidSequences (invalid → U+FFFD). The declaredTypes site additionally swaps jsNontrivialString → jsString to fix a pre-existing length>1 assert on 1-char declared types. Four new regression tests in test/js/bun/sqlite/sqlite.test.js cover each path, and test/js/sql/sqlite-sql.test.ts is updated to assert the new (correct, node:sqlite-matching) behavior instead of the old drop-to-empty bug.
Security risks
None. This widens decoding from "null on invalid byte" to "U+FFFD on invalid byte" for strings already being surfaced to JS — strictly more data preserved, no new untrusted-input surface, no auth/crypto/permissions involvement.
Level of scrutiny
Low-to-medium. The transformation is a one-token mechanical swap to a well-known WTF API that is already used elsewhere in this same file (bind-parameter names at :399/:939/:969) and matches the existing ≥64-byte path. For valid UTF-8 (the overwhelming common case) behavior is byte-identical; only the previously-broken invalid-UTF-8 case changes. Each changed site has a dedicated regression test that fails before / passes after.
Other factors
I reviewed this PR through three iterations; each round of same-bug-class sites I flagged (column names, expanded SQL, declared types, error messages) was incorporated with tests. The remaining strict sites (sqliteString at :145, loadExtension fallback at :1376, rebindValue at :823) were explicitly and reasonably deferred as untestable/cosmetic-only follow-ups. All inline threads are resolved. The single CI failure (streams-leak.test.ts on one Linux shard) is unrelated to sqlite and the head commit is an empty ci: retrigger. No CODEOWNERS cover these paths.
bun:sqlitereturned an empty string (the whole field) for a non-UTF-8TEXTvalue shorter than 64 bytes — e.g.SELECT CAST(x'4A6F73E9' AS TEXT)(Latin-1 "José") came back as""instead of"Jos�". TheSQLITE3_TEXTfast path for short strings usedWTF::String::fromUTF8, which returns a null string on any invalid UTF-8 byte, andjsString(null)becomes"". The>= 64-byte path already decodes leniently (invalid → U+FFFD), so the two paths disagreed on identical data, with a hard discontinuity at exactly 64 bytes.Use
WTF::String::fromUTF8ReplacingInvalidSequencesin the short-string branch so invalid bytes become U+FFFD, matching the long path (verified byte-identical across the maximal-subpart replacement classes) and node:sqlite. The fast WTF path is kept for the common valid-ASCII/UTF-8 case.Adds a regression test sweeping a non-UTF-8 TEXT value across the 63↔64-byte boundary (no discontinuity; node:sqlite parity).