Skip to content

bun:sqlite: decode short non-UTF-8 TEXT leniently instead of dropping the field#31514

Merged
Jarred-Sumner merged 5 commits into
mainfrom
ali/sqlite-text-nonutf8
May 28, 2026
Merged

bun:sqlite: decode short non-UTF-8 TEXT leniently instead of dropping the field#31514
Jarred-Sumner merged 5 commits into
mainfrom
ali/sqlite-text-nonutf8

Conversation

@alii

@alii alii commented May 28, 2026

Copy link
Copy Markdown
Member

bun:sqlite returned an empty string (the whole field) for a non-UTF-8 TEXT value shorter than 64 bytes — e.g. SELECT CAST(x'4A6F73E9' AS TEXT) (Latin-1 "José") came back as "" instead of "Jos�". The SQLITE3_TEXT fast path for short strings used WTF::String::fromUTF8, which returns a null string on any invalid UTF-8 byte, and jsString(null) becomes "". The >= 64-byte path already decodes leniently (invalid → U+FFFD), so the two paths disagreed on identical data, with a hard discontinuity at exactly 64 bytes.

Use WTF::String::fromUTF8ReplacingInvalidSequences in the short-string branch so invalid bytes become U+FFFD, matching the long path (verified byte-identical across the maximal-subpart replacement classes) and node:sqlite. The fast WTF path is kept for the common valid-ASCII/UTF-8 case.

Adds a regression test sweeping a non-UTF-8 TEXT value across the 63↔64-byte boundary (no discontinuity; node:sqlite parity).

@alii

alii commented May 28, 2026

Copy link
Copy Markdown
Member Author

@robobun adopt

@robobun

robobun commented May 28, 2026

Copy link
Copy Markdown
Collaborator
Updated 12:08 PM PT - May 28th, 2026

@robobun, your commit be798b0 has 2 failures in Build #58676 (All Failures):


🧪   To try this PR locally:

bunx bun-pr 31514

That installs a local version of the PR into your bun-31514 executable, so you can run:

bun-31514 --bun

@robobun

robobun commented May 28, 2026

Copy link
Copy Markdown
Collaborator

✅ Ready to merge — the diff is green; CI red is unrelated flake on one macOS lane.

The fix (bun:sqlite decoded SQLite-sourced strings with strict WTF::String::fromUTF8, which returns a null string on any invalid byte → data dropped to "" or, for identifiers/declared types, a crash). This PR decodes the reachable, user-facing paths leniently (invalid UTF-8 → U+FFFD), matching node:sqlite:

  • TEXT values (<64-byte fast path) — used to drop the whole field to "".
  • Column names (initializeColumnNames) — a non-UTF-8 identifier in an externally-created DB collided on the empty key and silently dropped a column (SIGABRT under ASAN).
  • Statement#toString() (sqlite3_expanded_sql) — a bound lone surrogate blanked the whole expanded SQL.
  • Statement#declaredTypes (sqlite3_column_decltype) — a non-UTF-8 declared type null-deref crashed; a 1-char type like "X" also tripped a length>1 assert (now uses jsString).
  • SQLiteError messages (sqlite3_errmsg) — a non-UTF-8 identifier echoed in a message blanked it.
  • sqlite-sql.test.ts — corrected the stale assertions that baked in the old drop-to-empty behavior.

Each case has a regression test that fails before the change (the column-name and declared-type cases SIGABRT under ASAN) and passes after. Local: bun:sqlite 79/79, column-types 9/9, sqlite-sql Unicode 11/11.

CI on be798b0b: 287 jobs passed, including every Linux test-bun lane (where the sqlite tests run). The only failure is one darwin-14-aarch64 lane that timed out — stuck on unrelated networking tests (net/tcp-server "abrupt stop the server", net/socket-retention) hanging 90s each until the lane hit SIGTERM. No sqlite test runs in that shard; nothing red touches this diff. Needs a maintainer to merge (or re-run the macOS lane).

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Actionable comments posted: 0

@coderabbitai

coderabbitai Bot commented May 28, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Walkthrough

This PR changes multiple JSC SQLite binding paths to decode SQLite-returned bytes leniently using fromUTF8ReplacingInvalidSequences, and adds/updates tests verifying replacement-character decoding for text, column names, bound values, expanded SQL, and declared types (including the 64-byte fast-path boundary).

Changes

Lenient UTF-8 decoding for SQLite-hosted strings

Layer / File(s) Summary
JSC SQLite decoding updates
src/jsc/bindings/sqlite/JSSQLStatement.cpp
Error messages, SQLITE3_TEXT fast-path (<64 bytes), column-name initialization (fast & slow paths), sqlite3_expanded_sql -> SQLStatement.prototype.toString, and declared-type decoding now use WTF::String::fromUTF8ReplacingInvalidSequences and return JS strings via JSC::jsString where applicable.
Regression tests and DB patch helpers
test/js/bun/sqlite/sqlite.test.js
Adds tests that assert lenient decoding to U+FFFD for CAST(... AS TEXT) across the 64-byte boundary, non-UTF-8 column names from patched on-disk DBs, Statement#toString expansion of bound non-UTF-8 values, and declared-type decoding from patched schemas. Also updates Node fs imports for read/write.
Unicode fuzz test expectation update
test/js/sql/sqlite-sql.test.ts
Adjusts the fuzzing test expectations: lone-surrogate cases now expect repeated U+FFFD replacement characters; only the BOM-inverse case remains expected to be dropped to an empty string.
🚥 Pre-merge checks | ✅ 4
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title directly and clearly describes the main bug fix: lenient UTF-8 decoding for short non-UTF-8 TEXT values instead of dropping the field.
Description check ✅ Passed The description comprehensively explains the bug, root cause, the fix, affected areas, and test coverage. It matches the required template with detailed 'What' and 'How verified' sections.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.


Comment @coderabbitai help to get the list of available commands and usage tips.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward one-line fix swapping to fromUTF8ReplacingInvalidSequences so the <64-byte path matches the ≥64-byte path, with a solid boundary-sweep regression test.

Extended reasoning...

Overview

One-line change in src/jsc/bindings/sqlite/JSSQLStatement.cpp replacing WTF::String::fromUTF8 with WTF::String::fromUTF8ReplacingInvalidSequences in the SQLITE3_TEXT short-string (<64 byte) branch, plus a new regression test in test/js/bun/sqlite/sqlite.test.js.

Security risks

None. This only changes how invalid UTF-8 bytes from SQLite TEXT columns are surfaced to JS — instead of silently collapsing the entire value to "", invalid sequences now become U+FFFD. No new untrusted input paths, no allocation/length math changes, and the API used is a stock WTF helper already used in ~8 other places in Bun's bindings.

Level of scrutiny

Low. The fix is mechanical and the PR description precisely identifies the root cause: fromUTF8 returns a null WTF::String on any invalid byte, and jsString(null) yields "". The ≥64-byte path (Bun__encoding__toStringUTF8) already decoded leniently, so this just removes a length-dependent discontinuity and aligns with node:sqlite. There's no design ambiguity here — lenient decoding with U+FFFD is the established behavior for the other path and for Node.

Other factors

The regression test is well-constructed: it checks get()/all()/values(), sweeps lengths 1/4/32/63/64/65/100 across the previous implementation boundary to assert no discontinuity, and confirms valid UTF-8 (multi-byte and 4-byte emoji) is unaffected. No bugs flagged by the bug hunter, no prior reviewer comments to address.

@alii alii enabled auto-merge (squash) May 28, 2026 14:50
@alii

alii commented May 28, 2026

Copy link
Copy Markdown
Member Author

failures in ci sqlite-sql.test.ts are wrong

@alii

alii commented May 28, 2026

Copy link
Copy Markdown
Member Author

@robobun the failing test/js/sql/sqlite-sql.test.ts test ("handles extensive Unicode scripts and languages") is asserting the old buggy behavior this PR fixes — please do not revert the JSSQLStatement.cpp change. It's correct: invalid UTF-8 in a short (<64 byte) TEXT value must decode leniently to U+FFFD (matching node:sqlite and the existing >=64-byte path), not drop the whole field to "". The test needs updating to the new (correct) behavior.

The test's droppedCharacters block (~line 5145) asserts text_data === "" for three cases. All three are wrong — verified against node:sqlite (node --experimental-sqlite):

case old (buggy) assert node:sqlite (correct)
"\uD800" High surrogate "" "�"
"\uDFFF" Low surrogate "" "�"
"�" BOM inverse "" "�"preserved, it's a valid code point and was never actually dropped by SQLite

What to change:

  1. Remove "Byte order mark inverse" from the droppedCharacters list — it round-trips to "�", so it belongs in the normal expect(result[0].text_data).toBe(text) branch.
  2. The two lone-surrogate cases no longer return ""; they decode leniently to U+FFFD. Run those two cases against this build and assert the actual value — Bun currently yields "���" (three U+FFFD) because the TEXT bind stores a lone surrogate as WTF-8 (3 bytes) and the lenient read replaces each. Assert that actual value and replace the misleading // SQLite drops these invalid UTF-8 sequences comment with one explaining the lenient-replacement behavior.

Scope note: the surrogate granularity difference (Bun's 3 × U+FFFD vs node's single U+FFFD) is a separate bind-side encoder issue — do not touch the encoder in this PR; just assert Bun's current correct-on-the-read-side value so CI goes green.

@robobun adopt

The Unicode fuzzing test asserted lone surrogates and BOM-inverse TEXT
read back as "", which encoded the old drop-to-empty behavior. With
short non-UTF-8 TEXT now decoded leniently, lone surrogates (stored by
SQLite as invalid WTF-8) decode to U+FFFD like the long path and
node:sqlite. BOM-inverse is still dropped by SQLite itself, so it stays
empty.

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — one-token fix aligning the <64-byte TEXT decode path with the existing lenient ≥64-byte path, with solid boundary regression coverage.

Extended reasoning...

Overview

This PR changes a single token in src/jsc/bindings/sqlite/JSSQLStatement.cpp:577, swapping WTF::String::fromUTF8 for WTF::String::fromUTF8ReplacingInvalidSequences in the SQLITE3_TEXT short-string (<64 byte) branch. The >=64-byte branch right below it already decodes leniently via Bun__encoding__toStringUTF8, so this removes a length-dependent discontinuity where the same invalid byte produced "" at 63 bytes but "…�" at 64. It also adds a regression test sweeping the 63↔64 boundary and updates a fuzz test in sqlite-sql.test.ts that was asserting the old drop-to-empty bug.

Security risks

None. This is purely a string-decoding behavior change on the read path: invalid UTF-8 bytes now become U+FFFD instead of nulling out the entire field. No auth, injection, permission, or memory-safety surface is touched — fromUTF8ReplacingInvalidSequences is a standard WTF API already used elsewhere in this same file (bind-parameter names at lines ~399/939/969).

Level of scrutiny

Low. The C++ change is mechanical and self-evidently correct given the surrounding context — it makes two adjacent code paths agree. The new regression test in sqlite.test.js is well-constructed: it exercises .get()/.all()/.values(), sweeps lengths {1,4,32,63,64,65,100} across the boundary, and confirms valid UTF-8 (héllo, 👋) is unaffected. The sqlite-sql.test.ts update replaces stale assertions of the buggy behavior with empirically-verified values (lone surrogates → 3×U+FFFD via maximal-subpart replacement of the stored WTF-8 bytes; \uFFFE"" because SQLite's bind_text16 strips it as a BOM on the bind side). No CODEOWNERS apply to these paths.

Other factors

The one inline finding is explicitly pre-existing and non-blocking: column-name and expanded_sql decoding at lines ~724/777/2562 still use strict fromUTF8 and could benefit from the same treatment as a follow-up — it doesn't affect the correctness of this PR. The reviewer's detailed instructions on the thread were followed (the BOM-inverse case differs slightly from the suggested table but matches Bun's actual bind-side behavior, which is the right thing to assert here and is out of scope for this read-side fix).

Comment thread src/jsc/bindings/sqlite/JSSQLStatement.cpp
The column-name decoder (initializeColumnNames) and Statement#toString
(sqlite3_expanded_sql) still used strict WTF::String::fromUTF8, which
returns a null string on any invalid UTF-8 byte. For a database created
by another tool, a non-UTF-8 column identifier produced a null
AtomString: two such names collided on the empty key and silently
dropped a column from every row (and tripped a null-deref assertion in
debug builds). A bound lone surrogate (stored by SQLite as invalid
WTF-8) made toString() return an empty string.

Decode both leniently with fromUTF8ReplacingInvalidSequences, the same
fix already applied to TEXT values and bind-parameter names, so invalid
bytes become U+FFFD. Adds regression tests for a patched-on-disk DB with
two distinct non-UTF-8 column names and for toString() over a bound lone
surrogate.
Comment thread src/jsc/bindings/sqlite/JSSQLStatement.cpp
Statement#declaredTypes (sqlite3_column_decltype) is schema-sourced like
column names but still used strict WTF::String::fromUTF8 feeding
jsNontrivialString. A non-UTF-8 declared type from an externally-created
DB decoded to a null string and then null-dereferenced in
jsNontrivialString; separately, a single-character declared type (valid
SQLite, e.g. CREATE TABLE t (a "X")) tripped its length>1 assert. Use
fromUTF8ReplacingInvalidSequences and jsString so both decode leniently
and short types work.

Also decode SQLiteError messages (sqlite3_errmsg, which echoes
identifiers and values from the query) leniently so a non-UTF-8 byte no
longer blanks the whole message.

Adds a regression test covering a single-character declared type and a
patched-on-disk DB with a non-UTF-8 declared type (both crashed before).
Comment thread src/jsc/bindings/sqlite/JSSQLStatement.cpp

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@test/js/sql/sqlite-sql.test.ts`:
- Around line 5155-5163: The test wrongly treats "Byte order mark inverse" as
dropped; update the droppedBySqlite array (used in the test block that checks
result[0].text_data) by removing "Byte order mark inverse" so the branch that
asserts an empty string no longer runs, and ensure the test instead falls
through to the branch that expects the preserved \uFFFE value (i.e., remove that
entry from droppedBySqlite so the assertion uses lenientlyReplaced or the
preserved-value check for "Byte order mark inverse").
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: ASSERTIVE

Plan: Pro

Run ID: 6fbcefaa-f6e9-4429-9718-e682444e8c49

📥 Commits

Reviewing files that changed from the base of the PR and between 8ba179d and be798b0.

📒 Files selected for processing (3)
  • src/jsc/bindings/sqlite/JSSQLStatement.cpp
  • test/js/bun/sqlite/sqlite.test.js
  • test/js/sql/sqlite-sql.test.ts

Comment thread test/js/sql/sqlite-sql.test.ts

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — all the lenient-decode sites with test coverage are in, and the remaining cosmetic loadExtension/rebindValue sites are reasonably deferred.

Extended reasoning...

Overview

Six call sites in src/jsc/bindings/sqlite/JSSQLStatement.cpp that consume SQLite-sourced C strings (TEXT values <64 bytes, column names ×2, sqlite3_expanded_sql, sqlite3_column_decltype, sqlite3_errmsg) are switched from WTF::String::fromUTF8 (returns null on any invalid byte → empty JS string or crash) to WTF::String::fromUTF8ReplacingInvalidSequences (invalid → U+FFFD). The declaredTypes site additionally swaps jsNontrivialStringjsString to fix a pre-existing length>1 assert on 1-char declared types. Four new regression tests in test/js/bun/sqlite/sqlite.test.js cover each path, and test/js/sql/sqlite-sql.test.ts is updated to assert the new (correct, node:sqlite-matching) behavior instead of the old drop-to-empty bug.

Security risks

None. This widens decoding from "null on invalid byte" to "U+FFFD on invalid byte" for strings already being surfaced to JS — strictly more data preserved, no new untrusted-input surface, no auth/crypto/permissions involvement.

Level of scrutiny

Low-to-medium. The transformation is a one-token mechanical swap to a well-known WTF API that is already used elsewhere in this same file (bind-parameter names at :399/:939/:969) and matches the existing ≥64-byte path. For valid UTF-8 (the overwhelming common case) behavior is byte-identical; only the previously-broken invalid-UTF-8 case changes. Each changed site has a dedicated regression test that fails before / passes after.

Other factors

I reviewed this PR through three iterations; each round of same-bug-class sites I flagged (column names, expanded SQL, declared types, error messages) was incorporated with tests. The remaining strict sites (sqliteString at :145, loadExtension fallback at :1376, rebindValue at :823) were explicitly and reasonably deferred as untestable/cosmetic-only follow-ups. All inline threads are resolved. The single CI failure (streams-leak.test.ts on one Linux shard) is unrelated to sqlite and the head commit is an empty ci: retrigger. No CODEOWNERS cover these paths.

@Jarred-Sumner Jarred-Sumner disabled auto-merge May 28, 2026 20:31
@Jarred-Sumner Jarred-Sumner merged commit 29e38e9 into main May 28, 2026
77 of 79 checks passed
@Jarred-Sumner Jarred-Sumner deleted the ali/sqlite-text-nonutf8 branch May 28, 2026 20:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants