fix: enforce UTF-8 encoding on all database creation paths#175
Conversation
The session database inherited SQL_ASCII from template1, causing UnicodeEncodeError on non-ASCII SQL comments. Fix all creation paths: - session_registry: externalize CREATE DATABASE SQL with explicit ENCODING/LC_COLLATE/LC_CTYPE and TEMPLATE template0 - docker-compose (main, demo, oauth): add POSTGRES_INITDB_ARGS - benchmarks: add encoding to bench DB creation - docs: update manual CREATE DATABASE commands in deployment plan and session persistence spec - tests: verify auto-created database has UTF-8 encoding Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The awareness user needs access to awareness, awareness_sessions, and postgres (for _ensure_database auto-create). Using per-database pg_hba rules caused connection failures when adding new databases. Updated deployment plan, design spec, and session persistence spec. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
en_US.UTF-8 is not available on all Postgres containers (CI testcontainers failed). C.UTF-8 is universally available and still provides UTF-8 encoding. Updated SQL, docker-compose files, benchmarks, docs, and tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
d9fa71d to
e376e66
Compare
C.UTF-8 locale is not available on all Postgres environments (CI testcontainers failed). ENCODING 'UTF8' TEMPLATE template0 is sufficient — locale inherits from template0 (C). Manual provisioning docs keep C.UTF-8 since holodeck has it installed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… SQL The right fix is matching the test environment to production, not removing locale from CREATE DATABASE. Configure the testcontainers Postgres with POSTGRES_INITDB_ARGS to initialize the cluster with C.UTF-8, then restore the full CREATE DATABASE with locale. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
C.UTF-8 locale name varies by OS (C.UTF-8 vs C.utf8) causing CREATE DATABASE to fail silently on some Postgres environments. The C locale is universally available and still allows UTF-8 encoding. Session DB stores IDs and timestamps — locale-sensitive collation isn't needed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comment '-- The {} placeholder is formatted...' contained a literal
{} which psycopg.sql.SQL().format() treated as a second placeholder,
causing IndexError silently swallowed by _ensure_database.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
psycopg.sql.SQL().format() treats ALL {} as placeholders, including
those in SQL comments. A stray {} in a comment silently broke database
creation via IndexError swallowed by the exception handler.
Added:
- Runtime guard in _ensure_database: validates exactly 1 placeholder
- Test: asserts session_create_database.sql has exactly 1 {}
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
Injects SQL with two {} placeholders into the cache and verifies
_ensure_database fails safely without creating the database.
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Adding QA Active — reviewing UTF-8 enforcement. |
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 1
Thorough root-cause fix for the SQL_ASCII encoding issue. The {} placeholder guard is smart — catches the exact class of bug that caused the original problem (stray {} in comments silently breaks psycopg.sql.SQL().format()).
Code review
| Area | Verdict |
|---|---|
session_create_database.sql — UTF-8, template0, portable 'C' locale |
✅ |
Placeholder guard (exactly 1 {}) in _ensure_database |
✅ |
SQL file comment explains 'C' vs 'C.UTF-8' portability choice |
✅ |
Docker Compose — POSTGRES_INITDB_ARGS for UTF-8 cluster init (3 files) |
✅ |
Deployment plan — pg_hba.conf widened to all databases, CREATE DATABASE has encoding |
✅ |
| Design spec — updated pg_hba explanation, deployment steps re-numbered | ✅ |
| Session persistence spec — CREATE DATABASE has encoding, pg_hba prerequisite added | ✅ |
| Benchmarks — CREATE DATABASE has encoding | ✅ |
Test: test_creates_database_if_missing now verifies encoding=UTF8 and collate=C |
✅ |
Test: TestCreateDatabaseSqlIntegrity — placeholder count guard + multi-placeholder rejection |
✅ |
| 53/53 session tests pass | ✅ |
| 721/721 full suite pass | ✅ |
| CI: all green | ✅ |
Findings
1. [Substantive] README test count is 719, should be 721
2 new tests (TestCreateDatabaseSqlIntegrity) bring the total to 721.
2. [Observation] Locale: code uses 'C', docs use 'C.UTF-8'
The SQL file correctly uses 'C' for portability (explained in comment). The deployment plan and session persistence spec use 'C.UTF-8' for the holodeck (Debian). This means the auto-created session DB gets 'C' while manually created databases get 'C.UTF-8'. Acceptable for session data (IDs/timestamps), but noting the difference.
PR checkboxes
| # | Test | Result |
|---|---|---|
| 1 | Auto-created session DB is UTF-8 | ✅ Verified via test_creates_database_if_missing (checks encoding=6, collate=C) |
| 2 | Externalized SQL loads correctly | ✅ File exists, 1 placeholder, tests pass |
| 3 | UTF-8 test passes | ✅ 53 passed |
| 4 | Guard test passes | ✅ Both integrity tests pass |
| 5 | pg_hba docs updated | ✅ all databases, with explanation |
| 6 | CHANGELOG not needed | ✅ Infrastructure hardening |
Verdict
Finding #1 (test count) needs a fix. Finding #2 is an observation, non-blocking.
|
Applying QA Failed — README test count 719→721. All checkboxes checked. One observation (locale C vs C.UTF-8), non-blocking. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Dev responseQA finding: test count 719 → 721
Ready for re-review. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 2
Both findings fixed:
| # | Finding | Fix |
|---|---|---|
| 1 | README test count 719→721 | ✅ |
| 2 | Locale: docs used 'C.UTF-8', code used 'C' |
✅ All aligned to 'C' (deployment plan + session persistence spec) |
CI green. Zero new findings. Verdict: Pass — ready for signoff.
|
Applying Ready for QA Signoff — test count and locale drift both fixed, CI green, zero findings. |
Summary
UnicodeEncodeErroron non-ASCII SQL comments (the root cause behind PR fix: replace em dash with ASCII in session SQL comment #172)ENCODING 'UTF8' LC_COLLATE 'C' LC_CTYPE 'C' TEMPLATE template0CREATE DATABASESQL tosql/session_create_database.sql{}placeholder in the SQL file — a stray{}in a comment silently brokepsycopg.sql.SQL().format()viaIndexErrorswallowed by the exception handlerPOSTGRES_INITDB_ARGSfor UTF-8 cluster initalldatabases, CREATE DATABASE commands include encodingQA
Prerequisites
pip install -e ".[dev]"AWARENESS_PORT=8421)Manual tests (via MCP tools)
test_creates_database_if_missing(checks encoding=6/UTF8, collate=C)session_create_database.sqlexists, 1 placeholder, tests passpytest tests/test_session_registry.py— 53/53 passTestCreateDatabaseSqlIntegrity— both placeholder validation tests passalldatabases with explanation🤖 Generated with Claude Code