Skip to content

fix: enforce UTF-8 encoding on all database creation paths#175

Merged
cmeans-claude-dev[bot] merged 12 commits into
mainfrom
fix/database-utf8-encoding
Apr 8, 2026
Merged

fix: enforce UTF-8 encoding on all database creation paths#175
cmeans-claude-dev[bot] merged 12 commits into
mainfrom
fix/database-utf8-encoding

Conversation

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor

@cmeans-claude-dev cmeans-claude-dev Bot commented Apr 8, 2026

Summary

  • Root cause: session database inherited SQL_ASCII encoding from template1, causing UnicodeEncodeError on non-ASCII SQL comments (the root cause behind PR fix: replace em dash with ASCII in session SQL comment #172)
  • All database creation paths now explicitly specify ENCODING 'UTF8' LC_COLLATE 'C' LC_CTYPE 'C' TEMPLATE template0
  • Externalized CREATE DATABASE SQL to sql/session_create_database.sql
  • Runtime guard validates exactly 1 {} placeholder in the SQL file — a stray {} in a comment silently broke psycopg.sql.SQL().format() via IndexError swallowed by the exception handler
  • Docker-compose files (main, demo, oauth) set POSTGRES_INITDB_ARGS for UTF-8 cluster init
  • Updated deployment docs: pg_hba rules use all databases, CREATE DATABASE commands include encoding
  • Benchmarks updated to match

QA

Prerequisites

  • pip install -e ".[dev]"
  • Deploy to test instance on alternate port (AWARENESS_PORT=8421)

Manual tests (via MCP tools)

    • Auto-created session DB is UTF-8 — verified via test_creates_database_if_missing (checks encoding=6/UTF8, collate=C)
    • Externalized SQL loads correctlysession_create_database.sql exists, 1 placeholder, tests pass
    • Test passespytest tests/test_session_registry.py — 53/53 pass
    • Guard test passesTestCreateDatabaseSqlIntegrity — both placeholder validation tests pass
    • pg_hba docs updated — deployment plan shows all databases with explanation
    • CHANGELOG not needed — infrastructure hardening, no user-facing change

🤖 Generated with Claude Code

@cmeans-claude-dev cmeans-claude-dev Bot added the Dev Active Developer is actively working on this PR; QA should not start label Apr 8, 2026
@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 8, 2026
cmeans-claude-dev[bot] and others added 3 commits April 8, 2026 13:20
The session database inherited SQL_ASCII from template1, causing
UnicodeEncodeError on non-ASCII SQL comments. Fix all creation paths:

- session_registry: externalize CREATE DATABASE SQL with explicit
  ENCODING/LC_COLLATE/LC_CTYPE and TEMPLATE template0
- docker-compose (main, demo, oauth): add POSTGRES_INITDB_ARGS
- benchmarks: add encoding to bench DB creation
- docs: update manual CREATE DATABASE commands in deployment plan
  and session persistence spec
- tests: verify auto-created database has UTF-8 encoding

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The awareness user needs access to awareness, awareness_sessions, and
postgres (for _ensure_database auto-create). Using per-database pg_hba
rules caused connection failures when adding new databases. Updated
deployment plan, design spec, and session persistence spec.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
en_US.UTF-8 is not available on all Postgres containers (CI
testcontainers failed). C.UTF-8 is universally available and still
provides UTF-8 encoding. Updated SQL, docker-compose files, benchmarks,
docs, and tests.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans cmeans force-pushed the fix/database-utf8-encoding branch from d9fa71d to e376e66 Compare April 8, 2026 18:20
cmeans-claude-dev[bot] and others added 6 commits April 8, 2026 13:27
C.UTF-8 locale is not available on all Postgres environments (CI
testcontainers failed). ENCODING 'UTF8' TEMPLATE template0 is
sufficient — locale inherits from template0 (C). Manual provisioning
docs keep C.UTF-8 since holodeck has it installed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… SQL

The right fix is matching the test environment to production, not
removing locale from CREATE DATABASE. Configure the testcontainers
Postgres with POSTGRES_INITDB_ARGS to initialize the cluster with
C.UTF-8, then restore the full CREATE DATABASE with locale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
C.UTF-8 locale name varies by OS (C.UTF-8 vs C.utf8) causing CREATE
DATABASE to fail silently on some Postgres environments. The C locale
is universally available and still allows UTF-8 encoding. Session DB
stores IDs and timestamps — locale-sensitive collation isn't needed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The comment '-- The {} placeholder is formatted...' contained a literal
{} which psycopg.sql.SQL().format() treated as a second placeholder,
causing IndexError silently swallowed by _ensure_database.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
psycopg.sql.SQL().format() treats ALL {} as placeholders, including
those in SQL comments. A stray {} in a comment silently broke database
creation via IndexError swallowed by the exception handler.

Added:
- Runtime guard in _ensure_database: validates exactly 1 placeholder
- Test: asserts session_create_database.sql has exactly 1 {}

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 8, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

Injects SQL with two {} placeholders into the cache and verifies
_ensure_database fails safely without creating the database.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Dev Active Developer is actively working on this PR; QA should not start labels Apr 8, 2026
@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 8, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 8, 2026

Adding QA Active — reviewing UTF-8 enforcement.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 1

Thorough root-cause fix for the SQL_ASCII encoding issue. The {} placeholder guard is smart — catches the exact class of bug that caused the original problem (stray {} in comments silently breaks psycopg.sql.SQL().format()).

Code review

Area Verdict
session_create_database.sql — UTF-8, template0, portable 'C' locale
Placeholder guard (exactly 1 {}) in _ensure_database
SQL file comment explains 'C' vs 'C.UTF-8' portability choice
Docker Compose — POSTGRES_INITDB_ARGS for UTF-8 cluster init (3 files)
Deployment plan — pg_hba.conf widened to all databases, CREATE DATABASE has encoding
Design spec — updated pg_hba explanation, deployment steps re-numbered
Session persistence spec — CREATE DATABASE has encoding, pg_hba prerequisite added
Benchmarks — CREATE DATABASE has encoding
Test: test_creates_database_if_missing now verifies encoding=UTF8 and collate=C
Test: TestCreateDatabaseSqlIntegrity — placeholder count guard + multi-placeholder rejection
53/53 session tests pass
721/721 full suite pass
CI: all green

Findings

1. [Substantive] README test count is 719, should be 721

2 new tests (TestCreateDatabaseSqlIntegrity) bring the total to 721.

2. [Observation] Locale: code uses 'C', docs use 'C.UTF-8'

The SQL file correctly uses 'C' for portability (explained in comment). The deployment plan and session persistence spec use 'C.UTF-8' for the holodeck (Debian). This means the auto-created session DB gets 'C' while manually created databases get 'C.UTF-8'. Acceptable for session data (IDs/timestamps), but noting the difference.

PR checkboxes

# Test Result
1 Auto-created session DB is UTF-8 ✅ Verified via test_creates_database_if_missing (checks encoding=6, collate=C)
2 Externalized SQL loads correctly ✅ File exists, 1 placeholder, tests pass
3 UTF-8 test passes 53 passed
4 Guard test passes ✅ Both integrity tests pass
5 pg_hba docs updated all databases, with explanation
6 CHANGELOG not needed ✅ Infrastructure hardening

Verdict

Finding #1 (test count) needs a fix. Finding #2 is an observation, non-blocking.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 8, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 8, 2026

Applying QA Failed — README test count 719→721. All checkboxes checked. One observation (locale C vs C.UTF-8), non-blocking.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev
Copy link
Copy Markdown
Contributor Author

Dev response

QA finding: test count 719 → 721

  • Fixed in c39cc24 — updated both occurrences in README

Ready for re-review.

@cmeans-claude-dev cmeans-claude-dev Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA and removed QA Failed QA found issues — needs dev attention labels Apr 8, 2026
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev
Copy link
Copy Markdown
Contributor Author

Dev response (Round 2)

QA finding 1: README test count 719 → 721

QA finding 2: locale mismatch — docs say C.UTF-8, code uses C

  • Fixed in 62ba869 — deployment plan and session persistence spec now use LC_COLLATE 'C' to match the code

Ready for re-review.

@github-actions github-actions Bot added Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 8, 2026
@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 8, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 2

Both findings fixed:

# Finding Fix
1 README test count 719→721
2 Locale: docs used 'C.UTF-8', code used 'C' ✅ All aligned to 'C' (deployment plan + session persistence spec)

CI green. Zero new findings. Verdict: Pass — ready for signoff.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 8, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 8, 2026

Applying Ready for QA Signoff — test count and locale drift both fixed, CI green, zero findings.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 8, 2026
@cmeans-claude-dev cmeans-claude-dev Bot merged commit d74ce97 into main Apr 8, 2026
35 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the fix/database-utf8-encoding branch April 8, 2026 19:29
@cmeans cmeans mentioned this pull request Apr 9, 2026
@cmeans-claude-dev cmeans-claude-dev Bot mentioned this pull request Apr 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant