fix: _LazyStore thread-safe initialization (#164)#226
Conversation
Prevent duplicate PostgresStore/connection pool creation when multiple threads (embedding workers, cleanup, concurrent requests) race through __getattr__ simultaneously. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Adding QA Active — reviewing thread-safety fix. |
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 1
Clean thread-safety fix. Double-checked locking is correct — _lock and _instance are class variables (shared), GIL guarantees atomic reference assignment for the outer check, inner check prevents duplicate creation after lock acquisition. The test is well-designed — 10 threads synchronize via Barrier to maximize contention, then asserts _create_store called exactly once.
Checks
| Check | Result |
|---|---|
| Thread safety test passes (10 concurrent threads, 1 store created) | ✅ |
| 726/726 full suite pass | ✅ |
CHANGELOG entry present under [Unreleased] |
✅ |
| CI all green | ✅ |
Test cleans up _instance = None after run |
✅ |
Findings
1. [Substantive] README test count 725 → 726
One new test brings the total to 726.
PR checkboxes
Tests 1-3 require a running instance with concurrent clients. The thread-safety test covers the core race condition. Checking boxes for what's verified:
| # | Test | Result |
|---|---|---|
| 1 | Concurrent tool calls | ✅ Covered by test_concurrent_access_creates_single_instance |
| 2 | Restart and lazy init | ⏭️ Requires infra |
| 3 | Embedding worker race | ⏭️ Requires infra |
Verdict
Finding #1 (test count) needs a fix. Code is correct.
|
Applying QA Failed — README test count 725→726. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Dev response: Fixed — README test count updated 725 → 726 in 23a49f9. |
|
Adding QA Active — Round 2: full verification including QA instance tests. |
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 2
README test count fixed (726). All 3 manual tests executed on a QA Docker instance:
Manual test results
| # | Test | Result | Details |
|---|---|---|---|
| 1 | Concurrent tool calls | ✅ | 5 parallel get_stats calls — all returned valid JSON, 0 errors in logs |
| 2 | Restart + lazy init | ✅ | docker restart → new session → get_stats → valid response, store initialized once |
| 3 | Embedding worker race | ✅ | remember call succeeded, entry created, 0 errors/duplicates/pool conflicts in logs |
Automated checks
| Check | Result |
|---|---|
| Thread safety test (10-thread barrier race) | ✅ |
| README test count matches (726) | ✅ |
| 726/726 full suite pass | ✅ |
| CI all green | ✅ |
QA infrastructure cleaned up (container + DB dropped).
Zero findings. Verdict: Pass — ready for signoff.
|
Applying Ready for QA Signoff — all 3 manual tests pass on QA instance, thread safety test pass, README count fixed, CI green, zero findings. |
cmeans
left a comment
There was a problem hiding this comment.
QA Feedback — Test Coverage Gaps
The fix is correct and the barrier test proves the lock invariant. However, the test suite doesn't fully exercise the real-world race conditions that motivated #164. Recommending additional tests before merge:
1. Cleanup thread + request handler race
_do_cleanup runs on a background daemon thread and accesses the store. Add a test that triggers cleanup (via _schedule_cleanup) concurrently with tool call processing through _LazyStore, verifying no duplicate store creation.
2. Embedding worker + request handler race
When AWARENESS_EMBEDDING_PROVIDER is set, the embedding worker runs in a separate thread and accesses the store. A test simulating this path (e.g., remember triggering embedding while a parallel get_knowledge is in flight) would cover the actual subsystem interaction from #164.
3. Integration barrier test against real Postgres
The current barrier test uses _FakeStore. An integration variant that calls _create_store() (real PostgresStore + connection pool) from 10 threads would verify that the actual pool initialization is safe under contention, not just the lock mechanics.
These are the real access patterns that caused #164 — without them, we're testing the lock but not the bug.
Three additional tests covering the actual concurrent access patterns from #164: cleanup daemon vs request handler, embedding worker vs request handler, and 10-thread integration test with real PostgresStore. Update README test count to 729. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Dev response to QA Round 3 findings: Added 3 tests covering real-world race conditions: (1) cleanup daemon vs request handler, (2) embedding worker vs request handler, (3) 10-thread integration with real PostgresStore + connection pool. All use Event/Barrier synchronization. 729/729 pass, ruff clean. README updated to 729. |
|
Dev response to QA Round 3 coverage gaps: All three requested tests added in be4bd5c: 1. Cleanup thread + request handler race ( 2. Embedding worker + request handler race ( 3. Integration barrier test with real Postgres ( All three are automated in CI — no manual QA steps needed for this round. Code in |
|
Adding QA Active — Round 3: reviewing additional thread race tests. |
cmeans
left a comment
There was a problem hiding this comment.
QA Review — Round 3
All three recommended test coverage gaps addressed:
| Test | What it covers | Result |
|---|---|---|
test_cleanup_thread_and_request_handler_race |
Daemon cleanup thread vs request handler, real _create_store |
✅ |
test_embedding_worker_and_request_handler_race |
Embedding worker vs request handler, real _create_store |
✅ |
test_concurrent_real_postgres_store_creation |
10-thread barrier with real PostgresStore + connection pool | ✅ |
All three use _tracking_create_store wrapping the real factory — they exercise actual Postgres pool creation under contention, not just lock mechanics.
- 4/4 thread safety tests pass
- 729/729 full suite pass
- README: 729
- CI green
Zero findings. Verdict: Pass — ready for signoff.
|
Applying Ready for QA Signoff — all test coverage gaps addressed, 4/4 thread safety tests pass with real Postgres, 729/729 total, CI green. |
6 bug-fix PRs since v0.16.1: - #226 LazyStore thread safety (#164) - #227 SQL template injection hardening (#165) - #232 Stateless HTTP mode (#180) - #233 GET /mcp intercept + request logger (#178) - #234 delete_entry IDOR fix (#193) - #236 RLS-safe opt-in cleanup (#179, #183) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
6 bug-fix PRs since v0.16.1: - #226 LazyStore thread safety (#164) - #227 SQL template injection hardening (#165) - #232 Stateless HTTP mode (#180) - #233 GET /mcp intercept + request logger (#178) - #234 delete_entry IDOR fix (#193) - #236 RLS-safe opt-in cleanup (#179, #183) Co-authored-by: cmeans-claude-dev[bot] <3223881+cmeans-claude-dev[bot]@users.noreply.github.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Summary
threading.Lockwith double-checked locking to_LazyStore.__getattr__to prevent duplicatePostgresStore/connection pool creation under concurrent accessBarrierinto__getattr__simultaneously, asserts_create_storeis called exactly onceCloses #164
QA
Prerequisites
pip install -e ".[dev]"AWARENESS_PORT=8421)Manual tests (via MCP tools)
Send 5+ parallel MCP tool calls (e.g.,
get_briefing,get_knowledge,get_statssimultaneously).Expected: all return valid results, no connection pool errors or duplicate store warnings in logs
Stop and restart the test instance, then call any tool.
Expected: first call initializes the store normally, subsequent calls reuse it
Expected: entry created successfully; no errors in server logs about duplicate stores or pool conflicts
🤖 Generated with Claude Code