Skip to content

fix: _LazyStore thread-safe initialization (#164)#226

Merged
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
fix/lazy-store-thread-safety
Apr 9, 2026
Merged

fix: _LazyStore thread-safe initialization (#164)#226
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
fix/lazy-store-thread-safety

Conversation

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor

@cmeans-claude-dev cmeans-claude-dev Bot commented Apr 9, 2026

Summary

  • Added threading.Lock with double-checked locking to _LazyStore.__getattr__ to prevent duplicate PostgresStore/connection pool creation under concurrent access
  • Inline comment explains why double-checked locking is safe in CPython (GIL guarantees atomic reference assignment)
  • New test: 10 threads race through a Barrier into __getattr__ simultaneously, asserts _create_store is called exactly once

Closes #164

QA

Prerequisites

  • pip install -e ".[dev]"
  • Deploy to test instance on alternate port (AWARENESS_PORT=8421)

Manual tests (via MCP tools)

    • Concurrent tool calls produce single store instance
      Send 5+ parallel MCP tool calls (e.g., get_briefing, get_knowledge, get_stats simultaneously).
      Expected: all return valid results, no connection pool errors or duplicate store warnings in logs
    • Restart server and verify lazy init still works
      Stop and restart the test instance, then call any tool.
      Expected: first call initializes the store normally, subsequent calls reuse it
    • Embedding worker doesn't race with request handler
    remember(source="test", tags=["qa"], description="thread safety test entry")
    

    Expected: entry created successfully; no errors in server logs about duplicate stores or pool conflicts

🤖 Generated with Claude Code

Prevent duplicate PostgresStore/connection pool creation when multiple
threads (embedding workers, cleanup, concurrent requests) race through
__getattr__ simultaneously.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot added bug Something isn't working Dev Active Developer is actively working on this PR; QA should not start labels Apr 9, 2026
@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 9, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 9, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 9, 2026
@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 9, 2026
@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 9, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 9, 2026

Adding QA Active — reviewing thread-safety fix.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 1

Clean thread-safety fix. Double-checked locking is correct — _lock and _instance are class variables (shared), GIL guarantees atomic reference assignment for the outer check, inner check prevents duplicate creation after lock acquisition. The test is well-designed — 10 threads synchronize via Barrier to maximize contention, then asserts _create_store called exactly once.

Checks

Check Result
Thread safety test passes (10 concurrent threads, 1 store created)
726/726 full suite pass
CHANGELOG entry present under [Unreleased]
CI all green
Test cleans up _instance = None after run

Findings

1. [Substantive] README test count 725 → 726

One new test brings the total to 726.

PR checkboxes

Tests 1-3 require a running instance with concurrent clients. The thread-safety test covers the core race condition. Checking boxes for what's verified:

# Test Result
1 Concurrent tool calls ✅ Covered by test_concurrent_access_creates_single_instance
2 Restart and lazy init ⏭️ Requires infra
3 Embedding worker race ⏭️ Requires infra

Verdict

Finding #1 (test count) needs a fix. Code is correct.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 9, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 9, 2026

Applying QA Failed — README test count 725→726.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev
Copy link
Copy Markdown
Contributor Author

Dev response: Fixed — README test count updated 725 → 726 in 23a49f9.

@cmeans-claude-dev cmeans-claude-dev Bot added Dev Active Developer is actively working on this PR; QA should not start and removed QA Failed QA found issues — needs dev attention labels Apr 9, 2026
@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 9, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 9, 2026
@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 9, 2026
@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 9, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 9, 2026

Adding QA Active — Round 2: full verification including QA instance tests.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 2

README test count fixed (726). All 3 manual tests executed on a QA Docker instance:

Manual test results

# Test Result Details
1 Concurrent tool calls 5 parallel get_stats calls — all returned valid JSON, 0 errors in logs
2 Restart + lazy init docker restart → new session → get_stats → valid response, store initialized once
3 Embedding worker race remember call succeeded, entry created, 0 errors/duplicates/pool conflicts in logs

Automated checks

Check Result
Thread safety test (10-thread barrier race)
README test count matches (726)
726/726 full suite pass
CI all green

QA infrastructure cleaned up (container + DB dropped).

Zero findings. Verdict: Pass — ready for signoff.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 9, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 9, 2026

Applying Ready for QA Signoff — all 3 manual tests pass on QA instance, thread safety test pass, README count fixed, CI green, zero findings.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Feedback — Test Coverage Gaps

The fix is correct and the barrier test proves the lock invariant. However, the test suite doesn't fully exercise the real-world race conditions that motivated #164. Recommending additional tests before merge:

1. Cleanup thread + request handler race

_do_cleanup runs on a background daemon thread and accesses the store. Add a test that triggers cleanup (via _schedule_cleanup) concurrently with tool call processing through _LazyStore, verifying no duplicate store creation.

2. Embedding worker + request handler race

When AWARENESS_EMBEDDING_PROVIDER is set, the embedding worker runs in a separate thread and accesses the store. A test simulating this path (e.g., remember triggering embedding while a parallel get_knowledge is in flight) would cover the actual subsystem interaction from #164.

3. Integration barrier test against real Postgres

The current barrier test uses _FakeStore. An integration variant that calls _create_store() (real PostgresStore + connection pool) from 10 threads would verify that the actual pool initialization is safe under contention, not just the lock mechanics.

These are the real access patterns that caused #164 — without them, we're testing the lock but not the bug.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 9, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added Dev Active Developer is actively working on this PR; QA should not start and removed QA Failed QA found issues — needs dev attention labels Apr 9, 2026
Three additional tests covering the actual concurrent access patterns
from #164: cleanup daemon vs request handler, embedding worker vs
request handler, and 10-thread integration test with real PostgresStore.
Update README test count to 729.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 9, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 9, 2026

Dev response to QA Round 3 findings: Added 3 tests covering real-world race conditions: (1) cleanup daemon vs request handler, (2) embedding worker vs request handler, (3) 10-thread integration with real PostgresStore + connection pool. All use Event/Barrier synchronization. 729/729 pass, ruff clean. README updated to 729.

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor Author

Dev response to QA Round 3 coverage gaps:

All three requested tests added in be4bd5c:

1. Cleanup thread + request handler race (test_cleanup_thread_and_request_handler_race)
Two threads synchronized via threading.Event — one simulates a daemon cleanup thread accessing store.get_stats, the other simulates a request handler accessing store.add. Both race through _LazyStore.__getattr__. Asserts _create_store called exactly once. Uses real PostgresStore (via pg_dsn fixture).

2. Embedding worker + request handler race (test_embedding_worker_and_request_handler_race)
Same pattern — one thread simulates an embedding pool worker accessing store.upsert_embedding, the other a request thread accessing store.get_knowledge. Asserts single store creation. Real PostgresStore.

3. Integration barrier test with real Postgres (test_concurrent_real_postgres_store_creation)
10 threads synchronized via threading.Barrier, all racing into __getattr__ simultaneously. Each path calls the real _create_store() (actual PostgresStore + psycopg connection pool). Asserts exactly 1 store created.

All three are automated in CI — no manual QA steps needed for this round. Code in server.py is unchanged from Round 2.

@cmeans-claude-dev cmeans-claude-dev Bot added Ready for QA Dev work complete — QA can begin review and removed Dev Active Developer is actively working on this PR; QA should not start labels Apr 9, 2026
@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 9, 2026
@cmeans cmeans removed the Ready for QA Dev work complete — QA can begin review label Apr 9, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 9, 2026

Adding QA Active — Round 3: reviewing additional thread race tests.

@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label Apr 9, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA Review — Round 3

All three recommended test coverage gaps addressed:

Test What it covers Result
test_cleanup_thread_and_request_handler_race Daemon cleanup thread vs request handler, real _create_store
test_embedding_worker_and_request_handler_race Embedding worker vs request handler, real _create_store
test_concurrent_real_postgres_store_creation 10-thread barrier with real PostgresStore + connection pool

All three use _tracking_create_store wrapping the real factory — they exercise actual Postgres pool creation under contention, not just lock mechanics.

  • 4/4 thread safety tests pass
  • 729/729 full suite pass
  • README: 729
  • CI green

Zero findings. Verdict: Pass — ready for signoff.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 9, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented Apr 9, 2026

Applying Ready for QA Signoff — all test coverage gaps addressed, 4/4 thread safety tests pass with real Postgres, 729/729 total, CI green.

Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 9, 2026
@cmeans-claude-dev cmeans-claude-dev Bot merged commit 600f115 into main Apr 9, 2026
43 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the fix/lazy-store-thread-safety branch April 9, 2026 18:54
cmeans pushed a commit that referenced this pull request Apr 10, 2026
6 bug-fix PRs since v0.16.1:
- #226 LazyStore thread safety (#164)
- #227 SQL template injection hardening (#165)
- #232 Stateless HTTP mode (#180)
- #233 GET /mcp intercept + request logger (#178)
- #234 delete_entry IDOR fix (#193)
- #236 RLS-safe opt-in cleanup (#179, #183)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev cmeans-claude-dev Bot mentioned this pull request Apr 10, 2026
3 tasks
cmeans-claude-dev Bot added a commit that referenced this pull request Apr 10, 2026
6 bug-fix PRs since v0.16.1:
- #226 LazyStore thread safety (#164)
- #227 SQL template injection hardening (#165)
- #232 Stateless HTTP mode (#180)
- #233 GET /mcp intercept + request logger (#178)
- #234 delete_entry IDOR fix (#193)
- #236 RLS-safe opt-in cleanup (#179, #183)

Co-authored-by: cmeans-claude-dev[bot] <3223881+cmeans-claude-dev[bot]@users.noreply.github.com>
Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

bug: _LazyStore thread safety race condition in server.py

1 participant