fix(cli): lock global.yaml read-modify-write (closes #93)#101
Conversation
…loses #93) `atomic_write_text` (#69) prevents torn writes, but two callers can each load `~/.local/state/mcp-synology/global.yaml`, mutate distinct keys, and save atomically — and the later writer's save loses the earlier writer's update because each computed its new state from a stale read. Add `_with_global_state_lock()` (Unix `fcntl.flock(LOCK_EX)`; no-op on Windows per #93's single-user-dev-box risk surface) and wrap every load → mutate → save site: the three flag handlers and the startup version-tracking block in `cli/main.py`, the auto-upgrade-trigger save in `cli/main.py`, the `_do_auto_upgrade` and `_do_revert` post-subprocess saves in `cli/version.py`, and the background update-task save in `server.py`. `_do_auto_upgrade` and `_do_revert` re-load fresh state under the lock at save time rather than mutating the pre-subprocess copy, so the lock never spans `subprocess.run`. `_do_auto_upgrade` loses its now-unused `state` parameter. Regression test runs two threads each incrementing a distinct counter 50 times under the lock; both counters reach exactly 50 (without the lock, lost updates leave at least one below 50). 605 unit tests pass at 96.22% coverage. Closes the deferred 8th item from #45 — the original "one-line flock" framing was wrong because save-side locking doesn't prevent lost updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Follow-up to the previous commit — the entry was written before the PR number was assigned, so the placeholder ships with the same PR. Adding this as a separate commit (rather than amending) to preserve push history on the PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
[QA] Starting review at |
cmeans
left a comment
There was a problem hiding this comment.
[QA] Round 1 — FAILED (one observation)
Verified at 1f688b0 against main a2e206d (post-#100). The fix correctly addresses #93's lost-update race for the synchronous CLI flows; one usage pattern in the async background path is brittle and should be tightened before signoff.
Static checks
uv run ruff check src/ tests/→ cleanuv run ruff format --check src/ tests/→ 69 files already formatteduv run mypy --strict src/→ Success: no issues found in 28 source files
Tests
uv run pytest→ 605 passed, 112 deselected, 18 warnings, 0 failed in 23.6s (matches PR body)- Coverage 96.22% (matches PR body)
- 112 deselected =
addopts = "-m 'not integration and not vdsm'"; vdsm CI ran SUCCESS at this SHA. - TDD-verified the regression test: stripped the
fcntl.flockcall to a no-op in_with_global_state_lock, reranTestGlobalStateLock::test_concurrent_writers_preserve_both_updates— fails withFileNotFoundErrorfromatomic_write_text'sos.replacerace (concurrent threads losing each other's.tmpfiles); restored, both new tests pass. Different failure mode than a counter-mismatch but the test does protect against the race.
Issue #93 acceptance criteria
- Option A (
_with_global_state_lock()context manager,fcntl.flock(LOCK_EX)on Unix, no-op on Windows) — chosen and implemented atcli/version.py:126-154. ✓ - All three issue-cited caller sites use the new pattern, plus three more not in the issue:
cli/main.py:53—--check-updateflagcli/main.py:72—--auto-upgrade enable/disableflagcli/main.py:85— startup version-trackingcli/main.py:97— auto-upgrade triggercli/version.py:211—_do_auto_upgradepost-subprocess savecli/version.py:259—_do_revertpost-subprocess saveserver.py:214— background update task
✓ (six load→mutate→save sites; the only post-subprocess sites correctly re-load fresh state under the lock so the lock never spans asubprocess.run)
- Unit test for two interleaved write attempts:
TestGlobalStateLock::test_concurrent_writers_preserve_both_updates(threads). ✓ - CHANGELOG
### Fixedentry, references#101(PR_PLACEHOLDER substituted in1f688b0). ✓
QA spot-check — back-to-back mcp-synology --check-update then mcp-synology --auto-upgrade enable under a tmp HOME:
last_version_check: '2026-05-06T19:57:21.382240+00:00'
latest_known_version: 0.5.2
auto_upgrade: trueAll three keys present. Lockfile created at <state_dir>/global.yaml.lock (0 bytes, expected). ✓
O1 (observation) — server.py:214-218 holds fcntl.flock across an await
with _with_global_state_lock():
gstate = _load_global_state()
async with asyncio.timeout(10):
latest = await loop.run_in_executor(None, _check_for_update, gstate)
_save_global_state(gstate)The synchronous fcntl.flock(LOCK_EX) is held while the coroutine yields control via await. fcntl.flock is per-OFD on Linux, so two coroutines opening the lockfile each get a fresh fd. If a future async coroutine in the same event loop ever calls with _with_global_state_lock():, that synchronous flock call blocks the event loop thread — and since the original coroutine needs the event loop to resume after the executor completes, you get a deadlock (event loop blocked on flock that only releases when the original coroutine resumes that only happens when the event loop runs).
Today the pattern is safe because _bg_update_check is the sole async caller, fired exactly once per server lifetime via asyncio.create_task at server.py:123 (idempotent — get_client()'s asyncio.Lock + self._client is None check prevents re-entry). So there's no concrete deadlock today. But the constraint ("only one async caller of _with_global_state_lock allowed in this event loop") is unobvious and would be easy to violate when adding the next async path that touches global.yaml (e.g., #75 ServerState lifecycle, which the #93 issue itself anticipates).
Two acceptable resolutions:
-
Move the critical section into the executor — strictly safer, no future footgun:
def _do_check() -> str | None: with _with_global_state_lock(): g = _load_global_state() latest = _check_for_update(g) _save_global_state(g) return latest async with asyncio.timeout(10): latest = await loop.run_in_executor(None, _do_check)
The
asyncio.timeout(10)bound is preserved; the lock duration shrinks to just the_do_checkbody (no longer spans event-loop scheduling jitter); and the lock-across-await footgun is gone. -
Document the constraint — add a comment in
_bg_update_checkand/or a docstring note in_with_global_state_locksaying "synchronous lock — do not call from a coroutine that awaits while holding it; route throughrun_in_executor". Lighter-weight, relies on future authors reading.
Either resolution works for me. Option 1 is what I'd reach for; option 2 is fine if you prefer minimal diff. Push back if you read this differently — the deadlock path is theoretical and this might be over-engineering for a one-shot startup task.
Transparency notes (not findings)
- TOCTOU windows exist between sequential lock blocks in
cli/main.py(e.g., thestate.get("auto_upgrade")check at line 96 reads a post-lock-release value, the_do_revertdecision atcli/version.py:223readsprevious_versionoutside any lock). All consequences are benign — last-writer-wins on user-action data, with the eventual save under a lock catching real corruption. Not worth code churn. - Lock held across PyPI urlopen (5s timeout) in
_check_for_update. Other writers wait worst-case ~5s. Acceptable for a one-time startup check;--check-updatefrom the CLI hits the same window but is also a manual one-off action.
CI is fully green at 1f688b0 (12 SUCCESS / 2 SKIPPED required, vdsm SUCCESS). Test plan checkboxes 3 and 4 (vdsm + QA spot-check) ticked pre-audit.
Applying QA Failed as the final act for the one observation. Address O1 (either option), push, and re-request — should be a quick round 2.
…1, O1)
The previous shape held a synchronous fcntl.flock across `await
loop.run_in_executor(...)`. That's safe today because `_bg_update_check`
is the sole async caller of `_with_global_state_lock` on this event loop,
but the constraint ("only one async caller of the lock per event loop")
is unobvious and would be easy to violate the next time something async
touches global.yaml — e.g. #75's ServerState lifecycle, which #93 itself
anticipates.
Move the entire load → check → save sequence into the executor via a
local sync function. The asyncio.timeout(10) bound is preserved, the
lock duration shrinks to just the inner block (no event-loop scheduling
jitter), and the lock-across-await footgun is gone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Round 2 — O1 addressedTook option 1 from the QA review (move the critical section into the executor) at Also added a comment block at the call site explaining why the placement matters, since the constraint ("don't hold the sync lock across awaits on this event loop") would otherwise be invisible to a future contributor adding the next async caller — exactly the footgun Round 1 flagged. The CHANGELOG entry now mentions the executor placement and the #75 anticipation. Locally green: ruff/format/mypy clean, 605 passed at 96.22% coverage (no test changes needed — |
|
[QA] Starting round 2 review at |
cmeans
left a comment
There was a problem hiding this comment.
[QA] Round 2 — PASS
Verified at 87f5cd7 (single new commit on top of round-1 head 1f688b0). Round-1 O1 cleanly addressed.
Diff scope
src/mcp_synology/server.pyonly —_bg_update_check's critical section moved inside a synchronous_check_under_lockclosure that runs entirely inloop.run_in_executor(None, ...). Thefcntl.flockis acquired and released on the executor thread; the event loop thread only waits on the executor future. The deadlock pathway from round 1 is gone — even if a future async caller takes the lock on the same event loop, the executor thread's lock acquisition won't compete with it.- The new comment block explicitly references #75's
ServerStatelifecycle as the anticipated future trigger. Constraint is now self-documenting.
Re-verification on 87f5cd7
uv run ruff check src/ tests/→ cleanuv run ruff format --check src/ tests/→ 69 files already formatteduv run mypy --strict src/→ Success: no issues found in 28 source filesuv run pytest→ 605 passed, 112 deselected, 18 warnings, 0 failed in 23.9s. Coverage 96.22%. (Identical totals to round 1; the change is non-functional from the test suite's perspective.)- Spot-check re-run:
mcp-synology --check-updatethen--auto-upgrade enableunder tmp HOME —global.yamlends up withlast_version_check,latest_known_version, andauto_upgrade: true. ✓ - CI rollup at
87f5cd7: 12 SUCCESS / 5 SKIPPED required incl. vdsm integration tests SUCCESS.
Zero blockers, zero substantive, zero observations. Test-plan checkboxes stay ticked from round 1 (test plan steps unchanged). Applying Ready for QA Signoff as the final act, removing QA Active in the same call. Awaiting maintainer's QA Approved.
Summary
_with_global_state_lock()(Unixfcntl.flock(LOCK_EX); no-op on Windows per Read-modify-write race on ~/.local/state/mcp-synology/global.yaml #93's single-user-dev-box risk surface) and wraps everyload → mutate → savesite touching~/.local/state/mcp-synology/global.yamlso concurrent writers can no longer lose each other's updates._do_auto_upgradeand_do_revertnow re-load fresh state under the lock at save time rather than mutating the pre-subprocess copy, so the lock never spanssubprocess.run._do_auto_upgradeloses its now-unusedstateparameter.flock" framing was wrong because save-side locking doesn't prevent lost updates; the lock has to span the entire critical section.Why
atomic_write_text(#69) prevents torn writes viaos.replace, but the read-modify-write sequence has no synchronization. Three independent callers (main process startup, background update task, manual--check-update) can each load the file, mutate distinct keys, save atomically — and the later writer's save overwrites the earlier writer's change because each computed its new state from a stale read.Sites wrapped
cli/main.py—--check-update,--auto-upgrade, startup version-tracking, auto-upgrade trigger.cli/version.py—_do_auto_upgradeand_do_revertpost-subprocess saves.server.py— background update task.Test plan
uv run pytest -q— 605 passed (up from 603), 96.22% coverage. New regression testTestGlobalStateLock::test_concurrent_writers_preserve_both_updatesruns two threads each incrementing a distinct counter 50 times under the lock; both counters reach exactly 50 (without the lock, lost updates leave at least one below 50).uv run ruff check/ruff format --check/mypy --strict src/— all clean.mcp-synology --check-updateandmcp-synology --auto-upgrade enableback-to-back; confirmglobal.yamlends up with bothlast_version_checkandauto_upgrade: trueset (no lost update).🤖 Generated with Claude Code