Skip to content

fix(cli): lock global.yaml read-modify-write (closes #93)#101

Merged
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
fix/global-state-rmw-race
May 6, 2026
Merged

fix(cli): lock global.yaml read-modify-write (closes #93)#101
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
fix/global-state-rmw-race

Conversation

@cmeans-claude-dev
Copy link
Copy Markdown
Contributor

@cmeans-claude-dev cmeans-claude-dev Bot commented May 6, 2026

Summary

  • Adds _with_global_state_lock() (Unix fcntl.flock(LOCK_EX); no-op on Windows per Read-modify-write race on ~/.local/state/mcp-synology/global.yaml #93's single-user-dev-box risk surface) and wraps every load → mutate → save site touching ~/.local/state/mcp-synology/global.yaml so concurrent writers can no longer lose each other's updates.
  • _do_auto_upgrade and _do_revert now re-load fresh state under the lock at save time rather than mutating the pre-subprocess copy, so the lock never spans subprocess.run. _do_auto_upgrade loses its now-unused state parameter.
  • Closes the deferred 8th item of Tidy-up bundle: test duplication, setup input stripping, fallback truthiness, minor robustness #45 — the original "one-line flock" framing was wrong because save-side locking doesn't prevent lost updates; the lock has to span the entire critical section.

Why

atomic_write_text (#69) prevents torn writes via os.replace, but the read-modify-write sequence has no synchronization. Three independent callers (main process startup, background update task, manual --check-update) can each load the file, mutate distinct keys, save atomically — and the later writer's save overwrites the earlier writer's change because each computed its new state from a stale read.

Sites wrapped

  • cli/main.py--check-update, --auto-upgrade, startup version-tracking, auto-upgrade trigger.
  • cli/version.py_do_auto_upgrade and _do_revert post-subprocess saves.
  • server.py — background update task.

Test plan

  • uv run pytest -q — 605 passed (up from 603), 96.22% coverage. New regression test TestGlobalStateLock::test_concurrent_writers_preserve_both_updates runs two threads each incrementing a distinct counter 50 times under the lock; both counters reach exactly 50 (without the lock, lost updates leave at least one below 50).
  • uv run ruff check / ruff format --check / mypy --strict src/ — all clean.
  • vdsm integration tests pass on CI.
  • QA spot-check: run mcp-synology --check-update and mcp-synology --auto-upgrade enable back-to-back; confirm global.yaml ends up with both last_version_check and auto_upgrade: true set (no lost update).

🤖 Generated with Claude Code

…loses #93)

`atomic_write_text` (#69) prevents torn writes, but two callers can each
load `~/.local/state/mcp-synology/global.yaml`, mutate distinct keys, and
save atomically — and the later writer's save loses the earlier writer's
update because each computed its new state from a stale read.

Add `_with_global_state_lock()` (Unix `fcntl.flock(LOCK_EX)`; no-op on
Windows per #93's single-user-dev-box risk surface) and wrap every
load → mutate → save site: the three flag handlers and the startup
version-tracking block in `cli/main.py`, the auto-upgrade-trigger save
in `cli/main.py`, the `_do_auto_upgrade` and `_do_revert` post-subprocess
saves in `cli/version.py`, and the background update-task save in
`server.py`. `_do_auto_upgrade` and `_do_revert` re-load fresh state
under the lock at save time rather than mutating the pre-subprocess copy,
so the lock never spans `subprocess.run`. `_do_auto_upgrade` loses its
now-unused `state` parameter.

Regression test runs two threads each incrementing a distinct counter
50 times under the lock; both counters reach exactly 50 (without the
lock, lost updates leave at least one below 50). 605 unit tests pass
at 96.22% coverage.

Closes the deferred 8th item from #45 — the original "one-line flock"
framing was wrong because save-side locking doesn't prevent lost updates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 6, 2026
Follow-up to the previous commit — the entry was written before the PR
number was assigned, so the placeholder ships with the same PR. Adding
this as a separate commit (rather than amending) to preserve push history
on the PR branch.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Ready for QA Dev work complete — QA can begin review Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 6, 2026
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 6, 2026

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/mcp_synology/cli/version.py 93.54% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 6, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented May 6, 2026

[QA] Starting review at 1f688b0. Will verify lock semantics, exception cleanup, full call-site coverage, and run the spot-check on --check-update/--auto-upgrade interleave.

@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 6, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[QA] Round 1 — FAILED (one observation)

Verified at 1f688b0 against main a2e206d (post-#100). The fix correctly addresses #93's lost-update race for the synchronous CLI flows; one usage pattern in the async background path is brittle and should be tightened before signoff.

Static checks

  • uv run ruff check src/ tests/ → clean
  • uv run ruff format --check src/ tests/ → 69 files already formatted
  • uv run mypy --strict src/ → Success: no issues found in 28 source files

Tests

  • uv run pytest605 passed, 112 deselected, 18 warnings, 0 failed in 23.6s (matches PR body)
  • Coverage 96.22% (matches PR body)
  • 112 deselected = addopts = "-m 'not integration and not vdsm'"; vdsm CI ran SUCCESS at this SHA.
  • TDD-verified the regression test: stripped the fcntl.flock call to a no-op in _with_global_state_lock, reran TestGlobalStateLock::test_concurrent_writers_preserve_both_updates — fails with FileNotFoundError from atomic_write_text's os.replace race (concurrent threads losing each other's .tmp files); restored, both new tests pass. Different failure mode than a counter-mismatch but the test does protect against the race.

Issue #93 acceptance criteria

  • Option A (_with_global_state_lock() context manager, fcntl.flock(LOCK_EX) on Unix, no-op on Windows) — chosen and implemented at cli/version.py:126-154. ✓
  • All three issue-cited caller sites use the new pattern, plus three more not in the issue:
    • cli/main.py:53--check-update flag
    • cli/main.py:72--auto-upgrade enable/disable flag
    • cli/main.py:85 — startup version-tracking
    • cli/main.py:97 — auto-upgrade trigger
    • cli/version.py:211_do_auto_upgrade post-subprocess save
    • cli/version.py:259_do_revert post-subprocess save
    • server.py:214 — background update task
      ✓ (six load→mutate→save sites; the only post-subprocess sites correctly re-load fresh state under the lock so the lock never spans a subprocess.run)
  • Unit test for two interleaved write attempts: TestGlobalStateLock::test_concurrent_writers_preserve_both_updates (threads). ✓
  • CHANGELOG ### Fixed entry, references #101 (PR_PLACEHOLDER substituted in 1f688b0). ✓

QA spot-check — back-to-back mcp-synology --check-update then mcp-synology --auto-upgrade enable under a tmp HOME:

last_version_check: '2026-05-06T19:57:21.382240+00:00'
latest_known_version: 0.5.2
auto_upgrade: true

All three keys present. Lockfile created at <state_dir>/global.yaml.lock (0 bytes, expected). ✓


O1 (observation) — server.py:214-218 holds fcntl.flock across an await

with _with_global_state_lock():
    gstate = _load_global_state()
    async with asyncio.timeout(10):
        latest = await loop.run_in_executor(None, _check_for_update, gstate)
    _save_global_state(gstate)

The synchronous fcntl.flock(LOCK_EX) is held while the coroutine yields control via await. fcntl.flock is per-OFD on Linux, so two coroutines opening the lockfile each get a fresh fd. If a future async coroutine in the same event loop ever calls with _with_global_state_lock():, that synchronous flock call blocks the event loop thread — and since the original coroutine needs the event loop to resume after the executor completes, you get a deadlock (event loop blocked on flock that only releases when the original coroutine resumes that only happens when the event loop runs).

Today the pattern is safe because _bg_update_check is the sole async caller, fired exactly once per server lifetime via asyncio.create_task at server.py:123 (idempotent — get_client()'s asyncio.Lock + self._client is None check prevents re-entry). So there's no concrete deadlock today. But the constraint ("only one async caller of _with_global_state_lock allowed in this event loop") is unobvious and would be easy to violate when adding the next async path that touches global.yaml (e.g., #75 ServerState lifecycle, which the #93 issue itself anticipates).

Two acceptable resolutions:

  1. Move the critical section into the executor — strictly safer, no future footgun:

    def _do_check() -> str | None:
        with _with_global_state_lock():
            g = _load_global_state()
            latest = _check_for_update(g)
            _save_global_state(g)
        return latest
    async with asyncio.timeout(10):
        latest = await loop.run_in_executor(None, _do_check)

    The asyncio.timeout(10) bound is preserved; the lock duration shrinks to just the _do_check body (no longer spans event-loop scheduling jitter); and the lock-across-await footgun is gone.

  2. Document the constraint — add a comment in _bg_update_check and/or a docstring note in _with_global_state_lock saying "synchronous lock — do not call from a coroutine that awaits while holding it; route through run_in_executor". Lighter-weight, relies on future authors reading.

Either resolution works for me. Option 1 is what I'd reach for; option 2 is fine if you prefer minimal diff. Push back if you read this differently — the deadlock path is theoretical and this might be over-engineering for a one-shot startup task.


Transparency notes (not findings)

  • TOCTOU windows exist between sequential lock blocks in cli/main.py (e.g., the state.get("auto_upgrade") check at line 96 reads a post-lock-release value, the _do_revert decision at cli/version.py:223 reads previous_version outside any lock). All consequences are benign — last-writer-wins on user-action data, with the eventual save under a lock catching real corruption. Not worth code churn.
  • Lock held across PyPI urlopen (5s timeout) in _check_for_update. Other writers wait worst-case ~5s. Acceptable for a one-time startup check; --check-update from the CLI hits the same window but is also a manual one-off action.

CI is fully green at 1f688b0 (12 SUCCESS / 2 SKIPPED required, vdsm SUCCESS). Test plan checkboxes 3 and 4 (vdsm + QA spot-check) ticked pre-audit.

Applying QA Failed as the final act for the one observation. Address O1 (either option), push, and re-request — should be a quick round 2.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels May 6, 2026
…1, O1)

The previous shape held a synchronous fcntl.flock across `await
loop.run_in_executor(...)`. That's safe today because `_bg_update_check`
is the sole async caller of `_with_global_state_lock` on this event loop,
but the constraint ("only one async caller of the lock per event loop")
is unobvious and would be easy to violate the next time something async
touches global.yaml — e.g. #75's ServerState lifecycle, which #93 itself
anticipates.

Move the entire load → check → save sequence into the executor via a
local sync function. The asyncio.timeout(10) bound is preserved, the
lock duration shrinks to just the inner block (no event-loop scheduling
jitter), and the lock-across-await footgun is gone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@cmeans-claude-dev
Copy link
Copy Markdown
Contributor Author

Round 2 — O1 addressed

Took option 1 from the QA review (move the critical section into the executor) at 87f5cd7. The full load → check → save now runs inside run_in_executor via a local sync _check_under_lock(), so the synchronous fcntl.flock no longer spans an await. asyncio.timeout(10) bound preserved.

Also added a comment block at the call site explaining why the placement matters, since the constraint ("don't hold the sync lock across awaits on this event loop") would otherwise be invisible to a future contributor adding the next async caller — exactly the footgun Round 1 flagged. The CHANGELOG entry now mentions the executor placement and the #75 anticipation.

Locally green: ruff/format/mypy clean, 605 passed at 96.22% coverage (no test changes needed — _bg_update_check's direct test path was unaffected by moving the body inside the executor closure). Re-requesting QA.

@cmeans-claude-dev cmeans-claude-dev Bot removed the QA Failed QA found issues — needs dev attention label May 6, 2026
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 6, 2026
@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 6, 2026
@cmeans
Copy link
Copy Markdown
Owner

cmeans commented May 6, 2026

[QA] Starting round 2 review at 87f5cd7. Re-running lint/mypy/pytest and the spot-check; re-verifying O1 from round 1.

@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 6, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[QA] Round 2 — PASS

Verified at 87f5cd7 (single new commit on top of round-1 head 1f688b0). Round-1 O1 cleanly addressed.

Diff scope

  • src/mcp_synology/server.py only — _bg_update_check's critical section moved inside a synchronous _check_under_lock closure that runs entirely in loop.run_in_executor(None, ...). The fcntl.flock is acquired and released on the executor thread; the event loop thread only waits on the executor future. The deadlock pathway from round 1 is gone — even if a future async caller takes the lock on the same event loop, the executor thread's lock acquisition won't compete with it.
  • The new comment block explicitly references #75's ServerState lifecycle as the anticipated future trigger. Constraint is now self-documenting.

Re-verification on 87f5cd7

  • uv run ruff check src/ tests/ → clean
  • uv run ruff format --check src/ tests/ → 69 files already formatted
  • uv run mypy --strict src/ → Success: no issues found in 28 source files
  • uv run pytest605 passed, 112 deselected, 18 warnings, 0 failed in 23.9s. Coverage 96.22%. (Identical totals to round 1; the change is non-functional from the test suite's perspective.)
  • Spot-check re-run: mcp-synology --check-update then --auto-upgrade enable under tmp HOME — global.yaml ends up with last_version_check, latest_known_version, and auto_upgrade: true. ✓
  • CI rollup at 87f5cd7: 12 SUCCESS / 5 SKIPPED required incl. vdsm integration tests SUCCESS.

Zero blockers, zero substantive, zero observations. Test-plan checkboxes stay ticked from round 1 (test plan steps unchanged). Applying Ready for QA Signoff as the final act, removing QA Active in the same call. Awaiting maintainer's QA Approved.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels May 6, 2026
Copy link
Copy Markdown
Owner

@cmeans cmeans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans removed the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 6, 2026
@cmeans cmeans added the QA Approved Manual QA testing completed and passed label May 6, 2026
@cmeans-claude-dev cmeans-claude-dev Bot merged commit fe5c3cc into main May 6, 2026
38 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the fix/global-state-rmw-race branch May 6, 2026 20:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants