fix(cli): lock global.yaml read-modify-write (closes #93) by cmeans-claude-dev[bot] · Pull Request #101 · cmeans/mcp-synology

cmeans-claude-dev · 2026-05-06T19:47:01Z

Summary

Adds _with_global_state_lock() (Unix fcntl.flock(LOCK_EX); no-op on Windows per Read-modify-write race on ~/.local/state/mcp-synology/global.yaml #93's single-user-dev-box risk surface) and wraps every load → mutate → save site touching ~/.local/state/mcp-synology/global.yaml so concurrent writers can no longer lose each other's updates.
_do_auto_upgrade and _do_revert now re-load fresh state under the lock at save time rather than mutating the pre-subprocess copy, so the lock never spans subprocess.run. _do_auto_upgrade loses its now-unused state parameter.
Closes the deferred 8th item of Tidy-up bundle: test duplication, setup input stripping, fallback truthiness, minor robustness #45 — the original "one-line flock" framing was wrong because save-side locking doesn't prevent lost updates; the lock has to span the entire critical section.

Why

atomic_write_text (#69) prevents torn writes via os.replace, but the read-modify-write sequence has no synchronization. Three independent callers (main process startup, background update task, manual --check-update) can each load the file, mutate distinct keys, save atomically — and the later writer's save overwrites the earlier writer's change because each computed its new state from a stale read.

Sites wrapped

cli/main.py — --check-update, --auto-upgrade, startup version-tracking, auto-upgrade trigger.
cli/version.py — _do_auto_upgrade and _do_revert post-subprocess saves.
server.py — background update task.

Test plan

uv run pytest -q — 605 passed (up from 603), 96.22% coverage. New regression test TestGlobalStateLock::test_concurrent_writers_preserve_both_updates runs two threads each incrementing a distinct counter 50 times under the lock; both counters reach exactly 50 (without the lock, lost updates leave at least one below 50).
uv run ruff check / ruff format --check / mypy --strict src/ — all clean.
vdsm integration tests pass on CI.
QA spot-check: run mcp-synology --check-update and mcp-synology --auto-upgrade enable back-to-back; confirm global.yaml ends up with both last_version_check and auto_upgrade: true set (no lost update).

🤖 Generated with Claude Code

…loses #93) `atomic_write_text` (#69) prevents torn writes, but two callers can each load `~/.local/state/mcp-synology/global.yaml`, mutate distinct keys, and save atomically — and the later writer's save loses the earlier writer's update because each computed its new state from a stale read. Add `_with_global_state_lock()` (Unix `fcntl.flock(LOCK_EX)`; no-op on Windows per #93's single-user-dev-box risk surface) and wrap every load → mutate → save site: the three flag handlers and the startup version-tracking block in `cli/main.py`, the auto-upgrade-trigger save in `cli/main.py`, the `_do_auto_upgrade` and `_do_revert` post-subprocess saves in `cli/version.py`, and the background update-task save in `server.py`. `_do_auto_upgrade` and `_do_revert` re-load fresh state under the lock at save time rather than mutating the pre-subprocess copy, so the lock never spans `subprocess.run`. `_do_auto_upgrade` loses its now-unused `state` parameter. Regression test runs two threads each incrementing a distinct counter 50 times under the lock; both counters reach exactly 50 (without the lock, lost updates leave at least one below 50). 605 unit tests pass at 96.22% coverage. Closes the deferred 8th item from #45 — the original "one-line flock" framing was wrong because save-side locking doesn't prevent lost updates. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Follow-up to the previous commit — the entry was written before the PR number was assigned, so the placeholder ships with the same PR. Adding this as a separate commit (rather than amending) to preserve push history on the PR branch. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-05-06T19:49:42Z

Codecov Report

❌ Patch coverage is 96.61017% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
src/mcp_synology/cli/version.py	93.54%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

cmeans · 2026-05-06T19:54:54Z

[QA] Starting review at 1f688b0. Will verify lock semantics, exception cleanup, full call-site coverage, and run the spot-check on --check-update/--auto-upgrade interleave.

cmeans

[QA] Round 1 — FAILED (one observation)

Verified at 1f688b0 against main a2e206d (post-#100). The fix correctly addresses #93's lost-update race for the synchronous CLI flows; one usage pattern in the async background path is brittle and should be tightened before signoff.

Static checks

uv run ruff check src/ tests/ → clean
uv run ruff format --check src/ tests/ → 69 files already formatted
uv run mypy --strict src/ → Success: no issues found in 28 source files

Tests

uv run pytest → 605 passed, 112 deselected, 18 warnings, 0 failed in 23.6s (matches PR body)
Coverage 96.22% (matches PR body)
112 deselected = addopts = "-m 'not integration and not vdsm'"; vdsm CI ran SUCCESS at this SHA.
TDD-verified the regression test: stripped the fcntl.flock call to a no-op in _with_global_state_lock, reran TestGlobalStateLock::test_concurrent_writers_preserve_both_updates — fails with FileNotFoundError from atomic_write_text's os.replace race (concurrent threads losing each other's .tmp files); restored, both new tests pass. Different failure mode than a counter-mismatch but the test does protect against the race.

Issue #93 acceptance criteria

Option A (_with_global_state_lock() context manager, fcntl.flock(LOCK_EX) on Unix, no-op on Windows) — chosen and implemented at cli/version.py:126-154. ✓
All three issue-cited caller sites use the new pattern, plus three more not in the issue:
- cli/main.py:53 — --check-update flag
- cli/main.py:72 — --auto-upgrade enable/disable flag
- cli/main.py:85 — startup version-tracking
- cli/main.py:97 — auto-upgrade trigger
- cli/version.py:211 — _do_auto_upgrade post-subprocess save
- cli/version.py:259 — _do_revert post-subprocess save
- server.py:214 — background update task
  ✓ (six load→mutate→save sites; the only post-subprocess sites correctly re-load fresh state under the lock so the lock never spans a subprocess.run)
Unit test for two interleaved write attempts: TestGlobalStateLock::test_concurrent_writers_preserve_both_updates (threads). ✓
CHANGELOG ### Fixed entry, references #101 (PR_PLACEHOLDER substituted in 1f688b0). ✓

QA spot-check — back-to-back mcp-synology --check-update then mcp-synology --auto-upgrade enable under a tmp HOME:

last_version_check: '2026-05-06T19:57:21.382240+00:00'
latest_known_version: 0.5.2
auto_upgrade: true

All three keys present. Lockfile created at <state_dir>/global.yaml.lock (0 bytes, expected). ✓

O1 (observation) — server.py:214-218 holds fcntl.flock across an await

with _with_global_state_lock():
    gstate = _load_global_state()
    async with asyncio.timeout(10):
        latest = await loop.run_in_executor(None, _check_for_update, gstate)
    _save_global_state(gstate)

The synchronous fcntl.flock(LOCK_EX) is held while the coroutine yields control via await. fcntl.flock is per-OFD on Linux, so two coroutines opening the lockfile each get a fresh fd. If a future async coroutine in the same event loop ever calls with _with_global_state_lock():, that synchronous flock call blocks the event loop thread — and since the original coroutine needs the event loop to resume after the executor completes, you get a deadlock (event loop blocked on flock that only releases when the original coroutine resumes that only happens when the event loop runs).

Today the pattern is safe because _bg_update_check is the sole async caller, fired exactly once per server lifetime via asyncio.create_task at server.py:123 (idempotent — get_client()'s asyncio.Lock + self._client is None check prevents re-entry). So there's no concrete deadlock today. But the constraint ("only one async caller of _with_global_state_lock allowed in this event loop") is unobvious and would be easy to violate when adding the next async path that touches global.yaml (e.g., #75 ServerState lifecycle, which the #93 issue itself anticipates).

Two acceptable resolutions:

Move the critical section into the executor — strictly safer, no future footgun:

def _do_check() -> str | None:
    with _with_global_state_lock():
        g = _load_global_state()
        latest = _check_for_update(g)
        _save_global_state(g)
    return latest
async with asyncio.timeout(10):
    latest = await loop.run_in_executor(None, _do_check)

The asyncio.timeout(10) bound is preserved; the lock duration shrinks to just the _do_check body (no longer spans event-loop scheduling jitter); and the lock-across-await footgun is gone.

Document the constraint — add a comment in _bg_update_check and/or a docstring note in _with_global_state_lock saying "synchronous lock — do not call from a coroutine that awaits while holding it; route through run_in_executor". Lighter-weight, relies on future authors reading.

Either resolution works for me. Option 1 is what I'd reach for; option 2 is fine if you prefer minimal diff. Push back if you read this differently — the deadlock path is theoretical and this might be over-engineering for a one-shot startup task.

Transparency notes (not findings)

TOCTOU windows exist between sequential lock blocks in cli/main.py (e.g., the state.get("auto_upgrade") check at line 96 reads a post-lock-release value, the _do_revert decision at cli/version.py:223 reads previous_version outside any lock). All consequences are benign — last-writer-wins on user-action data, with the eventual save under a lock catching real corruption. Not worth code churn.
Lock held across PyPI urlopen (5s timeout) in _check_for_update. Other writers wait worst-case ~5s. Acceptable for a one-time startup check; --check-update from the CLI hits the same window but is also a manual one-off action.

CI is fully green at 1f688b0 (12 SUCCESS / 2 SKIPPED required, vdsm SUCCESS). Test plan checkboxes 3 and 4 (vdsm + QA spot-check) ticked pre-audit.

Applying QA Failed as the final act for the one observation. Address O1 (either option), push, and re-request — should be a quick round 2.

…1, O1) The previous shape held a synchronous fcntl.flock across `await loop.run_in_executor(...)`. That's safe today because `_bg_update_check` is the sole async caller of `_with_global_state_lock` on this event loop, but the constraint ("only one async caller of the lock per event loop") is unobvious and would be easy to violate the next time something async touches global.yaml — e.g. #75's ServerState lifecycle, which #93 itself anticipates. Move the entire load → check → save sequence into the executor via a local sync function. The asyncio.timeout(10) bound is preserved, the lock duration shrinks to just the inner block (no event-loop scheduling jitter), and the lock-across-await footgun is gone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cmeans-claude-dev · 2026-05-06T20:04:27Z

Round 2 — O1 addressed

Took option 1 from the QA review (move the critical section into the executor) at 87f5cd7. The full load → check → save now runs inside run_in_executor via a local sync _check_under_lock(), so the synchronous fcntl.flock no longer spans an await. asyncio.timeout(10) bound preserved.

Also added a comment block at the call site explaining why the placement matters, since the constraint ("don't hold the sync lock across awaits on this event loop") would otherwise be invisible to a future contributor adding the next async caller — exactly the footgun Round 1 flagged. The CHANGELOG entry now mentions the executor placement and the #75 anticipation.

Locally green: ruff/format/mypy clean, 605 passed at 96.22% coverage (no test changes needed — _bg_update_check's direct test path was unaffected by moving the body inside the executor closure). Re-requesting QA.

cmeans · 2026-05-06T20:45:18Z

[QA] Starting round 2 review at 87f5cd7. Re-running lint/mypy/pytest and the spot-check; re-verifying O1 from round 1.

cmeans

[QA] Round 2 — PASS

Verified at 87f5cd7 (single new commit on top of round-1 head 1f688b0). Round-1 O1 cleanly addressed.

Diff scope

src/mcp_synology/server.py only — _bg_update_check's critical section moved inside a synchronous _check_under_lock closure that runs entirely in loop.run_in_executor(None, ...). The fcntl.flock is acquired and released on the executor thread; the event loop thread only waits on the executor future. The deadlock pathway from round 1 is gone — even if a future async caller takes the lock on the same event loop, the executor thread's lock acquisition won't compete with it.
The new comment block explicitly references #75's ServerState lifecycle as the anticipated future trigger. Constraint is now self-documenting.

Re-verification on 87f5cd7

uv run ruff check src/ tests/ → clean
uv run ruff format --check src/ tests/ → 69 files already formatted
uv run mypy --strict src/ → Success: no issues found in 28 source files
uv run pytest → 605 passed, 112 deselected, 18 warnings, 0 failed in 23.9s. Coverage 96.22%. (Identical totals to round 1; the change is non-functional from the test suite's perspective.)
Spot-check re-run: mcp-synology --check-update then --auto-upgrade enable under tmp HOME — global.yaml ends up with last_version_check, latest_known_version, and auto_upgrade: true. ✓
CI rollup at 87f5cd7: 12 SUCCESS / 5 SKIPPED required incl. vdsm integration tests SUCCESS.

Zero blockers, zero substantive, zero observations. Test-plan checkboxes stay ticked from round 1 (test plan steps unchanged). Applying Ready for QA Signoff as the final act, removing QA Active in the same call. Awaiting maintainer's QA Approved.

cmeans

LGTM

github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 6, 2026

github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Ready for QA Dev work complete — QA can begin review Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 6, 2026

cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 6, 2026

github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 6, 2026

cmeans requested changes May 6, 2026

View reviewed changes

cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels May 6, 2026

cmeans-claude-dev Bot removed the QA Failed QA found issues — needs dev attention label May 6, 2026

github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 6, 2026

cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 6, 2026

github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 6, 2026

cmeans reviewed May 6, 2026

View reviewed changes

cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels May 6, 2026

cmeans approved these changes May 6, 2026

View reviewed changes

cmeans removed the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 6, 2026

cmeans added the QA Approved Manual QA testing completed and passed label May 6, 2026

cmeans-claude-dev Bot merged commit fe5c3cc into main May 6, 2026
38 checks passed

cmeans-claude-dev Bot deleted the fix/global-state-rmw-race branch May 6, 2026 20:53

cmeans-claude-dev Bot mentioned this pull request May 6, 2026

Tidy-up bundle: test duplication, setup input stripping, fallback truthiness, minor robustness #45

Closed

9 tasks

cmeans mentioned this pull request May 13, 2026

chore(deps): bump python-multipart from 0.0.26 to 0.0.27 in the uv group across 1 directory #102

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(cli): lock global.yaml read-modify-write (closes #93)#101

fix(cli): lock global.yaml read-modify-write (closes #93)#101
cmeans-claude-dev[bot] merged 3 commits into
mainfrom
fix/global-state-rmw-race

cmeans-claude-dev Bot commented May 6, 2026 •

edited by cmeans

Loading

Uh oh!

codecov-commenter commented May 6, 2026 •

edited

Loading

Uh oh!

cmeans commented May 6, 2026

Uh oh!

cmeans left a comment

Uh oh!

cmeans-claude-dev Bot commented May 6, 2026

Uh oh!

cmeans commented May 6, 2026

Uh oh!

cmeans left a comment

Uh oh!

cmeans left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cmeans-claude-dev Bot commented May 6, 2026 • edited by cmeans Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Sites wrapped

Test plan

Uh oh!

codecov-commenter commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cmeans commented May 6, 2026

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

Uh oh!

cmeans-claude-dev Bot commented May 6, 2026

Round 2 — O1 addressed

Uh oh!

cmeans commented May 6, 2026

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmeans-claude-dev Bot commented May 6, 2026 •

edited by cmeans

Loading

codecov-commenter commented May 6, 2026 •

edited

Loading