fix(collector): _write_health OSError no longer escapes per-package isolation (#32)#35
Conversation
…solation (closes #32) Surfaced by independent code-review pass against main `4e4148f` (2026-04-26): `collect()` called `_write_health(...)` with no try/except, and `__main__.main()` had no handler around `collector_fn(config)`. A failure inside the health-write step (disk full, output dir not writable, cross-device atomic-replace, etc.) propagated as an unhandled OSError traceback all the way out of the process, bypassing the structured per-package failure summary that operators rely on. This defeated the per-package isolation contract one level up: `_collect_one` already wraps `(CollectorError, OSError)` so one bad package can't kill the run, but the final health-write step had no equivalent guard. The fix: - Wrap the `_write_health(...)` call in `collect()` with `try: ... except OSError as e: logger.error(...)`. Surface the failure structurally via a new `CollectorResult.health_write_error: str | None` field (default `None`, additive change — existing call sites that construct `CollectorResult` keep working without modification). - Update `__main__.main()` to combine per-package failures and health-write failures into one structured exit message: `winnow-collect: 2 package(s) failed: foo, bar; health file write failed: [Errno 28] No space left on device`. Both modes produce non-zero exit; either alone produces a single-clause message. Tests added: - `test_collect_health_write_oserror_recorded_not_raised` — monkeypatches `collector_module.os.replace` to raise ENOSPC ONLY on the `_health.json` target (badge writes still work via the real call). Asserts the per-package outcome lands intact AND `result.health_write_error` is set, AND no exception escapes `collect()`. - `test_main_exits_nonzero_when_health_file_write_fails` — calls `__main__.main()` with a fake `collector_fn` that returns `health_write_error="[Errno 28] No space left on device"`; asserts SystemExit message contains the structured text. - `test_main_combines_package_and_health_failure_messages` — same shape with both per-package failures AND health_write_error set; asserts both clauses appear in the exit message. - Existing `test_collect_result_reports_no_failures_on_full_success` extended to assert `health_write_error is None` on the happy path. Verified locally: 59/59 pytest pass, ruff/format/mypy clean.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
cmeans
left a comment
There was a problem hiding this comment.
QA round 1 — clean, no follow-ups
Closes #32 cleanly via the issue's resolution A. Bug class is the actual bug — _write_health OSError escaping per-package isolation — and the regression test exercises the actual bug, not just the fix mechanic.
Issue #32 scope vs PR delivery:
| Issue #32 ask | Delivered |
|---|---|
Wrap _write_health(...) in collect() with try/except OSError; log; continue |
collector.py now does exactly that, plus surfaces the failure structurally via new CollectorResult.health_write_error: str | None = None field (additive, default None, backward-compat) |
| Regression test that injects a write failure | test_collect_health_write_oserror_recorded_not_raised — monkeypatches collector_module.os.replace to raise ENOSPC only when dst ends in _health.json, lets badge writes use the real call. Asserts: result.outcomes still populated; result.failures == (); result.health_write_error contains "No space left on device". This is the right level of targeting — tests the bug class, not a blanket os.replace failure that would also break badge writes. |
| Honest exit code | __main__.main() rewritten to accumulate a problems list, joining package-failures and health-write-failures with ; separator. All four permutations covered: success silent, only-package, only-health, both-combined. |
Per-package isolation contract verified intact:
_collect_one already wraps (CollectorError, OSError) around the body that calls badge.write_badge (which uses the same os.replace pattern as _write_health). Repo-wide grep for unwrapped os.replace / write_text / mkdir calls in collect()'s call graph: only _write_health was the gap; this PR closes it, no other surface remains.
Test sufficiency — covers bug, not just mechanics:
| Test | Bug-class coverage |
|---|---|
test_collect_health_write_oserror_recorded_not_raised |
Forces OSError at the exact code path (os.replace on _health.json); verifies non-propagation + structured surfacing |
test_collect_result_reports_no_failures_on_full_success |
Backward-compat: health_write_error is None on happy path |
test_main_exits_nonzero_when_health_file_write_fails |
main() exits non-zero with structured message when only health fails |
test_main_combines_package_and_health_failure_messages |
Multi-clause exit message permutation: both failure modes together produce the ;-joined message |
Hygiene:
- New
except OSError as eis not silent —logger.error("collector: failed to write _health.json: %s", e)+ structured surfacing. Nofeedback_silent_exceptionsviolation. - Backward compat: existing
CollectorResult(...)construction sites keep working (defaultNoneon the new field). - Test plan: all 7 verifiable items now ticked (item 7 — squash-merge title preservation — is post-merge).
Local verification on head 3f2fc85:
| Check | Result |
|---|---|
uv run pytest -q |
59 passed, 0 deselected, 0.26s |
uv run pytest tests/test_collector.py::test_collect_health_write_oserror_recorded_not_raised (just the bug-class test) |
passed |
uv run pytest tests/test_main.py::test_main_combines_package_and_health_failure_messages |
passed |
uv run ruff check, ruff format --check, mypy src |
all clean |
Coverage of new lines in collector.py:209-227 and __main__.py:39-47 |
100% (the lone uncovered prod lines remain the same pre-existing 1% gap discussed earlier — __init__.py PackageNotFoundError fallback, __main__.py entry-point shim, two config.py error paths, one collector line — none introduced by this PR) |
| CI on PR head | all SUCCESS (test 3.11/3.12/3.13, lint, typecheck, deploy-smoke, qa-approved, on-push) |
No findings. Transitioning label to Ready for QA Signoff.
|
Applying |
Summary
Closes #32.
collect()called_write_health(...)with notry/except, and__main__.main()had no handler aroundcollector_fn(config). A failure inside the health-write step (disk full, output dir not writable, atomic-replace cross-device, etc.) propagated as an unhandledOSErrortraceback through the process, bypassing the structured per-package failure summary that operators rely on. This defeated the per-package isolation contract one level up:_collect_onealready wraps(CollectorError, OSError)so one bad package can't kill the run, but the final health-write step had no equivalent guard.Approach
src/pypi_winnow_downloads/collector.py_write_health(...)call incollect()withtry: … except OSError as e: logger.error(…). Surface failure via new `CollectorResult.health_write_error: strsrc/pypi_winnow_downloads/__main__.pywinnow-collect: 2 package(s) failed: foo, bar; health file write failed: [Errno 28] No space left on device. Both modes produce non-zero exit; either alone produces a single-clause message.tests/test_collector.pytest_collect_health_write_oserror_recorded_not_raised(monkeypatchesos.replaceto raise ENOSPC ONLY on_health.json, badge writes still work). Extended:test_collect_result_reports_no_failures_on_full_successassertshealth_write_error is Noneon happy path.tests/test_main.pytest_main_exits_nonzero_when_health_file_write_failsandtest_main_combines_package_and_health_failure_messages.CHANGELOG.md### Fixedentry referencing #32.Test plan
uv run pytest --cov --cov-report=xml— 59 passed (3 new + extended 1)uv run ruff check src/ tests/— cleanuv run ruff format --check src/ tests/— cleanuv run mypy src/pypi_winnow_downloads/— cleanCollectorResultfield addition is backward-compatible: existing test fixtures construct withouthealth_write_error, defaultNonemakes them work unchangedos.replaceis targeted (only raises for the health-file destination, lets badge writes use the real call) — this verifies the OSError comes from the right code path, not from a side effectWhy a new field rather than logging-only
Two failure modes to surface:
If we just logged the OSError without surfacing it structurally, a run with no per-package failures + a broken health-write would exit 0 silently — operators would think the collector succeeded when in fact
_health.jsonis stale or absent. The new field keeps the exit code honest.Alternative (rejected): a top-level
except OSErrorin__main__.main()aroundcollector_fn(config). Loses per-package outcome detail (CollectorResult never returned), and requires operators to read raw OSError strings without context about which step failed. The current approach gives finer signal at the same complexity cost.Source of finding
Independent code-review pass against main
4e4148f, 2026-04-26. Filed as #32, picked up immediately.Closes #32