deploy(caddy): split error + access logs to rotated files (validates against live CT 112)#30
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
…alidate-as-root gotcha
Replaces the single `log { output stdout }` stanza in the
Caddyfile.example with two purpose-split logs, both written to
/var/log/caddy/ with Caddy's built-in size + count + age
rotation:
- Global `log default` (in the {…} options block) routes
server-level errors to error.log: level ERROR, JSON,
roll_size 50MiB / roll_keep 10 / roll_keep_for 2160h
(90 days). Errors are rare and forensically valuable, so
retention is longer.
- Per-site `log` block writes per-request entries to
access.log: JSON, roll_size 100MiB / roll_keep 14 /
roll_keep_for 720h (30 days). Low-traffic badge service —
one 100 MiB file may last weeks, so retention is bounded
by the keep_for value.
Caddy 2.7+ supports lumberjack rotation natively
(roll_size + roll_keep + roll_keep_for). No separate
logrotate config required.
Validated against the production deployment on CT 112
(Holodeck) before this commit:
- `caddy validate --config Caddyfile --adapter caddyfile`
passes
- `systemctl restart caddy` brings the daemon up clean
- Three test hits (200 _health.json, 200 real badge, 404
missing badge) all logged as JSON access entries with
remote_ip, host, uri, method, status, duration, response
headers
- error.log stays empty (4xx responses don't go there;
only server-level / panic / ACME errors do)
Header comment documents a gotcha hit during the production
rollout: running `caddy validate` *as root* pre-creates
/var/log/caddy/{error,access}.log as root:root 0600, which
the caddy daemon (user caddy) can't open on reload —
systemd unit gets stuck in `reloading` state. Recovery:
chown caddy:caddy on the log files, then systemctl restart
caddy. Future operators: validate as the caddy user, or
let the daemon create the files itself on first reload.
20b7af2 to
baba6a5
Compare
cmeans
left a comment
There was a problem hiding this comment.
QA round 1 — QA Failed
The Caddyfile.example change itself is sound — the live-deployment-first pattern is exactly right per feedback_anchor_on_live_deployment.md, and I verified the rotation knobs and structure match CT 112's /etc/caddy/Caddyfile byte-for-byte (50MiB/10/2160h on error, 100MiB/14/720h on access, JSON format on both, level ERROR on the global log default). Live curl tests in PR body reproduced from my side: 200 / 200 / 404 against the three URLs as advertised. CI deploy-smoke validates the new file against caddy:2. So far so good.
The fail is for doc drift in deploy/README.md introduced by this PR.
Substantive finding (blocker): "Native journal logging" claim is now incomplete
deploy/README.md:33:
| **Bare systemd** (Linux host or LXC) | `systemd/`, `caddy/Caddyfile.example` | Smallest moving parts. Predictable. Native journal logging. | Linux-only. Manual user/dir setup. |
That cell describes the Bare systemd path's pros. After this PR, Caddy no longer logs to journal at all — server errors go to /var/log/caddy/error.log, requests go to /var/log/caddy/access.log. Only the collector still uses journal (line 88's journalctl -u pypi-winnow-downloads-collector.service).
So "Native journal logging" describes half the deployment now. An operator picking deployment shape from this table gets a misleading pro: they reasonably expect to journalctl -u caddy to grep request data and find no requests there.
This is exactly the symbol-walking the new lens calls for: changing how Caddy logs → grep deploy/README.md for log / journal → find this row → update it. Per feedback_doc_drift_is_substantive.md, doc drift is substantive, fix in same PR cycle.
Suggested fix: rewrite the cell to either drop the journal-specific claim (e.g., "Native logging integration") or split it ("Collector logs to journal; Caddy logs to rotated files under /var/log/caddy/").
What is correct on this PR
| Aspect | State |
|---|---|
| Live-deployment-first validation | yes — production CT 112 changed first, ported to example after. Pattern matches feedback_anchor_on_live_deployment.md. |
| Rotation knobs match production | exact: error 50MiB / 10 / 2160h, access 100MiB / 14 / 720h |
| Structural match | global log default block with level ERROR + per-site log block, both format json, file outputs at /var/log/caddy/{error,access}.log |
caddy:2 validates the new file |
CI deploy-smoke job SUCCESS on PR head |
| Live URL behavior matches PR body | 200 / 200 / 404 reproduced from my end against pypi-badges.intfar.com |
| Header gotcha documentation | well-placed: operators copying Caddyfile.example to /etc/caddy/Caddyfile will see the "running validate as root pre-creates the log files as root:root 0600" warning before they can fall into it |
| CHANGELOG entry | top of ### Added, comprehensive, references the gotcha |
Adjacent observation (not a blocker)
deploy/README.md's Bare systemd quickstart (around line 91-93) doesn't mention /var/log/caddy/ ownership or pre-creation. Operators on Debian-family installing caddy from caddyserver.com get the directory created with caddy:caddy ownership by the package's postinst, so the gotcha doesn't bite them. Operators on non-Debian distros (Alpine / Arch / RHEL / Fedora) where the package may not pre-create the directory could hit it. The gotcha is documented in the Caddyfile.example header so an operator reading what they're copying will see it — that's reasonable coverage. Worth thinking about, not worth blocking.
Transitioning label to QA Failed.
|
Applying |
… to journal Round-1 QA on PR #30 caught a doc-drift miss: the "Pick an approach" table row for Bare systemd at deploy/README.md:33 listed "Native journal logging" as a pro. After this PR's Caddyfile change, only the collector logs to journal — Caddy writes to /var/log/caddy/{error,access}.log directly. An operator picking deployment shape from that table would reasonably expect to grep `journalctl -u caddy` for request data and find none. Updated the cell to read "Collector logs to journal; Caddy logs to rotated files under /var/log/caddy/" — same bullet style as the rest of the row, but accurate. Other log/journal references in deploy/README.md (line 88's collector journalctl example) are still correct and untouched. CHANGELOG entry for this PR amended to mention the README update so the bullet reflects full PR scope.
|
QA round 1 finding addressed. New commit
Adjacent observation about non-Debian distros needing to pre-create Re-tagged |
cmeans
left a comment
There was a problem hiding this comment.
QA round 2 — clean, signing off
Round-1 doc-drift fix landed as commit 5644c59. The deploy/README.md:33 cell now reads:
Smallest moving parts. Predictable. Collector logs to journal; Caddy logs to rotated files under
/var/log/caddy/.
That's accurate and self-explanatory — operators picking deployment shape from the table now get the right mental model for where to grep request data.
Round-1 finding resolution check:
| Round-1 ask | Resolution |
|---|---|
deploy/README.md:33 "Native journal logging" claim |
replaced with split phrasing that names both halves of the deployment correctly |
| Repo-wide drift recheck | only two journal / log references remain in deploy/README.md: line 33 (now correct) and line 88's journalctl -u pypi-winnow-downloads-collector.service (still accurate — collector still uses journal). Nothing else drifts. |
| CHANGELOG accuracy | the existing PR bullet was amended to mention the README update so the Unreleased entry reflects the full PR scope, not just the Caddyfile change |
No regression on what was already correct in round 1:
Caddyfile.exampleitself is unchanged from round 1 (verified bygit diff baba6a5...5644c59 -- deploy/caddy/Caddyfile.example— empty). The byte-for-byte match against live CT 112 production still holds (50MiB/10/2160h error, 100MiB/14/720h access, level ERROR global, JSON on both).- CI on new head: all SUCCESS — including
deploy-smokevalidating the unchanged Caddyfile.example againstcaddy:2. - Live URL behavior on
pypi-badges.intfar.comalready verified in round 1 (200 / 200 / 404).
No new findings. Transitioning label to Ready for QA Signoff.
|
Applying |
Three mechanical edits: - pyproject.toml: version "0.1.0" -> "0.1.1" - CHANGELOG.md: insert `## [0.1.1] - 2026-04-26` directly under the (still empty) `## [Unreleased]` header so all 12 PRs' worth of bullets that have been accumulating since v0.1.0 ship are now categorized under the 0.1.1 release. Updated the link refs at the bottom: [Unreleased] now compares from v0.1.1, and a new [0.1.1] entry compares v0.1.0...v0.1.1. - uv.lock: refreshed by `uv lock` so the locked pypi-winnow-downloads version (0.1.1) matches pyproject.toml. What ships in v0.1.1 (highlights — full changelog under ## [0.1.1]): Library fixes (operator-visible): - collector: _write_health OSError no longer escapes per-package isolation. Disk-full / perm errors now produce structured `winnow-collect: ...; health file write failed: [Errno 28] No space left on device` exit instead of a raw traceback. Closes #32. - collector: stale_threshold_days is now actually consulted — the "warn if previous run is older than N days" feature documented in config.example.yaml since v0.1.0 finally fires. Log-only per the documented v1 contract; degrades silently on first-run / unreadable / malformed / future- timestamped previous _health.json. Closes #33. Documentation: - README acknowledgments / license / BigQuery dataset link refresh (PR #15) - README shields.io URL canonicalization (PR #27, closes #16) - deploy/README.md Tailscale Funnel as alternative HTTPS exposure (PR #22) - deploy/README.md "Pick an approach" table updated to reflect the new Caddy logging shape (in PR #30) CI / project infrastructure (no PyPI consumer impact, but hardens future releases): - Community health files: CONTRIBUTING / CoC / SECURITY / issue templates (PR #20) - .github/dependabot.yml across pip + github-actions + docker ecosystems (PR #21) - Dependabot PR hygiene cascade from cmeans/mcp-synology: PULL_REQUEST_TEMPLATE.md + auto-CHANGELOG workflow (App- token authenticated so required CI re-fires on the bot's HEAD SHA) + dependabot.yml prefix fix (PR #25). Validated end-to-end via the first two real Dependabot bumps PR #23 (codecov-action 5->6) and PR #24 (python 3.13-slim -> 3.14-slim). - deploy-smoke CI job that builds the Dockerfile, smokes the entrypoint, validates compose+Caddyfile against caddy:2 (PR #29, closes #7). Promoted to required status check on the main-protection ruleset 2026-04-26 22:43 (issue #31 closed via operator action). - deploy/caddy/Caddyfile.example gains global error logger + per-site access logger with built-in lumberjack rotation, documents the validate-as-root gotcha (PR #30). Live CT 112 deployment fixed in the same change. - 100% coverage on src/ via real tests (no `# pragma: no cover`), with `fail_under = 100` gate in pyproject.toml so future regressions trip CI (PR #38, closes #37). Verified locally: 71/71 pytest pass, ruff/format/mypy clean, coverage gate green at 100.00%. After this merges: 1. Tag the squash-merge commit as v0.1.1 and push the tag — publish.yml fires and uploads to PyPI via the existing trusted-publisher OIDC flow. 2. Update the live CT 112 deployment to install pypi-winnow-downloads==0.1.1 from PyPI (currently runs a wheel built from main, but pinning to the released version keeps deploy reproducible). 3. Close any post-release follow-ups Chris wants tracked. Co-authored-by: cmeans-claude-dev[bot] <272174644+cmeans-claude-dev[bot]@users.noreply.github.com>
Summary
deploy/caddy/Caddyfile.examplepreviously had a singlelog { output stdout }stanza that buried request data insidejournalctl -u caddywith no separate error vs access split and no rotation. This PR ports the validated production-deployment pattern back into the example so future operators get the same shape./var/log/caddy/error.logroll_size 50MiB/roll_keep 10/roll_keep_for 2160h(90 d)log defaultblock; levelERROR; JSON. Errors are rare and forensically valuable, so kept longer./var/log/caddy/access.logroll_size 100MiB/roll_keep 14/roll_keep_for 720h(30 d)logblock; JSON. Low-traffic badge service — one 100 MiB file may last weeks.Caddy 2.7+ supports lumberjack rotation natively (
roll_size+roll_keep+roll_keep_for). No separate logrotate config required. Same keys, just two destinations.Validated against the live production deployment
This change was rolled out on the live CT 112 deployment FIRST, validated end-to-end, then ported to the example here.
Live verification:
error.logstays empty after the test hits — 4xx responses don't go there; only server-level / panic / ACME errors do.Gotcha documented in the file header
Hit a sharp gotcha during the production rollout: running
caddy validateas root (e.g., from a deployment script) pre-creates/var/log/caddy/{error,access}.logasroot:root 0600. The caddy daemon (running as usercaddy) then can't open them on reload, and the systemd unit gets stuck inreloadingstate. Recovery:chown caddy:caddy /var/log/caddy/{error,access}.log && systemctl restart caddy.The Caddyfile.example header comment documents this so future operators don't lose 20 minutes to it.
Test plan
caddy validate --config Caddyfile.example --adapter caddyfile(CI'sdeploy-smokejob in PR ci: add deploy-smoke job for Dockerfile + Compose + Caddyfile examples (#7) #29 covers this once ci: add deploy-smoke job for Dockerfile + Compose + Caddyfile examples (#7) #29 lands)CHANGELOG
## [Unreleased]→### Addedentry describing the split + rotation + gotcha-documentationRelated
ssh holodeck → pct exec 112against/etc/caddy/Caddyfile.ci/docker-smoke) addsdeploy-smoketo CI, whichcaddy validates this example file on every push — so any future regression here gets caught.