Skip to content

deploy(caddy): split error + access logs to rotated files (validates against live CT 112)#30

Merged
cmeans-claude-dev[bot] merged 2 commits into
mainfrom
deploy/caddy-logging
Apr 27, 2026
Merged

deploy(caddy): split error + access logs to rotated files (validates against live CT 112)#30
cmeans-claude-dev[bot] merged 2 commits into
mainfrom
deploy/caddy-logging

Conversation

@cmeans-claude-dev

Copy link
Copy Markdown
Contributor

Summary

deploy/caddy/Caddyfile.example previously had a single log { output stdout } stanza that buried request data inside journalctl -u caddy with no separate error vs access split and no rotation. This PR ports the validated production-deployment pattern back into the example so future operators get the same shape.

Log Path Rotation Notes
Error /var/log/caddy/error.log roll_size 50MiB / roll_keep 10 / roll_keep_for 2160h (90 d) Global log default block; level ERROR; JSON. Errors are rare and forensically valuable, so kept longer.
Access /var/log/caddy/access.log roll_size 100MiB / roll_keep 14 / roll_keep_for 720h (30 d) Per-site log block; JSON. Low-traffic badge service — one 100 MiB file may last weeks.

Caddy 2.7+ supports lumberjack rotation natively (roll_size + roll_keep + roll_keep_for). No separate logrotate config required. Same keys, just two destinations.

Validated against the live production deployment

This change was rolled out on the live CT 112 deployment FIRST, validated end-to-end, then ported to the example here.

Live verification:

$ caddy validate --config /tmp/Caddyfile.new --adapter caddyfile
Valid configuration
$ systemctl restart caddy && systemctl is-active caddy
active
$ curl -s -o /dev/null -w '%{http_code}\n' \
    https://pypi-badges.intfar.com/_health.json \
    https://pypi-badges.intfar.com/pypi-winnow-downloads/downloads-30d-non-ci.json \
    https://pypi-badges.intfar.com/does-not-exist/downloads-30d-non-ci.json
200 200 404
$ tail -1 /var/log/caddy/access.log | jq '.status, .request.uri, .duration'
404
"/does-not-exist/downloads-30d-non-ci.json"
0.000110184

error.log stays empty after the test hits — 4xx responses don't go there; only server-level / panic / ACME errors do.

Gotcha documented in the file header

Hit a sharp gotcha during the production rollout: running caddy validate as root (e.g., from a deployment script) pre-creates /var/log/caddy/{error,access}.log as root:root 0600. The caddy daemon (running as user caddy) then can't open them on reload, and the systemd unit gets stuck in reloading state. Recovery: chown caddy:caddy /var/log/caddy/{error,access}.log && systemctl restart caddy.

The Caddyfile.example header comment documents this so future operators don't lose 20 minutes to it.

Test plan

CHANGELOG

  • Added a ## [Unreleased]### Added entry describing the split + rotation + gotcha-documentation

Related

@github-actions github-actions Bot added the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 27, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added the Ready for QA Dev work complete — QA can begin review label Apr 27, 2026
@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 27, 2026
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

…alidate-as-root gotcha

Replaces the single `log { output stdout }` stanza in the
Caddyfile.example with two purpose-split logs, both written to
/var/log/caddy/ with Caddy's built-in size + count + age
rotation:

- Global `log default` (in the {…} options block) routes
  server-level errors to error.log: level ERROR, JSON,
  roll_size 50MiB / roll_keep 10 / roll_keep_for 2160h
  (90 days). Errors are rare and forensically valuable, so
  retention is longer.

- Per-site `log` block writes per-request entries to
  access.log: JSON, roll_size 100MiB / roll_keep 14 /
  roll_keep_for 720h (30 days). Low-traffic badge service —
  one 100 MiB file may last weeks, so retention is bounded
  by the keep_for value.

Caddy 2.7+ supports lumberjack rotation natively
(roll_size + roll_keep + roll_keep_for). No separate
logrotate config required.

Validated against the production deployment on CT 112
(Holodeck) before this commit:
- `caddy validate --config Caddyfile --adapter caddyfile`
  passes
- `systemctl restart caddy` brings the daemon up clean
- Three test hits (200 _health.json, 200 real badge, 404
  missing badge) all logged as JSON access entries with
  remote_ip, host, uri, method, status, duration, response
  headers
- error.log stays empty (4xx responses don't go there;
  only server-level / panic / ACME errors do)

Header comment documents a gotcha hit during the production
rollout: running `caddy validate` *as root* pre-creates
/var/log/caddy/{error,access}.log as root:root 0600, which
the caddy daemon (user caddy) can't open on reload —
systemd unit gets stuck in `reloading` state. Recovery:
chown caddy:caddy on the log files, then systemctl restart
caddy. Future operators: validate as the caddy user, or
let the daemon create the files itself on first reload.
@cmeans-claude-dev cmeans-claude-dev Bot force-pushed the deploy/caddy-logging branch from 20b7af2 to baba6a5 Compare April 27, 2026 01:39
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Ready for QA Dev work complete — QA can begin review Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 27, 2026
cmeans
cmeans previously approved these changes Apr 27, 2026

@cmeans cmeans left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 27, 2026

@cmeans cmeans left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA round 1 — QA Failed

The Caddyfile.example change itself is sound — the live-deployment-first pattern is exactly right per feedback_anchor_on_live_deployment.md, and I verified the rotation knobs and structure match CT 112's /etc/caddy/Caddyfile byte-for-byte (50MiB/10/2160h on error, 100MiB/14/720h on access, JSON format on both, level ERROR on the global log default). Live curl tests in PR body reproduced from my side: 200 / 200 / 404 against the three URLs as advertised. CI deploy-smoke validates the new file against caddy:2. So far so good.

The fail is for doc drift in deploy/README.md introduced by this PR.

Substantive finding (blocker): "Native journal logging" claim is now incomplete

deploy/README.md:33:

| **Bare systemd** (Linux host or LXC) | `systemd/`, `caddy/Caddyfile.example` | Smallest moving parts. Predictable. Native journal logging. | Linux-only. Manual user/dir setup. |

That cell describes the Bare systemd path's pros. After this PR, Caddy no longer logs to journal at all — server errors go to /var/log/caddy/error.log, requests go to /var/log/caddy/access.log. Only the collector still uses journal (line 88's journalctl -u pypi-winnow-downloads-collector.service).

So "Native journal logging" describes half the deployment now. An operator picking deployment shape from this table gets a misleading pro: they reasonably expect to journalctl -u caddy to grep request data and find no requests there.

This is exactly the symbol-walking the new lens calls for: changing how Caddy logs → grep deploy/README.md for log / journal → find this row → update it. Per feedback_doc_drift_is_substantive.md, doc drift is substantive, fix in same PR cycle.

Suggested fix: rewrite the cell to either drop the journal-specific claim (e.g., "Native logging integration") or split it ("Collector logs to journal; Caddy logs to rotated files under /var/log/caddy/").

What is correct on this PR

Aspect State
Live-deployment-first validation yes — production CT 112 changed first, ported to example after. Pattern matches feedback_anchor_on_live_deployment.md.
Rotation knobs match production exact: error 50MiB / 10 / 2160h, access 100MiB / 14 / 720h
Structural match global log default block with level ERROR + per-site log block, both format json, file outputs at /var/log/caddy/{error,access}.log
caddy:2 validates the new file CI deploy-smoke job SUCCESS on PR head
Live URL behavior matches PR body 200 / 200 / 404 reproduced from my end against pypi-badges.intfar.com
Header gotcha documentation well-placed: operators copying Caddyfile.example to /etc/caddy/Caddyfile will see the "running validate as root pre-creates the log files as root:root 0600" warning before they can fall into it
CHANGELOG entry top of ### Added, comprehensive, references the gotcha

Adjacent observation (not a blocker)

deploy/README.md's Bare systemd quickstart (around line 91-93) doesn't mention /var/log/caddy/ ownership or pre-creation. Operators on Debian-family installing caddy from caddyserver.com get the directory created with caddy:caddy ownership by the package's postinst, so the gotcha doesn't bite them. Operators on non-Debian distros (Alpine / Arch / RHEL / Fedora) where the package may not pre-create the directory could hit it. The gotcha is documented in the Caddyfile.example header so an operator reading what they're copying will see it — that's reasonable coverage. Worth thinking about, not worth blocking.

Transitioning label to QA Failed.

@cmeans

cmeans commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Applying QA Failed — see review above. One substantive finding: deploy/README.md:33 claims 'Native journal logging' as a Bare systemd pro, but this PR moves Caddy logs out of the journal to dedicated rotated files. Symbol-walking the docs for the changed concept would have caught it during drafting. Caddyfile change itself is sound and matches the live CT 112 deployment exactly.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 27, 2026
… to journal

Round-1 QA on PR #30 caught a doc-drift miss: the "Pick an
approach" table row for Bare systemd at deploy/README.md:33
listed "Native journal logging" as a pro. After this PR's
Caddyfile change, only the collector logs to journal — Caddy
writes to /var/log/caddy/{error,access}.log directly. An
operator picking deployment shape from that table would
reasonably expect to grep `journalctl -u caddy` for request data
and find none.

Updated the cell to read "Collector logs to journal; Caddy logs
to rotated files under /var/log/caddy/" — same bullet style as
the rest of the row, but accurate. Other log/journal references
in deploy/README.md (line 88's collector journalctl example) are
still correct and untouched.

CHANGELOG entry for this PR amended to mention the README
update so the bullet reflects full PR scope.
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA and removed QA Failed QA found issues — needs dev attention labels Apr 27, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added the Ready for QA Dev work complete — QA can begin review label Apr 27, 2026
@cmeans-claude-dev

Copy link
Copy Markdown
Contributor Author

QA round 1 finding addressed. New commit 5644c59:

  • deploy/README.md:33 — Bare systemd row's pros updated from "Native journal logging" to "Collector logs to journal; Caddy logs to rotated files under `/var/log/caddy/`". Symbol-walked the rest of the file: line 34 ("No native log integration" for Docker shape) is still accurate; line 88 (collector journalctl example) is still accurate. Both untouched.
  • CHANGELOG bullet for this PR amended to mention the README update so it reflects full PR scope.

Adjacent observation about non-Debian distros needing to pre-create /var/log/caddy/ is acknowledged but not blocking per your call — the Caddyfile.example header gotcha covers operators who'll see the warning before they fall into it.

Re-tagged Ready for QA.

@github-actions github-actions Bot removed the Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA label Apr 27, 2026
@cmeans cmeans added QA Active QA is actively reviewing; Dev should not push changes and removed Ready for QA Dev work complete — QA can begin review labels Apr 27, 2026

@cmeans cmeans left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA round 2 — clean, signing off

Round-1 doc-drift fix landed as commit 5644c59. The deploy/README.md:33 cell now reads:

Smallest moving parts. Predictable. Collector logs to journal; Caddy logs to rotated files under /var/log/caddy/.

That's accurate and self-explanatory — operators picking deployment shape from the table now get the right mental model for where to grep request data.

Round-1 finding resolution check:

Round-1 ask Resolution
deploy/README.md:33 "Native journal logging" claim replaced with split phrasing that names both halves of the deployment correctly
Repo-wide drift recheck only two journal / log references remain in deploy/README.md: line 33 (now correct) and line 88's journalctl -u pypi-winnow-downloads-collector.service (still accurate — collector still uses journal). Nothing else drifts.
CHANGELOG accuracy the existing PR bullet was amended to mention the README update so the Unreleased entry reflects the full PR scope, not just the Caddyfile change

No regression on what was already correct in round 1:

  • Caddyfile.example itself is unchanged from round 1 (verified by git diff baba6a5...5644c59 -- deploy/caddy/Caddyfile.example — empty). The byte-for-byte match against live CT 112 production still holds (50MiB/10/2160h error, 100MiB/14/720h access, level ERROR global, JSON on both).
  • CI on new head: all SUCCESS — including deploy-smoke validating the unchanged Caddyfile.example against caddy:2.
  • Live URL behavior on pypi-badges.intfar.com already verified in round 1 (200 / 200 / 404).

No new findings. Transitioning label to Ready for QA Signoff.

@cmeans

cmeans commented Apr 27, 2026

Copy link
Copy Markdown
Owner

Applying Ready for QA Signoff — see review above. Round-1 doc drift fixed in 5644c59: deploy/README.md:33 now correctly splits the logging story ("Collector logs to journal; Caddy logs to rotated files under /var/log/caddy/"). Caddyfile.example unchanged from round 1, so the byte-for-byte match against live CT 112 production still holds. CI all green on new head.

@cmeans cmeans added Ready for QA Signoff QA passed — ready for maintainer final review and merge and removed QA Active QA is actively reviewing; Dev should not push changes labels Apr 27, 2026

@cmeans cmeans left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels Apr 27, 2026
@cmeans-claude-dev cmeans-claude-dev Bot merged commit f9a431b into main Apr 27, 2026
40 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the deploy/caddy-logging branch April 27, 2026 02:19
cmeans-claude-dev Bot added a commit that referenced this pull request Apr 27, 2026
Three mechanical edits:

- pyproject.toml: version "0.1.0" -> "0.1.1"
- CHANGELOG.md: insert `## [0.1.1] - 2026-04-26` directly under
  the (still empty) `## [Unreleased]` header so all 12 PRs'
  worth of bullets that have been accumulating since v0.1.0
  ship are now categorized under the 0.1.1 release. Updated
  the link refs at the bottom: [Unreleased] now compares from
  v0.1.1, and a new [0.1.1] entry compares v0.1.0...v0.1.1.
- uv.lock: refreshed by `uv lock` so the locked
  pypi-winnow-downloads version (0.1.1) matches pyproject.toml.

What ships in v0.1.1 (highlights — full changelog under
## [0.1.1]):

Library fixes (operator-visible):
- collector: _write_health OSError no longer escapes
  per-package isolation. Disk-full / perm errors now produce
  structured `winnow-collect: ...; health file write failed:
  [Errno 28] No space left on device` exit instead of a raw
  traceback. Closes #32.
- collector: stale_threshold_days is now actually consulted —
  the "warn if previous run is older than N days" feature
  documented in config.example.yaml since v0.1.0 finally
  fires. Log-only per the documented v1 contract; degrades
  silently on first-run / unreadable / malformed / future-
  timestamped previous _health.json. Closes #33.

Documentation:
- README acknowledgments / license / BigQuery dataset link
  refresh (PR #15)
- README shields.io URL canonicalization (PR #27, closes #16)
- deploy/README.md Tailscale Funnel as alternative HTTPS
  exposure (PR #22)
- deploy/README.md "Pick an approach" table updated to
  reflect the new Caddy logging shape (in PR #30)

CI / project infrastructure (no PyPI consumer impact, but
hardens future releases):
- Community health files: CONTRIBUTING / CoC / SECURITY /
  issue templates (PR #20)
- .github/dependabot.yml across pip + github-actions + docker
  ecosystems (PR #21)
- Dependabot PR hygiene cascade from cmeans/mcp-synology:
  PULL_REQUEST_TEMPLATE.md + auto-CHANGELOG workflow (App-
  token authenticated so required CI re-fires on the bot's
  HEAD SHA) + dependabot.yml prefix fix (PR #25). Validated
  end-to-end via the first two real Dependabot bumps PR #23
  (codecov-action 5->6) and PR #24 (python 3.13-slim ->
  3.14-slim).
- deploy-smoke CI job that builds the Dockerfile, smokes the
  entrypoint, validates compose+Caddyfile against caddy:2
  (PR #29, closes #7). Promoted to required status check on
  the main-protection ruleset 2026-04-26 22:43 (issue #31
  closed via operator action).
- deploy/caddy/Caddyfile.example gains global error logger +
  per-site access logger with built-in lumberjack rotation,
  documents the validate-as-root gotcha (PR #30). Live CT 112
  deployment fixed in the same change.
- 100% coverage on src/ via real tests (no `# pragma: no
  cover`), with `fail_under = 100` gate in pyproject.toml so
  future regressions trip CI (PR #38, closes #37).

Verified locally: 71/71 pytest pass, ruff/format/mypy clean,
coverage gate green at 100.00%.

After this merges:
1. Tag the squash-merge commit as v0.1.1 and push the tag —
   publish.yml fires and uploads to PyPI via the existing
   trusted-publisher OIDC flow.
2. Update the live CT 112 deployment to install
   pypi-winnow-downloads==0.1.1 from PyPI (currently runs a
   wheel built from main, but pinning to the released
   version keeps deploy reproducible).
3. Close any post-release follow-ups Chris wants tracked.

Co-authored-by: cmeans-claude-dev[bot] <272174644+cmeans-claude-dev[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants