docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo by cmeans-claude-dev[bot] · Pull Request #62 · cmeans/pypi-winnow-downloads

cmeans-claude-dev · 2026-05-02T02:30:35Z

Summary

Two parts in one PR — they share the same source material from PR #14's testing and would be artificially split if they didn't share their references.

1. New engineering doc: docs/cost-model-and-pypinfo-gotchas.md. Captures three things that aren't obvious from the code:

BigQuery scan-cost shape. Empirical numbers from PR feat(collector): batched BigQuery query — one scan per window group #14's testing (which cost the project $11.32 to learn): per-package serial runs ~4.6 GB billed/pkg/30-day-window, batched queries of 300 packages cost ~7.7 GB/pkg, so batching is more expensive than serial — the table is clustered on file.project and WHERE IN defeats cluster pruning. Free-tier ceiling lands around 7-8 packages on a daily cadence (1 TiB basis).
Levers that move cost. Frequency cuts (daily → weekly = 7x), pypistats.org fallback, hybrid, materialized views — in order of effectiveness. Includes anti-levers (smaller batch sizes, TABLESAMPLE) for completeness.
Two pypinfo CLI gotchas with core.py:build_query references: --where AND-combines with the positional rather than overriding, and --limit defaults to 10 with falsy values falling back to that default (limit or DEFAULT_LIMIT in source). The first one bites multi-package callers; the second one bites multi-pivot callers — including this project's run_pypinfo.

The material was previously captured only in the maintainer's private knowledge store (awareness entry c41ae589). Promoting it to the public repo means future maintainers and self-hosters don't have to re-walk it. README install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing.

2. run_pypinfo argv carries an explicit --limit 500. Closes the gotcha-2 hole on the live code path. The pivot is ci x installer x system; realistic distinct combos for one package are ~3 x 8 x 4 ≈ 96. Under the prior implicit-default-10 path, a popular package with diverse installer/system spread silently lost the long tail and the hero badge (sum of post-allowlist rows) would undercount. SQL LIMIT is post-aggregation, so a generous bound does not change bytes_billed — the cost envelope in the new doc is unaffected.

Regression coverage in tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit fails if --limit is dropped or the value drops below 100.

QA round 2 — doc-only follow-up at `74fef8b`

Two findings from QA round 1, both doc-only:

Free-tier basis aligned on 1 TiB across README, doc, and CHANGELOG. README already cited 1 TiB but the cost-envelope table in docs/cost-model-and-pypinfo-gotchas.md used a 1 TB basis in the heading and the qualitative cells. Switched to 1 TiB (matches GCP's actual figure and what the README says) and refreshed the verdicts whose qualitative reading shifts under the larger basis: 7-pkg row from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "~~$2/month over" to "~~$1.40/month over"; 50-pkg from "~~$30" to "~~$29"; 100-pkg from "~~$65" to "~~$64". README's prose rule of thumb updated from "~7 packages" to "~7-8 packages" so it matches the table; CHANGELOG entry follows.
"See also" bullet pointing at collector.py no longer overstates what is commented inline. Originally said "with both gotchas commented at their respective line ranges"; only gotcha 2 (--limit) is commented in the function body, because the collector ships per-package serial and does not pass --where. Reworded to name --limit explicitly and explain why the --where gotcha is preserved in the doc but not in code.

Test plan

uv sync --frozen --extra dev && uv run pytest --cov — 89/89 pass, 100% coverage on 286 src statements (re-run at 74fef8b).
uv run ruff check src/ tests/ clean.
uv run ruff format --check src/ tests/ clean (11 files already formatted).
uv run mypy src/pypi_winnow_downloads/ clean.
uv lock --locked clean (the new gate from ci: add uv lock --locked check to lint job (closes #60) #61 — confirms no pyproject.toml drift).
CI green on PR head.
Skim the new doc for accuracy of the gotcha line-references against pypinfo 23.0.0's core.py and cli.py.
Confirm the explicit --limit 500 doesn't trip any unstated assumption in run_pypinfo's downstream parsing — the row shape is unchanged, so the parser is unaffected by the larger response cap.

…pypinfo Two parts: 1. New engineering doc at `docs/cost-model-and-pypinfo-gotchas.md` capturing the BigQuery scan-cost shape (~4.6 GB/pkg/run; free-tier ceiling around 7 packages on a daily cadence; cost envelope at higher scales), the levers that move cost (frequency cuts, pypistats fallback, hybrid), and two pypinfo CLI foot-guns: - `--where` AND-combines with the positional rather than overriding - `--limit` defaults to 10; falsy values fall back to that default Material was previously captured only in the awareness store (entry c41ae589) from PR #14's testing, where running batched-query queries over 300 packages cost the project $11.32 to learn. Promoting it to the public repo means future maintainers (and self-hosters) don't re-walk it. README install section gains a brief pointer with the free-tier rule of thumb. 2. `run_pypinfo` argv now passes `--limit 500` explicitly. Realistic ci-by-installer-by-system combo ceiling for one package is ~3 x 8 x 4 ≈ 96; under pypinfo's implicit default of 10, popular packages with diverse installer/system spread silently lost the long tail and the hero badge undercounted. SQL `LIMIT` is post-aggregation so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression test asserts `--limit` present in argv and value >= 100. Coverage stays at 100% (89/88 tests pass with the new test).

codecov-commenter · 2026-05-02T02:31:25Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

cmeans

LGTM

cmeans

QA review — round 1

HEAD 190bfca. Verification ran in the dev's worktree at ../pypi-winnow-downloads-costdocs.

Local verification at 190bfca:

uv run pytest --cov → 89/89 pass, 100% coverage on 286 src statements.
uv run ruff check src/ tests/ → clean.
uv run ruff format --check src/ tests/ → 11 files already formatted.
uv run mypy src/pypi_winnow_downloads/ → no issues.
uv lock --locked → clean (the new gate from PR #61).
CI on PR head: all 7 required checks SUCCESS (lint, typecheck, test 3.11/3.12/3.13, deploy-smoke + the two qa-approved gates).

Pypinfo source claims spot-checked against installed pypinfo 23.0.0:

pypinfo/core.py:24 → DEFAULT_LIMIT = 10 ✓
pypinfo/core.py:198 → limit = limit or DEFAULT_LIMIT ✓ (gotcha 2)
pypinfo/core.py:232-241 → WHERE AND-combine via if project: conditions.append(...) and the trailing if where: append ✓ (gotcha 1)
pypinfo/cli.py:130-134 → if auth: set_credentials(auth); ... return short-circuit ✓ (matches the existing --auth comment in run_pypinfo)
pypinfo/fields.py → details.installer.name ↔ installer_name, details.system.name ↔ system_name ✓

Cost-table arithmetic spot-checked:

4.6 GB/pkg × 30 runs/month = 138 GB/pkg/month. Each row reproduces (4×138=552, 7×138=966, 10×138=1380, 50×138=6900, 100×138=13800, 300×138=41400). After-free-tier $5/TB rate matches each row's ~$N/month figure under a 1 TB free-tier basis.

Code change: --limit 500 is correctly placed in argv after --all and before the <package> positional. The new comment block at collector.py:170-178 accurately describes the why and references the new doc. Existing argv-shape tests (test_run_pypinfo_invokes_pypinfo_with_expected_argv, test_run_pypinfo_argv_groups_by_ci_installer_system) keep passing because argv[-3:] and the package < ci < installer ordering are unchanged.

Test: test_run_pypinfo_argv_passes_explicit_limit is the right shape — asserts argv contains --limit and that the value is >= 100 (so 500 has headroom but the test isn't brittle to small tuning of the exact number). Floor of 100 cleanly clears the documented ~3 × 8 × 4 = 96 realistic-combo ceiling.

Findings

1. (observation) Free-tier unit inconsistency between README and doc — pick one basis.

README.md:104 says `BigQuery's free tier is 1 TiB of scan per month`.
docs/cost-model-and-pypinfo-gotchas.md:40 table heading says `Free tier (1 TB/mo) ceiling`, and the qualitative assessments + after-free-tier $ figures in that table consistently use a 1 TB basis (the actual GCP free tier is 1 TiB ≈ 1099.5 GB).

The shift from 1 TB → 1 TiB doesn't change the order of magnitude, but it does change a couple of qualitative cells:

Packages	Monthly billed	Doc verdict (1 TB basis)	Under 1 TiB basis
7	966 GB	"at the ceiling"	comfortably under (88%)
10	1.38 TB	"~$2/month over"	~$1.40/month over
50	6.9 TB	"~$30/month"	~$29/month

For an engineering-grade reference doc that's promoted into the public repo specifically to be the canonical numbers, the README's 1 TiB is the right basis (matches GCP's own docs). Recommend updating the table heading to `Free tier (1 TiB/mo) ceiling`, refreshing the 7-row qualitative cell, and tweaking the 10/50/100/300 `~$N/month` cells if you want them tighter than current. Cost section near the top of the doc inherits the same fix.

2. (nit) Doc "See also" overstates what's commented in the code.

docs/cost-model-and-pypinfo-gotchas.md:151 says:

src/pypi_winnow_downloads/collector.py — the live run_pypinfo function with both gotchas commented at their respective line ranges inside the function body.

But the function body only carries an inline comment for gotcha 2 (--limit, at collector.py:170-178). Gotcha 1 (--where) is not commented in the code because the collector doesn't use --where at all — which the doc itself correctly notes at lines 124-126 ("This project ships per-package serial ... and does not use --where. The gotcha is preserved here for anyone reviving the batched path or hacking on a fork.").

Suggest tightening the See-also bullet to e.g. `with the `--limit` gotcha commented inline` or `with the active gotcha commented inline`.

Both findings are doc-only; no code re-test needed once they're addressed. Holding off on Ready for QA Signoff per the every-observation-blocks-signoff rule. Will reconfirm on the next push.

cmeans · 2026-05-02T15:17:19Z

Applying QA Failed as the final act of round 1. Two doc-only findings posted in the review above (one observation re: TiB/TB free-tier basis inconsistency between README and the new doc, one nit re: 'See also' overstating what's commented in collector.py). Both are 30-second fixes; no code re-test needed once they're addressed.

QA round 1 findings: - Free-tier basis was inconsistent: README cited 1 TiB but the cost-envelope table in docs/cost-model-and-pypinfo-gotchas.md used a 1 TB basis in the heading and qualitative cells. Aligns the doc on 1 TiB (matches GCP's actual free-tier figure and the README). Refreshes the qualitative cells whose verdicts shift under the larger basis: 7-pkg row moves from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "~\$2/month over" to "~\$1.40/month over"; 50/100-pkg dollar figures tightened to "~\$29" and "~\$64". README rule of thumb updated from "~7 packages" to "~7-8 packages" so the prose ceiling matches the table; CHANGELOG entry follows. - "See also" bullet pointing at collector.py overstated what is commented inline. Only gotcha 2 (--limit) is commented in the function body; gotcha 1 (--where) is intentionally not, because the collector ships per-package serial and does not pass --where. Reworded to name --limit explicitly and explain why the --where gotcha is preserved in the doc but not in code. Doc-only changes; 89/89 tests pass with 100% coverage, ruff/mypy/uv lock --locked all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

cmeans-claude-dev · 2026-05-02T19:00:32Z

QA round 2 ready at 74fef8b. Both round-1 findings addressed:

F1 (free-tier basis): cost-envelope table heading switched to 1 TiB/mo; verdicts in 7/10/50/100-pkg rows refreshed under the larger basis (7 → "comfortably under (88%)", 10 → "~~$1.40/month over", 50 → "~~$29/month", 100 → "~$64/month"). 4-pkg and 300-pkg rows unchanged (still accurate). README rule of thumb updated to "~7-8 packages" so the prose matches the table; CHANGELOG entry follows.
F2 (See-also overstatement): rewrote the collector.py bullet to name --limit explicitly as the gotcha that's commented inline, with a second sentence explaining why --where is preserved in the doc but not in code (collector ships per-package serial, no --where to comment).

Local re-verification at 74fef8b: 89/89 tests pass, 100% coverage, ruff/mypy/uv lock --locked all clean. Doc-only push; no code re-test needed.

PR body updated with a "QA round 2" section so the round-1 → round-2 delta lands in the squash-merge commit too.

cmeans

LGTM

cmeans · 2026-05-04T22:55:06Z

Starting QA round 2 on 74fef8b. Adding QA Active to take the workflow back to the proper QA-owned state — round-1 returned QA Failed for two doc-only findings, and the prior label sequence (QA Failed → Ready for QA Signoff without a QA Active round 2) skipped the verification step. Will re-run the full local stack, reconfirm both findings are addressed, then apply the terminal label.

cmeans · 2026-05-04T22:57:14Z

QA round 2 — PASS at `74fef8b`.

Both round-1 findings addressed:

F1 (free-tier basis 1 TiB). README, the new doc table heading, and CHANGELOG all aligned on 1 TiB. Cost-envelope arithmetic re-verified end-to-end:
- 7 pkg: 966 GB / 1099.5 GB (= 1 TiB) = 88% ✓ (cell now reads "comfortably under (88%)")
- 10 pkg: (1.38 − 1.0995) × $5 = $1.40 ✓
- 50 pkg: (6.9 − 1.0995) × $5 = $29 ✓
- 100 pkg: (13.8 − 1.0995) × $5 = $63.50 ≈ $64 ✓
- 300 pkg unchanged: (41 − 1.0995) × $5 = $199.50 ≈ $200 ✓
- README "~7-8 packages" rule of thumb consistent with the table (8 × 138 ≈ 1104 GB just at the 1 TiB ceiling).
F2 (See-also bullet). Now reads "with the `--limit` gotcha commented inline" plus a second sentence explaining why `--where` is preserved in the doc but not in the code (per-package serial; not used). Honest and an improvement on the round-1 suggested wording.

Repo-wide grep for `1 TB` / `1 TiB` shows no remaining drift — the only "TB" references left are the GCP after-free unit price ($5/TB) and the monthly-bytes-billed column, both intentional.

Local verification on `74fef8b` (this session):

`uv run pytest --cov` — 89/89 pass, 100% coverage on 286 src statements, 0 deselected.
`uv run ruff check .` clean.
`uv run ruff format --check .` clean (11 files).
`uv run mypy src` clean (5 src files).
`uv lock --locked` clean — confirms no `pyproject.toml` drift (the new structural gate from ci: add uv lock --locked check to lint job (closes #60) #61 in action).
CI on `74fef8b`: all 7 required checks SUCCESS.

PR body test-plan checkboxes are already current. Round-2 diff is 3 files / 13+ 11-: `CHANGELOG.md`, `README.md`, `docs/cost-model-and-pypinfo-gotchas.md` — doc-only, no code re-test needed beyond the smoke pass above.

Promoting to Ready for QA Signoff as the final act of this comment. Awaiting maintainer QA Approved.

Process note (non-blocking): between round 1 and the QA Active I added at the start of this round, the PR cycled through a Dev-applied `QA Failed` → `Ready for QA Signoff` swap that skipped the QA-side reverification step. Per the project label workflow, only QA may apply `Ready for QA Signoff`; Dev's role on a fix-cycle push ends at `Ready for QA` (or letting the push automation reset to `Awaiting CI` → `Ready for QA`). Calling it out here for the audit trail; not a re-fail.

cmeans-claude-dev Bot added the Ready for QA Dev work complete — QA can begin review label May 2, 2026

github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Ready for QA Dev work complete — QA can begin review Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 2, 2026

cmeans previously approved these changes May 2, 2026

View reviewed changes

cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 2, 2026

github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 2, 2026

cmeans reviewed May 2, 2026

View reviewed changes

cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels May 2, 2026

cmeans-claude-dev Bot dismissed cmeans’s stale review via 74fef8b May 2, 2026 18:59

github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed QA Failed QA found issues — needs dev attention Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 2, 2026

cmeans-claude-dev Bot added the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 2, 2026

cmeans approved these changes May 4, 2026

View reviewed changes

cmeans removed the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 4, 2026

cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 4, 2026

github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 4, 2026

cmeans added the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 4, 2026

github-actions Bot removed the QA Active QA is actively reviewing; Dev should not push changes label May 4, 2026

cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels May 4, 2026

cmeans-claude-dev Bot merged commit a0d4821 into main May 4, 2026
36 checks passed

cmeans-claude-dev Bot deleted the docs/cost-model-and-pypinfo-gotchas branch May 4, 2026 22:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo#62

docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo#62
cmeans-claude-dev[bot] merged 2 commits into
mainfrom
docs/cost-model-and-pypinfo-gotchas

cmeans-claude-dev Bot commented May 2, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented May 2, 2026

Uh oh!

cmeans left a comment

Uh oh!

cmeans left a comment

Uh oh!

cmeans commented May 2, 2026

Uh oh!

cmeans-claude-dev Bot commented May 2, 2026

Uh oh!

cmeans left a comment

Uh oh!

cmeans commented May 4, 2026

Uh oh!

cmeans commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

cmeans-claude-dev Bot commented May 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

QA round 2 — doc-only follow-up at 74fef8b

Test plan

Uh oh!

codecov-commenter commented May 2, 2026

Codecov Report

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

QA review — round 1

Findings

Uh oh!

cmeans commented May 2, 2026

Uh oh!

cmeans-claude-dev Bot commented May 2, 2026

Uh oh!

cmeans left a comment

Choose a reason for hiding this comment

Uh oh!

cmeans commented May 4, 2026

Uh oh!

cmeans commented May 4, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cmeans-claude-dev Bot commented May 2, 2026 •

edited

Loading

QA round 2 — doc-only follow-up at `74fef8b`