docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo#62
Conversation
…pypinfo Two parts: 1. New engineering doc at `docs/cost-model-and-pypinfo-gotchas.md` capturing the BigQuery scan-cost shape (~4.6 GB/pkg/run; free-tier ceiling around 7 packages on a daily cadence; cost envelope at higher scales), the levers that move cost (frequency cuts, pypistats fallback, hybrid), and two pypinfo CLI foot-guns: - `--where` AND-combines with the positional rather than overriding - `--limit` defaults to 10; falsy values fall back to that default Material was previously captured only in the awareness store (entry c41ae589) from PR #14's testing, where running batched-query queries over 300 packages cost the project $11.32 to learn. Promoting it to the public repo means future maintainers (and self-hosters) don't re-walk it. README install section gains a brief pointer with the free-tier rule of thumb. 2. `run_pypinfo` argv now passes `--limit 500` explicitly. Realistic ci-by-installer-by-system combo ceiling for one package is ~3 x 8 x 4 ≈ 96; under pypinfo's implicit default of 10, popular packages with diverse installer/system spread silently lost the long tail and the hero badge undercounted. SQL `LIMIT` is post-aggregation so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression test asserts `--limit` present in argv and value >= 100. Coverage stays at 100% (89/88 tests pass with the new test).
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
cmeans
left a comment
There was a problem hiding this comment.
QA review — round 1
HEAD 190bfca. Verification ran in the dev's worktree at ../pypi-winnow-downloads-costdocs.
Local verification at 190bfca:
uv run pytest --cov→ 89/89 pass, 100% coverage on 286 src statements.uv run ruff check src/ tests/→ clean.uv run ruff format --check src/ tests/→ 11 files already formatted.uv run mypy src/pypi_winnow_downloads/→ no issues.uv lock --locked→ clean (the new gate from PR #61).- CI on PR head: all 7 required checks SUCCESS (lint, typecheck, test 3.11/3.12/3.13, deploy-smoke + the two qa-approved gates).
Pypinfo source claims spot-checked against installed pypinfo 23.0.0:
pypinfo/core.py:24→DEFAULT_LIMIT = 10✓pypinfo/core.py:198→limit = limit or DEFAULT_LIMIT✓ (gotcha 2)pypinfo/core.py:232-241→WHEREAND-combine viaif project: conditions.append(...)and the trailingif where:append ✓ (gotcha 1)pypinfo/cli.py:130-134→if auth: set_credentials(auth); ... returnshort-circuit ✓ (matches the existing--authcomment inrun_pypinfo)pypinfo/fields.py→details.installer.name↔installer_name,details.system.name↔system_name✓
Cost-table arithmetic spot-checked:
4.6 GB/pkg × 30 runs/month = 138 GB/pkg/month. Each row reproduces (4×138=552, 7×138=966, 10×138=1380, 50×138=6900, 100×138=13800, 300×138=41400). After-free-tier $5/TB rate matches each row's ~$N/month figure under a 1 TB free-tier basis.
Code change: --limit 500 is correctly placed in argv after --all and before the <package> positional. The new comment block at collector.py:170-178 accurately describes the why and references the new doc. Existing argv-shape tests (test_run_pypinfo_invokes_pypinfo_with_expected_argv, test_run_pypinfo_argv_groups_by_ci_installer_system) keep passing because argv[-3:] and the package < ci < installer ordering are unchanged.
Test: test_run_pypinfo_argv_passes_explicit_limit is the right shape — asserts argv contains --limit and that the value is >= 100 (so 500 has headroom but the test isn't brittle to small tuning of the exact number). Floor of 100 cleanly clears the documented ~3 × 8 × 4 = 96 realistic-combo ceiling.
Findings
1. (observation) Free-tier unit inconsistency between README and doc — pick one basis.
README.md:104says `BigQuery's free tier is 1 TiB of scan per month`.docs/cost-model-and-pypinfo-gotchas.md:40table heading says `Free tier (1 TB/mo) ceiling`, and the qualitative assessments + after-free-tier$figures in that table consistently use a 1 TB basis (the actual GCP free tier is 1 TiB ≈ 1099.5 GB).
The shift from 1 TB → 1 TiB doesn't change the order of magnitude, but it does change a couple of qualitative cells:
| Packages | Monthly billed | Doc verdict (1 TB basis) | Under 1 TiB basis |
|---|---|---|---|
| 7 | 966 GB | "at the ceiling" | comfortably under (88%) |
| 10 | 1.38 TB | "~$2/month over" | ~$1.40/month over |
| 50 | 6.9 TB | "~$30/month" | ~$29/month |
For an engineering-grade reference doc that's promoted into the public repo specifically to be the canonical numbers, the README's 1 TiB is the right basis (matches GCP's own docs). Recommend updating the table heading to `Free tier (1 TiB/mo) ceiling`, refreshing the 7-row qualitative cell, and tweaking the 10/50/100/300 `~$N/month` cells if you want them tighter than current. Cost section near the top of the doc inherits the same fix.
2. (nit) Doc "See also" overstates what's commented in the code.
docs/cost-model-and-pypinfo-gotchas.md:151 says:
src/pypi_winnow_downloads/collector.py— the liverun_pypinfofunction with both gotchas commented at their respective line ranges inside the function body.
But the function body only carries an inline comment for gotcha 2 (--limit, at collector.py:170-178). Gotcha 1 (--where) is not commented in the code because the collector doesn't use --where at all — which the doc itself correctly notes at lines 124-126 ("This project ships per-package serial ... and does not use --where. The gotcha is preserved here for anyone reviving the batched path or hacking on a fork.").
Suggest tightening the See-also bullet to e.g. `with the `--limit` gotcha commented inline` or `with the active gotcha commented inline`.
Both findings are doc-only; no code re-test needed once they're addressed. Holding off on Ready for QA Signoff per the every-observation-blocks-signoff rule. Will reconfirm on the next push.
|
Applying QA Failed as the final act of round 1. Two doc-only findings posted in the review above (one observation re: TiB/TB free-tier basis inconsistency between README and the new doc, one nit re: 'See also' overstating what's commented in |
QA round 1 findings: - Free-tier basis was inconsistent: README cited 1 TiB but the cost-envelope table in docs/cost-model-and-pypinfo-gotchas.md used a 1 TB basis in the heading and qualitative cells. Aligns the doc on 1 TiB (matches GCP's actual free-tier figure and the README). Refreshes the qualitative cells whose verdicts shift under the larger basis: 7-pkg row moves from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "~\$2/month over" to "~\$1.40/month over"; 50/100-pkg dollar figures tightened to "~\$29" and "~\$64". README rule of thumb updated from "~7 packages" to "~7-8 packages" so the prose ceiling matches the table; CHANGELOG entry follows. - "See also" bullet pointing at collector.py overstated what is commented inline. Only gotcha 2 (--limit) is commented in the function body; gotcha 1 (--where) is intentionally not, because the collector ships per-package serial and does not pass --where. Reworded to name --limit explicitly and explain why the --where gotcha is preserved in the doc but not in code. Doc-only changes; 89/89 tests pass with 100% coverage, ruff/mypy/uv lock --locked all clean. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
QA round 2 ready at
Local re-verification at PR body updated with a "QA round 2" section so the round-1 → round-2 delta lands in the squash-merge commit too. |
|
Starting QA round 2 on |
|
QA round 2 — PASS at `74fef8b`. Both round-1 findings addressed:
Repo-wide grep for `1 TB` / `1 TiB` shows no remaining drift — the only "TB" references left are the GCP after-free unit price ($5/TB) and the monthly-bytes-billed column, both intentional. Local verification on `74fef8b` (this session):
PR body test-plan checkboxes are already current. Round-2 diff is 3 files / 13+ 11-: `CHANGELOG.md`, `README.md`, `docs/cost-model-and-pypinfo-gotchas.md` — doc-only, no code re-test needed beyond the smoke pass above. Promoting to Ready for QA Signoff as the final act of this comment. Awaiting maintainer QA Approved. Process note (non-blocking): between round 1 and the QA Active I added at the start of this round, the PR cycled through a Dev-applied `QA Failed` → `Ready for QA Signoff` swap that skipped the QA-side reverification step. Per the project label workflow, only QA may apply `Ready for QA Signoff`; Dev's role on a fix-cycle push ends at `Ready for QA` (or letting the push automation reset to `Awaiting CI` → `Ready for QA`). Calling it out here for the audit trail; not a re-fail. |
Summary
Two parts in one PR — they share the same source material from PR #14's testing and would be artificially split if they didn't share their references.
1. New engineering doc:
docs/cost-model-and-pypinfo-gotchas.md. Captures three things that aren't obvious from the code:file.projectandWHERE INdefeats cluster pruning. Free-tier ceiling lands around 7-8 packages on a daily cadence (1 TiB basis).TABLESAMPLE) for completeness.core.py:build_queryreferences:--whereAND-combines with the positional rather than overriding, and--limitdefaults to 10 with falsy values falling back to that default (limit or DEFAULT_LIMITin source). The first one bites multi-package callers; the second one bites multi-pivot callers — including this project'srun_pypinfo.The material was previously captured only in the maintainer's private knowledge store (awareness entry
c41ae589). Promoting it to the public repo means future maintainers and self-hosters don't have to re-walk it. README install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing.2.
run_pypinfoargv carries an explicit--limit 500. Closes the gotcha-2 hole on the live code path. The pivot isci x installer x system; realistic distinct combos for one package are ~3 x 8 x 4 ≈ 96. Under the prior implicit-default-10 path, a popular package with diverse installer/system spread silently lost the long tail and the hero badge (sum of post-allowlist rows) would undercount. SQLLIMITis post-aggregation, so a generous bound does not changebytes_billed— the cost envelope in the new doc is unaffected.Regression coverage in
tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limitfails if--limitis dropped or the value drops below 100.QA round 2 — doc-only follow-up at 74fef8b
Two findings from QA round 1, both doc-only:
docs/cost-model-and-pypinfo-gotchas.mdused a 1 TB basis in the heading and the qualitative cells. Switched to 1 TiB (matches GCP's actual figure and what the README says) and refreshed the verdicts whose qualitative reading shifts under the larger basis: 7-pkg row from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "$2/month over" to "$1.40/month over"; 50-pkg from "$30" to "$29"; 100-pkg from "$65" to "$64". README's prose rule of thumb updated from "~7 packages" to "~7-8 packages" so it matches the table; CHANGELOG entry follows.collector.pyno longer overstates what is commented inline. Originally said "with both gotchas commented at their respective line ranges"; only gotcha 2 (--limit) is commented in the function body, because the collector ships per-package serial and does not pass--where. Reworded to name--limitexplicitly and explain why the--wheregotcha is preserved in the doc but not in code.Test plan
uv sync --frozen --extra dev && uv run pytest --cov— 89/89 pass, 100% coverage on 286 src statements (re-run at 74fef8b).uv run ruff check src/ tests/clean.uv run ruff format --check src/ tests/clean (11 files already formatted).uv run mypy src/pypi_winnow_downloads/clean.uv lock --lockedclean (the new gate from ci: add uv lock --locked check to lint job (closes #60) #61 — confirms nopyproject.tomldrift).pypinfo23.0.0'score.pyandcli.py.--limit 500doesn't trip any unstated assumption inrun_pypinfo's downstream parsing — the row shape is unchanged, so the parser is unaffected by the larger response cap.