Skip to content

docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo#62

Merged
cmeans-claude-dev[bot] merged 2 commits into
mainfrom
docs/cost-model-and-pypinfo-gotchas
May 4, 2026
Merged

docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo#62
cmeans-claude-dev[bot] merged 2 commits into
mainfrom
docs/cost-model-and-pypinfo-gotchas

Conversation

@cmeans-claude-dev

@cmeans-claude-dev cmeans-claude-dev Bot commented May 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Two parts in one PR — they share the same source material from PR #14's testing and would be artificially split if they didn't share their references.

1. New engineering doc: docs/cost-model-and-pypinfo-gotchas.md. Captures three things that aren't obvious from the code:

  • BigQuery scan-cost shape. Empirical numbers from PR feat(collector): batched BigQuery query — one scan per window group #14's testing (which cost the project $11.32 to learn): per-package serial runs ~4.6 GB billed/pkg/30-day-window, batched queries of 300 packages cost ~7.7 GB/pkg, so batching is more expensive than serial — the table is clustered on file.project and WHERE IN defeats cluster pruning. Free-tier ceiling lands around 7-8 packages on a daily cadence (1 TiB basis).
  • Levers that move cost. Frequency cuts (daily → weekly = 7x), pypistats.org fallback, hybrid, materialized views — in order of effectiveness. Includes anti-levers (smaller batch sizes, TABLESAMPLE) for completeness.
  • Two pypinfo CLI gotchas with core.py:build_query references: --where AND-combines with the positional rather than overriding, and --limit defaults to 10 with falsy values falling back to that default (limit or DEFAULT_LIMIT in source). The first one bites multi-package callers; the second one bites multi-pivot callers — including this project's run_pypinfo.

The material was previously captured only in the maintainer's private knowledge store (awareness entry c41ae589). Promoting it to the public repo means future maintainers and self-hosters don't have to re-walk it. README install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing.

2. run_pypinfo argv carries an explicit --limit 500. Closes the gotcha-2 hole on the live code path. The pivot is ci x installer x system; realistic distinct combos for one package are ~3 x 8 x 4 ≈ 96. Under the prior implicit-default-10 path, a popular package with diverse installer/system spread silently lost the long tail and the hero badge (sum of post-allowlist rows) would undercount. SQL LIMIT is post-aggregation, so a generous bound does not change bytes_billed — the cost envelope in the new doc is unaffected.

Regression coverage in tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit fails if --limit is dropped or the value drops below 100.

QA round 2 — doc-only follow-up at 74fef8b

Two findings from QA round 1, both doc-only:

  • Free-tier basis aligned on 1 TiB across README, doc, and CHANGELOG. README already cited 1 TiB but the cost-envelope table in docs/cost-model-and-pypinfo-gotchas.md used a 1 TB basis in the heading and the qualitative cells. Switched to 1 TiB (matches GCP's actual figure and what the README says) and refreshed the verdicts whose qualitative reading shifts under the larger basis: 7-pkg row from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "$2/month over" to "$1.40/month over"; 50-pkg from "$30" to "$29"; 100-pkg from "$65" to "$64". README's prose rule of thumb updated from "~7 packages" to "~7-8 packages" so it matches the table; CHANGELOG entry follows.
  • "See also" bullet pointing at collector.py no longer overstates what is commented inline. Originally said "with both gotchas commented at their respective line ranges"; only gotcha 2 (--limit) is commented in the function body, because the collector ships per-package serial and does not pass --where. Reworded to name --limit explicitly and explain why the --where gotcha is preserved in the doc but not in code.

Test plan

  • uv sync --frozen --extra dev && uv run pytest --cov — 89/89 pass, 100% coverage on 286 src statements (re-run at 74fef8b).
  • uv run ruff check src/ tests/ clean.
  • uv run ruff format --check src/ tests/ clean (11 files already formatted).
  • uv run mypy src/pypi_winnow_downloads/ clean.
  • uv lock --locked clean (the new gate from ci: add uv lock --locked check to lint job (closes #60) #61 — confirms no pyproject.toml drift).
  • CI green on PR head.
  • Skim the new doc for accuracy of the gotcha line-references against pypinfo 23.0.0's core.py and cli.py.
  • Confirm the explicit --limit 500 doesn't trip any unstated assumption in run_pypinfo's downstream parsing — the row shape is unchanged, so the parser is unaffected by the larger response cap.

…pypinfo

Two parts:

1. New engineering doc at `docs/cost-model-and-pypinfo-gotchas.md`
   capturing the BigQuery scan-cost shape (~4.6 GB/pkg/run; free-tier
   ceiling around 7 packages on a daily cadence; cost envelope at
   higher scales), the levers that move cost (frequency cuts,
   pypistats fallback, hybrid), and two pypinfo CLI foot-guns:
   - `--where` AND-combines with the positional rather than overriding
   - `--limit` defaults to 10; falsy values fall back to that default

   Material was previously captured only in the awareness store (entry
   c41ae589) from PR #14's testing, where running batched-query queries
   over 300 packages cost the project $11.32 to learn. Promoting it to
   the public repo means future maintainers (and self-hosters) don't
   re-walk it. README install section gains a brief pointer with the
   free-tier rule of thumb.

2. `run_pypinfo` argv now passes `--limit 500` explicitly. Realistic
   ci-by-installer-by-system combo ceiling for one package is ~3 x 8 x 4
   ≈ 96; under pypinfo's implicit default of 10, popular packages with
   diverse installer/system spread silently lost the long tail and the
   hero badge undercounted. SQL `LIMIT` is post-aggregation so a
   generous bound does not change `bytes_billed` — the cost envelope
   in the new doc is unaffected. Regression test asserts `--limit`
   present in argv and value >= 100.

Coverage stays at 100% (89/88 tests pass with the new test).
@cmeans-claude-dev cmeans-claude-dev Bot added the Ready for QA Dev work complete — QA can begin review label May 2, 2026
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Ready for QA Dev work complete — QA can begin review Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 2, 2026
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

cmeans
cmeans previously approved these changes May 2, 2026

@cmeans cmeans left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 2, 2026
@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 2, 2026

@cmeans cmeans left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

QA review — round 1

HEAD 190bfca. Verification ran in the dev's worktree at ../pypi-winnow-downloads-costdocs.

Local verification at 190bfca:

  • uv run pytest --cov89/89 pass, 100% coverage on 286 src statements.
  • uv run ruff check src/ tests/ → clean.
  • uv run ruff format --check src/ tests/ → 11 files already formatted.
  • uv run mypy src/pypi_winnow_downloads/ → no issues.
  • uv lock --locked → clean (the new gate from PR #61).
  • CI on PR head: all 7 required checks SUCCESS (lint, typecheck, test 3.11/3.12/3.13, deploy-smoke + the two qa-approved gates).

Pypinfo source claims spot-checked against installed pypinfo 23.0.0:

  • pypinfo/core.py:24DEFAULT_LIMIT = 10
  • pypinfo/core.py:198limit = limit or DEFAULT_LIMIT ✓ (gotcha 2)
  • pypinfo/core.py:232-241WHERE AND-combine via if project: conditions.append(...) and the trailing if where: append ✓ (gotcha 1)
  • pypinfo/cli.py:130-134if auth: set_credentials(auth); ... return short-circuit ✓ (matches the existing --auth comment in run_pypinfo)
  • pypinfo/fields.pydetails.installer.nameinstaller_name, details.system.namesystem_name

Cost-table arithmetic spot-checked:

4.6 GB/pkg × 30 runs/month = 138 GB/pkg/month. Each row reproduces (4×138=552, 7×138=966, 10×138=1380, 50×138=6900, 100×138=13800, 300×138=41400). After-free-tier $5/TB rate matches each row's ~$N/month figure under a 1 TB free-tier basis.

Code change: --limit 500 is correctly placed in argv after --all and before the <package> positional. The new comment block at collector.py:170-178 accurately describes the why and references the new doc. Existing argv-shape tests (test_run_pypinfo_invokes_pypinfo_with_expected_argv, test_run_pypinfo_argv_groups_by_ci_installer_system) keep passing because argv[-3:] and the package < ci < installer ordering are unchanged.

Test: test_run_pypinfo_argv_passes_explicit_limit is the right shape — asserts argv contains --limit and that the value is >= 100 (so 500 has headroom but the test isn't brittle to small tuning of the exact number). Floor of 100 cleanly clears the documented ~3 × 8 × 4 = 96 realistic-combo ceiling.


Findings

1. (observation) Free-tier unit inconsistency between README and doc — pick one basis.

  • README.md:104 says `BigQuery's free tier is 1 TiB of scan per month`.
  • docs/cost-model-and-pypinfo-gotchas.md:40 table heading says `Free tier (1 TB/mo) ceiling`, and the qualitative assessments + after-free-tier $ figures in that table consistently use a 1 TB basis (the actual GCP free tier is 1 TiB ≈ 1099.5 GB).

The shift from 1 TB → 1 TiB doesn't change the order of magnitude, but it does change a couple of qualitative cells:

Packages Monthly billed Doc verdict (1 TB basis) Under 1 TiB basis
7 966 GB "at the ceiling" comfortably under (88%)
10 1.38 TB "~$2/month over" ~$1.40/month over
50 6.9 TB "~$30/month" ~$29/month

For an engineering-grade reference doc that's promoted into the public repo specifically to be the canonical numbers, the README's 1 TiB is the right basis (matches GCP's own docs). Recommend updating the table heading to `Free tier (1 TiB/mo) ceiling`, refreshing the 7-row qualitative cell, and tweaking the 10/50/100/300 `~$N/month` cells if you want them tighter than current. Cost section near the top of the doc inherits the same fix.

2. (nit) Doc "See also" overstates what's commented in the code.

docs/cost-model-and-pypinfo-gotchas.md:151 says:

src/pypi_winnow_downloads/collector.py — the live run_pypinfo function with both gotchas commented at their respective line ranges inside the function body.

But the function body only carries an inline comment for gotcha 2 (--limit, at collector.py:170-178). Gotcha 1 (--where) is not commented in the code because the collector doesn't use --where at all — which the doc itself correctly notes at lines 124-126 ("This project ships per-package serial ... and does not use --where. The gotcha is preserved here for anyone reviving the batched path or hacking on a fork.").

Suggest tightening the See-also bullet to e.g. `with the `--limit` gotcha commented inline` or `with the active gotcha commented inline`.


Both findings are doc-only; no code re-test needed once they're addressed. Holding off on Ready for QA Signoff per the every-observation-blocks-signoff rule. Will reconfirm on the next push.

@cmeans

cmeans commented May 2, 2026

Copy link
Copy Markdown
Owner

Applying QA Failed as the final act of round 1. Two doc-only findings posted in the review above (one observation re: TiB/TB free-tier basis inconsistency between README and the new doc, one nit re: 'See also' overstating what's commented in collector.py). Both are 30-second fixes; no code re-test needed once they're addressed.

@cmeans cmeans added QA Failed QA found issues — needs dev attention and removed QA Active QA is actively reviewing; Dev should not push changes labels May 2, 2026
QA round 1 findings:

- Free-tier basis was inconsistent: README cited 1 TiB but the
  cost-envelope table in docs/cost-model-and-pypinfo-gotchas.md used
  a 1 TB basis in the heading and qualitative cells. Aligns the doc
  on 1 TiB (matches GCP's actual free-tier figure and the README).
  Refreshes the qualitative cells whose verdicts shift under the
  larger basis: 7-pkg row moves from "at the ceiling" to "comfortably
  under (88%)"; 10-pkg row from "~\$2/month over" to "~\$1.40/month
  over"; 50/100-pkg dollar figures tightened to "~\$29" and "~\$64".
  README rule of thumb updated from "~7 packages" to "~7-8 packages"
  so the prose ceiling matches the table; CHANGELOG entry follows.

- "See also" bullet pointing at collector.py overstated what is
  commented inline. Only gotcha 2 (--limit) is commented in the
  function body; gotcha 1 (--where) is intentionally not, because
  the collector ships per-package serial and does not pass --where.
  Reworded to name --limit explicitly and explain why the --where
  gotcha is preserved in the doc but not in code.

Doc-only changes; 89/89 tests pass with 100% coverage, ruff/mypy/uv
lock --locked all clean.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed QA Failed QA found issues — needs dev attention Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels May 2, 2026
@cmeans-claude-dev cmeans-claude-dev Bot added the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 2, 2026
@cmeans-claude-dev

Copy link
Copy Markdown
Contributor Author

QA round 2 ready at 74fef8b. Both round-1 findings addressed:

  • F1 (free-tier basis): cost-envelope table heading switched to 1 TiB/mo; verdicts in 7/10/50/100-pkg rows refreshed under the larger basis (7 → "comfortably under (88%)", 10 → "$1.40/month over", 50 → "$29/month", 100 → "~$64/month"). 4-pkg and 300-pkg rows unchanged (still accurate). README rule of thumb updated to "~7-8 packages" so the prose matches the table; CHANGELOG entry follows.
  • F2 (See-also overstatement): rewrote the collector.py bullet to name --limit explicitly as the gotcha that's commented inline, with a second sentence explaining why --where is preserved in the doc but not in code (collector ships per-package serial, no --where to comment).

Local re-verification at 74fef8b: 89/89 tests pass, 100% coverage, ruff/mypy/uv lock --locked all clean. Doc-only push; no code re-test needed.

PR body updated with a "QA round 2" section so the round-1 → round-2 delta lands in the squash-merge commit too.

@cmeans cmeans left a comment

Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@cmeans cmeans removed the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 4, 2026
@cmeans

cmeans commented May 4, 2026

Copy link
Copy Markdown
Owner

Starting QA round 2 on 74fef8b. Adding QA Active to take the workflow back to the proper QA-owned state — round-1 returned QA Failed for two doc-only findings, and the prior label sequence (QA FailedReady for QA Signoff without a QA Active round 2) skipped the verification step. Will re-run the full local stack, reconfirm both findings are addressed, then apply the terminal label.

@cmeans cmeans added the QA Active QA is actively reviewing; Dev should not push changes label May 4, 2026
@github-actions github-actions Bot removed the Ready for QA Dev work complete — QA can begin review label May 4, 2026
@cmeans

cmeans commented May 4, 2026

Copy link
Copy Markdown
Owner

QA round 2 — PASS at `74fef8b`.

Both round-1 findings addressed:

  • F1 (free-tier basis 1 TiB). README, the new doc table heading, and CHANGELOG all aligned on 1 TiB. Cost-envelope arithmetic re-verified end-to-end:
    • 7 pkg: 966 GB / 1099.5 GB (= 1 TiB) = 88% ✓ (cell now reads "comfortably under (88%)")
    • 10 pkg: (1.38 − 1.0995) × $5 = $1.40
    • 50 pkg: (6.9 − 1.0995) × $5 = $29
    • 100 pkg: (13.8 − 1.0995) × $5 = $63.50 ≈ $64
    • 300 pkg unchanged: (41 − 1.0995) × $5 = $199.50 ≈ $200
    • README "~7-8 packages" rule of thumb consistent with the table (8 × 138 ≈ 1104 GB just at the 1 TiB ceiling).
  • F2 (See-also bullet). Now reads "with the `--limit` gotcha commented inline" plus a second sentence explaining why `--where` is preserved in the doc but not in the code (per-package serial; not used). Honest and an improvement on the round-1 suggested wording.

Repo-wide grep for `1 TB` / `1 TiB` shows no remaining drift — the only "TB" references left are the GCP after-free unit price ($5/TB) and the monthly-bytes-billed column, both intentional.

Local verification on `74fef8b` (this session):

  • `uv run pytest --cov` — 89/89 pass, 100% coverage on 286 src statements, 0 deselected.
  • `uv run ruff check .` clean.
  • `uv run ruff format --check .` clean (11 files).
  • `uv run mypy src` clean (5 src files).
  • `uv lock --locked` clean — confirms no `pyproject.toml` drift (the new structural gate from ci: add uv lock --locked check to lint job (closes #60) #61 in action).
  • CI on `74fef8b`: all 7 required checks SUCCESS.

PR body test-plan checkboxes are already current. Round-2 diff is 3 files / 13+ 11-: `CHANGELOG.md`, `README.md`, `docs/cost-model-and-pypinfo-gotchas.md` — doc-only, no code re-test needed beyond the smoke pass above.

Promoting to Ready for QA Signoff as the final act of this comment. Awaiting maintainer QA Approved.

Process note (non-blocking): between round 1 and the QA Active I added at the start of this round, the PR cycled through a Dev-applied `QA Failed` → `Ready for QA Signoff` swap that skipped the QA-side reverification step. Per the project label workflow, only QA may apply `Ready for QA Signoff`; Dev's role on a fix-cycle push ends at `Ready for QA` (or letting the push automation reset to `Awaiting CI` → `Ready for QA`). Calling it out here for the audit trail; not a re-fail.

@cmeans cmeans added the Ready for QA Signoff QA passed — ready for maintainer final review and merge label May 4, 2026
@github-actions github-actions Bot removed the QA Active QA is actively reviewing; Dev should not push changes label May 4, 2026
@cmeans cmeans added QA Approved Manual QA testing completed and passed and removed Ready for QA Signoff QA passed — ready for maintainer final review and merge labels May 4, 2026
@cmeans-claude-dev cmeans-claude-dev Bot merged commit a0d4821 into main May 4, 2026
36 checks passed
@cmeans-claude-dev cmeans-claude-dev Bot deleted the docs/cost-model-and-pypinfo-gotchas branch May 4, 2026 22:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

QA Approved Manual QA testing completed and passed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants