From 190bfca9eb7c1893899083eafc74aa9c48e5a9d7 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <272174644+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Fri, 1 May 2026 21:30:06 -0500 Subject: [PATCH 1/2] docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Two parts: 1. New engineering doc at `docs/cost-model-and-pypinfo-gotchas.md` capturing the BigQuery scan-cost shape (~4.6 GB/pkg/run; free-tier ceiling around 7 packages on a daily cadence; cost envelope at higher scales), the levers that move cost (frequency cuts, pypistats fallback, hybrid), and two pypinfo CLI foot-guns: - `--where` AND-combines with the positional rather than overriding - `--limit` defaults to 10; falsy values fall back to that default Material was previously captured only in the awareness store (entry c41ae589) from PR #14's testing, where running batched-query queries over 300 packages cost the project $11.32 to learn. Promoting it to the public repo means future maintainers (and self-hosters) don't re-walk it. README install section gains a brief pointer with the free-tier rule of thumb. 2. `run_pypinfo` argv now passes `--limit 500` explicitly. Realistic ci-by-installer-by-system combo ceiling for one package is ~3 x 8 x 4 ≈ 96; under pypinfo's implicit default of 10, popular packages with diverse installer/system spread silently lost the long tail and the hero badge undercounted. SQL `LIMIT` is post-aggregation so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression test asserts `--limit` present in argv and value >= 100. Coverage stays at 100% (89/88 tests pass with the new test). --- CHANGELOG.md | 4 + README.md | 8 ++ docs/cost-model-and-pypinfo-gotchas.md | 158 +++++++++++++++++++++++++ src/pypi_winnow_downloads/collector.py | 11 ++ tests/test_collector.py | 35 ++++++ 5 files changed, 216 insertions(+) create mode 100644 docs/cost-model-and-pypinfo-gotchas.md diff --git a/CHANGELOG.md b/CHANGELOG.md index ab88d0b..b0ec7a1 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,10 @@ ### Added +- **`docs/cost-model-and-pypinfo-gotchas.md`** new engineering note documenting the BigQuery scan-cost shape for `bigquery-public-data.pypi.file_downloads` (the ~4.6 GB-per-package per-query rate, the free-tier ceiling around 7 packages on a daily cadence, the cost envelope at higher scales, and the levers that actually move cost) and two `pypinfo` CLI foot-guns surfaced during PR #14's testing: `--where` AND-combining with the positional rather than overriding it, and `--limit` defaulting to 10 with falsy values falling back to that default. Also explains why batching is *not* a cost lever — empirically more expensive than per-package serial because the `pypi.file_downloads` table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. README's install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing. + +- **`run_pypinfo` argv carries an explicit `--limit 500`.** Defends against pypinfo's CLI default of 10 (`limit or DEFAULT_LIMIT` in `pypinfo/core.py:build_query` treats falsy values, including 0, as falling back) silently truncating the long tail of `(ci, installer, system)` rows. Realistic combo ceiling for one package is ~3 × 8 × 4 ≈ 96; under the prior implicit-default-10 path, a popular package with diverse installer/system spread would lose the tail and the hero count (sum of post-allowlist rows) would silently undercount. SQL `LIMIT` is applied after aggregation, so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression test in `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` fails if `--limit` is dropped or the value drops below 100. See `docs/cost-model-and-pypinfo-gotchas.md` gotcha 2. + - **`.github/workflows/ci.yml`** lint job gains a `uv lock --locked` step that fails fast if `uv.lock` and `pyproject.toml` have drifted. Catches the v0.3.0-style regression where a release commit bumps `pyproject.toml` `version` without re-running `uv lock`, leaving the lockfile's self-version block on the previous release. Structural backstop: a release commit that forgets the lockfile bump now redlits its own PR before merge rather than landing and getting picked up days later by the weekly `uv-lock-refresh.yml` cron with mis-framed PR copy. Does not touch upstream-PyPI freshness — that remains the weekly cron's job, since `uv lock --locked` only checks `pyproject.toml`-vs-lockfile consistency, not lockfile-vs-PyPI freshness. Closes [#60](https://github.com/cmeans/pypi-winnow-downloads/issues/60). ### Changed diff --git a/README.md b/README.md index 5b78e51..e384e30 100644 --- a/README.md +++ b/README.md @@ -101,6 +101,14 @@ works and is what `config.example.yaml` and the reference deploy document. Then point `service.credential_file` in your config at the resulting file. +BigQuery's free tier is 1 TiB of scan per month. At the per-package +serial rate this collector uses (~4.6 GB scanned per package per +30-day window query), a daily run on ~7 packages sits at the free-tier +ceiling. The +[cost model and pypinfo gotchas](docs/cost-model-and-pypinfo-gotchas.md) +doc has the full table, the levers for scaling beyond that, and why +batching is *not* one of them. + Run with a YAML config — copy [`config.example.yaml`](https://github.com/cmeans/pypi-winnow-downloads/blob/main/config.example.yaml) and edit: diff --git a/docs/cost-model-and-pypinfo-gotchas.md b/docs/cost-model-and-pypinfo-gotchas.md new file mode 100644 index 0000000..4562c99 --- /dev/null +++ b/docs/cost-model-and-pypinfo-gotchas.md @@ -0,0 +1,158 @@ +# Cost model and pypinfo CLI gotchas + +This is an engineering note for anyone running `pypi-winnow-downloads` +themselves, or anyone hacking on the `collector.py` BigQuery code path. +It captures two things that are non-obvious from the code alone: + +1. The shape of BigQuery scan cost for `bigquery-public-data.pypi.file_downloads` +2. Two foot-guns in `pypinfo`'s CLI that affect how the collector calls it + +The numbers in the cost section come from empirical testing during PR #14 +(batched-query refactor, ultimately closed in favor of the per-package +serial approach this repo ships). The pypinfo gotchas were surfaced +during that same testing and during the v0.1.x feat/collector PR review +cycles. + +## BigQuery scan cost shape + +`bigquery-public-data.pypi.file_downloads` is clustered on `file.project` +(or has clustering-equivalent block layout). Clustering means a query +filtering on a single project efficiently prunes to that project's +blocks; a query filtering on N projects via `WHERE file.project IN (...)` +scans all N projects' blocks. + +Empirical (30-day window, daily run, all installers): + +| Approach | Bytes billed | Per package | +|-----------------------------|----------------|-------------| +| 1 package, single query | ~4.6 GB | 4.6 GB | +| 300 packages, batched query | ~2.32 TB | ~7.7 GB/pkg | +| 300 packages, serial calls | ~1.38 TB | ~4.6 GB/pkg | + +**Batching is more expensive than serial**, not less. The cluster-pruning +advantage applies most cleanly to single-package queries, so a tight +loop of 1-package queries comes out ahead of a single big `WHERE IN`. +Bytes billed is roughly proportional to packages-touched regardless of +batching strategy. + +### Cost envelope at the per-package serial rate (~4.6 GB/pkg/run) + +| Packages | Monthly bytes billed | Free tier (1 TB/mo) ceiling | +|----------|----------------------|-----------------------------| +| 4 | 552 GB | comfortably under | +| 7 | 966 GB | at the ceiling | +| 10 | 1.38 TB | ~$2/month over | +| 50 | 6.9 TB | ~$30/month | +| 100 | 13.8 TB | ~$65/month | +| 300 | 41 TB | ~$200/month | + +After-free-tier rate: $5/TB. + +The `BigQuery Sandbox` mode this project's GCP setup uses returns quota +errors (not charges) when the free tier is exhausted. That is the +desired failure mode for a hobby workload: "stop emitting badges" beats +"silently bill the maintainer's credit card." + +### Levers for scale, in order of effectiveness + +1. **Reduce collection frequency.** Daily → weekly is a 7x cost cut; + daily → monthly is 30x. Trades freshness for cost; for download + counts averaged over a 30-day window, freshness within a few days + is plenty. +2. **Use pypistats.org as the data source instead.** Free, no scan + cost, but loses the installer-mirror filtering refinement that this + project's "non-CI downloads" framing depends on. You're back to + the v1 mirror-inclusive numbers we explicitly improved on. +3. **Hybrid:** pypistats daily for rough counts, BigQuery weekly for + installer-mix sanity. Best cost-vs-quality tradeoff at scale. +4. **Materialized view of pre-aggregated daily summaries.** Significant + setup, modest savings, only worth it past hundreds of packages. + +### Levers that do NOT help cost + +- Direct BigQuery client library (skip the `pypinfo` subprocess) — the + scan cost is the same. Only saves subprocess startup time (~100 ms). +- Smaller batch sizes (e.g., 10 packages at a time vs. 300) — total + scan cost is roughly proportional to packages-touched. +- Sample-based scans (`TABLESAMPLE`) — introduces noise. Bad for the + honesty pitch this project makes about its numbers. + +### Decision for v1 + +Per-package serial is the right shape for current scale (4 dogfood +packages, comfortably under the free tier) and remains reasonable at +~10-20 packages. Beyond that, the levers above are the lever — not +query batching. PR #14's batched-query refactor was closed for that +reason. + +## pypinfo CLI gotchas + +These are documented at `pypinfo/cli.py` and `pypinfo/core.py` line +ranges as of the version this project pins (`pypinfo>=23.0.0`). They +are surprising enough that anyone editing `run_pypinfo`'s argv +construction should know about them. + +### 1. `--where` AND-combines with the positional, never overrides it + +`pypinfo [PROJECT] [FIELDS...]` always generates a `WHERE +file.project = ""` clause from the positional and ANDs it with +any additional `--where` predicate. It does not replace one with the +other. Source, `pypinfo/core.py:build_query`: + +```python +conditions = ["WHERE timestamp BETWEEN ..."] +if project: + conditions.append(f'file.project = "{project}"\n') +... +query += " AND ".join(conditions) +if where: + query += f" AND {where}\n" +``` + +The `if project:` check means an **empty-string positional** (`""`) +skips the auto-filter. So if you ever want a multi-package query, the +positional must be `""` and the package list goes in `--where`: + +``` +pypinfo --where 'file.project IN ("a","b","c")' "" project ci installer +``` + +Anything else and the SQL ends up with both `file.project = "a" AND +file.project IN (...)`, which silently restricts the response to +package `a` only. + +This project ships per-package serial (one query per package), so the +collector passes the real package name as the positional and does not +use `--where`. The gotcha is preserved here for anyone reviving the +batched path or hacking on a fork. + +### 2. `--limit` defaults to 10; falsy values fall back to that default + +`limit = limit or DEFAULT_LIMIT` in `core.build_query`, with +`DEFAULT_LIMIT = 10`. Passing `--limit 0` is treated as falsy and +falls back to 10. There is no "no-limit" mode in the CLI. + +For multi-pivot queries — e.g., `[PROJECT] ci installer system` — +the result has up to one row per distinct `(ci, installer, system)` +combination. Realistic combos for one package run to a few dozen, so +the default of 10 silently truncates the long tail. + +In `run_pypinfo` we pivot by `ci × installer × system` (3 fields), so +the argv carries an explicit `--limit 500` to leave several-multiple +headroom over the realistic combo ceiling. SQL `LIMIT` is applied +after aggregation, so a generous bound does not change `bytes_billed`. + +If you ever extend the pivot — adding `country` or `version` — +recompute the combo ceiling and bump `--limit` accordingly. + +## See also + +- `src/pypi_winnow_downloads/collector.py` — the live `run_pypinfo` + function with both gotchas commented at their respective line ranges + inside the function body. +- `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` + — regression coverage that fails if `--limit` is dropped from argv. +- [BigQuery pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing) + — the $5/TB on-demand rate referenced above. +- [pypinfo](https://github.com/ofek/pypinfo) — upstream source, where + the file/line references in this doc point. diff --git a/src/pypi_winnow_downloads/collector.py b/src/pypi_winnow_downloads/collector.py index e6e8605..3fd137c 100644 --- a/src/pypi_winnow_downloads/collector.py +++ b/src/pypi_winnow_downloads/collector.py @@ -166,12 +166,23 @@ def run_pypinfo( # short-circuits to a credential-setter path when --auth is present and # never runs the query. Use GOOGLE_APPLICATION_CREDENTIALS instead, which # pypinfo's core.py reads via os.environ.get on the no-flag path. + # + # `--limit 500` defends against pypinfo's CLI default of 10 (cli.py + # passes `limit or DEFAULT_LIMIT` to core.build_query, where falsy + # values including 0 fall back to 10 — there is no "no-limit" mode). + # The pivot is `ci x installer x system`: realistic max distinct + # combos for one package is ~3 x 8 x 4 = 96, so 500 leaves ~5x + # headroom for unexpected installer/system variants. SQL `LIMIT` is + # applied after aggregation, so a generous bound does not change + # `bytes_billed`. See docs/cost-model-and-pypinfo-gotchas.md. argv = [ _resolve_pypinfo_path(), "--json", "--days", str(window_days), "--all", + "--limit", + "500", package, "ci", "installer", diff --git a/tests/test_collector.py b/tests/test_collector.py index 1ffd0dc..6676f78 100644 --- a/tests/test_collector.py +++ b/tests/test_collector.py @@ -83,6 +83,41 @@ def fake_runner(argv: list[str], env: dict[str, str]) -> subprocess.CompletedPro assert argv[-3:] == ["ci", "installer", "system"], argv +def test_run_pypinfo_argv_passes_explicit_limit(tmp_path: Path) -> None: + """pypinfo's CLI defaults to `--limit 10` and treats falsy values + (including 0) as falling back to that default — `limit or DEFAULT_LIMIT` + in core.build_query. With a `ci x installer x system` pivot, realistic + distinct combos for a single package can run to several dozen rows; a + silent LIMIT-10 truncation would undercount the hero badge by dropping + the long tail of less-common installer/system pairs. The argv must + therefore carry an explicit `--limit` with a value comfortably above + the realistic ceiling. See docs/cost-model-and-pypinfo-gotchas.md. + """ + captured: list[list[str]] = [] + + def fake_runner(argv: list[str], env: dict[str, str]) -> subprocess.CompletedProcess[str]: + captured.append(list(argv)) + return _ok_result(argv) + + creds = tmp_path / "creds.json" + creds.write_text("{}") + run_pypinfo("mypkg", 30, credential_file=creds, runner=fake_runner) + + argv = captured[0] + assert "--limit" in argv, ( + "argv must carry `--limit` to defeat pypinfo's default-10 truncation; " + "see docs/cost-model-and-pypinfo-gotchas.md gotcha 2." + ) + limit_value = int(argv[argv.index("--limit") + 1]) + # Realistic max combos for one package is ~3 x 8 x 4 = 96, so the bound + # must clear that with margin. 100 is the floor; the current value is + # 500 (~5x headroom). + assert limit_value >= 100, ( + f"--limit must be >= 100 to clear the realistic ci-by-installer-by-system " + f"combo ceiling for one package; got {limit_value}." + ) + + def _ok_rows(rows: list[dict]) -> str: """Helper: shape an `_ok_result` JSON payload from a list of pypinfo row dicts.""" return json.dumps({"rows": rows}) From 74fef8b674690a726ad3c10d81f5fd37deab89c0 Mon Sep 17 00:00:00 2001 From: "cmeans-claude-dev[bot]" <272174644+cmeans-claude-dev[bot]@users.noreply.github.com> Date: Sat, 2 May 2026 13:59:15 -0500 Subject: [PATCH 2/2] docs: align cost-envelope basis on 1 TiB; tighten See-also wording QA round 1 findings: - Free-tier basis was inconsistent: README cited 1 TiB but the cost-envelope table in docs/cost-model-and-pypinfo-gotchas.md used a 1 TB basis in the heading and qualitative cells. Aligns the doc on 1 TiB (matches GCP's actual free-tier figure and the README). Refreshes the qualitative cells whose verdicts shift under the larger basis: 7-pkg row moves from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "~\$2/month over" to "~\$1.40/month over"; 50/100-pkg dollar figures tightened to "~\$29" and "~\$64". README rule of thumb updated from "~7 packages" to "~7-8 packages" so the prose ceiling matches the table; CHANGELOG entry follows. - "See also" bullet pointing at collector.py overstated what is commented inline. Only gotcha 2 (--limit) is commented in the function body; gotcha 1 (--where) is intentionally not, because the collector ships per-package serial and does not pass --where. Reworded to name --limit explicitly and explain why the --where gotcha is preserved in the doc but not in code. Doc-only changes; 89/89 tests pass with 100% coverage, ruff/mypy/uv lock --locked all clean. Co-Authored-By: Claude Opus 4.7 (1M context) --- CHANGELOG.md | 2 +- README.md | 4 ++-- docs/cost-model-and-pypinfo-gotchas.md | 18 ++++++++++-------- 3 files changed, 13 insertions(+), 11 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index b0ec7a1..f9d45de 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,7 +4,7 @@ ### Added -- **`docs/cost-model-and-pypinfo-gotchas.md`** new engineering note documenting the BigQuery scan-cost shape for `bigquery-public-data.pypi.file_downloads` (the ~4.6 GB-per-package per-query rate, the free-tier ceiling around 7 packages on a daily cadence, the cost envelope at higher scales, and the levers that actually move cost) and two `pypinfo` CLI foot-guns surfaced during PR #14's testing: `--where` AND-combining with the positional rather than overriding it, and `--limit` defaulting to 10 with falsy values falling back to that default. Also explains why batching is *not* a cost lever — empirically more expensive than per-package serial because the `pypi.file_downloads` table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. README's install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing. +- **`docs/cost-model-and-pypinfo-gotchas.md`** new engineering note documenting the BigQuery scan-cost shape for `bigquery-public-data.pypi.file_downloads` (the ~4.6 GB-per-package per-query rate, the free-tier ceiling around 7-8 packages on a daily cadence, the cost envelope at higher scales, and the levers that actually move cost) and two `pypinfo` CLI foot-guns surfaced during PR #14's testing: `--where` AND-combining with the positional rather than overriding it, and `--limit` defaulting to 10 with falsy values falling back to that default. Also explains why batching is *not* a cost lever — empirically more expensive than per-package serial because the `pypi.file_downloads` table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. README's install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing. - **`run_pypinfo` argv carries an explicit `--limit 500`.** Defends against pypinfo's CLI default of 10 (`limit or DEFAULT_LIMIT` in `pypinfo/core.py:build_query` treats falsy values, including 0, as falling back) silently truncating the long tail of `(ci, installer, system)` rows. Realistic combo ceiling for one package is ~3 × 8 × 4 ≈ 96; under the prior implicit-default-10 path, a popular package with diverse installer/system spread would lose the tail and the hero count (sum of post-allowlist rows) would silently undercount. SQL `LIMIT` is applied after aggregation, so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression test in `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` fails if `--limit` is dropped or the value drops below 100. See `docs/cost-model-and-pypinfo-gotchas.md` gotcha 2. diff --git a/README.md b/README.md index e384e30..f5a9a55 100644 --- a/README.md +++ b/README.md @@ -103,8 +103,8 @@ resulting file. BigQuery's free tier is 1 TiB of scan per month. At the per-package serial rate this collector uses (~4.6 GB scanned per package per -30-day window query), a daily run on ~7 packages sits at the free-tier -ceiling. The +30-day window query), a daily run on ~7-8 packages sits at the +free-tier ceiling. The [cost model and pypinfo gotchas](docs/cost-model-and-pypinfo-gotchas.md) doc has the full table, the levers for scaling beyond that, and why batching is *not* one of them. diff --git a/docs/cost-model-and-pypinfo-gotchas.md b/docs/cost-model-and-pypinfo-gotchas.md index 4562c99..742c61e 100644 --- a/docs/cost-model-and-pypinfo-gotchas.md +++ b/docs/cost-model-and-pypinfo-gotchas.md @@ -37,13 +37,13 @@ batching strategy. ### Cost envelope at the per-package serial rate (~4.6 GB/pkg/run) -| Packages | Monthly bytes billed | Free tier (1 TB/mo) ceiling | -|----------|----------------------|-----------------------------| +| Packages | Monthly bytes billed | Free tier (1 TiB/mo) ceiling | +|----------|----------------------|------------------------------| | 4 | 552 GB | comfortably under | -| 7 | 966 GB | at the ceiling | -| 10 | 1.38 TB | ~$2/month over | -| 50 | 6.9 TB | ~$30/month | -| 100 | 13.8 TB | ~$65/month | +| 7 | 966 GB | comfortably under (88%) | +| 10 | 1.38 TB | ~$1.40/month over | +| 50 | 6.9 TB | ~$29/month | +| 100 | 13.8 TB | ~$64/month | | 300 | 41 TB | ~$200/month | After-free-tier rate: $5/TB. @@ -148,8 +148,10 @@ recompute the combo ceiling and bump `--limit` accordingly. ## See also - `src/pypi_winnow_downloads/collector.py` — the live `run_pypinfo` - function with both gotchas commented at their respective line ranges - inside the function body. + function with the `--limit` gotcha commented inline. The `--where` + gotcha is not commented in the code because the collector ships + per-package serial and does not pass `--where`; it is preserved here + for anyone reviving the batched path or hacking on a fork. - `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` — regression coverage that fails if `--limit` is dropped from argv. - [BigQuery pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing)