feat(collector): batched BigQuery query — one scan per window group#14
feat(collector): batched BigQuery query — one scan per window group#14cmeans-claude-dev[bot] wants to merge 1 commit into
Conversation
Replaces the previous per-package `pypinfo` invocation with a single
batched call per `window_days` group. The new
`run_pypinfo_batch(packages, window_days, ...)` builds a
`WHERE file.project IN (...)` clause covering every package with the
same window, pivots by `project` + `ci` + `installer`, and splits the
response into a `{package: count}` dict post-parse.
## Why
BigQuery's `bytes_billed` is bounded by partition + column scan; the
size of a `WHERE`-IN list does NOT affect billed bytes. So one batched
call for N packages costs the same as one call for one package
(~4-5 GB scan, observed in real query metadata). The previous
implementation scaled linearly with package count, capping realistic
free-tier hosting at ~7 packages on a daily cadence. Batching scales
to hundreds of packages on the same 1 TB/month budget.
This is the cost lever for opening the service to packages other than
the maintainer's own — the prerequisite for accepting external
package-add requests via PR.
## Behavior preserved
- Installer allowlist (`pip`, `uv`, `poetry`, `pdm`, `pipenv`, `pipx`),
ci-filter (`details.ci != "True"`), badge label
`pip*/uv/poetry/pdm (Nd)`, badge filename
`<pkg>/downloads-Nd-non-ci.json`, `_health.json` shape — all
unchanged.
- `XDG_DATA_HOME` isolation still in place per-invocation (now per-batch
rather than per-package, but the principle is the same).
- `subprocess.run` timeout=180, `Path(sys.executable).parent` resolver,
no `-a/--auth` on argv — all unchanged.
## Failure isolation
- A batch-level failure (BigQuery error, malformed JSON, schema break,
missing `project`/`installer_name` fields, row for unrequested
package) marks every package in that window's batch as failed.
- Other windows still run independently — a 7-day batch isn't blocked
by a 30-day batch's failure, and vice versa.
- Per-package badge-write failures stay isolated to the affected
package (a read-only output subdir for one package doesn't break the
rest).
- `_health.json` writes unconditionally regardless of which batches
failed.
## Security note
Package names go directly into the `WHERE file.project IN (...)`
clause as double-quoted SQL literals. PyPI's PEP 508 name grammar
restricts names to `[A-Za-z0-9._-]`, so they cannot contain quotes or
escape characters and the literal join is safe. Belt-and-braces:
explicit rejection of names containing `"` or `\\` before SQL
composition, with a `CollectorError`. This becomes load-bearing if
the input source is ever broadened beyond a maintainer-curated YAML.
## Tests
26 collector tests (was 21), all passing. Major rewrites:
- New: `test_run_pypinfo_batch_uses_where_in_clause_for_all_packages`,
`test_run_pypinfo_batch_splits_counts_per_package`,
`test_run_pypinfo_batch_includes_zero_count_packages`,
`test_run_pypinfo_batch_returns_empty_for_empty_input`,
`test_run_pypinfo_batch_raises_on_missing_project_field`,
`test_run_pypinfo_batch_raises_on_unrequested_package`,
`test_run_pypinfo_batch_rejects_package_names_with_quote_characters`,
`test_collect_groups_packages_by_window_into_one_batch_per_window`,
`test_collect_one_batch_for_all_packages_when_window_is_uniform`,
`test_collect_records_batch_failure_for_all_packages_in_window`.
- Adapted: every existing test that assumed the per-package signature.
- Removed: `_collect_one` helper-test pattern (no longer relevant —
collect's loop is now over per-window batches, not per-package).
Sweep:
- ruff / format / mypy: clean
- pytest --cov: 61/61 passed (was 56), 99% coverage
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Closing PR #14 — batching is not a cost lever; v0.1.0's per-package serial winsClosing this PR after live testing against 300 real PyPI packages on CT 112 contradicted the load-bearing rationale. Recording the findings here so the test investment isn't lost. What we testedTop 300 PyPI packages (from hugovk's top-pypi-packages list), real BigQuery query, single batch via What we foundCost claim was wrong. I'd asserted batching kept
Per-package average inside the batch: ~7.7 GB — higher than single-package serial (~4.6 GB). The pypi.file_downloads table is clustered on This means the recomputed cost envelope for v0.1.0's per-package serial is more permissive than I originally claimed:
Two real pypinfo bugs caughtThe test investment did surface fixable issues:
These are noted in the awareness store under Why close instead of reframeEven after the bug fixes, the only remaining benefit of batching is atomic snapshot of all packages at one BigQuery moment — meaningful for trend analysis but immaterial for badge serving (the 30-second skew across 4 packages doesn't matter). Paying 67% more per query for that benefit isn't worth it at our scale. What this means for hosting-for-othersRealistic envelope for the existing per-package serial implementation:
Cost reduction levers, in order of effectiveness:
A separate small docs-only PR will document this cost model in the README and Mechanics
|
…pypinfo (#62) ## QA round 2 — doc-only follow-up at 74fef8b Two findings from QA round 1, both doc-only: - **Free-tier basis aligned on 1 TiB across README, doc, and CHANGELOG.** README already cited 1 TiB but the cost-envelope table in `docs/cost-model-and-pypinfo-gotchas.md` used a 1 TB basis in the heading and the qualitative cells. Switched to 1 TiB (matches GCP's actual figure and what the README says) and refreshed the verdicts whose qualitative reading shifts under the larger basis: 7-pkg row from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "~$2/month over" to "~$1.40/month over"; 50-pkg from "~$30" to "~$29"; 100-pkg from "~$65" to "~$64". README's prose rule of thumb updated from "~7 packages" to "~7-8 packages" so it matches the table; CHANGELOG entry follows. - **"See also" bullet pointing at `collector.py` no longer overstates what is commented inline.** Originally said "with both gotchas commented at their respective line ranges"; only gotcha 2 (`--limit`) is commented in the function body, because the collector ships per-package serial and does not pass `--where`. Reworded to name `--limit` explicitly and explain why the `--where` gotcha is preserved in the doc but not in code. ## Summary Two parts in one PR — they share the same source material from PR #14's testing and would be artificially split if they didn't share their references. **1. New engineering doc: `docs/cost-model-and-pypinfo-gotchas.md`.** Captures three things that aren't obvious from the code: - **BigQuery scan-cost shape.** Empirical numbers from PR #14's testing (which cost the project \$11.32 to learn): per-package serial runs ~4.6 GB billed/pkg/30-day-window, batched queries of 300 packages cost ~7.7 GB/pkg, so batching is *more* expensive than serial — the table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. Free-tier ceiling lands around 7-8 packages on a daily cadence (1 TiB basis). - **Levers that move cost.** Frequency cuts (daily → weekly = 7x), pypistats.org fallback, hybrid, materialized views — in order of effectiveness. Includes anti-levers (smaller batch sizes, `TABLESAMPLE`) for completeness. - **Two pypinfo CLI gotchas** with `core.py:build_query` references: `--where` AND-combines with the positional rather than overriding, and `--limit` defaults to 10 with falsy values falling back to that default (`limit or DEFAULT_LIMIT` in source). The first one bites multi-package callers; the second one bites multi-pivot callers — including this project's `run_pypinfo`. The material was previously captured only in the maintainer's private knowledge store. Promoting it to the public repo means future maintainers and self-hosters don't have to re-walk it. README install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing. **2. `run_pypinfo` argv carries an explicit `--limit 500`.** Closes the gotcha-2 hole on the live code path. The pivot is `ci x installer x system`; realistic distinct combos for one package are ~3 x 8 x 4 ≈ 96. Under the prior implicit-default-10 path, a popular package with diverse installer/system spread silently lost the long tail and the hero badge (sum of post-allowlist rows) would undercount. SQL `LIMIT` is post-aggregation, so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression coverage in `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` fails if `--limit` is dropped or the value drops below 100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Replaces the per-package pypinfo invocation with a single batched call per
window_daysgroup. The cost lever for hosting many packages: BigQuery'sbytes_billedis bounded by partition + column scan and is NOT affected by the size of aWHERE-IN list, so one batched call for N packages costs the same as one call for a single package (~4-5 GB).Why now: prerequisite for opening the service to external package requests. The previous per-package model maxed out the 1 TB/month free tier at ~7 packages on a daily cadence; the batched model scales to hundreds.
API change
collect()groups configured packages bywindow_daysand runs one batch per group. Typical case (everyone on the default 30-day window) is exactly one BigQuery query per collector run regardless of N.Argv shape
The
pkg-apositional is required by pypinfo's CLI parser but its generatedfile.project = "..."clause is overridden by--where; the placeholder choice is irrelevant.Failure isolation preserved
_health.jsonwrites unconditionally.Security
Package names go into the
WHEREclause as double-quoted SQL literals. PEP 508 restricts names to[A-Za-z0-9._-], so quotes/escapes can't appear. Belt-and-braces: explicit rejection (CollectorError) of any name containing"or\\before SQL composition. This becomes load-bearing once external requests start landing.Tests
New tests cover: argv shape with WHERE-IN, per-package count splitting, zero-count packages still appear in dict, empty-input no-op, missing
projectfield raises, unrequested-package row raises, quote-character rejection, batch-grouping invariant, single-batch-for-uniform-windows invariant, batch-failure-marks-all-in-window behaviour.Verification
Test plan for QA
_resolve_pypinfo_path(unchanged),run_pypinfo_batch, andcollect. Argv shape correct (--where,projectpivot first, no-a/--auth); WHERE clause built safely; per-package failure isolation preserved at the badge-write layer.pytest --covconfirms 61/61, 99% coverage.bytes_billedfrom the real query (thequeryfield of pypinfo's JSON output). A four-package batch should bill the same ~4.6 GB observed today for a single-package query.Post-merge
After this lands and CT 112 is redeployed, the service is positioned to accept external package-add PRs. A separate PR can document the request flow (likely a
tracked.yamlrename + CONTRIBUTING.md update).🤖 Generated with Claude Code