feat(collector): batched BigQuery query — one scan per window group by cmeans-claude-dev[bot] · Pull Request #14 · cmeans/pypi-winnow-downloads

cmeans-claude-dev · 2026-04-25T04:00:06Z

Summary

Replaces the per-package pypinfo invocation with a single batched call per window_days group. The cost lever for hosting many packages: BigQuery's bytes_billed is bounded by partition + column scan and is NOT affected by the size of a WHERE-IN list, so one batched call for N packages costs the same as one call for a single package (~4-5 GB).

Why now: prerequisite for opening the service to external package requests. The previous per-package model maxed out the 1 TB/month free tier at ~7 packages on a daily cadence; the batched model scales to hundreds.

API change

# Before:
run_pypinfo(package: str, window_days: int, ...) -> int

# After:
run_pypinfo_batch(packages: Sequence[str], window_days: int, ...) -> dict[str, int]

collect() groups configured packages by window_days and runs one batch per group. Typical case (everyone on the default 30-day window) is exactly one BigQuery query per collector run regardless of N.

Argv shape

pypinfo --json --days 30 --all \
  --where 'file.project IN ("pkg-a", "pkg-b", "pkg-c")' \
  pkg-a project ci installer

The pkg-a positional is required by pypinfo's CLI parser but its generated file.project = "..." clause is overridden by --where; the placeholder choice is irrelevant.

Failure isolation preserved

Batch-level failure → every package in that window's batch fails; other windows still run.
Per-package badge-write failure → isolated to the affected package.
_health.json writes unconditionally.

Security

Package names go into the WHERE clause as double-quoted SQL literals. PEP 508 restricts names to [A-Za-z0-9._-], so quotes/escapes can't appear. Belt-and-braces: explicit rejection (CollectorError) of any name containing " or \\ before SQL composition. This becomes load-bearing once external requests start landing.

Tests

	Before	After
Collector tests	21	26
Total tests	56	61
Coverage	99%	99%

New tests cover: argv shape with WHERE-IN, per-package count splitting, zero-count packages still appear in dict, empty-input no-op, missing project field raises, unrequested-package row raises, quote-character rejection, batch-grouping invariant, single-batch-for-uniform-windows invariant, batch-failure-marks-all-in-window behaviour.

Verification

ruff check + ruff format --check + mypy: clean
pytest --cov: 61/61 passed, 99% coverage (collector 99%)

Test plan for QA

Read _resolve_pypinfo_path (unchanged), run_pypinfo_batch, and collect. Argv shape correct (--where, project pivot first, no -a/--auth); WHERE clause built safely; per-package failure isolation preserved at the badge-write layer.
pytest --cov confirms 61/61, 99% coverage.
Spot-test on the deployed CT 112: rebuild wheel, push, run once, confirm the four currently-tracked packages all get badges (counts unchanged; this PR is a cost optimization, not a metric change).
Sanity-check the BigQuery-cost story by capturing bytes_billed from the real query (the query field of pypinfo's JSON output). A four-package batch should bill the same ~4.6 GB observed today for a single-package query.

Post-merge

After this lands and CT 112 is redeployed, the service is positioned to accept external package-add PRs. A separate PR can document the request flow (likely a tracked.yaml rename + CONTRIBUTING.md update).

🤖 Generated with Claude Code

Replaces the previous per-package `pypinfo` invocation with a single batched call per `window_days` group. The new `run_pypinfo_batch(packages, window_days, ...)` builds a `WHERE file.project IN (...)` clause covering every package with the same window, pivots by `project` + `ci` + `installer`, and splits the response into a `{package: count}` dict post-parse. ## Why BigQuery's `bytes_billed` is bounded by partition + column scan; the size of a `WHERE`-IN list does NOT affect billed bytes. So one batched call for N packages costs the same as one call for one package (~4-5 GB scan, observed in real query metadata). The previous implementation scaled linearly with package count, capping realistic free-tier hosting at ~7 packages on a daily cadence. Batching scales to hundreds of packages on the same 1 TB/month budget. This is the cost lever for opening the service to packages other than the maintainer's own — the prerequisite for accepting external package-add requests via PR. ## Behavior preserved - Installer allowlist (`pip`, `uv`, `poetry`, `pdm`, `pipenv`, `pipx`), ci-filter (`details.ci != "True"`), badge label `pip*/uv/poetry/pdm (Nd)`, badge filename `<pkg>/downloads-Nd-non-ci.json`, `_health.json` shape — all unchanged. - `XDG_DATA_HOME` isolation still in place per-invocation (now per-batch rather than per-package, but the principle is the same). - `subprocess.run` timeout=180, `Path(sys.executable).parent` resolver, no `-a/--auth` on argv — all unchanged. ## Failure isolation - A batch-level failure (BigQuery error, malformed JSON, schema break, missing `project`/`installer_name` fields, row for unrequested package) marks every package in that window's batch as failed. - Other windows still run independently — a 7-day batch isn't blocked by a 30-day batch's failure, and vice versa. - Per-package badge-write failures stay isolated to the affected package (a read-only output subdir for one package doesn't break the rest). - `_health.json` writes unconditionally regardless of which batches failed. ## Security note Package names go directly into the `WHERE file.project IN (...)` clause as double-quoted SQL literals. PyPI's PEP 508 name grammar restricts names to `[A-Za-z0-9._-]`, so they cannot contain quotes or escape characters and the literal join is safe. Belt-and-braces: explicit rejection of names containing `"` or `\\` before SQL composition, with a `CollectorError`. This becomes load-bearing if the input source is ever broadened beyond a maintainer-curated YAML. ## Tests 26 collector tests (was 21), all passing. Major rewrites: - New: `test_run_pypinfo_batch_uses_where_in_clause_for_all_packages`, `test_run_pypinfo_batch_splits_counts_per_package`, `test_run_pypinfo_batch_includes_zero_count_packages`, `test_run_pypinfo_batch_returns_empty_for_empty_input`, `test_run_pypinfo_batch_raises_on_missing_project_field`, `test_run_pypinfo_batch_raises_on_unrequested_package`, `test_run_pypinfo_batch_rejects_package_names_with_quote_characters`, `test_collect_groups_packages_by_window_into_one_batch_per_window`, `test_collect_one_batch_for_all_packages_when_window_is_uniform`, `test_collect_records_batch_failure_for_all_packages_in_window`. - Adapted: every existing test that assumed the per-package signature. - Removed: `_collect_one` helper-test pattern (no longer relevant — collect's loop is now over per-window batches, not per-package). Sweep: - ruff / format / mypy: clean - pytest --cov: 61/61 passed (was 56), 99% coverage Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

codecov-commenter · 2026-04-25T04:01:51Z

Codecov Report

❌ Patch coverage is 99.02439% with 2 lines in your changes missing coverage. Please review.

Files with missing lines	Patch %	Lines
tests/test_collector.py	98.67%	2 Missing ⚠️

📢 Thoughts on this report? Let us know!

cmeans-claude-dev · 2026-04-25T13:49:57Z

Closing PR #14 — batching is not a cost lever; v0.1.0's per-package serial wins

Closing this PR after live testing against 300 real PyPI packages on CT 112 contradicted the load-bearing rationale. Recording the findings here so the test investment isn't lost.

What we tested

Top 300 PyPI packages (from hugovk's top-pypi-packages list), real BigQuery query, single batch via pypinfo --where 'file.project IN (...)'. Total cost: $11.32 in BigQuery scan.

What we found

Cost claim was wrong. I'd asserted batching kept bytes_billed at ~4-5 GB regardless of N. Empirically:

Query	bytes_billed
Single package	~4.6 GB
300 packages batched	2.32 TB

Per-package average inside the batch: ~7.7 GB — higher than single-package serial (~4.6 GB). The pypi.file_downloads table is clustered on file.project, so single-package queries efficiently prune to one cluster's worth of blocks; multi-package queries scan all the requested clusters. Batching is roughly 67% more expensive per package than serial.

This means the recomputed cost envelope for v0.1.0's per-package serial is more permissive than I originally claimed:

Packages	Monthly bytes (daily collection)	Free tier?
4 (current)	552 GB	✅
7	966 GB	✅ ceiling
10	1.38 TB	~$2/month over
50	6.9 TB	~$30/month
100	13.8 TB	~$65/month
300	41 TB	~$200/month

Two real pypinfo bugs caught

The test investment did surface fixable issues:

Positional [PROJECT] AND-combines with --where instead of being overridden. With a placeholder positional like boto3, the SQL becomes WHERE file.project = "boto3" AND file.project IN ("boto3", ...), silently restricting the response to one package. Workaround: pass "" (empty string) so pypinfo's if project: branch in core.py:build_query skips emitting the auto-filter line.
pypinfo defaults to LIMIT 10 and limit or DEFAULT_LIMIT in source means 0 falls back to 10. With project x ci x installer pivot producing potentially thousands of rows, the default truncates badly. Need explicit --limit <large>.

These are noted in the awareness store under pypinfo-cli-gotchas for future use.

Why close instead of reframe

Even after the bug fixes, the only remaining benefit of batching is atomic snapshot of all packages at one BigQuery moment — meaningful for trend analysis but immaterial for badge serving (the 30-second skew across 4 packages doesn't matter). Paying 67% more per query for that benefit isn't worth it at our scale.

What this means for hosting-for-others

Realistic envelope for the existing per-package serial implementation:

4-7 packages: comfortably free tier
10-20 packages: a few dollars per month
100+ packages: genuinely expensive, needs a different strategy (less-frequent collection, pypistats.org as data source, or accepting cost as service economics)

Cost reduction levers, in order of effectiveness:

Reduce collection frequency (daily → weekly = 7x reduction)
Use pypistats.org instead of BigQuery (free, but loses installer-allowlist refinement)
Hybrid (pypistats for daily updates, BigQuery for weekly installer-mix refresh)

A separate small docs-only PR will document this cost model in the README and deploy/ so future operators understand the scale envelope.

Mechanics

Branch feat/batched-bigquery-query deleted on close.
v0.1.0 (per-package serial, currently deployed at pypi-badges.intfar.com) is the right shape for our scale.
Findings preserved in awareness so the next session doesn't re-walk this $11 lesson.

…pypinfo (#62) ## QA round 2 — doc-only follow-up at 74fef8b Two findings from QA round 1, both doc-only: - **Free-tier basis aligned on 1 TiB across README, doc, and CHANGELOG.** README already cited 1 TiB but the cost-envelope table in `docs/cost-model-and-pypinfo-gotchas.md` used a 1 TB basis in the heading and the qualitative cells. Switched to 1 TiB (matches GCP's actual figure and what the README says) and refreshed the verdicts whose qualitative reading shifts under the larger basis: 7-pkg row from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "~$2/month over" to "~$1.40/month over"; 50-pkg from "~$30" to "~$29"; 100-pkg from "~$65" to "~$64". README's prose rule of thumb updated from "~7 packages" to "~7-8 packages" so it matches the table; CHANGELOG entry follows. - **"See also" bullet pointing at `collector.py` no longer overstates what is commented inline.** Originally said "with both gotchas commented at their respective line ranges"; only gotcha 2 (`--limit`) is commented in the function body, because the collector ships per-package serial and does not pass `--where`. Reworded to name `--limit` explicitly and explain why the `--where` gotcha is preserved in the doc but not in code. ## Summary Two parts in one PR — they share the same source material from PR #14's testing and would be artificially split if they didn't share their references. **1. New engineering doc: `docs/cost-model-and-pypinfo-gotchas.md`.** Captures three things that aren't obvious from the code: - **BigQuery scan-cost shape.** Empirical numbers from PR #14's testing (which cost the project \$11.32 to learn): per-package serial runs ~4.6 GB billed/pkg/30-day-window, batched queries of 300 packages cost ~7.7 GB/pkg, so batching is *more* expensive than serial — the table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. Free-tier ceiling lands around 7-8 packages on a daily cadence (1 TiB basis). - **Levers that move cost.** Frequency cuts (daily → weekly = 7x), pypistats.org fallback, hybrid, materialized views — in order of effectiveness. Includes anti-levers (smaller batch sizes, `TABLESAMPLE`) for completeness. - **Two pypinfo CLI gotchas** with `core.py:build_query` references: `--where` AND-combines with the positional rather than overriding, and `--limit` defaults to 10 with falsy values falling back to that default (`limit or DEFAULT_LIMIT` in source). The first one bites multi-package callers; the second one bites multi-pivot callers — including this project's `run_pypinfo`. The material was previously captured only in the maintainer's private knowledge store. Promoting it to the public repo means future maintainers and self-hosters don't have to re-walk it. README install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing. **2. `run_pypinfo` argv carries an explicit `--limit 500`.** Closes the gotcha-2 hole on the live code path. The pivot is `ci x installer x system`; realistic distinct combos for one package are ~3 x 8 x 4 ≈ 96. Under the prior implicit-default-10 path, a popular package with diverse installer/system spread silently lost the long tail and the hero badge (sum of post-allowlist rows) would undercount. SQL `LIMIT` is post-aggregation, so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression coverage in `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` fails if `--limit` is dropped or the value drops below 100. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 25, 2026

cmeans-claude-dev Bot closed this Apr 25, 2026

cmeans-claude-dev Bot deleted the feat/batched-bigquery-query branch April 25, 2026 13:49

cmeans-claude-dev Bot mentioned this pull request May 2, 2026

docs: cost-model + pypinfo CLI gotchas note; explicit --limit on run_pypinfo #62

Merged

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(collector): batched BigQuery query — one scan per window group#14

feat(collector): batched BigQuery query — one scan per window group#14
cmeans-claude-dev[bot] wants to merge 1 commit into
mainfrom
feat/batched-bigquery-query

cmeans-claude-dev Bot commented Apr 25, 2026

Uh oh!

codecov-commenter commented Apr 25, 2026

Uh oh!

cmeans-claude-dev Bot commented Apr 25, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

cmeans-claude-dev Bot commented Apr 25, 2026

Summary

API change

Argv shape

Failure isolation preserved

Security

Tests

Verification

Test plan for QA

Post-merge

Uh oh!

codecov-commenter commented Apr 25, 2026

Codecov Report

Uh oh!

cmeans-claude-dev Bot commented Apr 25, 2026

Closing PR #14 — batching is not a cost lever; v0.1.0's per-package serial wins

What we tested

What we found

Two real pypinfo bugs caught

Why close instead of reframe

What this means for hosting-for-others

Mechanics

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant