Skip to content

feat(collector): batched BigQuery query — one scan per window group#14

Closed
cmeans-claude-dev[bot] wants to merge 1 commit into
mainfrom
feat/batched-bigquery-query
Closed

feat(collector): batched BigQuery query — one scan per window group#14
cmeans-claude-dev[bot] wants to merge 1 commit into
mainfrom
feat/batched-bigquery-query

Conversation

@cmeans-claude-dev

Copy link
Copy Markdown
Contributor

Summary

Replaces the per-package pypinfo invocation with a single batched call per window_days group. The cost lever for hosting many packages: BigQuery's bytes_billed is bounded by partition + column scan and is NOT affected by the size of a WHERE-IN list, so one batched call for N packages costs the same as one call for a single package (~4-5 GB).

Why now: prerequisite for opening the service to external package requests. The previous per-package model maxed out the 1 TB/month free tier at ~7 packages on a daily cadence; the batched model scales to hundreds.

API change

# Before:
run_pypinfo(package: str, window_days: int, ...) -> int

# After:
run_pypinfo_batch(packages: Sequence[str], window_days: int, ...) -> dict[str, int]

collect() groups configured packages by window_days and runs one batch per group. Typical case (everyone on the default 30-day window) is exactly one BigQuery query per collector run regardless of N.

Argv shape

pypinfo --json --days 30 --all \
  --where 'file.project IN ("pkg-a", "pkg-b", "pkg-c")' \
  pkg-a project ci installer

The pkg-a positional is required by pypinfo's CLI parser but its generated file.project = "..." clause is overridden by --where; the placeholder choice is irrelevant.

Failure isolation preserved

  • Batch-level failure → every package in that window's batch fails; other windows still run.
  • Per-package badge-write failure → isolated to the affected package.
  • _health.json writes unconditionally.

Security

Package names go into the WHERE clause as double-quoted SQL literals. PEP 508 restricts names to [A-Za-z0-9._-], so quotes/escapes can't appear. Belt-and-braces: explicit rejection (CollectorError) of any name containing " or \\ before SQL composition. This becomes load-bearing once external requests start landing.

Tests

Before After
Collector tests 21 26
Total tests 56 61
Coverage 99% 99%

New tests cover: argv shape with WHERE-IN, per-package count splitting, zero-count packages still appear in dict, empty-input no-op, missing project field raises, unrequested-package row raises, quote-character rejection, batch-grouping invariant, single-batch-for-uniform-windows invariant, batch-failure-marks-all-in-window behaviour.

Verification

  • ruff check + ruff format --check + mypy: clean
  • pytest --cov: 61/61 passed, 99% coverage (collector 99%)

Test plan for QA

  • Read _resolve_pypinfo_path (unchanged), run_pypinfo_batch, and collect. Argv shape correct (--where, project pivot first, no -a/--auth); WHERE clause built safely; per-package failure isolation preserved at the badge-write layer.
  • pytest --cov confirms 61/61, 99% coverage.
  • Spot-test on the deployed CT 112: rebuild wheel, push, run once, confirm the four currently-tracked packages all get badges (counts unchanged; this PR is a cost optimization, not a metric change).
  • Sanity-check the BigQuery-cost story by capturing bytes_billed from the real query (the query field of pypinfo's JSON output). A four-package batch should bill the same ~4.6 GB observed today for a single-package query.

Post-merge

After this lands and CT 112 is redeployed, the service is positioned to accept external package-add PRs. A separate PR can document the request flow (likely a tracked.yaml rename + CONTRIBUTING.md update).

🤖 Generated with Claude Code

Replaces the previous per-package `pypinfo` invocation with a single
batched call per `window_days` group. The new
`run_pypinfo_batch(packages, window_days, ...)` builds a
`WHERE file.project IN (...)` clause covering every package with the
same window, pivots by `project` + `ci` + `installer`, and splits the
response into a `{package: count}` dict post-parse.

## Why

BigQuery's `bytes_billed` is bounded by partition + column scan; the
size of a `WHERE`-IN list does NOT affect billed bytes. So one batched
call for N packages costs the same as one call for one package
(~4-5 GB scan, observed in real query metadata). The previous
implementation scaled linearly with package count, capping realistic
free-tier hosting at ~7 packages on a daily cadence. Batching scales
to hundreds of packages on the same 1 TB/month budget.

This is the cost lever for opening the service to packages other than
the maintainer's own — the prerequisite for accepting external
package-add requests via PR.

## Behavior preserved

- Installer allowlist (`pip`, `uv`, `poetry`, `pdm`, `pipenv`, `pipx`),
  ci-filter (`details.ci != "True"`), badge label
  `pip*/uv/poetry/pdm (Nd)`, badge filename
  `<pkg>/downloads-Nd-non-ci.json`, `_health.json` shape — all
  unchanged.
- `XDG_DATA_HOME` isolation still in place per-invocation (now per-batch
  rather than per-package, but the principle is the same).
- `subprocess.run` timeout=180, `Path(sys.executable).parent` resolver,
  no `-a/--auth` on argv — all unchanged.

## Failure isolation

- A batch-level failure (BigQuery error, malformed JSON, schema break,
  missing `project`/`installer_name` fields, row for unrequested
  package) marks every package in that window's batch as failed.
- Other windows still run independently — a 7-day batch isn't blocked
  by a 30-day batch's failure, and vice versa.
- Per-package badge-write failures stay isolated to the affected
  package (a read-only output subdir for one package doesn't break the
  rest).
- `_health.json` writes unconditionally regardless of which batches
  failed.

## Security note

Package names go directly into the `WHERE file.project IN (...)`
clause as double-quoted SQL literals. PyPI's PEP 508 name grammar
restricts names to `[A-Za-z0-9._-]`, so they cannot contain quotes or
escape characters and the literal join is safe. Belt-and-braces:
explicit rejection of names containing `"` or `\\` before SQL
composition, with a `CollectorError`. This becomes load-bearing if
the input source is ever broadened beyond a maintainer-curated YAML.

## Tests

26 collector tests (was 21), all passing. Major rewrites:

- New: `test_run_pypinfo_batch_uses_where_in_clause_for_all_packages`,
  `test_run_pypinfo_batch_splits_counts_per_package`,
  `test_run_pypinfo_batch_includes_zero_count_packages`,
  `test_run_pypinfo_batch_returns_empty_for_empty_input`,
  `test_run_pypinfo_batch_raises_on_missing_project_field`,
  `test_run_pypinfo_batch_raises_on_unrequested_package`,
  `test_run_pypinfo_batch_rejects_package_names_with_quote_characters`,
  `test_collect_groups_packages_by_window_into_one_batch_per_window`,
  `test_collect_one_batch_for_all_packages_when_window_is_uniform`,
  `test_collect_records_batch_failure_for_all_packages_in_window`.
- Adapted: every existing test that assumed the per-package signature.
- Removed: `_collect_one` helper-test pattern (no longer relevant —
  collect's loop is now over per-window batches, not per-package).

Sweep:
- ruff / format / mypy: clean
- pytest --cov: 61/61 passed (was 56), 99% coverage

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA Ready for QA Dev work complete — QA can begin review and removed Awaiting CI Dev complete, waiting for CI/Codecov to pass before QA labels Apr 25, 2026
@codecov-commenter

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 99.02439% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
tests/test_collector.py 98.67% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@cmeans-claude-dev

Copy link
Copy Markdown
Contributor Author

Closing PR #14 — batching is not a cost lever; v0.1.0's per-package serial wins

Closing this PR after live testing against 300 real PyPI packages on CT 112 contradicted the load-bearing rationale. Recording the findings here so the test investment isn't lost.

What we tested

Top 300 PyPI packages (from hugovk's top-pypi-packages list), real BigQuery query, single batch via pypinfo --where 'file.project IN (...)'. Total cost: $11.32 in BigQuery scan.

What we found

Cost claim was wrong. I'd asserted batching kept bytes_billed at ~4-5 GB regardless of N. Empirically:

Query bytes_billed
Single package ~4.6 GB
300 packages batched 2.32 TB

Per-package average inside the batch: ~7.7 GB — higher than single-package serial (~4.6 GB). The pypi.file_downloads table is clustered on file.project, so single-package queries efficiently prune to one cluster's worth of blocks; multi-package queries scan all the requested clusters. Batching is roughly 67% more expensive per package than serial.

This means the recomputed cost envelope for v0.1.0's per-package serial is more permissive than I originally claimed:

Packages Monthly bytes (daily collection) Free tier?
4 (current) 552 GB
7 966 GB ✅ ceiling
10 1.38 TB ~$2/month over
50 6.9 TB ~$30/month
100 13.8 TB ~$65/month
300 41 TB ~$200/month

Two real pypinfo bugs caught

The test investment did surface fixable issues:

  1. Positional [PROJECT] AND-combines with --where instead of being overridden. With a placeholder positional like boto3, the SQL becomes WHERE file.project = "boto3" AND file.project IN ("boto3", ...), silently restricting the response to one package. Workaround: pass "" (empty string) so pypinfo's if project: branch in core.py:build_query skips emitting the auto-filter line.
  2. pypinfo defaults to LIMIT 10 and limit or DEFAULT_LIMIT in source means 0 falls back to 10. With project x ci x installer pivot producing potentially thousands of rows, the default truncates badly. Need explicit --limit <large>.

These are noted in the awareness store under pypinfo-cli-gotchas for future use.

Why close instead of reframe

Even after the bug fixes, the only remaining benefit of batching is atomic snapshot of all packages at one BigQuery moment — meaningful for trend analysis but immaterial for badge serving (the 30-second skew across 4 packages doesn't matter). Paying 67% more per query for that benefit isn't worth it at our scale.

What this means for hosting-for-others

Realistic envelope for the existing per-package serial implementation:

  • 4-7 packages: comfortably free tier
  • 10-20 packages: a few dollars per month
  • 100+ packages: genuinely expensive, needs a different strategy (less-frequent collection, pypistats.org as data source, or accepting cost as service economics)

Cost reduction levers, in order of effectiveness:

  1. Reduce collection frequency (daily → weekly = 7x reduction)
  2. Use pypistats.org instead of BigQuery (free, but loses installer-allowlist refinement)
  3. Hybrid (pypistats for daily updates, BigQuery for weekly installer-mix refresh)

A separate small docs-only PR will document this cost model in the README and deploy/ so future operators understand the scale envelope.

Mechanics

  • Branch feat/batched-bigquery-query deleted on close.
  • v0.1.0 (per-package serial, currently deployed at pypi-badges.intfar.com) is the right shape for our scale.
  • Findings preserved in awareness so the next session doesn't re-walk this $11 lesson.

@cmeans-claude-dev cmeans-claude-dev Bot deleted the feat/batched-bigquery-query branch April 25, 2026 13:49
cmeans-claude-dev Bot added a commit that referenced this pull request May 4, 2026
…pypinfo (#62)

## QA round 2 — doc-only follow-up at 74fef8b

Two findings from QA round 1, both doc-only:

- **Free-tier basis aligned on 1 TiB across README, doc, and CHANGELOG.** README already cited 1 TiB but the cost-envelope table in `docs/cost-model-and-pypinfo-gotchas.md` used a 1 TB basis in the heading and the qualitative cells. Switched to 1 TiB (matches GCP's actual figure and what the README says) and refreshed the verdicts whose qualitative reading shifts under the larger basis: 7-pkg row from "at the ceiling" to "comfortably under (88%)"; 10-pkg row from "~$2/month over" to "~$1.40/month over"; 50-pkg from "~$30" to "~$29"; 100-pkg from "~$65" to "~$64". README's prose rule of thumb updated from "~7 packages" to "~7-8 packages" so it matches the table; CHANGELOG entry follows.
- **"See also" bullet pointing at `collector.py` no longer overstates what is commented inline.** Originally said "with both gotchas commented at their respective line ranges"; only gotcha 2 (`--limit`) is commented in the function body, because the collector ships per-package serial and does not pass `--where`. Reworded to name `--limit` explicitly and explain why the `--where` gotcha is preserved in the doc but not in code.

## Summary

Two parts in one PR — they share the same source material from PR #14's testing and would be artificially split if they didn't share their references.

**1. New engineering doc: `docs/cost-model-and-pypinfo-gotchas.md`.** Captures three things that aren't obvious from the code:

- **BigQuery scan-cost shape.** Empirical numbers from PR #14's testing (which cost the project \$11.32 to learn): per-package serial runs ~4.6 GB billed/pkg/30-day-window, batched queries of 300 packages cost ~7.7 GB/pkg, so batching is *more* expensive than serial — the table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. Free-tier ceiling lands around 7-8 packages on a daily cadence (1 TiB basis).
- **Levers that move cost.** Frequency cuts (daily → weekly = 7x), pypistats.org fallback, hybrid, materialized views — in order of effectiveness. Includes anti-levers (smaller batch sizes, `TABLESAMPLE`) for completeness.
- **Two pypinfo CLI gotchas** with `core.py:build_query` references: `--where` AND-combines with the positional rather than overriding, and `--limit` defaults to 10 with falsy values falling back to that default (`limit or DEFAULT_LIMIT` in source). The first one bites multi-package callers; the second one bites multi-pivot callers — including this project's `run_pypinfo`.

The material was previously captured only in the maintainer's private knowledge store. Promoting it to the public repo means future maintainers and self-hosters don't have to re-walk it. README install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing.

**2. `run_pypinfo` argv carries an explicit `--limit 500`.** Closes the gotcha-2 hole on the live code path. The pivot is `ci x installer x system`; realistic distinct combos for one package are ~3 x 8 x 4 ≈ 96. Under the prior implicit-default-10 path, a popular package with diverse installer/system spread silently lost the long tail and the hero badge (sum of post-allowlist rows) would undercount. SQL `LIMIT` is post-aggregation, so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected.

Regression coverage in `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` fails if `--limit` is dropped or the value drops below 100.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Ready for QA Dev work complete — QA can begin review

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant