cmeans · cmeans-claude-dev · May 4, 2026 · May 2, 2026 · May 2, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -4,6 +4,10 @@
 
 ### Added
 
+- **`docs/cost-model-and-pypinfo-gotchas.md`** new engineering note documenting the BigQuery scan-cost shape for `bigquery-public-data.pypi.file_downloads` (the ~4.6 GB-per-package per-query rate, the free-tier ceiling around 7-8 packages on a daily cadence, the cost envelope at higher scales, and the levers that actually move cost) and two `pypinfo` CLI foot-guns surfaced during PR #14's testing: `--where` AND-combining with the positional rather than overriding it, and `--limit` defaulting to 10 with falsy values falling back to that default. Also explains why batching is *not* a cost lever — empirically more expensive than per-package serial because the `pypi.file_downloads` table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. README's install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing.
+
+- **`run_pypinfo` argv carries an explicit `--limit 500`.** Defends against pypinfo's CLI default of 10 (`limit or DEFAULT_LIMIT` in `pypinfo/core.py:build_query` treats falsy values, including 0, as falling back) silently truncating the long tail of `(ci, installer, system)` rows. Realistic combo ceiling for one package is ~3 × 8 × 4 ≈ 96; under the prior implicit-default-10 path, a popular package with diverse installer/system spread would lose the tail and the hero count (sum of post-allowlist rows) would silently undercount. SQL `LIMIT` is applied after aggregation, so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression test in `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` fails if `--limit` is dropped or the value drops below 100. See `docs/cost-model-and-pypinfo-gotchas.md` gotcha 2.
+
 - **`.github/workflows/ci.yml`** lint job gains a `uv lock --locked` step that fails fast if `uv.lock` and `pyproject.toml` have drifted. Catches the v0.3.0-style regression where a release commit bumps `pyproject.toml` `version` without re-running `uv lock`, leaving the lockfile's self-version block on the previous release. Structural backstop: a release commit that forgets the lockfile bump now redlits its own PR before merge rather than landing and getting picked up days later by the weekly `uv-lock-refresh.yml` cron with mis-framed PR copy. Does not touch upstream-PyPI freshness — that remains the weekly cron's job, since `uv lock --locked` only checks `pyproject.toml`-vs-lockfile consistency, not lockfile-vs-PyPI freshness. Closes [#60](https://github.com/cmeans/pypi-winnow-downloads/issues/60).
 
 ### Changed

diff --git a/README.md b/README.md
@@ -101,6 +101,14 @@ works and is what `config.example.yaml` and the reference deploy
 document. Then point `service.credential_file` in your config at the
 resulting file.
 
+BigQuery's free tier is 1 TiB of scan per month. At the per-package
+serial rate this collector uses (~4.6 GB scanned per package per
+30-day window query), a daily run on ~7-8 packages sits at the
+free-tier ceiling. The
+[cost model and pypinfo gotchas](docs/cost-model-and-pypinfo-gotchas.md)
+doc has the full table, the levers for scaling beyond that, and why
+batching is *not* one of them.
+
 Run with a YAML config — copy
 [`config.example.yaml`](https://github.com/cmeans/pypi-winnow-downloads/blob/main/config.example.yaml)
 and edit:

diff --git a/docs/cost-model-and-pypinfo-gotchas.md b/docs/cost-model-and-pypinfo-gotchas.md
@@ -0,0 +1,160 @@
+# Cost model and pypinfo CLI gotchas
+
+This is an engineering note for anyone running `pypi-winnow-downloads`
+themselves, or anyone hacking on the `collector.py` BigQuery code path.
+It captures two things that are non-obvious from the code alone:
+
+1. The shape of BigQuery scan cost for `bigquery-public-data.pypi.file_downloads`
+2. Two foot-guns in `pypinfo`'s CLI that affect how the collector calls it
+
+The numbers in the cost section come from empirical testing during PR #14
+(batched-query refactor, ultimately closed in favor of the per-package
+serial approach this repo ships). The pypinfo gotchas were surfaced
+during that same testing and during the v0.1.x feat/collector PR review
+cycles.
+
+## BigQuery scan cost shape
+
+`bigquery-public-data.pypi.file_downloads` is clustered on `file.project`
+(or has clustering-equivalent block layout). Clustering means a query
+filtering on a single project efficiently prunes to that project's
+blocks; a query filtering on N projects via `WHERE file.project IN (...)`
+scans all N projects' blocks.
+
+Empirical (30-day window, daily run, all installers):
+
+| Approach                    | Bytes billed   | Per package |
+|-----------------------------|----------------|-------------|
+| 1 package, single query     | ~4.6 GB        | 4.6 GB      |
+| 300 packages, batched query | ~2.32 TB       | ~7.7 GB/pkg |
+| 300 packages, serial calls  | ~1.38 TB       | ~4.6 GB/pkg |
+
+**Batching is more expensive than serial**, not less. The cluster-pruning
+advantage applies most cleanly to single-package queries, so a tight
+loop of 1-package queries comes out ahead of a single big `WHERE IN`.
+Bytes billed is roughly proportional to packages-touched regardless of
+batching strategy.
+
+### Cost envelope at the per-package serial rate (~4.6 GB/pkg/run)
+
+| Packages | Monthly bytes billed | Free tier (1 TiB/mo) ceiling |
+|----------|----------------------|------------------------------|
+| 4        | 552 GB               | comfortably under            |
+| 7        | 966 GB               | comfortably under (88%)      |
+| 10       | 1.38 TB              | ~$1.40/month over            |
+| 50       | 6.9 TB               | ~$29/month                   |
+| 100      | 13.8 TB              | ~$64/month                   |
+| 300      | 41 TB                | ~$200/month                  |
+
+After-free-tier rate: $5/TB.
+
+The `BigQuery Sandbox` mode this project's GCP setup uses returns quota
+errors (not charges) when the free tier is exhausted. That is the
+desired failure mode for a hobby workload: "stop emitting badges" beats
+"silently bill the maintainer's credit card."
+
+### Levers for scale, in order of effectiveness
+
+1. **Reduce collection frequency.** Daily → weekly is a 7x cost cut;
+   daily → monthly is 30x. Trades freshness for cost; for download
+   counts averaged over a 30-day window, freshness within a few days
+   is plenty.
+2. **Use pypistats.org as the data source instead.** Free, no scan
+   cost, but loses the installer-mirror filtering refinement that this
+   project's "non-CI downloads" framing depends on. You're back to
+   the v1 mirror-inclusive numbers we explicitly improved on.
+3. **Hybrid:** pypistats daily for rough counts, BigQuery weekly for
+   installer-mix sanity. Best cost-vs-quality tradeoff at scale.
+4. **Materialized view of pre-aggregated daily summaries.** Significant
+   setup, modest savings, only worth it past hundreds of packages.
+
+### Levers that do NOT help cost
+
+- Direct BigQuery client library (skip the `pypinfo` subprocess) — the
+  scan cost is the same. Only saves subprocess startup time (~100 ms).
+- Smaller batch sizes (e.g., 10 packages at a time vs. 300) — total
+  scan cost is roughly proportional to packages-touched.
+- Sample-based scans (`TABLESAMPLE`) — introduces noise. Bad for the
+  honesty pitch this project makes about its numbers.
+
+### Decision for v1
+
+Per-package serial is the right shape for current scale (4 dogfood
+packages, comfortably under the free tier) and remains reasonable at
+~10-20 packages. Beyond that, the levers above are the lever — not
+query batching. PR #14's batched-query refactor was closed for that
+reason.
+
+## pypinfo CLI gotchas
+
+These are documented at `pypinfo/cli.py` and `pypinfo/core.py` line
+ranges as of the version this project pins (`pypinfo>=23.0.0`). They
+are surprising enough that anyone editing `run_pypinfo`'s argv
+construction should know about them.
+
+### 1. `--where` AND-combines with the positional, never overrides it
+
+`pypinfo [PROJECT] [FIELDS...]` always generates a `WHERE
+file.project = "<PROJECT>"` clause from the positional and ANDs it with
+any additional `--where` predicate. It does not replace one with the
+other. Source, `pypinfo/core.py:build_query`:
+
+```python
+conditions = ["WHERE timestamp BETWEEN ..."]
+if project:
+    conditions.append(f'file.project = "{project}"\n')
+...
+query += "  AND ".join(conditions)
+if where:
+    query += f"  AND {where}\n"
+```
+
+The `if project:` check means an **empty-string positional** (`""`)
+skips the auto-filter. So if you ever want a multi-package query, the
+positional must be `""` and the package list goes in `--where`:
+
+```
+pypinfo --where 'file.project IN ("a","b","c")' "" project ci installer
+```
+
+Anything else and the SQL ends up with both `file.project = "a" AND
+file.project IN (...)`, which silently restricts the response to
+package `a` only.
+
+This project ships per-package serial (one query per package), so the
+collector passes the real package name as the positional and does not
+use `--where`. The gotcha is preserved here for anyone reviving the
+batched path or hacking on a fork.
+
+### 2. `--limit` defaults to 10; falsy values fall back to that default
+
+`limit = limit or DEFAULT_LIMIT` in `core.build_query`, with
+`DEFAULT_LIMIT = 10`. Passing `--limit 0` is treated as falsy and
+falls back to 10. There is no "no-limit" mode in the CLI.
+
+For multi-pivot queries — e.g., `[PROJECT] ci installer system` —
+the result has up to one row per distinct `(ci, installer, system)`
+combination. Realistic combos for one package run to a few dozen, so
+the default of 10 silently truncates the long tail.
+
+In `run_pypinfo` we pivot by `ci × installer × system` (3 fields), so
+the argv carries an explicit `--limit 500` to leave several-multiple
+headroom over the realistic combo ceiling. SQL `LIMIT` is applied
+after aggregation, so a generous bound does not change `bytes_billed`.
+
+If you ever extend the pivot — adding `country` or `version` —
+recompute the combo ceiling and bump `--limit` accordingly.
+
+## See also
+
+- `src/pypi_winnow_downloads/collector.py` — the live `run_pypinfo`
+  function with the `--limit` gotcha commented inline. The `--where`
+  gotcha is not commented in the code because the collector ships
+  per-package serial and does not pass `--where`; it is preserved here
+  for anyone reviving the batched path or hacking on a fork.
+- `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit`
+  — regression coverage that fails if `--limit` is dropped from argv.
+- [BigQuery pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing)
+  — the $5/TB on-demand rate referenced above.
+- [pypinfo](https://github.com/ofek/pypinfo) — upstream source, where
+  the file/line references in this doc point.
diff --git a/src/pypi_winnow_downloads/collector.py b/src/pypi_winnow_downloads/collector.py
@@ -166,12 +166,23 @@ def run_pypinfo(
     # short-circuits to a credential-setter path when --auth is present and
     # never runs the query. Use GOOGLE_APPLICATION_CREDENTIALS instead, which
     # pypinfo's core.py reads via os.environ.get on the no-flag path.
+    #
+    # `--limit 500` defends against pypinfo's CLI default of 10 (cli.py
+    # passes `limit or DEFAULT_LIMIT` to core.build_query, where falsy
+    # values including 0 fall back to 10 — there is no "no-limit" mode).
+    # The pivot is `ci x installer x system`: realistic max distinct
+    # combos for one package is ~3 x 8 x 4 = 96, so 500 leaves ~5x
+    # headroom for unexpected installer/system variants. SQL `LIMIT` is
+    # applied after aggregation, so a generous bound does not change
+    # `bytes_billed`. See docs/cost-model-and-pypinfo-gotchas.md.
     argv = [
         _resolve_pypinfo_path(),
         "--json",
         "--days",
         str(window_days),
         "--all",
+        "--limit",
+        "500",
         package,
         "ci",
         "installer",

diff --git a/tests/test_collector.py b/tests/test_collector.py
@@ -83,6 +83,41 @@ def fake_runner(argv: list[str], env: dict[str, str]) -> subprocess.CompletedPro
     assert argv[-3:] == ["ci", "installer", "system"], argv
 
 
+def test_run_pypinfo_argv_passes_explicit_limit(tmp_path: Path) -> None:
+    """pypinfo's CLI defaults to `--limit 10` and treats falsy values
+    (including 0) as falling back to that default — `limit or DEFAULT_LIMIT`
+    in core.build_query. With a `ci x installer x system` pivot, realistic
+    distinct combos for a single package can run to several dozen rows; a
+    silent LIMIT-10 truncation would undercount the hero badge by dropping
+    the long tail of less-common installer/system pairs. The argv must
+    therefore carry an explicit `--limit` with a value comfortably above
+    the realistic ceiling. See docs/cost-model-and-pypinfo-gotchas.md.
+    """
+    captured: list[list[str]] = []
+
+    def fake_runner(argv: list[str], env: dict[str, str]) -> subprocess.CompletedProcess[str]:
+        captured.append(list(argv))
+        return _ok_result(argv)
+
+    creds = tmp_path / "creds.json"
+    creds.write_text("{}")
+    run_pypinfo("mypkg", 30, credential_file=creds, runner=fake_runner)
+
+    argv = captured[0]
+    assert "--limit" in argv, (
+        "argv must carry `--limit` to defeat pypinfo's default-10 truncation; "
+        "see docs/cost-model-and-pypinfo-gotchas.md gotcha 2."
+    )
+    limit_value = int(argv[argv.index("--limit") + 1])
+    # Realistic max combos for one package is ~3 x 8 x 4 = 96, so the bound
+    # must clear that with margin. 100 is the floor; the current value is
+    # 500 (~5x headroom).
+    assert limit_value >= 100, (
+        f"--limit must be >= 100 to clear the realistic ci-by-installer-by-system "
+        f"combo ceiling for one package; got {limit_value}."
+    )
+
+
 def _ok_rows(rows: list[dict]) -> str:
     """Helper: shape an `_ok_result` JSON payload from a list of pypinfo row dicts."""
     return json.dumps({"rows": rows})