Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

### Added

- **`docs/cost-model-and-pypinfo-gotchas.md`** new engineering note documenting the BigQuery scan-cost shape for `bigquery-public-data.pypi.file_downloads` (the ~4.6 GB-per-package per-query rate, the free-tier ceiling around 7-8 packages on a daily cadence, the cost envelope at higher scales, and the levers that actually move cost) and two `pypinfo` CLI foot-guns surfaced during PR #14's testing: `--where` AND-combining with the positional rather than overriding it, and `--limit` defaulting to 10 with falsy values falling back to that default. Also explains why batching is *not* a cost lever — empirically more expensive than per-package serial because the `pypi.file_downloads` table is clustered on `file.project` and `WHERE IN` defeats cluster pruning. README's install section gains a brief pointer with the free-tier rule of thumb so self-hosters can size against it before committing.

- **`run_pypinfo` argv carries an explicit `--limit 500`.** Defends against pypinfo's CLI default of 10 (`limit or DEFAULT_LIMIT` in `pypinfo/core.py:build_query` treats falsy values, including 0, as falling back) silently truncating the long tail of `(ci, installer, system)` rows. Realistic combo ceiling for one package is ~3 × 8 × 4 ≈ 96; under the prior implicit-default-10 path, a popular package with diverse installer/system spread would lose the tail and the hero count (sum of post-allowlist rows) would silently undercount. SQL `LIMIT` is applied after aggregation, so a generous bound does not change `bytes_billed` — the cost envelope in the new doc is unaffected. Regression test in `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit` fails if `--limit` is dropped or the value drops below 100. See `docs/cost-model-and-pypinfo-gotchas.md` gotcha 2.

- **`.github/workflows/ci.yml`** lint job gains a `uv lock --locked` step that fails fast if `uv.lock` and `pyproject.toml` have drifted. Catches the v0.3.0-style regression where a release commit bumps `pyproject.toml` `version` without re-running `uv lock`, leaving the lockfile's self-version block on the previous release. Structural backstop: a release commit that forgets the lockfile bump now redlits its own PR before merge rather than landing and getting picked up days later by the weekly `uv-lock-refresh.yml` cron with mis-framed PR copy. Does not touch upstream-PyPI freshness — that remains the weekly cron's job, since `uv lock --locked` only checks `pyproject.toml`-vs-lockfile consistency, not lockfile-vs-PyPI freshness. Closes [#60](https://github.com/cmeans/pypi-winnow-downloads/issues/60).

### Changed
Expand Down
8 changes: 8 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,14 @@ works and is what `config.example.yaml` and the reference deploy
document. Then point `service.credential_file` in your config at the
resulting file.

BigQuery's free tier is 1 TiB of scan per month. At the per-package
serial rate this collector uses (~4.6 GB scanned per package per
30-day window query), a daily run on ~7-8 packages sits at the
free-tier ceiling. The
[cost model and pypinfo gotchas](docs/cost-model-and-pypinfo-gotchas.md)
doc has the full table, the levers for scaling beyond that, and why
batching is *not* one of them.

Run with a YAML config — copy
[`config.example.yaml`](https://github.com/cmeans/pypi-winnow-downloads/blob/main/config.example.yaml)
and edit:
Expand Down
160 changes: 160 additions & 0 deletions docs/cost-model-and-pypinfo-gotchas.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,160 @@
# Cost model and pypinfo CLI gotchas

This is an engineering note for anyone running `pypi-winnow-downloads`
themselves, or anyone hacking on the `collector.py` BigQuery code path.
It captures two things that are non-obvious from the code alone:

1. The shape of BigQuery scan cost for `bigquery-public-data.pypi.file_downloads`
2. Two foot-guns in `pypinfo`'s CLI that affect how the collector calls it

The numbers in the cost section come from empirical testing during PR #14
(batched-query refactor, ultimately closed in favor of the per-package
serial approach this repo ships). The pypinfo gotchas were surfaced
during that same testing and during the v0.1.x feat/collector PR review
cycles.

## BigQuery scan cost shape

`bigquery-public-data.pypi.file_downloads` is clustered on `file.project`
(or has clustering-equivalent block layout). Clustering means a query
filtering on a single project efficiently prunes to that project's
blocks; a query filtering on N projects via `WHERE file.project IN (...)`
scans all N projects' blocks.

Empirical (30-day window, daily run, all installers):

| Approach | Bytes billed | Per package |
|-----------------------------|----------------|-------------|
| 1 package, single query | ~4.6 GB | 4.6 GB |
| 300 packages, batched query | ~2.32 TB | ~7.7 GB/pkg |
| 300 packages, serial calls | ~1.38 TB | ~4.6 GB/pkg |

**Batching is more expensive than serial**, not less. The cluster-pruning
advantage applies most cleanly to single-package queries, so a tight
loop of 1-package queries comes out ahead of a single big `WHERE IN`.
Bytes billed is roughly proportional to packages-touched regardless of
batching strategy.

### Cost envelope at the per-package serial rate (~4.6 GB/pkg/run)

| Packages | Monthly bytes billed | Free tier (1 TiB/mo) ceiling |
|----------|----------------------|------------------------------|
| 4 | 552 GB | comfortably under |
| 7 | 966 GB | comfortably under (88%) |
| 10 | 1.38 TB | ~$1.40/month over |
| 50 | 6.9 TB | ~$29/month |
| 100 | 13.8 TB | ~$64/month |
| 300 | 41 TB | ~$200/month |

After-free-tier rate: $5/TB.

The `BigQuery Sandbox` mode this project's GCP setup uses returns quota
errors (not charges) when the free tier is exhausted. That is the
desired failure mode for a hobby workload: "stop emitting badges" beats
"silently bill the maintainer's credit card."

### Levers for scale, in order of effectiveness

1. **Reduce collection frequency.** Daily → weekly is a 7x cost cut;
daily → monthly is 30x. Trades freshness for cost; for download
counts averaged over a 30-day window, freshness within a few days
is plenty.
2. **Use pypistats.org as the data source instead.** Free, no scan
cost, but loses the installer-mirror filtering refinement that this
project's "non-CI downloads" framing depends on. You're back to
the v1 mirror-inclusive numbers we explicitly improved on.
3. **Hybrid:** pypistats daily for rough counts, BigQuery weekly for
installer-mix sanity. Best cost-vs-quality tradeoff at scale.
4. **Materialized view of pre-aggregated daily summaries.** Significant
setup, modest savings, only worth it past hundreds of packages.

### Levers that do NOT help cost

- Direct BigQuery client library (skip the `pypinfo` subprocess) — the
scan cost is the same. Only saves subprocess startup time (~100 ms).
- Smaller batch sizes (e.g., 10 packages at a time vs. 300) — total
scan cost is roughly proportional to packages-touched.
- Sample-based scans (`TABLESAMPLE`) — introduces noise. Bad for the
honesty pitch this project makes about its numbers.

### Decision for v1

Per-package serial is the right shape for current scale (4 dogfood
packages, comfortably under the free tier) and remains reasonable at
~10-20 packages. Beyond that, the levers above are the lever — not
query batching. PR #14's batched-query refactor was closed for that
reason.

## pypinfo CLI gotchas

These are documented at `pypinfo/cli.py` and `pypinfo/core.py` line
ranges as of the version this project pins (`pypinfo>=23.0.0`). They
are surprising enough that anyone editing `run_pypinfo`'s argv
construction should know about them.

### 1. `--where` AND-combines with the positional, never overrides it

`pypinfo [PROJECT] [FIELDS...]` always generates a `WHERE
file.project = "<PROJECT>"` clause from the positional and ANDs it with
any additional `--where` predicate. It does not replace one with the
other. Source, `pypinfo/core.py:build_query`:

```python
conditions = ["WHERE timestamp BETWEEN ..."]
if project:
conditions.append(f'file.project = "{project}"\n')
...
query += " AND ".join(conditions)
if where:
query += f" AND {where}\n"
```

The `if project:` check means an **empty-string positional** (`""`)
skips the auto-filter. So if you ever want a multi-package query, the
positional must be `""` and the package list goes in `--where`:

```
pypinfo --where 'file.project IN ("a","b","c")' "" project ci installer
```

Anything else and the SQL ends up with both `file.project = "a" AND
file.project IN (...)`, which silently restricts the response to
package `a` only.

This project ships per-package serial (one query per package), so the
collector passes the real package name as the positional and does not
use `--where`. The gotcha is preserved here for anyone reviving the
batched path or hacking on a fork.

### 2. `--limit` defaults to 10; falsy values fall back to that default

`limit = limit or DEFAULT_LIMIT` in `core.build_query`, with
`DEFAULT_LIMIT = 10`. Passing `--limit 0` is treated as falsy and
falls back to 10. There is no "no-limit" mode in the CLI.

For multi-pivot queries — e.g., `[PROJECT] ci installer system` —
the result has up to one row per distinct `(ci, installer, system)`
combination. Realistic combos for one package run to a few dozen, so
the default of 10 silently truncates the long tail.

In `run_pypinfo` we pivot by `ci × installer × system` (3 fields), so
the argv carries an explicit `--limit 500` to leave several-multiple
headroom over the realistic combo ceiling. SQL `LIMIT` is applied
after aggregation, so a generous bound does not change `bytes_billed`.

If you ever extend the pivot — adding `country` or `version` —
recompute the combo ceiling and bump `--limit` accordingly.

## See also

- `src/pypi_winnow_downloads/collector.py` — the live `run_pypinfo`
function with the `--limit` gotcha commented inline. The `--where`
gotcha is not commented in the code because the collector ships
per-package serial and does not pass `--where`; it is preserved here
for anyone reviving the batched path or hacking on a fork.
- `tests/test_collector.py::test_run_pypinfo_argv_passes_explicit_limit`
— regression coverage that fails if `--limit` is dropped from argv.
- [BigQuery pricing](https://cloud.google.com/bigquery/pricing#analysis_pricing)
— the $5/TB on-demand rate referenced above.
- [pypinfo](https://github.com/ofek/pypinfo) — upstream source, where
the file/line references in this doc point.
11 changes: 11 additions & 0 deletions src/pypi_winnow_downloads/collector.py
Original file line number Diff line number Diff line change
Expand Up @@ -166,12 +166,23 @@ def run_pypinfo(
# short-circuits to a credential-setter path when --auth is present and
# never runs the query. Use GOOGLE_APPLICATION_CREDENTIALS instead, which
# pypinfo's core.py reads via os.environ.get on the no-flag path.
#
# `--limit 500` defends against pypinfo's CLI default of 10 (cli.py
# passes `limit or DEFAULT_LIMIT` to core.build_query, where falsy
# values including 0 fall back to 10 — there is no "no-limit" mode).
# The pivot is `ci x installer x system`: realistic max distinct
# combos for one package is ~3 x 8 x 4 = 96, so 500 leaves ~5x
# headroom for unexpected installer/system variants. SQL `LIMIT` is
# applied after aggregation, so a generous bound does not change
# `bytes_billed`. See docs/cost-model-and-pypinfo-gotchas.md.
argv = [
_resolve_pypinfo_path(),
"--json",
"--days",
str(window_days),
"--all",
"--limit",
"500",
package,
"ci",
"installer",
Expand Down
35 changes: 35 additions & 0 deletions tests/test_collector.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,41 @@ def fake_runner(argv: list[str], env: dict[str, str]) -> subprocess.CompletedPro
assert argv[-3:] == ["ci", "installer", "system"], argv


def test_run_pypinfo_argv_passes_explicit_limit(tmp_path: Path) -> None:
"""pypinfo's CLI defaults to `--limit 10` and treats falsy values
(including 0) as falling back to that default — `limit or DEFAULT_LIMIT`
in core.build_query. With a `ci x installer x system` pivot, realistic
distinct combos for a single package can run to several dozen rows; a
silent LIMIT-10 truncation would undercount the hero badge by dropping
the long tail of less-common installer/system pairs. The argv must
therefore carry an explicit `--limit` with a value comfortably above
the realistic ceiling. See docs/cost-model-and-pypinfo-gotchas.md.
"""
captured: list[list[str]] = []

def fake_runner(argv: list[str], env: dict[str, str]) -> subprocess.CompletedProcess[str]:
captured.append(list(argv))
return _ok_result(argv)

creds = tmp_path / "creds.json"
creds.write_text("{}")
run_pypinfo("mypkg", 30, credential_file=creds, runner=fake_runner)

argv = captured[0]
assert "--limit" in argv, (
"argv must carry `--limit` to defeat pypinfo's default-10 truncation; "
"see docs/cost-model-and-pypinfo-gotchas.md gotcha 2."
)
limit_value = int(argv[argv.index("--limit") + 1])
# Realistic max combos for one package is ~3 x 8 x 4 = 96, so the bound
# must clear that with margin. 100 is the floor; the current value is
# 500 (~5x headroom).
assert limit_value >= 100, (
f"--limit must be >= 100 to clear the realistic ci-by-installer-by-system "
f"combo ceiling for one package; got {limit_value}."
)


def _ok_rows(rows: list[dict]) -> str:
"""Helper: shape an `_ok_result` JSON payload from a list of pypinfo row dicts."""
return json.dumps({"rows": rows})
Expand Down