Skip to content

feat: add gallery_examples dataset mapping examples to datasets#776

Draft
dsmedia wants to merge 16 commits intovega:mainfrom
dsmedia:feat/rewrite-gallery-examples
Draft

feat: add gallery_examples dataset mapping examples to datasets#776
dsmedia wants to merge 16 commits intovega:mainfrom
dsmedia:feat/rewrite-gallery-examples

Conversation

@dsmedia
Copy link
Copy Markdown
Collaborator

@dsmedia dsmedia commented Apr 7, 2026

Summary

Adds a dataset that maps Vega, Vega-Lite, and Altair gallery examples to the vega-datasets they reference, with the example's name, URL, categories, and description.

data/gallery_examples.json is a flat JSON array (399 records) registered as a Data Package v2 resource in datapackage.json, same as every other dataset in the repo.

Ground-up rewrite of the approach in #724 — reduced from ~2,500 lines to ~570, with visualization technique detection removed. The dataset focuses on the mechanically verifiable relationship: which examples load which datasets.

What's included

File Purpose
scripts/generate_gallery_examples.py Async scraper (~45s runtime)
tests/test_generate_gallery_examples.py 36 unit tests
_data/gallery_examples.toml Source URLs
_data/datapackage_additions.toml Resource metadata (description, schema, sources, license)
scripts/build_datapackage.py Registers gallery_examples.json as a Data Package resource
data/gallery_examples.json Generated output (399 examples across 3 galleries)

Output schema

{
  "gallery_name": "altair",
  "example_name": "2D Histogram Heatmap",
  "example_url": "https://altair-viz.github.io/gallery/histogram_heatmap.html",
  "spec_url": "https://raw.githubusercontent.com/vega/altair/main/tests/...",
  "categories": ["Distributions"],
  "description": "This example shows how to make a heatmap.",
  "datasets": ["movies"]
}

spec_url is the declared primary key.

Supersedes #724.

dsmedia added 2 commits April 7, 2026 04:42
Add a dataset cataloging which Vega, Vega-Lite, and Altair gallery
examples use which vega-datasets. Each record links a gallery example
to the dataset(s) it references, with the example's name, URL,
categories, and description.

gallery_examples.json is a flat JSON array (399 records) registered as
a Data Package v2 resource in datapackage.json.

- scripts/generate_gallery_examples.py — 526 lines, async fetching
- tests/test_generate_gallery_examples.py — 337 lines
- _data/gallery_examples.toml — source URLs
domoritz pushed a commit to vega/vega-lite that referenced this pull request Apr 7, 2026
## PR Description

The `interactive_area_brush` entry in `examples.json` has `"descripton"`
(missing an `i`) instead of `"description"`. Because
[`create-example-pages`](https://github.com/vega/vega-lite/blob/main/scripts/create-example-pages#L36)
checks `example.description`, the misspelled field is silently ignored,
and the [gallery
page](https://vega.github.io/vega-lite/examples/interactive_area_brush.html)
renders with no description text.

The entry has a
[description](https://github.com/vega/vega-lite/blob/main/site/_data/examples.json#L841),
but it never renders because of the typo:

> In this example, we apply an `interval` selection to select subset of
data in an area chart. The selected data is highlighted in gold by the
second layer of an area mark that `filter`s its data by the `brush`
selection.

The [gallery
page](https://vega.github.io/vega-lite/examples/interactive_area_brush.html)
currently shows no description text:

<!-- upload docs/ideation/screenshot-area-brush-cropped.png and paste
image link here -->

Just for future reference,
[`create-example-pages`](https://github.com/vega/vega-lite/blob/main/scripts/create-example-pages#L36-L39)
first checks for a `description` field in the example's `examples.json`
entry. If none is found, it falls back to the `description` field in the
`.vl.json` spec file. In this case, `examples.json` has the description
under a misspelled key, and the [spec
file](https://github.com/vega/vega-lite/blob/main/examples/specs/interactive_area_brush.vl.json)
has no description at all — so neither source provides one.

Found while working on gallery example metadata in
vega/vega-datasets#776.

<details>
  <summary><h2>Checklist</h2></summary>

- [x] This PR is atomic (i.e., it fixes one issue at a time).
- [x] The title is a concise [semantic commit
message](https://www.conventionalcommits.org/) (e.g. "fix: correctly
handle undefined properties").
- [x] `npm test` runs successfully
- For new features:
  - [ ] Has unit tests.
  - [ ] Has documentation under `site/docs/` + examples.

Tips:

-
https://medium.com/@greenberg/writing-pull-requests-your-coworkers-might-enjoy-reading-9d0307e93da3
is a nice article about writing a nice PR.
- Use draft PR for work in progress PRs / when you want early feedback
(https://github.blog/2019-02-14-introducing-draft-pull-requests/).
</details>
dsmedia added 6 commits April 7, 2026 19:45
Picks up descriptions now available after vega-lite#9813 fixed the
`descripton` → `description` typo in examples.json, along with other
recently added descriptions from vega-lite.
Two fixes for generate_gallery_examples.py:

1. Fall back to spec-file descriptions for vega-lite examples when
   the gallery index has no description (mirrors vega-lite's own
   create-example-pages behavior). Affected ~75 examples.

2. Disable HTTP/2 on niquests AsyncSession to prevent response body
   corruption caused by multiplexing bug in urllib3-future (<2.15.903).
   See urllib3-future#309.
The vega-lite examples.json uses "" as the subcategory key for
"Maps (Geographic Displays)" — an intentional convention (their site
template skips the header when the key is empty). All three Vega
ecosystem galleries keep geographic examples as a flat group; the
academic literature (Heer et al., "A Tour through the Visualization
Zoo", 2010) defines map subtypes but the galleries don't use them.

Our script was propagating the empty string into categories, giving
12 map examples categories: [""]. Now falls back to the section name,
producing categories: ["Maps (Geographic Displays)"].

Regenerated gallery_examples.json reflects this fix plus the
description fallback and HTTP/2 fixes from the previous commit.
Fix TOML key ordering in pyproject.toml (pytest.ini_options).
Add test coverage for concat specs, bare filename normalization,
and empty-path edge case. Remove stale test_extract_datasets.py.
Regenerate datapackage metadata.
Fix dead zero-examples guard using set-difference, re-raise
BaseException after asyncio.gather, add Altair response canary
check, validate fetched index types, and put id first in JSON
output. Add 7 tests for _parse_altair_metadata and
_build_vegalite_examples empty-subcategory fallback.
- Use prefix-based matching for vega-datasets URL detection instead of
  broad substring check that could false-positive on unrelated URLs
- Add type guard for non-string URL values in normalize_dataset_reference
- Add type guard for non-list category values in VL index parsing
- Fix title fallback to handle explicit null values from upstream
dsmedia added 2 commits April 9, 2026 22:18
Use resp.json() instead of json.loads(resp.text) for JSON responses
to avoid niquests str | None type mismatch. Add assertion for Altair
text response type narrowing. Regenerate gallery_examples.json with
id-first field ordering from updated assign_ids.
Mirror the Vega-Lite description fallback to the Vega enrichment
path. Rescues 93 descriptions that were previously null.
@joelostblom
Copy link
Copy Markdown

Replying to some part of vega/altair#3999 (comment) here:

Having them indexed opens up things like spotting coverage gaps (which features lack gallery examples?), identifying where new datasets could better showcase capabilities, and giving tools a way to surface relevant examples in context.

I think this might be able to replace some/most of the work I did with heuristics in vega/altair#4002 with a more robust link between gallery examples and user guide sections. Maybe we will also be able to show a little thumbnail gallery at the bottom of each user guide section with "Related gallery examples" where readers can go to learn more.

For Vega and Vega-Lite the ingestion is straightforward — structured JSON indexes and specs that can be walked mechanically.

I haven't read the code carefully, but I'm imagining that for altair specs, you could look at the generated vega-lite spec to get the same categories (which maybe is what you are doing already?). I don't know if it is too expensive to actually execute the code to retrieve the spec like that; another downside is that it would miss any altair-specific features/shortcuts, but you would avoid having to write a separate parser for the altair code.

Altair is the trickiest leg since I don't believe there's a structured index of examples: it discovers examples by listing tests/examples_methods_syntax/ via the GitHub API, then pattern-matches dataset references and parses metadata from docstrings and # category: comments in the source

If it would be a big help for you to have some different organization of the altair examples (and not a huge lift to do so), I would be open to us changing that as long as it doesn't affect the URLs of the individual gallery examples on the website since people might have them saved since previously.

@dsmedia dsmedia force-pushed the feat/rewrite-gallery-examples branch from ba2d704 to 76d3f82 Compare April 11, 2026 21:40
dsmedia added 3 commits April 12, 2026 04:46
- Move gallery_examples.json into data/ so it is handled by
  iter_resources like every other dataset
- Remove create_gallery_examples_resource() special-case function
- Drop legacy sequential id field; use spec_url as primary key
- Fix niquests resp.text type (str | None) Pylance errors
- Add spec_url uniqueness assertion to enforce primary key invariant
- Update references in docs, tests, package.json, and TOML
The file move in 3e386b0 (gallery_examples.json -> data/) didn't
regenerate src/urls.ts, so the URL index was missing the entry.
Consumers reading urls['gallery_examples.json'] from the npm package
would get undefined.

CI didn't catch this because npm run build regenerates urls.ts rather
than verifying it's up to date.
@dsmedia
Copy link
Copy Markdown
Collaborator Author

dsmedia commented Apr 12, 2026

Note on test coverage: I removed the network-gated test_full_pipeline smoke test from this branch rather than adding a scheduled CI job that would run it. The smoke test asserted end-to-end invariants against the live upstream galleries, but running it on a schedule here would start sending failure alerts whenever Vega, Vega-Lite, or Altair restructures their example galleries — and those restructurings may prompt deliberate changes on our side first.

Until we've decided how we want to track upstream gallery drift, I'd rather have no alert than a noisy one. To keep the coverage that the smoke test provided, the two invariants that actually mattered — the spec_url primary-key uniqueness check and the "all three galleries present" check — have been promoted into standalone functions (assert_unique_spec_urls, assert_expected_galleries) and the primary-key one now has direct unit tests. The pipeline itself is split into run_pipeline() (no I/O) and async_main() (writes the file), so adding a future network test without the side effect of overwriting data/gallery_examples.json is a one-line call to run_pipeline().

Happy to wire up a scheduled network check later once the gallery-alignment question is settled.

dsmedia added 3 commits April 12, 2026 12:06
Rename data/gallery_examples.json -> data/gallery-examples.json so the
filename matches the kebab-case convention used by every other multi-word
file under data/.

Review-driven code improvements:
- Altair extraction: fail-loud when a recognized pattern names a dataset
  not in vega-datasets (likely upstream rename). Previously silent-drop.
- Vega-Lite duplicate-slug dedup: longest-wins for title/description
  (fixes 3 real cases: layer_bar_labels, layer_text_heatmap, repeat_layer).
  Categories now deduplicated when slugs appear in the same category twice.
- Split async_main() into run_pipeline() (no I/O) and async_main() (writes
  the file), so future tests can exercise the pipeline without mutating
  the tracked data file.
- Extract _fetch_text() helper; replace 5 assert statements with an
  explicit RuntimeError so the checks survive python -O.
- Add isinstance guards to the Vega index loop (symmetry with Vega-Lite).
- Drop dead compute_file_hash() in build_datapackage.py.
- Drop dead BaseException re-raise loop — asyncio.gather propagates
  BaseException subclasses directly on Python 3.12+.
- Log all enrichment errors (was errors[:5] truncation).
- Widen Altair docstring regex to support ''' and multi-line titles.
- Tighten vega-datasets prefix check with trailing / to prevent
  false-positive matches against sibling package names.
- Rename load_config -> load_sources to match what it returns.
- Add constraints = { required = false } on description schema field.
- Cap niquests to <4 (HTTP/2 body-swap workaround is version-sensitive).

Drop the network-gated test_full_pipeline smoke test — it never ran in
CI anyway (addopts = "-m 'not network'") and overwriting the tracked
data file as a test side effect was a trap. Its real invariants have
been promoted into standalone helpers with direct unit tests:
- assert_unique_spec_urls (spec_url primary-key invariant)
- assert_expected_galleries (all three galleries present)

Tests: 25 unit + 1 network -> 40 unit tests.
Catches catastrophic regressions (upstream restructuring, parser breakage)
that first-seen-wins + primary-key checks miss. Floors are ~15% below
current counts so normal upstream attrition doesn't trip them:

- altair: 100 (current 117, ~15% headroom)
- vega: 80 (current 93, ~14% headroom)
- vega-lite: 160 (current 189, ~15% headroom)

Missing gallery counts as zero and trips the same check, so the prior
"missing galleries" branch is subsumed. Four unit tests cover the pass,
one-below-floor, missing-gallery, and multi-gap paths.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants