feat: add gallery_examples dataset mapping examples to datasets#776
feat: add gallery_examples dataset mapping examples to datasets#776
Conversation
Add a dataset cataloging which Vega, Vega-Lite, and Altair gallery examples use which vega-datasets. Each record links a gallery example to the dataset(s) it references, with the example's name, URL, categories, and description. gallery_examples.json is a flat JSON array (399 records) registered as a Data Package v2 resource in datapackage.json. - scripts/generate_gallery_examples.py — 526 lines, async fetching - tests/test_generate_gallery_examples.py — 337 lines - _data/gallery_examples.toml — source URLs
## PR Description The `interactive_area_brush` entry in `examples.json` has `"descripton"` (missing an `i`) instead of `"description"`. Because [`create-example-pages`](https://github.com/vega/vega-lite/blob/main/scripts/create-example-pages#L36) checks `example.description`, the misspelled field is silently ignored, and the [gallery page](https://vega.github.io/vega-lite/examples/interactive_area_brush.html) renders with no description text. The entry has a [description](https://github.com/vega/vega-lite/blob/main/site/_data/examples.json#L841), but it never renders because of the typo: > In this example, we apply an `interval` selection to select subset of data in an area chart. The selected data is highlighted in gold by the second layer of an area mark that `filter`s its data by the `brush` selection. The [gallery page](https://vega.github.io/vega-lite/examples/interactive_area_brush.html) currently shows no description text: <!-- upload docs/ideation/screenshot-area-brush-cropped.png and paste image link here --> Just for future reference, [`create-example-pages`](https://github.com/vega/vega-lite/blob/main/scripts/create-example-pages#L36-L39) first checks for a `description` field in the example's `examples.json` entry. If none is found, it falls back to the `description` field in the `.vl.json` spec file. In this case, `examples.json` has the description under a misspelled key, and the [spec file](https://github.com/vega/vega-lite/blob/main/examples/specs/interactive_area_brush.vl.json) has no description at all — so neither source provides one. Found while working on gallery example metadata in vega/vega-datasets#776. <details> <summary><h2>Checklist</h2></summary> - [x] This PR is atomic (i.e., it fixes one issue at a time). - [x] The title is a concise [semantic commit message](https://www.conventionalcommits.org/) (e.g. "fix: correctly handle undefined properties"). - [x] `npm test` runs successfully - For new features: - [ ] Has unit tests. - [ ] Has documentation under `site/docs/` + examples. Tips: - https://medium.com/@greenberg/writing-pull-requests-your-coworkers-might-enjoy-reading-9d0307e93da3 is a nice article about writing a nice PR. - Use draft PR for work in progress PRs / when you want early feedback (https://github.blog/2019-02-14-introducing-draft-pull-requests/). </details>
Picks up descriptions now available after vega-lite#9813 fixed the `descripton` → `description` typo in examples.json, along with other recently added descriptions from vega-lite.
Two fixes for generate_gallery_examples.py: 1. Fall back to spec-file descriptions for vega-lite examples when the gallery index has no description (mirrors vega-lite's own create-example-pages behavior). Affected ~75 examples. 2. Disable HTTP/2 on niquests AsyncSession to prevent response body corruption caused by multiplexing bug in urllib3-future (<2.15.903). See urllib3-future#309.
The vega-lite examples.json uses "" as the subcategory key for "Maps (Geographic Displays)" — an intentional convention (their site template skips the header when the key is empty). All three Vega ecosystem galleries keep geographic examples as a flat group; the academic literature (Heer et al., "A Tour through the Visualization Zoo", 2010) defines map subtypes but the galleries don't use them. Our script was propagating the empty string into categories, giving 12 map examples categories: [""]. Now falls back to the section name, producing categories: ["Maps (Geographic Displays)"]. Regenerated gallery_examples.json reflects this fix plus the description fallback and HTTP/2 fixes from the previous commit.
Fix TOML key ordering in pyproject.toml (pytest.ini_options). Add test coverage for concat specs, bare filename normalization, and empty-path edge case. Remove stale test_extract_datasets.py. Regenerate datapackage metadata.
Fix dead zero-examples guard using set-difference, re-raise BaseException after asyncio.gather, add Altair response canary check, validate fetched index types, and put id first in JSON output. Add 7 tests for _parse_altair_metadata and _build_vegalite_examples empty-subcategory fallback.
- Use prefix-based matching for vega-datasets URL detection instead of broad substring check that could false-positive on unrelated URLs - Add type guard for non-string URL values in normalize_dataset_reference - Add type guard for non-list category values in VL index parsing - Fix title fallback to handle explicit null values from upstream
Use resp.json() instead of json.loads(resp.text) for JSON responses to avoid niquests str | None type mismatch. Add assertion for Altair text response type narrowing. Regenerate gallery_examples.json with id-first field ordering from updated assign_ids.
Mirror the Vega-Lite description fallback to the Vega enrichment path. Rescues 93 descriptions that were previously null.
|
Replying to some part of vega/altair#3999 (comment) here:
I think this might be able to replace some/most of the work I did with heuristics in vega/altair#4002 with a more robust link between gallery examples and user guide sections. Maybe we will also be able to show a little thumbnail gallery at the bottom of each user guide section with "Related gallery examples" where readers can go to learn more.
I haven't read the code carefully, but I'm imagining that for altair specs, you could look at the generated vega-lite spec to get the same categories (which maybe is what you are doing already?). I don't know if it is too expensive to actually execute the code to retrieve the spec like that; another downside is that it would miss any altair-specific features/shortcuts, but you would avoid having to write a separate parser for the altair code.
If it would be a big help for you to have some different organization of the altair examples (and not a huge lift to do so), I would be open to us changing that as long as it doesn't affect the URLs of the individual gallery examples on the website since people might have them saved since previously. |
ba2d704 to
76d3f82
Compare
- Move gallery_examples.json into data/ so it is handled by iter_resources like every other dataset - Remove create_gallery_examples_resource() special-case function - Drop legacy sequential id field; use spec_url as primary key - Fix niquests resp.text type (str | None) Pylance errors - Add spec_url uniqueness assertion to enforce primary key invariant - Update references in docs, tests, package.json, and TOML
The file move in 3e386b0 (gallery_examples.json -> data/) didn't regenerate src/urls.ts, so the URL index was missing the entry. Consumers reading urls['gallery_examples.json'] from the npm package would get undefined. CI didn't catch this because npm run build regenerates urls.ts rather than verifying it's up to date.
|
Note on test coverage: I removed the network-gated Until we've decided how we want to track upstream gallery drift, I'd rather have no alert than a noisy one. To keep the coverage that the smoke test provided, the two invariants that actually mattered — the Happy to wire up a scheduled network check later once the gallery-alignment question is settled. |
Rename data/gallery_examples.json -> data/gallery-examples.json so the
filename matches the kebab-case convention used by every other multi-word
file under data/.
Review-driven code improvements:
- Altair extraction: fail-loud when a recognized pattern names a dataset
not in vega-datasets (likely upstream rename). Previously silent-drop.
- Vega-Lite duplicate-slug dedup: longest-wins for title/description
(fixes 3 real cases: layer_bar_labels, layer_text_heatmap, repeat_layer).
Categories now deduplicated when slugs appear in the same category twice.
- Split async_main() into run_pipeline() (no I/O) and async_main() (writes
the file), so future tests can exercise the pipeline without mutating
the tracked data file.
- Extract _fetch_text() helper; replace 5 assert statements with an
explicit RuntimeError so the checks survive python -O.
- Add isinstance guards to the Vega index loop (symmetry with Vega-Lite).
- Drop dead compute_file_hash() in build_datapackage.py.
- Drop dead BaseException re-raise loop — asyncio.gather propagates
BaseException subclasses directly on Python 3.12+.
- Log all enrichment errors (was errors[:5] truncation).
- Widen Altair docstring regex to support ''' and multi-line titles.
- Tighten vega-datasets prefix check with trailing / to prevent
false-positive matches against sibling package names.
- Rename load_config -> load_sources to match what it returns.
- Add constraints = { required = false } on description schema field.
- Cap niquests to <4 (HTTP/2 body-swap workaround is version-sensitive).
Drop the network-gated test_full_pipeline smoke test — it never ran in
CI anyway (addopts = "-m 'not network'") and overwriting the tracked
data file as a test side effect was a trap. Its real invariants have
been promoted into standalone helpers with direct unit tests:
- assert_unique_spec_urls (spec_url primary-key invariant)
- assert_expected_galleries (all three galleries present)
Tests: 25 unit + 1 network -> 40 unit tests.
Catches catastrophic regressions (upstream restructuring, parser breakage) that first-seen-wins + primary-key checks miss. Floors are ~15% below current counts so normal upstream attrition doesn't trip them: - altair: 100 (current 117, ~15% headroom) - vega: 80 (current 93, ~14% headroom) - vega-lite: 160 (current 189, ~15% headroom) Missing gallery counts as zero and trips the same check, so the prior "missing galleries" branch is subsumed. Four unit tests cover the pass, one-below-floor, missing-gallery, and multi-gap paths.
Summary
Adds a dataset that maps Vega, Vega-Lite, and Altair gallery examples to the vega-datasets they reference, with the example's name, URL, categories, and description.
data/gallery_examples.jsonis a flat JSON array (399 records) registered as a Data Package v2 resource indatapackage.json, same as every other dataset in the repo.Ground-up rewrite of the approach in #724 — reduced from ~2,500 lines to ~570, with visualization technique detection removed. The dataset focuses on the mechanically verifiable relationship: which examples load which datasets.
What's included
scripts/generate_gallery_examples.pytests/test_generate_gallery_examples.py_data/gallery_examples.toml_data/datapackage_additions.tomlscripts/build_datapackage.pygallery_examples.jsonas a Data Package resourcedata/gallery_examples.jsonOutput schema
{ "gallery_name": "altair", "example_name": "2D Histogram Heatmap", "example_url": "https://altair-viz.github.io/gallery/histogram_heatmap.html", "spec_url": "https://raw.githubusercontent.com/vega/altair/main/tests/...", "categories": ["Distributions"], "description": "This example shows how to make a heatmap.", "datasets": ["movies"] }spec_urlis the declared primary key.Supersedes #724.