feat: replace `SOURCES.md` with `datapackage.md` #643

dsmedia · 2024-12-08T18:35:25Z

Resolves #634

Tasks

Add SOURCES.toml to provide supplemental (extrinisic) metadata on datasets, from SOURCES.md, in a form usable by build_datapackage.py
Include resource descriptions, sources, and licenses to supplement script output
Preserve existing markdown content for future documentation
Remove duplicated content between descriptions and sources
Incorporate resource-level column descriptions into table schema, where available
Migrate license link, where available, into [[resources.licenses]]
Determine if root-level $schema property should be specified in the TOML file with the value "https://datapackage.org/profiles/2.0/datapackage.json" per Frictionless Data guidelines
Rename datapackage-tabular.md -> datapackage.md (comment)

- Add SOURCES.toml to provide supplemental (extrinisic) metadata on datasets, from SOURCES.md, in a form usable by build_datapackage.py - Include resource descriptions, sources, and licenses to supplement script output - Preserve existing markdown content for future documentation - TODO: Remove duplicated content between descriptions and sources - TODO: Incorporate resource-level column descriptions into table schema, where available - TODO: determine if root-level $schema property should be specified in the TOML file with the value "https://datapackage.org/profiles/2.0/datapackage.json" per Frictionless Data guidelines Resolves #634

SOURCES.toml

dsmedia · 2024-12-10T00:48:10Z

I'm currently revising now to remove duplicate sources from descriptions.

…ce description

dsmedia · 2024-12-10T02:31:35Z

Completed in commit 42b5b25:

Remove duplicated content between descriptions and sources
Incorporate resource-level column descriptions into table schema
Migrate license link, where available, into [[resources.licenses]]

Root-level schema property is optional in the Frictionelss Spec and isn't essential at this stage, so I'm marking this as ready for review.

Uses single space before comments

…=1"`

The other schemas have much more content, so leaving those as-is

SOURCES.toml

dangotbanned · 2024-12-10T16:34:56Z

@dsmedia I wanted to get your opinion on formatting the (longer) descriptions like this?

Note

I'm not sure why GitHub doesn't understand multi-line strings, but they are valid toml.

I found it pretty difficult to read some of these descriptions with how long the lines were.
We have multiple ways to format these, depending on if we actually want to keep the whitespace.

With these suggestions added, the longest line is still 1280 characters - so I'd want to make changes like this more broadly.
Ideally aiming for 80-100 characters per line. See pep-0008/#maximum-line-length if this number is new to you

dangotbanned · 2024-12-10T16:38:22Z

Overall, really great work getting everything translated over from markdown @dsmedia!

domoritz

Looks good to me (pending other comments).

Let's delete SOURCES.md as part of this pull request and update the references in the pull request template and readme.

domoritz · 2024-12-10T22:14:16Z

SOURCES.toml

@@ -0,0 +1,686 @@
+[[resources]] # Path: 7zip.png
+path        = "7zip.png"


I'm not a fan of aligned =.

Suggested change

path = "7zip.png"

path = "7zip.png"

is simpler and easier to edit

@domoritz having align_entries=false is fine with me.

If you've got any other preferences, maybe we could add those in a taplo.toml?

If there is an automatic tool (ruff?) I am okay either way btw. I just don't want to manually align stuff.

If there is an automatic tool (ruff?) I am okay either way btw. I just don't want to manually align stuff.

@domoritz

https://taplo.tamasfe.dev/configuration/file.html

https://taplo.tamasfe.dev/cli/usage/formatting.html

https://taplo.tamasfe.dev/configuration/formatter-options.html

@domoritz
I should've explained this better in the description for (1ea2812)

That commit uses automatic formatting via taplo.
The two non-default configuration options applied were:

align_entries=true

Align entries vertically. Entries that have table headers, comments, or blank lines between them are not aligned.

allowed_blank_lines=1

The maximum amount of consecutive blank lines allowed.

taplo can be used via its CLI and as a VSCode extension (Even Better TOML)

I haven't used it in CI before, but it is available on PyPI - so we could run it with uvx?
https://github.com/tamasfe/taplo/blob/a8bc571ee28775e7d5ad84c2ea87cf5b61ab42f5/.github/workflows/releases.yaml#L503-L504

NPM is another option for install

If we have a formatter, it should run on the ci as well.

@domoritz I've opened #646 for this

Co-authored-by: Dan Redding <[email protected]>

SOURCES.toml

Rebuilds using content as of (eadf7c5)

Co-authored-by: Dan Redding <[email protected]>

dsmedia · 2024-12-12T12:49:06Z

Let's delete SOURCES.md as part of this pull request and update the references in the pull request template and readme.

@dangotbanned If I understand the process correctly, SOURCES.toml is designed to contain only the information not captured (or captured incorrectly) by the datapackage script. In this sense, I'm not sure we can simply replace the references to SOURCES.md with SOURCES.toml, since the toml file won't contain the complete source information. Ideally, I think, the script could, after generating datapakcage.json, convert that json into a human-readable SOURCES.md with the synthesized output of the script-generated metadata and the hard-coded metadata from SOURCES.toml.

Here is a related repo

This is a command-line tool to extract the resources within a Frictionless Data Package into a variety of formats such as Markdown, HTML, CSV, etc.

Resolves (#643 (comment))

I think this was in the original due to everything being in a bullet list. No longer relevant now

#643 (comment)

Seems to be parsed fine, but confused the language server in vscode

dangotbanned · 2024-12-13T21:45:42Z

I think you can see the schema from the script instead of toml but I dint have strong feelings.

@domoritz are you talking about this point?

Determine if root-level $schema property should be specified in the TOML file with the value "datapackage.org/profiles/2.0/datapackage.json" per Frictionless Data guidelines

I don't think we need this at all, since we're using the framework directly there are options under Validating Data if we want to do this.

domoritz · 2024-12-13T22:04:28Z

Yep. Great if we don't need it.

dangotbanned

Happy to merge after feedback on (#643 (comment))

Thanks for working on this @dsmedia 😄

scripts/build_datapackage.py

I've fixed these in #643 https://github.com/vega/vega-datasets/actions/runs/12330436827/job/34415906150?pr=647

dsmedia · 2024-12-14T14:37:28Z

One slight concern on the licensing section of the json/md.

{
  "name": "vega-datasets",
  "description": "Common repository for example datasets used by Vega related projects.",
  "homepage": "https://github.com/vega/vega-datasets.git",
  "licenses": [
    {
      "name": "BSD-3-Clause",
      "path": "https://opensource.org/license/bsd-3-clause",
      "title": "The 3-Clause BSD License"
    }

The licensing section could benefit from clarification regarding the distinction between package code and dataset licenses. Here is how it's handled in the European data portal:

Most of the resources published display a specific reference to the licence under which the owner has chosen to release them.

For the resources without licence information, users must consult the licence conditions in the original portal where the resources were initially published.

We could strengthen this further with explicit guidance:

"BSD-3-Clause license applies to package code and infrastructure. Users are solely responsible for ensuring their use of datasets complies with the license terms of the original sources where the datasets were published. Dataset license information that may be included here is provided as a reference starting point only and without any warranty of accuracy or completeness."

This addition would clearly communicate the dual nature of licensing (code vs datasets) to users. The clarification could be added either in the package license title or in the package description. While it might also fit in the repo README.md, having this in datapackage files ensures the licensing clarity travels with the dataset metadata itself.

dangotbanned · 2024-12-14T14:59:15Z

#643 (comment)

`datapackage` Docs

Package

https://datapackage.org/standard/data-package/#licenses

The license(s) under which the package is provided.

Caution

This property is not legally binding and does not guarantee the package is licensed under the terms defined in this property.

licenses MUST be an array. Each item in the array is a License. Each MUST be an object. > The object MUST contain a name property and/or a path property, and it MAY contain a title property:

name: A string containing an Open Definition license ID
path: A URL or Path, that is a fully qualified HTTP address, or a relative POSIX path.
title: A string containing human-readable title.

Resource

https://datapackage.org/standard/data-resource/#licenses

List of licenses as for Data Package. If not specified the resource inherits from the data package.

@dsmedia my read on this would be I think you'd need to raise this over at https://github.com/frictionlessdata/datapackage/issues
The distinction package/resource licenses I think is already covered.

However, inheriting the license from the package seems like the wrong move for our situation.

Also, there isn't any room in the spec for adding context (e.g. a description) per-license.
Your proposal would require at least one more field - which seems reasonable to me personally

#643 (comment)

dsmedia · 2024-12-14T15:13:19Z

How about within the datapackage description itself?

description
A description of the package. The description MUST be markdown formatted — this also allows for simple plain text as plain text is itself valid markdown. The first paragraph (up to the first double line break) SHOULD be usable as summary information for the package.

This is the top-level description property in the spec root object

{
  "title": "Data Package",
  "description": "Data Package", 
  "type": "object",

}

Pre-merging, you can use these instead: https://github.com/dsmedia/vega-datasets/blob/main/datapackage.md#resources https://github.com/dsmedia/vega-datasets/blob/main/datapackage.json

dangotbanned · 2024-12-14T15:57:16Z

#643 (comment)

Sure @dsmedia that seems like a good place you could add a disclaimer

Feel free to add 🙌

Top-level description included in datapackage_additions.toml overrides the value pulled in from package.json.

Warning was emitted following (5ce07d5)

dangotbanned · 2024-12-14T20:09:55Z

Hope the title change in (#643 (comment)) makes sense to others.

I think this is more descriptive of the larger change this PR will have

dangotbanned · 2024-12-15T18:23:42Z

Thanks for all the work you put into this @dsmedia, merging now

- Adds some incomplete types for fields (`sources`, `licenses`) - Misc changes from vega/vega-datasets#651, vega/vega-datasets#643

@joelostblom

* feat: Adds `.arrow` support To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow) * feat: Add support for caching metadata * feat: Support env var `VEGA_GITHUB_TOKEN` Not required for these requests, but may be helpful to avoid limits * feat: Add support for multi-version metadata As an example, for comparing against the most recent I've added the 5 most recent * refactor: Renaming, docs, reorganize * feat: Support collecting release tags See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags * feat: Adds `refresh_tags` - Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests * feat(DRAFT): Adds `url_from` Experimenting with querying the url cache w/ expressions * fix: Wrap all requests with auth * chore: Remove `DATASET_NAMES_USED` * feat: Major `GitHub` rewrite, handle rate limiting - `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb** * feat(DRAFT): Partial implement `data("name")` * fix(typing): Resolve some `mypy` errors * fix(ruff): Apply `3.8` fixes https://github.com/vega/altair/actions/runs/11495437283/job/31994955413 * docs(typing): Add `WorkInProgress` marker to `data(...)` - Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well * feat(DRAFT): Add a source for available `npm` versions * refactor: Bake `"v"` prefix into `tags_npm` * refactor: Move `_npm_metadata` into a class * chore: Remove unused, add todo * feat: Adds `app` context for github<->npm * fix: Invalidate old trees * chore: Remove early test files# * refactor: Rename `metadata_full` -> `metadata` Suffix was only added due to *now-removed* test files * refactor: `tools.vendor_datasets` -> `tools.datasets` package Will be following up with some more splitting into composite modules * refactor: Move `TypedDict`, `NamedTuple`(s) -> `datasets.models` * refactor: Move, rename `semver`-related tools * refactor: Remove `write_schema` from `_Npm`, `_GitHub` Handled in `Application` now * refactor: Rename, split `_Npm`, `_GitHub` into own modules `tools.datasets.npm` will later be performing the requests that are in `Dataset.__call__` currently * refactor: Move `DataLoader.__call__` -> `DataLoader.url()` -`data.name()` -> `data(name)` - `data.name.url` -> `data.url(name)` * feat(typing): Generate annotations based on known datasets * refactor(typing): Utilize `datasets._typing` * feat: Adds `Npm.dataset` for remote reading] * refactor: Remove dead code * refactor: Replace `name_js`, `name_py` with `dataset_name` Since we're just using strings, there is no need for 2 forms of the name. The legacy package needed this for `__getattr__` access with valid identifiers * fix: Remove invalid `semver.sort` op I think this was added in error, since the schema of the file never had `semver` columns Only noticed the bug when doing a full rebuild * fix: Add missing init path for `refresh_trees` * refactor: Move public interface to `_io` Temporary home, see module docstring * refactor(perf): Don't recreate path mapping on every attribute access * refactor: Split `Reader._url_from` into `url`, `_query` - Much more generic now in what it can be used for - For the caching, I'll need more columns than just `"url_npm"` - `"url_github" contains a hash * feat(DRAFT): Adds `GitHubUrl.BLOBS` - Common prefix to all rows in `metadata[url_github]` - Stripping this leaves only `sha` - For **2800** rows, there are only **109** unique hashes, so these can be used to reduce cache size * feat: Store `sha` instead of `github_url` Related 661a385 * feat(perf): Adds caching to `ALTAIR_DATASETS_DIR` * feat(DRAFT): Adds initial generic backends * feat: Generate and move `Metadata` (`TypedDict`) to `datasets._typing` * feat: Adds optional backends, `polars[pyarrow]`, `with_backend` * feat: Adds `pyarrow` backend * docs: Update `.with_backend()` * chore: Remove `duckdb` comment Not planning to support this anymore, requires `fsspec` which isn't in `dev` ``` InvalidInputException Traceback (most recent call last) Cell In[6], line 5 3 with duck._reader._opener.open(url) as f: 4 fn = duck._reader._read_fn['.json'] ----> 5 thing = fn(f.read()) InvalidInputException: Invalid Input Error: This operation could not be completed because required module 'fsspec' is not installed" ``` * ci(typing): Add `pyarrow-stubs` to `dev` dependencies Will put this in another PR, but need it here for IDE support * refactor: `generate_datasets_typing` -> `Application.generate_typing` * refactor: Split `datasets` into public/private packages - `tools.datasets`: Building & updating metadata file(s), generating annotations - `altair.datasets`: Consuming metadata, remote & cached dataset management * refactor: Provide `npm` url to `GitHub(...)` * refactor: Rename `ext` -> `suffix` * refactor: Remove unimplemented `tag="latest"` Since `metadata.parquet` is sorted, this was already the behavior when not providing a tag * feat: Rename `_datasets_dir`, make configurable, add docs Still on the fence about `Loader.cache_dir` vs `Loader.cache` * docs: Adds examples to `Loader.with_backend` * refactor: Clean up requirements -> imports * docs: Add basic example to `Loader` class Also incorporates changes from previous commit into `__repr__` 4a2a2e0 * refactor: Reorder `alt.datasets` module * docs: Fill out `Loader.url` * feat: Adds `_Reader._read_metadata` * refactor: Rename `(reader|scanner_from()` -> `(read|scan)_fn()` * refactor(typing): Replace some explicit casts * refactor: Shorten and document request delays * feat(DRAFT): Make `[tag]` a `pl.Enum` * fix: Handle `pyarrow` scalars conversion * test: Adds `test_datasets` Initially quite basic, need to add more parameterize and test caching * fix(DRAFT): hotfix `pyarrow` read * fix(DRAFT): Treat `polars` as exception, invalidate cache Possibly fix https://github.com/vega/altair/actions/runs/11768349827/job/32778071725?pr=3631 * test: Skip `pyarrow` tests on `3.9` Forgot that this gets uninstalled in CI https://github.com/vega/altair/actions/runs/11768424121/job/32778234026?pr=3631 * refactor: Tidy up changes from last 4 commits - Rename and properly document "file-like object" handling - Also made a bit clearer what is being called and when - Use a more granular approach to skipping in `@backends` - Previously, everything was skipped regardless of whether it required `pyarrow` - Now, `polars`, `pandas` **always** run - with `pandas` expected to fail - I had to clean up `skip_requires_pyarrow` to make it compatible with `pytest.param` - It has a runtime check for if `MarkDecorator`, instead of just a callable bb7bc17, ebc1bfa, fe0ae88, 7089f2a * refactor: Rework `_readers.py` - Moved `_Reader._metadata` -> module-level constant `_METADATA`. - It was never modified and is based on the relative directory of this module - Generally improved the readability with more method-chaining (less assignment) - Renamed, improved doc `_filter_reduce` -> `_parse_predicates_constraints` * test: Adds tests for missing dependencies * test: Adds `test_dataset_not_found` * test: Adds `test_reader_cache` * docs: Finish `_Reader`, fill parameters of `Loader.__call__` Still need examples for `Loader.__call__` * refactor: Rename `backend` -> `backend_name`, `get_backend` -> `backend` `get_` was the wrong term since it isn't a free operation * fix(DRAFT): Add multiple fallbacks for `pyarrow` JSON * test: Remove `pandas` fallback for `pyarrow` There are enough alternatives here, it only added complexity * test: Adds `test_all_datasets` Disabled by default, since there are 74 datasets * refactor: Remove `_Reader._response` Can't reproduce the original issue that led to adding this. All backends are supporting `HTTPResponse` directly * fix: Correctly handle no remote connection Previously, `Path.touch()` appeared to be a cache-hit - despite being an empty file. - Fixes that bug - Adds tests * docs: Align `_typing.Metadata` and `Loader.(url|__call__)` descriptions Related c572180 * feat: Update to `v2.10.0`, fix tag inconsistency - Noticed one branch that missed the join to `npm` - Moved the join to `.tags()` and added a doc - https://github.com/vega/vega-datasets/releases/tag/v2.10.0 * refactor: Tidying up `tools.datasets` * revert: Remove tags schema files * ci: Introduce `datasets` refresh to `generate_schema_wrapper` Unrelated to schema, but needs to hook in somewhere * docs: Add `tools.datasets.Application` doc * revert: Remove comment * docs: Add a table preview to `Metadata` * docs: Add examples for `Loader.__call__` * refactor: Rename `DatasetName` -> `Dataset`, `VersionTag` -> `Version` * fix: Ensure latest `[tag]` appears first When updating from `v2.9.0` -> `v2.10.0`, new tags were appended to the bottom. This invalidated an assumption in `Loader.(dataset|url)` that the first result is the latest * refactor: Misc `models.py` updates - Remove unused `ParsedTreesResponse` - Align more of the doc style - Rename `ReParsedTag` -> `SemVerTag` * docs: Update `tools.datasets.__init__.py` * test: Fix `@datasets_debug` selection Wasn't being recognised by `-m not datasets_debug` and always ran * test: Add support for overrides in `test_all_datasets` vega/vega-datasets#627 * test: Adds `test_metadata_columns` * fix: Warn instead of raise for hit rate limit There should be enough handling elsewhere to stop requesting https://github.com/vega/altair/actions/runs/11823002117/job/32941324941#step:8:102 * feat: Update for `v2.11.0` https://github.com/vega/vega-datasets/releases/tag/v2.11.0 Includes support for `.parquet` following: - vega/vega-datasets#628 - vega/vega-datasets#627 * feat: Always use `pl.read_csv(try_parse_dates=True)` Related #3631 (comment) * feat: Adds `_pl_read_json_roundtrip` First mentioned in #3631 (comment) Addresses most of the `polars` part of #3631 (comment) * feat(DRAFT): Adds infer-based `altair.datasets.load` Requested by @joelostblom in: #3631 (comment) #3631 (comment) * refactor: Rename `Loader.with_backend` -> `Loader.from_backend` #3631 (comment) * feat(DRAFT): Add optional `backend` parameter for `load(...)` Requested by @jonmmease #3631 (comment) #3631 (comment) * feat(DRAFT): Adds `altair.datasets.url` A dataframe package is still required currently,. Can later be adapted to fit the requirements of (#3631 (comment)). Related: - #3631 (comment) - #3631 (comment) - #3150 (reply in thread) @mattijn, @joelostblom * feat: Support `url(...)` without dependencies #3631 (comment), #3631 (comment), #3631 (comment) * fix(DRAFT): Don't generate csv on refresh https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631 * test: Replace rogue `NotImplementedError` https://github.com/vega/altair/actions/runs/11942364658/job/33289235198?pr=3631 * fix: Omit `.gz` last modification time header Previously was creating a diff on every refresh, since the current time updated. https://docs.python.org/3/library/gzip.html#gzip.GzipFile.mtime https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631 * docs: Add doc for `Application.write_csv_gzip` * revert: Remove `"polars[pyarrow]" backend Partially related to #3631 (comment) After some thought, this backend didn't add support for any unique dependency configs. I've only ever used `use_pyarrow=True` for `pl.DataFrame.write_parquet` to resolve an issue with invalid headers in `"polars<1.0.0;>=0.19.0"` * test: Add a complex `xfail` for `test_load_call` Doesn't happen in CI, still unclear why the import within `pandas` breaks under these conditions. Have tried multiple combinations of `pytest.MonkeyPatch`, hard imports, but had no luck in fixing the bug * refactor: Renaming/recomposing `_readers.py` The next commits benefit from having functionality decoupled from `_Reader.query`. Mainly, keeping things lazy and not raising a user-facing error * build: Generate `VERSION_LATEST` Simplifies logic that relies on enum/categoricals that may not be recognised as ordered * feat: Adds `_cache.py` for `UrlCache`, `DatasetCache` Docs to follow * ci(ruff): Ignore `0.8.0` violations #3687 (comment) * fix: Use stable `narwhals` imports narwhals-dev/narwhals#1426, #3693 (comment) * revert(ruff): Ignore `0.8.0` violations f21b52b * revert: Remove `_readers._filter` Feature has been adopted upstream in narwhals-dev/narwhals#1417 * feat: Adds example and tests for disabling caching * refactor: Tidy up `DatasetCache` * docs: Finish `Loader.cache` Not using doctest style here, none of these return anything but I want them hinted at * refactor(typing): Use `Mapping` instead of `dict` Mutability is not needed. Also see #3573 * perf: Use `to_list()` for all backends narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment) * feat(DRAFT): Utilize `datapackage` schemas in `pandas` backends Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing. cc @joelostblom The solution is possible in large part to vega/vega-datasets#631 #3631 (comment) * refactor(ruff): Apply `TC006` fixes in new code Related #3706 * docs(DRAFT): Add notes on `datapackage.features_typing` * docs: Update `Loader.from_backend` example w/ dtypes Related 909e7d0 * feat: Use `_pl_read_json_roundtrip` instead of `pl.read_json` for `pyarrow` Provides better dtype inference * docs: Replace example dataset Switching to one with a timestamp that `frictionless` recognises https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L2674-L2689 https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L45-L57 * fix(ruff): resolve `RUF043` warnings https://github.com/vega/altair/actions/runs/12439154550/job/34732432411?pr=3631 * build: run `generate-schema-wrapper` https://github.com/vega/altair/actions/runs/12439184312/job/34732516789?pr=3631 * chore: update schemas Changes from vega/vega-datasets#648 Currently pinned on `main` until `v3.0.0` introduces `datapackage.json` https://github.com/vega/vega-datasets/tree/main * feat(typing): Update `frictionless` model hierarchy - Adds some incomplete types for fields (`sources`, `licenses`) - Misc changes from vega/vega-datasets#651, vega/vega-datasets#643 * chore: Freeze all metadata Mainly for `datapackage.json`, which is now temporarily stored un-transformed Using version (vega/vega-datasets@7c2e67f) * feat: Support and extract `hash` from `datapackage.json` Related vega/vega-datasets#665 * feat: Build dataset url with `datapackage.json` New column deviates from original approach, to support working from `main` https://github.com/vega/altair/blob/e259fbabfc38c3803de0a952f7e2b081a22a3ba3/altair/datasets/_readers.py#L154 * revert: Removes `is_name_collision` Not relevant following upstream change vega/vega-datasets#633 * build: Re-enable and generate `datapackage_features.parquet` Eventually, will replace `metadata.parquet` - But for a single version (current) only - Paired with a **limited** `.csv.gz` version, to support cases where `.parquet` reading is not available (`pandas` w/o (`pyarrow`|`fastparquet`)) * feat: add temp `_Reader.*_dpkg` methods - Will be replacing the non-suffixed versions - Need to do this gradually as `tag` will likely be dropped - Breaking most of the tests * test: Remove/replace all `tag` based tests * revert: Remove all `tag` based features * feat: Source version from `tool.altair.vega.vega-datasets` * refactor(DRAFT): Migrate to `datapackage.json` only Major switch from multiple github/npm endpoints -> a single file. Was Only possible following vega/vega-datasets#665 Still need to rewrite/fill out the `Metadata` doc, then moving onto features * docs: Update `Metadata` example * docs: Add missing descriptions to `Metadata` * refactor: Renaming/reorganize in `tools/` Mainly removing `Fl` prefix, as there is no confusion now `models.py` is purely `frictionless` structures * test: Skip `is_image` datasets * refactor: Make caching **opt-out**, use `$XDG_CACHE_HOME` Caching is the more sensible default when considering a notebook environment Using a standardised path now also https://specifications.freedesktop.org/basedir-spec/latest/#variables * refactor(typing): Add `_iter_results` helper * feat(DRAFT): Replace `UrlCache` w/ `CsvCache` Now that only a single version is supported, it is possible to mitigate the `pandas` case w/o `.parquet` support (#3631 (comment)) This commit adds the file and some tools needed to implement this - but I'll need to follow up with some more changes to integrate this into `_Reader` * refactor: Misc reworking caching - Made paths a `ClassVar` - Removed unused `SchemaCache` methods - Replace `_FIELD_TO_DTYPE` w/ `_DTYPE_TO_FIELD` - Only one variant is ever used Use a `SchemaCache` instance per-`pandas`-based reader - Make fallback `csv_cache` initialization lazy - Only going to use the global when no dependencies found - Otherwise, instance-per-reader * chore: Include `.parquet` in `metadata.csv.gz` - Readable via url w/ `vegafusion` installed - Currently no cases where a dataset has both `.parquet` and another extension * feat: Extend `_extract_suffix` to support `Metadata` Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling * refactor(typing): Simplify `Dataset` import * fix: Convert `str` to correct types in `CsvCache` * feat: Support `pandas` w/o a `.parquet` reader * refactor: Reduce repetition w/ `_Reader._download` * feat(DRAFT): `Metadata`-based error handling - Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work * chore(ruff): Remove unused `0.9.2` ignores Related #3771 https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631 * refactor: clean up, standardize `_exceptions.py` * test: Refactor decorators, test new errors * docs: Replace outdated docs - Using `load` instead of `data` - Don't mention multi-versions, as that was dropped * refactor: Clean up `tools.datasets` - `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc * test: `test_datasets` overhaul - Eliminated all flaky tests - Mocking more of the internals that is safer to run in parallel - Split out non-threadsafe tests with `@no_xdist` - Huge performance improvement for the slower tests - Added some helper functions (`is_*`) where common patterns were identified - **Removed skipping from native `pandas` backend** - Confirms that its now safe without `pyarrow` installed * refactor: Reuse `tools.fs` more, fix `app.(read|scan)` Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files Now these methods safely handle all formats in use * feat(typing): Set `"polars"` as default in `Loader.from_backend` Without a default, I found that VSCode was always suggesting the **last** overload first (`"pyarrow"`) This is a bad suggestion, as it provides the *worst native* experience. The default now aligns with the backend providing the *best native* experience * docs: Adds module-level doc to `altair.datasets` - Multiple **brief** examples, for a taste of the public API - See (#3763) - Refs to everywhere a first-time user may need help from - Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here * test: Clean up `test_datasets` - Reduce superfluous docs - Format/reorganize remaining docs - Follow up on some comments Misc style changes * docs: Make `sphinx` happy with docs These changes are very minor in VSCode, but fix a lot of rendering issues on the website * refactor: Add `find_spec` fastpath to `is_available` Have a lot of changes locally that use `find_spec`, but would prefer a single name assoicated with this action The actual spec is never relevant for this usage * feat(DRAFT): Private API overhaul **Public API is unchanged** Core changes are to simplify testing and extension: - `_readers.py` -> `_reader.py` - w/ two new support modules `_constraints`, and `_readimpl` - Functions (`BaseImpl`) are declared with what they support (`include`) and restrictions (`exclude`) on that subset - Transforms a lot of the imperative logic into set operations - Greatly improved `pyarrow` support - Utilize schema - Provides additional fallback `.json` implementations - `_stdlib_read_json_to_arrow` finally resolves `"movies.json"` issue * refactor: Simplify obsolete paths in `CsvCache` They were an artifact of *previously* using multiple `vega-dataset` versions in `.paquet` - but only the most recent in `.csv.gz` Currently both store the same range of names, so this error handling never triggered * chore: add workaround for `narwhals` bug Opened (narwhals-dev/narwhals#1897) Marking (#3631 (comment)) as resolved * feat(typing): replace `(Read|Scan)Impl` classes with aliases - Shorter names `Read`, `Scan` - The single unique method is now `into_scan` - There was no real need to have concrete classes when they behave the same as parent * feat: Rename, docs `unwrap_or` -> `unwrap_or_skip` * refactor: Replace `._contents` w/ `.__str__()` Inspired by https://github.com/pypa/packaging/blob/8510bd9d3bab5571974202ec85f6ef7b0359bfaf/src/packaging/requirements.py#L67-L71 * fix: Use correct type for `pyarrow.csv.read_csv` Resolves: ```py File ../altair/.venv/Lib/site-packages/pyarrow/csv.pyx:1258, in pyarrow._csv.read_csv() TypeError: Cannot convert dict to pyarrow._csv.ParseOptions ``` * docs: Add docs for `Read`, `Scan`, `BaseImpl` * docs: Clean up `_merge_kwds`, `_solve` * refactor(typing): Include all suffixes in `Extension` Also simplifies and removes outdated `Extension`-related tooling * feat: Finish `Reader.profile` - Reduced the scope a bit, now just un/supported - Added `pprint` option - Finished docs, including example pointing to use `url(...)` * test: Use `Reader.profile` in `is_polars_backed_pyarrow` * feat: Clean up, add tests for new exceptions * feat: Adds `Reader.open_markdown` - Will be even more useful after merging vega/vega-datasets#663 - Thinking this is a fair tradeoff vs inlining the descriptions into `altair` - All the info is available and it is quicker than manually searching the headings in a browser * docs: fix typo Resolves #3631 (comment) * fix: fix typo in error message #3631 (comment) * refactor: utilize narwhals fix narwhals-dev/narwhals#1934 * refactor: utilize `nw.Implementation.from_backend` See narwhals-dev/narwhals#1888 * feat(typing): utilize `nw.LazyFrame` working `TypeVar` Possible since narwhals-dev/narwhals#1930 @MarcoGorelli if you're interested what that PR did (besides fix warnings 😉) * docs: Show less data in examples * feat: Update for `[email protected]` Made possible via vega/vega-datasets#681 - Removes temp files - Removes some outdated apis - Remove test based on removed `"points"` dataset * refactor: replace `SchemaCache.schema_pyarrow` -> `nw.Schema.to_arrow` Related - narwhals-dev/narwhals#1924 - #3631 (comment) * feat(typing): Properly annotate `dataset_name`, `suffix` Makes more sense following (755ab4f) * chore: bump `vega-datasets==3.1.0` * test(typing): Ignore `_pytest` imports for `pyright` See microsoft/pyright#10248 (comment) * feat: Basic `geopandas` impl Still need to update tests * fix: Add missing `v` prefix to url * test: Update `test_spatial` * ci: Try pinning locked `ruff` https://github.com/vega/altair/actions/runs/14478364865/job/40609439929 * ci(uv): Add `--group geospatial` * chore: Reduce `geopandas` pin * feat: Basic `polars-st` impl -Seems to work pretty similarly to `geopandas` - The repr isn't as clean - Pretty cool that you can get *something* from `load("us-10m").st.plot()` * ci(typing): `mypy` ignore `polars-st` https://github.com/vega/altair/actions/runs/14494920661/job/40660098022?pr=3631 * build against vega-datasets 3.2.0 * run generate-schema-wrapper * prevent infinite recursion in _split_markers * sync to v6 * resolve doctest on lower python versions * resolve comment in github action * changed examples to modern interface to pass docbuild --------- Co-authored-by: dangotbanned <[email protected]>

dsmedia marked this pull request as ready for review December 8, 2024 18:36

dsmedia marked this pull request as draft December 8, 2024 18:37

Merge branch 'main' into main

71456a7

dangotbanned reviewed Dec 9, 2024

View reviewed changes

SOURCES.toml Outdated Show resolved Hide resolved

feat: Migrate sources, licenses and column descriptions out of resour…

42b5b25

…ce description

dsmedia marked this pull request as ready for review December 10, 2024 02:34

dsmedia requested a review from dangotbanned December 10, 2024 02:36

dangotbanned added 5 commits December 10, 2024 14:19

Merge branch 'main' into main

dbf4c6f

fix: Remove empty [[resources]]

5ccfae3

style: run default taplo fmt ...

a2b3be0

Uses single space before comments

style: run `taplo fmt -o "align_entries=true" -o "allowed_blank_lines…

1ea2812

…=1"`

style: Use an inline table for unemployment.tsv schema

4ff32cc

The other schemas have much more content, so leaving those as-is

dangotbanned reviewed Dec 10, 2024

View reviewed changes

SOURCES.toml Outdated Show resolved Hide resolved

dangotbanned reviewed Dec 10, 2024

View reviewed changes

SOURCES.toml Outdated Show resolved Hide resolved

dangotbanned added the enhancement label Dec 10, 2024

domoritz approved these changes Dec 10, 2024

View reviewed changes

domoritz reviewed Dec 10, 2024

View reviewed changes

dsmedia and others added 3 commits December 10, 2024 17:24

improve dataset description readability

56e01e9

Co-authored-by: Dan Redding <[email protected]>

improve dataset description readability

5f44926

Co-authored-by: Dan Redding <[email protected]>

Reduce line length; move columns to table schema

eadf7c5

dangotbanned reviewed Dec 11, 2024

View reviewed changes

SOURCES.toml Outdated Show resolved Hide resolved

dangotbanned and others added 3 commits December 11, 2024 16:46

feat: Integrate SOURCE.tomlwith datapackage.json

1ff024c

Rebuilds using content as of (eadf7c5)

restore URL to single line

6011269

Co-authored-by: Dan Redding <[email protected]>

fix: avoid breaks in URL and backtick-enclosed strings

e5121d8

fix: correct typo, add link

6148b8b

Resolves (#643 (comment))

dangotbanned mentioned this pull request Dec 13, 2024

Configure and run taplo in CI for formatting .toml #646

Closed

dangotbanned added 5 commits December 13, 2024 21:08

style: Adjust link, date format for markdown

6cbe84f

fix: remove unneeded indent

ed39948

I think this was in the original due to everything being in a bullet list. No longer relevant now

fix: correct burtin.json quote breaks

8b09c39

#643 (comment)

fix: more markdown fixes

9c9eefa

fix: move triple quotes to own line

6998e07

Seems to be parsed fine, but confused the language server in vscode

dangotbanned marked this pull request as ready for review December 13, 2024 21:36

dangotbanned requested changes Dec 13, 2024

View reviewed changes

scripts/build_datapackage.py Outdated Show resolved Hide resolved

dangotbanned added a commit that referenced this pull request Dec 14, 2024

ci: re-exclude build_datapackage.py

c48731b

I've fixed these in #643 https://github.com/vega/vega-datasets/actions/runs/12330436827/job/34415906150?pr=647

dangotbanned mentioned this pull request Dec 14, 2024

ci: adds taplo.toml, pyproject.toml #647

Merged

4 tasks

refactor: rename datapackage-tabular.md -> datapackage.md

c97ad7d

#643 (comment)

docs: Update README.md links

d550d8a

Pre-merging, you can use these instead: https://github.com/dsmedia/vega-datasets/blob/main/datapackage.md#resources https://github.com/dsmedia/vega-datasets/blob/main/datapackage.json

dsmedia and others added 2 commits December 14, 2024 16:26

docs: update description with dataset license information

5ce07d5

Top-level description included in datapackage_additions.toml overrides the value pulled in from package.json.

fix: remove overriden description

257bcb7

Warning was emitted following (5ce07d5)

dangotbanned approved these changes Dec 14, 2024

View reviewed changes

dangotbanned changed the title ~~feat: Add datapackage_additions.toml for dataset metadata~~ feat: replace SOURCES.md with datapackage.md Dec 14, 2024

dangotbanned added 2 commits December 15, 2024 18:19

Merge remote-tracking branch 'upstream/main' into pr/dsmedia/643

35954cf

ci: remove build_datapackage.py exclude

755976f

dangotbanned merged commit 5eaa256 into vega:main Dec 15, 2024
2 checks passed

dangotbanned added a commit to vega/altair that referenced this pull request Dec 22, 2024

feat(typing): Update frictionless model hierarchy

897e8f9

- Adds some incomplete types for fields (`sources`, `licenses`) - Misc changes from vega/vega-datasets#651, vega/vega-datasets#643

		@@ -0,0 +1,686 @@
		[[resources]] # Path: 7zip.png
		path = "7zip.png"

Uh oh!

feat: replace SOURCES.md with datapackage.md #643

feat: replace SOURCES.md with datapackage.md #643

Uh oh!

Conversation

dsmedia commented Dec 8, 2024 • edited by dangotbanned Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Tasks

Uh oh!

Uh oh!

dsmedia commented Dec 10, 2024

Uh oh!

dsmedia commented Dec 10, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dangotbanned commented Dec 10, 2024

Uh oh!

dangotbanned commented Dec 10, 2024

Uh oh!

domoritz left a comment

Choose a reason for hiding this comment

Uh oh!

domoritz Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 10, 2024

Choose a reason for hiding this comment

Uh oh!

domoritz Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

domoritz Dec 11, 2024

Choose a reason for hiding this comment

Uh oh!

dangotbanned Dec 13, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dsmedia commented Dec 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned commented Dec 13, 2024

Uh oh!

domoritz commented Dec 13, 2024

Uh oh!

dangotbanned left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dsmedia commented Dec 14, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dangotbanned commented Dec 14, 2024

datapackage Docs

Package

Resource

Uh oh!

dsmedia commented Dec 14, 2024

Uh oh!

dangotbanned commented Dec 14, 2024

Uh oh!

dangotbanned commented Dec 14, 2024

Uh oh!

dangotbanned commented Dec 15, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

feat: replace `SOURCES.md` with `datapackage.md` #643

feat: replace `SOURCES.md` with `datapackage.md` #643

dsmedia commented Dec 8, 2024 •

edited by dangotbanned

Loading

dsmedia commented Dec 10, 2024 •

edited

Loading

dangotbanned Dec 11, 2024 •

edited

Loading

dsmedia commented Dec 12, 2024 •

edited

Loading

dsmedia commented Dec 14, 2024 •

edited

Loading

`datapackage` Docs