feat: add `Schema.to_(arrow|pandas|polars)` #1924

dangotbanned · 2025-02-03T14:05:07Z

Will close #1912

What type of PR is this? (check all applicable)

✨ Feature

Related issues

Closes [Enh]: nw.(DType|Schema) conversion API #1912

Checklist

Code follows style guide (ruff)
Tests added
Documented the changes

If you have comments or can explain your changes, please do so below

Started with porting nw.functions._from_dict_impl
Extended Schema w/ ._version

Will close narwhals-dev#1912 - Starting with porting `nw.functions._from_dict_impl` - Thinking that `Schema` should have `._version: ClassVar[Version]` to remove the need for user-facing arg (narwhals-dev#1912 (comment))

dangotbanned · 2025-02-03T14:11:51Z

@MarcoGorelli I've pushed this early to get some thoughts on handling Version.

To avoid needing a user-provided arg, I think we'd need to approach it differently than in here with the public/private function pair:

narwhals/narwhals/functions.py

Lines 436 to 451 in 53e780c

    
               return _from_dict_impl( 
        
                   data, 
        
                   schema, 
        
                   native_namespace=native_namespace, 
        
                   version=Version.MAIN, 
        
               ) 
        
           def _from_dict_impl( 
        
               data: dict[str, Any], 
        
               schema: dict[str, DType] | Schema | None = None, 
        
               *, 
        
               native_namespace: ModuleType | None = None, 
        
               version: Version, 
        
           ) -> DataFrame[Any]: 
        
               from narwhals.series import Series

If you have something like:

class Schema(BaseSchema):
    _version: ClassVar[Version] = Version.MAIN

That could easily be overridden for v1:

class Schema(NwSchema):
    _version = Version.V1

New API can start without deprecations, related narwhals-dev#1931

MarcoGorelli · 2025-02-04T17:51:01Z

thanks Dan! Yes, having a private _version sounds good to me

Some other comments:

I don't think you need backend for to_pandas. I think you just need dtype_backend, whose signature should match the one in https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html
I'm not sure about to_native, because I might then expect to do:
```
df= pd.DataFrame({'a': [1,2,3]})
nw.from_native(df).schema.to_native()
```
and expect to get a pandas schema out the other end. However, that's not what happens, the nw.Schema has no knowledge of the underlying implementation. And I think that's OK. Shall we start with just to_pandas / to_pyarrow / to_polars?

MarcoGorelli · 2025-02-04T18:01:52Z

yup, thanks!

dangotbanned · 2025-02-04T18:55:25Z

thanks Dan! Yes, having a private _version sounds good to me

Great, I'll get that added soon

Some other comments:

I don't think you need backend for to_pandas. I think you just need dtype_backend, whose signature should match the one in pandas.pydata.org/docs/reference/api/pandas.DataFrame.convert_dtypes.html

I'm not sure about to_native, because I might then expect to do:
df= pd.DataFrame({'a': [1,2,3]})
nw.from_native(df).schema.to_native()
and expect to get a pandas schema out the other end.
However, that's not what happens, the nw.Schema has no knowledge of the underlying implementation. And I think that's OK.

I think you've stumbled into an interesting topic here @MarcoGorelli.
So the current signatures deviate from the similar methods for nw.DataFrame - as here backend is required to account for this lack of knowledge.

Mainly this was from adapting nw.from_dict - which (prior to #1931) required a ModuleType and that acted as a guarantee that we had the import.
Now, the main difference between this and from_dict is that backend can't be None - since there isn't some context to do this trick

narwhals/narwhals/functions.py

Lines 473 to 480 in 71a5bc5

    
           if backend is None: 
        
               for val in data.values(): 
        
                   if isinstance(val, Series): 
        
                       native_namespace = val.__native_namespace__() 
        
                       break 
        
               else: 
        
                   msg = "Calling `from_dict` without `backend` is only supported if all input values are already Narwhals Series" 
        
                   raise TypeError(msg)

Shall we start with just to_pandas / to_pyarrow / to_polars?

Yeah definitely!
Now that I understand whats going on behind the scenes, to_native seems less helpful.

Question

If we drop the backend requirement and to_native(), do we need handling in nw.Implementation for missing imports?

I can keep this local to nw.Schema.to_*.
But I feel like with the push (#1917, #1931) towards backend: ModuleType | Implementation | str - there might be a benefit in some "infallible" paths.
If only to provide consistent errors?

narwhals-dev#1924 (comment) narwhals-dev#1924 (comment)

narwhals-dev#1924 (comment)

MarcoGorelli · 2025-02-04T19:14:35Z

I think in general we need a import_optional_dependency utility function, which can unify error messages for missing imports

but nw.Schema.to_pandas can import pandas (and let it raise if it's not available), nw.Schema.to_polars can import Polars (and let it raise if it's not available), etc...

narwhals-dev#1924 (comment)

`pandas` seems a bit more complex, leaving for now

dangotbanned · 2025-02-04T19:57:14Z

I think in general we need a import_optional_dependency utility function, which can unify error messages for missing imports

but nw.Schema.to_pandas can import pandas (and let it raise if it's not available), nw.Schema.to_polars can import Polars (and let it raise if it's not available), etc...

https://results.pre-commit.ci/run/github/760058710/1738698879.vjLSyWjmSYiQy0B_ViLJlA

@MarcoGorelli should this fall under the same exception-to-the-rule as the dtypes modules?

Edit

Seems I misunderstood how the check works https://github.com/narwhals-dev/narwhals/blob/71a5bc5d6848da10e658db9d68a4a139f37c099b/utils/import_check.py

Not sure how this could fit in

When needed, this is now available on `Schema` (d1e0576)

narwhals/schema.py

MarcoGorelli · 2025-02-04T22:08:07Z

thanks, looking good!

you can add a # ignore-banned-import where necessary, the check is only there go nudge devs towards double-checking when they perform certain imports that they're really only doing them when strictly needed

narwhals/functions.py

dangotbanned · 2025-02-04T22:10:20Z

#1924 (comment)

Thanks @MarcoGorelli, will give this a try tomorrow

narwhals-dev#1924 (comment)

Resolves narwhals-dev#1924 (comment)

tests/frame/schema_test.py

narwhals/typing.py

MarcoGorelli

thanks @dangotbanned

just made some minor edits, and got dtype_backend to match how it currently is in pandas

happy with this?

dangotbanned · 2025-02-08T16:26:54Z

thanks @dangotbanned

just made some minor edits, and got dtype_backend to match how it currently is in pandas

happy with this?

Thanks @MarcoGorelli, just taking a look now

dangotbanned · 2025-02-08T16:43:22Z

narwhals/schema.py

+        if parse_version(pl.__version__) < (1, 0, 0):  # pragma: no cover
+            return dict(schema)  # type: ignore[return-value]
+        return pl.Schema(schema)


@MarcoGorelli why is the order reversed now?

This is just one small example from typing_extensions:

https://github.com/python/typing_extensions/blob/8184ac61398c187203dad819eb5b9d34005a96ae/src/typing_extensions.py#L547-L552

I find the pattern there is easier to follow:

if parse_version(pl.__version__) >= (1, 0, 0): # happy path first ... else: # compat path ...

The import of cast was blocking me adding it as a suggestion, but this is my suggestion
(3bcebc1)

it just became easier to not have type ignore and pragma no cover on the same line (we measure coverage in the ci job with the latest versions - i've found combining coverage jobs to be too unreliable and flaky)

sure, no objections to 3bcebc1

https://github.com/narwhals-dev/narwhals/pull/1924/files#r1947895133

https://github.com/narwhals-dev/narwhals/actions/runs/13217489312/job/36898464609?pr=1924

dangotbanned · 2025-02-08T16:54:26Z

tests/frame/schema_test.py

+
+def test_schema_to_pandas_invalid() -> None:
+    schema = nw.Schema({"a": nw.Int64()})
+    msg = "Expected one of {None, 'pyarrow', 'numpy_nullable'}, got: 'cabbage'"


dangotbanned · 2025-02-08T16:57:52Z

Ready to merge if you're onboard w/ (#1924 (comment)) @MarcoGorelli

Thanks for helping me through this 🎉

Note

Maybe rename the PR to make it more discoverable?
feat: add nw.Schema.to_(arrow|pandas|polars)

MarcoGorelli · 2025-02-08T16:59:29Z

sure, looks good, feel free to apply that and ship it

dangotbanned · 2025-02-08T17:02:05Z

sure, looks good, feel free to apply that and ship it

Ah great, sorry last note in (#1924 (comment)) (if you didn't see it)

MarcoGorelli · 2025-02-08T17:06:52Z

changing the title? sure, i think you should have permission to edit

dangotbanned · 2025-02-08T17:21:28Z

changing the title? sure, i think you should have permission to edit

Oh yeah 🎉

Related - narwhals-dev/narwhals#1924 - #3631 (comment)

@joelostblom

* feat: Adds `.arrow` support To support [flights-200k.arrow](https://github.com/vega/vega-datasets/blob/f637f85f6a16f4b551b9e2eb669599cc21d77e69/data/flights-200k.arrow) * feat: Add support for caching metadata * feat: Support env var `VEGA_GITHUB_TOKEN` Not required for these requests, but may be helpful to avoid limits * feat: Add support for multi-version metadata As an example, for comparing against the most recent I've added the 5 most recent * refactor: Renaming, docs, reorganize * feat: Support collecting release tags See https://docs.github.com/en/rest/repos/repos?apiVersion=2022-11-28#list-repository-tags * feat: Adds `refresh_tags` - Basic mechanism for discovering new versions - Tries to minimise number of and total size of requests * feat(DRAFT): Adds `url_from` Experimenting with querying the url cache w/ expressions * fix: Wrap all requests with auth * chore: Remove `DATASET_NAMES_USED` * feat: Major `GitHub` rewrite, handle rate limiting - `metadata_full.parquet` stores **all known** file metadata - `GitHub.refresh()` to maintain integrity in a safe manner - Roughly 3000 rows - Single release: **9kb** vs 46 releases: **21kb** * feat(DRAFT): Partial implement `data("name")` * fix(typing): Resolve some `mypy` errors * fix(ruff): Apply `3.8` fixes https://github.com/vega/altair/actions/runs/11495437283/job/31994955413 * docs(typing): Add `WorkInProgress` marker to `data(...)` - Still undecided exactly how this functionality should work - Need to resolve `npm` tags != `gh` tags issue as well * feat(DRAFT): Add a source for available `npm` versions * refactor: Bake `"v"` prefix into `tags_npm` * refactor: Move `_npm_metadata` into a class * chore: Remove unused, add todo * feat: Adds `app` context for github<->npm * fix: Invalidate old trees * chore: Remove early test files# * refactor: Rename `metadata_full` -> `metadata` Suffix was only added due to *now-removed* test files * refactor: `tools.vendor_datasets` -> `tools.datasets` package Will be following up with some more splitting into composite modules * refactor: Move `TypedDict`, `NamedTuple`(s) -> `datasets.models` * refactor: Move, rename `semver`-related tools * refactor: Remove `write_schema` from `_Npm`, `_GitHub` Handled in `Application` now * refactor: Rename, split `_Npm`, `_GitHub` into own modules `tools.datasets.npm` will later be performing the requests that are in `Dataset.__call__` currently * refactor: Move `DataLoader.__call__` -> `DataLoader.url()` -`data.name()` -> `data(name)` - `data.name.url` -> `data.url(name)` * feat(typing): Generate annotations based on known datasets * refactor(typing): Utilize `datasets._typing` * feat: Adds `Npm.dataset` for remote reading] * refactor: Remove dead code * refactor: Replace `name_js`, `name_py` with `dataset_name` Since we're just using strings, there is no need for 2 forms of the name. The legacy package needed this for `__getattr__` access with valid identifiers * fix: Remove invalid `semver.sort` op I think this was added in error, since the schema of the file never had `semver` columns Only noticed the bug when doing a full rebuild * fix: Add missing init path for `refresh_trees` * refactor: Move public interface to `_io` Temporary home, see module docstring * refactor(perf): Don't recreate path mapping on every attribute access * refactor: Split `Reader._url_from` into `url`, `_query` - Much more generic now in what it can be used for - For the caching, I'll need more columns than just `"url_npm"` - `"url_github" contains a hash * feat(DRAFT): Adds `GitHubUrl.BLOBS` - Common prefix to all rows in `metadata[url_github]` - Stripping this leaves only `sha` - For **2800** rows, there are only **109** unique hashes, so these can be used to reduce cache size * feat: Store `sha` instead of `github_url` Related 661a385 * feat(perf): Adds caching to `ALTAIR_DATASETS_DIR` * feat(DRAFT): Adds initial generic backends * feat: Generate and move `Metadata` (`TypedDict`) to `datasets._typing` * feat: Adds optional backends, `polars[pyarrow]`, `with_backend` * feat: Adds `pyarrow` backend * docs: Update `.with_backend()` * chore: Remove `duckdb` comment Not planning to support this anymore, requires `fsspec` which isn't in `dev` ``` InvalidInputException Traceback (most recent call last) Cell In[6], line 5 3 with duck._reader._opener.open(url) as f: 4 fn = duck._reader._read_fn['.json'] ----> 5 thing = fn(f.read()) InvalidInputException: Invalid Input Error: This operation could not be completed because required module 'fsspec' is not installed" ``` * ci(typing): Add `pyarrow-stubs` to `dev` dependencies Will put this in another PR, but need it here for IDE support * refactor: `generate_datasets_typing` -> `Application.generate_typing` * refactor: Split `datasets` into public/private packages - `tools.datasets`: Building & updating metadata file(s), generating annotations - `altair.datasets`: Consuming metadata, remote & cached dataset management * refactor: Provide `npm` url to `GitHub(...)` * refactor: Rename `ext` -> `suffix` * refactor: Remove unimplemented `tag="latest"` Since `metadata.parquet` is sorted, this was already the behavior when not providing a tag * feat: Rename `_datasets_dir`, make configurable, add docs Still on the fence about `Loader.cache_dir` vs `Loader.cache` * docs: Adds examples to `Loader.with_backend` * refactor: Clean up requirements -> imports * docs: Add basic example to `Loader` class Also incorporates changes from previous commit into `__repr__` 4a2a2e0 * refactor: Reorder `alt.datasets` module * docs: Fill out `Loader.url` * feat: Adds `_Reader._read_metadata` * refactor: Rename `(reader|scanner_from()` -> `(read|scan)_fn()` * refactor(typing): Replace some explicit casts * refactor: Shorten and document request delays * feat(DRAFT): Make `[tag]` a `pl.Enum` * fix: Handle `pyarrow` scalars conversion * test: Adds `test_datasets` Initially quite basic, need to add more parameterize and test caching * fix(DRAFT): hotfix `pyarrow` read * fix(DRAFT): Treat `polars` as exception, invalidate cache Possibly fix https://github.com/vega/altair/actions/runs/11768349827/job/32778071725?pr=3631 * test: Skip `pyarrow` tests on `3.9` Forgot that this gets uninstalled in CI https://github.com/vega/altair/actions/runs/11768424121/job/32778234026?pr=3631 * refactor: Tidy up changes from last 4 commits - Rename and properly document "file-like object" handling - Also made a bit clearer what is being called and when - Use a more granular approach to skipping in `@backends` - Previously, everything was skipped regardless of whether it required `pyarrow` - Now, `polars`, `pandas` **always** run - with `pandas` expected to fail - I had to clean up `skip_requires_pyarrow` to make it compatible with `pytest.param` - It has a runtime check for if `MarkDecorator`, instead of just a callable bb7bc17, ebc1bfa, fe0ae88, 7089f2a * refactor: Rework `_readers.py` - Moved `_Reader._metadata` -> module-level constant `_METADATA`. - It was never modified and is based on the relative directory of this module - Generally improved the readability with more method-chaining (less assignment) - Renamed, improved doc `_filter_reduce` -> `_parse_predicates_constraints` * test: Adds tests for missing dependencies * test: Adds `test_dataset_not_found` * test: Adds `test_reader_cache` * docs: Finish `_Reader`, fill parameters of `Loader.__call__` Still need examples for `Loader.__call__` * refactor: Rename `backend` -> `backend_name`, `get_backend` -> `backend` `get_` was the wrong term since it isn't a free operation * fix(DRAFT): Add multiple fallbacks for `pyarrow` JSON * test: Remove `pandas` fallback for `pyarrow` There are enough alternatives here, it only added complexity * test: Adds `test_all_datasets` Disabled by default, since there are 74 datasets * refactor: Remove `_Reader._response` Can't reproduce the original issue that led to adding this. All backends are supporting `HTTPResponse` directly * fix: Correctly handle no remote connection Previously, `Path.touch()` appeared to be a cache-hit - despite being an empty file. - Fixes that bug - Adds tests * docs: Align `_typing.Metadata` and `Loader.(url|__call__)` descriptions Related c572180 * feat: Update to `v2.10.0`, fix tag inconsistency - Noticed one branch that missed the join to `npm` - Moved the join to `.tags()` and added a doc - https://github.com/vega/vega-datasets/releases/tag/v2.10.0 * refactor: Tidying up `tools.datasets` * revert: Remove tags schema files * ci: Introduce `datasets` refresh to `generate_schema_wrapper` Unrelated to schema, but needs to hook in somewhere * docs: Add `tools.datasets.Application` doc * revert: Remove comment * docs: Add a table preview to `Metadata` * docs: Add examples for `Loader.__call__` * refactor: Rename `DatasetName` -> `Dataset`, `VersionTag` -> `Version` * fix: Ensure latest `[tag]` appears first When updating from `v2.9.0` -> `v2.10.0`, new tags were appended to the bottom. This invalidated an assumption in `Loader.(dataset|url)` that the first result is the latest * refactor: Misc `models.py` updates - Remove unused `ParsedTreesResponse` - Align more of the doc style - Rename `ReParsedTag` -> `SemVerTag` * docs: Update `tools.datasets.__init__.py` * test: Fix `@datasets_debug` selection Wasn't being recognised by `-m not datasets_debug` and always ran * test: Add support for overrides in `test_all_datasets` vega/vega-datasets#627 * test: Adds `test_metadata_columns` * fix: Warn instead of raise for hit rate limit There should be enough handling elsewhere to stop requesting https://github.com/vega/altair/actions/runs/11823002117/job/32941324941#step:8:102 * feat: Update for `v2.11.0` https://github.com/vega/vega-datasets/releases/tag/v2.11.0 Includes support for `.parquet` following: - vega/vega-datasets#628 - vega/vega-datasets#627 * feat: Always use `pl.read_csv(try_parse_dates=True)` Related #3631 (comment) * feat: Adds `_pl_read_json_roundtrip` First mentioned in #3631 (comment) Addresses most of the `polars` part of #3631 (comment) * feat(DRAFT): Adds infer-based `altair.datasets.load` Requested by @joelostblom in: #3631 (comment) #3631 (comment) * refactor: Rename `Loader.with_backend` -> `Loader.from_backend` #3631 (comment) * feat(DRAFT): Add optional `backend` parameter for `load(...)` Requested by @jonmmease #3631 (comment) #3631 (comment) * feat(DRAFT): Adds `altair.datasets.url` A dataframe package is still required currently,. Can later be adapted to fit the requirements of (#3631 (comment)). Related: - #3631 (comment) - #3631 (comment) - #3150 (reply in thread) @mattijn, @joelostblom * feat: Support `url(...)` without dependencies #3631 (comment), #3631 (comment), #3631 (comment) * fix(DRAFT): Don't generate csv on refresh https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631 * test: Replace rogue `NotImplementedError` https://github.com/vega/altair/actions/runs/11942364658/job/33289235198?pr=3631 * fix: Omit `.gz` last modification time header Previously was creating a diff on every refresh, since the current time updated. https://docs.python.org/3/library/gzip.html#gzip.GzipFile.mtime https://github.com/vega/altair/actions/runs/11942284568/job/33288974210?pr=3631 * docs: Add doc for `Application.write_csv_gzip` * revert: Remove `"polars[pyarrow]" backend Partially related to #3631 (comment) After some thought, this backend didn't add support for any unique dependency configs. I've only ever used `use_pyarrow=True` for `pl.DataFrame.write_parquet` to resolve an issue with invalid headers in `"polars<1.0.0;>=0.19.0"` * test: Add a complex `xfail` for `test_load_call` Doesn't happen in CI, still unclear why the import within `pandas` breaks under these conditions. Have tried multiple combinations of `pytest.MonkeyPatch`, hard imports, but had no luck in fixing the bug * refactor: Renaming/recomposing `_readers.py` The next commits benefit from having functionality decoupled from `_Reader.query`. Mainly, keeping things lazy and not raising a user-facing error * build: Generate `VERSION_LATEST` Simplifies logic that relies on enum/categoricals that may not be recognised as ordered * feat: Adds `_cache.py` for `UrlCache`, `DatasetCache` Docs to follow * ci(ruff): Ignore `0.8.0` violations #3687 (comment) * fix: Use stable `narwhals` imports narwhals-dev/narwhals#1426, #3693 (comment) * revert(ruff): Ignore `0.8.0` violations f21b52b * revert: Remove `_readers._filter` Feature has been adopted upstream in narwhals-dev/narwhals#1417 * feat: Adds example and tests for disabling caching * refactor: Tidy up `DatasetCache` * docs: Finish `Loader.cache` Not using doctest style here, none of these return anything but I want them hinted at * refactor(typing): Use `Mapping` instead of `dict` Mutability is not needed. Also see #3573 * perf: Use `to_list()` for all backends narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment), narwhals-dev/narwhals#1443 (comment) * feat(DRAFT): Utilize `datapackage` schemas in `pandas` backends Provides a generalized solution to `pd.read_(csv|json)` requiring the names of date columns to attempt parsing. cc @joelostblom The solution is possible in large part to vega/vega-datasets#631 #3631 (comment) * refactor(ruff): Apply `TC006` fixes in new code Related #3706 * docs(DRAFT): Add notes on `datapackage.features_typing` * docs: Update `Loader.from_backend` example w/ dtypes Related 909e7d0 * feat: Use `_pl_read_json_roundtrip` instead of `pl.read_json` for `pyarrow` Provides better dtype inference * docs: Replace example dataset Switching to one with a timestamp that `frictionless` recognises https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L2674-L2689 https://github.com/vega/vega-datasets/blob/8745f5c61ba951fe057a42562b8b88604b4a3735/datapackage.json#L45-L57 * fix(ruff): resolve `RUF043` warnings https://github.com/vega/altair/actions/runs/12439154550/job/34732432411?pr=3631 * build: run `generate-schema-wrapper` https://github.com/vega/altair/actions/runs/12439184312/job/34732516789?pr=3631 * chore: update schemas Changes from vega/vega-datasets#648 Currently pinned on `main` until `v3.0.0` introduces `datapackage.json` https://github.com/vega/vega-datasets/tree/main * feat(typing): Update `frictionless` model hierarchy - Adds some incomplete types for fields (`sources`, `licenses`) - Misc changes from vega/vega-datasets#651, vega/vega-datasets#643 * chore: Freeze all metadata Mainly for `datapackage.json`, which is now temporarily stored un-transformed Using version (vega/vega-datasets@7c2e67f) * feat: Support and extract `hash` from `datapackage.json` Related vega/vega-datasets#665 * feat: Build dataset url with `datapackage.json` New column deviates from original approach, to support working from `main` https://github.com/vega/altair/blob/e259fbabfc38c3803de0a952f7e2b081a22a3ba3/altair/datasets/_readers.py#L154 * revert: Removes `is_name_collision` Not relevant following upstream change vega/vega-datasets#633 * build: Re-enable and generate `datapackage_features.parquet` Eventually, will replace `metadata.parquet` - But for a single version (current) only - Paired with a **limited** `.csv.gz` version, to support cases where `.parquet` reading is not available (`pandas` w/o (`pyarrow`|`fastparquet`)) * feat: add temp `_Reader.*_dpkg` methods - Will be replacing the non-suffixed versions - Need to do this gradually as `tag` will likely be dropped - Breaking most of the tests * test: Remove/replace all `tag` based tests * revert: Remove all `tag` based features * feat: Source version from `tool.altair.vega.vega-datasets` * refactor(DRAFT): Migrate to `datapackage.json` only Major switch from multiple github/npm endpoints -> a single file. Was Only possible following vega/vega-datasets#665 Still need to rewrite/fill out the `Metadata` doc, then moving onto features * docs: Update `Metadata` example * docs: Add missing descriptions to `Metadata` * refactor: Renaming/reorganize in `tools/` Mainly removing `Fl` prefix, as there is no confusion now `models.py` is purely `frictionless` structures * test: Skip `is_image` datasets * refactor: Make caching **opt-out**, use `$XDG_CACHE_HOME` Caching is the more sensible default when considering a notebook environment Using a standardised path now also https://specifications.freedesktop.org/basedir-spec/latest/#variables * refactor(typing): Add `_iter_results` helper * feat(DRAFT): Replace `UrlCache` w/ `CsvCache` Now that only a single version is supported, it is possible to mitigate the `pandas` case w/o `.parquet` support (#3631 (comment)) This commit adds the file and some tools needed to implement this - but I'll need to follow up with some more changes to integrate this into `_Reader` * refactor: Misc reworking caching - Made paths a `ClassVar` - Removed unused `SchemaCache` methods - Replace `_FIELD_TO_DTYPE` w/ `_DTYPE_TO_FIELD` - Only one variant is ever used Use a `SchemaCache` instance per-`pandas`-based reader - Make fallback `csv_cache` initialization lazy - Only going to use the global when no dependencies found - Otherwise, instance-per-reader * chore: Include `.parquet` in `metadata.csv.gz` - Readable via url w/ `vegafusion` installed - Currently no cases where a dataset has both `.parquet` and another extension * feat: Extend `_extract_suffix` to support `Metadata` Most subsequent changes are operating on this `TypedDict` directly, as it provides richer info for error handling * refactor(typing): Simplify `Dataset` import * fix: Convert `str` to correct types in `CsvCache` * feat: Support `pandas` w/o a `.parquet` reader * refactor: Reduce repetition w/ `_Reader._download` * feat(DRAFT): `Metadata`-based error handling - Adds `_exceptions.py` with some initial cases - Renaming `result` -> `meta` - Reduced the complexity of `_PyArrowReader` - Generally, trying to avoid exceptions from 3rd parties - to allow suggesting an alternate path that may work * chore(ruff): Remove unused `0.9.2` ignores Related #3771 https://github.com/vega/altair/actions/runs/12810882256/job/35718940621?pr=3631 * refactor: clean up, standardize `_exceptions.py` * test: Refactor decorators, test new errors * docs: Replace outdated docs - Using `load` instead of `data` - Don't mention multi-versions, as that was dropped * refactor: Clean up `tools.datasets` - `Application.generate_typing` now mostly populated by `DataPackage` methods - Docs are defined alongside expressions - Factored out repetitive code into `spell_literal_alias` - `Metadata` examples table is now generated inside the doc * test: `test_datasets` overhaul - Eliminated all flaky tests - Mocking more of the internals that is safer to run in parallel - Split out non-threadsafe tests with `@no_xdist` - Huge performance improvement for the slower tests - Added some helper functions (`is_*`) where common patterns were identified - **Removed skipping from native `pandas` backend** - Confirms that its now safe without `pyarrow` installed * refactor: Reuse `tools.fs` more, fix `app.(read|scan)` Using only `.parquet` was relevant in earlier versions that produced multiple `.parquet` files Now these methods safely handle all formats in use * feat(typing): Set `"polars"` as default in `Loader.from_backend` Without a default, I found that VSCode was always suggesting the **last** overload first (`"pyarrow"`) This is a bad suggestion, as it provides the *worst native* experience. The default now aligns with the backend providing the *best native* experience * docs: Adds module-level doc to `altair.datasets` - Multiple **brief** examples, for a taste of the public API - See (#3763) - Refs to everywhere a first-time user may need help from - Also aligned the (`Loader`|`load`) docs w/ eachother and the new phrasing here * test: Clean up `test_datasets` - Reduce superfluous docs - Format/reorganize remaining docs - Follow up on some comments Misc style changes * docs: Make `sphinx` happy with docs These changes are very minor in VSCode, but fix a lot of rendering issues on the website * refactor: Add `find_spec` fastpath to `is_available` Have a lot of changes locally that use `find_spec`, but would prefer a single name assoicated with this action The actual spec is never relevant for this usage * feat(DRAFT): Private API overhaul **Public API is unchanged** Core changes are to simplify testing and extension: - `_readers.py` -> `_reader.py` - w/ two new support modules `_constraints`, and `_readimpl` - Functions (`BaseImpl`) are declared with what they support (`include`) and restrictions (`exclude`) on that subset - Transforms a lot of the imperative logic into set operations - Greatly improved `pyarrow` support - Utilize schema - Provides additional fallback `.json` implementations - `_stdlib_read_json_to_arrow` finally resolves `"movies.json"` issue * refactor: Simplify obsolete paths in `CsvCache` They were an artifact of *previously* using multiple `vega-dataset` versions in `.paquet` - but only the most recent in `.csv.gz` Currently both store the same range of names, so this error handling never triggered * chore: add workaround for `narwhals` bug Opened (narwhals-dev/narwhals#1897) Marking (#3631 (comment)) as resolved * feat(typing): replace `(Read|Scan)Impl` classes with aliases - Shorter names `Read`, `Scan` - The single unique method is now `into_scan` - There was no real need to have concrete classes when they behave the same as parent * feat: Rename, docs `unwrap_or` -> `unwrap_or_skip` * refactor: Replace `._contents` w/ `.__str__()` Inspired by https://github.com/pypa/packaging/blob/8510bd9d3bab5571974202ec85f6ef7b0359bfaf/src/packaging/requirements.py#L67-L71 * fix: Use correct type for `pyarrow.csv.read_csv` Resolves: ```py File ../altair/.venv/Lib/site-packages/pyarrow/csv.pyx:1258, in pyarrow._csv.read_csv() TypeError: Cannot convert dict to pyarrow._csv.ParseOptions ``` * docs: Add docs for `Read`, `Scan`, `BaseImpl` * docs: Clean up `_merge_kwds`, `_solve` * refactor(typing): Include all suffixes in `Extension` Also simplifies and removes outdated `Extension`-related tooling * feat: Finish `Reader.profile` - Reduced the scope a bit, now just un/supported - Added `pprint` option - Finished docs, including example pointing to use `url(...)` * test: Use `Reader.profile` in `is_polars_backed_pyarrow` * feat: Clean up, add tests for new exceptions * feat: Adds `Reader.open_markdown` - Will be even more useful after merging vega/vega-datasets#663 - Thinking this is a fair tradeoff vs inlining the descriptions into `altair` - All the info is available and it is quicker than manually searching the headings in a browser * docs: fix typo Resolves #3631 (comment) * fix: fix typo in error message #3631 (comment) * refactor: utilize narwhals fix narwhals-dev/narwhals#1934 * refactor: utilize `nw.Implementation.from_backend` See narwhals-dev/narwhals#1888 * feat(typing): utilize `nw.LazyFrame` working `TypeVar` Possible since narwhals-dev/narwhals#1930 @MarcoGorelli if you're interested what that PR did (besides fix warnings 😉) * docs: Show less data in examples * feat: Update for `[email protected]` Made possible via vega/vega-datasets#681 - Removes temp files - Removes some outdated apis - Remove test based on removed `"points"` dataset * refactor: replace `SchemaCache.schema_pyarrow` -> `nw.Schema.to_arrow` Related - narwhals-dev/narwhals#1924 - #3631 (comment) * feat(typing): Properly annotate `dataset_name`, `suffix` Makes more sense following (755ab4f) * chore: bump `vega-datasets==3.1.0` * test(typing): Ignore `_pytest` imports for `pyright` See microsoft/pyright#10248 (comment) * feat: Basic `geopandas` impl Still need to update tests * fix: Add missing `v` prefix to url * test: Update `test_spatial` * ci: Try pinning locked `ruff` https://github.com/vega/altair/actions/runs/14478364865/job/40609439929 * ci(uv): Add `--group geospatial` * chore: Reduce `geopandas` pin * feat: Basic `polars-st` impl -Seems to work pretty similarly to `geopandas` - The repr isn't as clean - Pretty cool that you can get *something* from `load("us-10m").st.plot()` * ci(typing): `mypy` ignore `polars-st` https://github.com/vega/altair/actions/runs/14494920661/job/40660098022?pr=3631 * build against vega-datasets 3.2.0 * run generate-schema-wrapper * prevent infinite recursion in _split_markers * sync to v6 * resolve doctest on lower python versions * resolve comment in github action * changed examples to modern interface to pass docbuild --------- Co-authored-by: dangotbanned <[email protected]>

feat: add nw.Schema.to_* methods

54760af

Will close narwhals-dev#1912 - Starting with porting `nw.functions._from_dict_impl` - Thinking that `Schema` should have `._version: ClassVar[Version]` to remove the need for user-facing arg (narwhals-dev#1912 (comment))

Merge remote-tracking branch 'upstream/main' into schema-convert-api

e9ccaad

dangotbanned mentioned this pull request Feb 3, 2025

feat(RFC): Adds altair.datasets vega/altair#3631

Closed

6 tasks

dangotbanned added 3 commits February 4, 2025 09:36

Merge branch 'main' into schema-convert-api

db5bac2

Merge remote-tracking branch 'upstream/main' into schema-convert-api

e72cb75

feat: replace native_namespace -> backend

e09426d

New API can start without deprecations, related narwhals-dev#1931

This comment was marked as resolved.

Sign in to view

This comment was marked as outdated.

Sign in to view

dangotbanned added 3 commits February 4, 2025 18:58

Merge remote-tracking branch 'upstream/main' into schema-convert-api

3826e1f

feat: adds Schema._version

d1e0576

narwhals-dev#1924 (comment) narwhals-dev#1924 (comment)

revert: remove Schema.to_native

62b6e24

narwhals-dev#1924 (comment)

dangotbanned added 2 commits February 4, 2025 19:39

refactor: drop backend, use hard imports

fe96f8c

narwhals-dev#1924 (comment)

refactor: use Schema.to_(arrow|polars) in from_dict

a575ea8

`pandas` seems a bit more complex, leaving for now

dangotbanned added 2 commits February 4, 2025 21:32

refactor: use Schema..to_pandas in from_dict

72fc3de

refactor: remove version parameter from _from_dict_impl

82b1623

When needed, this is now available on `Schema` (d1e0576)

dangotbanned commented Feb 4, 2025

View reviewed changes

narwhals/schema.py Show resolved Hide resolved

MarcoGorelli reviewed Feb 4, 2025

View reviewed changes

narwhals/functions.py Show resolved Hide resolved

dangotbanned added 4 commits February 5, 2025 09:21

chore: ignore banned imports

7d90555

narwhals-dev#1924 (comment)

Merge remote-tracking branch 'upstream/main' into schema-convert-api

3957699

test: adds test_schema_to_pandas

8925445

Resolves narwhals-dev#1924 (comment)

Merge remote-tracking branch 'upstream/main' into schema-convert-api

38adce2

dangotbanned commented Feb 5, 2025

View reviewed changes

tests/frame/schema_test.py Show resolved Hide resolved

dangotbanned and others added 4 commits February 7, 2025 18:35

Merge remote-tracking branch 'upstream/main' into schema-convert-api

c86512e

Merge branch 'main' into schema-convert-api

9f72d7c

match pandas dtype_backend

5e9beb2

py39 compat

bae64af

dangotbanned commented Feb 8, 2025

View reviewed changes

narwhals/typing.py Outdated Show resolved Hide resolved

MarcoGorelli added 4 commits February 8, 2025 15:37

remove Any in return from Schema.to_polars

b901957

fixup

bde48a8

missing else

0133050

coverage

5c26e9a

MarcoGorelli approved these changes Feb 8, 2025

View reviewed changes

Merge remote-tracking branch 'upstream/main' into schema-convert-api

a2181c6

dangotbanned commented Feb 8, 2025

View reviewed changes

dangotbanned added 2 commits February 8, 2025 16:43

refactor(typing): lie more explicitly

3bcebc1

https://github.com/narwhals-dev/narwhals/pull/1924/files#r1947895133

chore: move pragma

7b89936

https://github.com/narwhals-dev/narwhals/actions/runs/13217489312/job/36898464609?pr=1924

dangotbanned commented Feb 8, 2025

View reviewed changes

dangotbanned changed the title ~~feat: add nw.Schema.to_* methods~~ feat: add Schema.to_(arrow|pandas|polars) Feb 8, 2025

Merge branch 'main' into schema-convert-api

10b04d3

dangotbanned merged commit 365cdbd into narwhals-dev:main Feb 8, 2025
23 of 24 checks passed

dangotbanned deleted the schema-convert-api branch February 8, 2025 17:24

FBruzzesi mentioned this pull request Feb 9, 2025

[Doc]: Add Schema.to_<backend> methods in API docs #1978

Closed

dangotbanned added a commit to vega/altair that referenced this pull request Feb 10, 2025

refactor: replace SchemaCache.schema_pyarrow -> nw.Schema.to_arrow

a776e2f

Related - narwhals-dev/narwhals#1924 - #3631 (comment)

dangotbanned mentioned this pull request Feb 16, 2025

chore(typing): relax from_numpy to use Mapping, Sequence #2023

Merged

10 tasks

dangotbanned mentioned this pull request Jul 20, 2025

feat: Adds {Expr,Series}.{first,last} #2528

Merged

28 tasks

feat: add Schema.to_(arrow|pandas|polars) #1924

feat: add Schema.to_(arrow|pandas|polars) #1924

Uh oh!

Conversation

dangotbanned commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this? (check all applicable)

Related issues

Checklist

If you have comments or can explain your changes, please do so below

Uh oh!

dangotbanned commented Feb 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as resolved.

MarcoGorelli commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

This comment was marked as outdated.

MarcoGorelli commented Feb 4, 2025

Uh oh!

dangotbanned commented Feb 4, 2025

Question

Uh oh!

MarcoGorelli commented Feb 4, 2025

Uh oh!

dangotbanned commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Edit

Uh oh!

Uh oh!

MarcoGorelli commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

dangotbanned commented Feb 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented Feb 8, 2025

Uh oh!

dangotbanned Feb 8, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli Feb 8, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned Feb 8, 2025

Choose a reason for hiding this comment

Uh oh!

dangotbanned commented Feb 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MarcoGorelli commented Feb 8, 2025

Uh oh!

dangotbanned commented Feb 8, 2025

Uh oh!

MarcoGorelli commented Feb 8, 2025

Uh oh!

dangotbanned commented Feb 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

feat: add `Schema.to_(arrow|pandas|polars)` #1924

feat: add `Schema.to_(arrow|pandas|polars)` #1924

dangotbanned commented Feb 3, 2025 •

edited

Loading

dangotbanned commented Feb 3, 2025 •

edited

Loading

MarcoGorelli commented Feb 4, 2025 •

edited

Loading

dangotbanned commented Feb 4, 2025 •

edited

Loading

MarcoGorelli commented Feb 4, 2025 •

edited

Loading

dangotbanned commented Feb 4, 2025 •

edited

Loading

dangotbanned Feb 8, 2025 •

edited

Loading

dangotbanned commented Feb 8, 2025 •

edited

Loading