Skip to content

Conversation

@alexander-beedie
Copy link
Collaborator

@alexander-beedie alexander-beedie commented May 7, 2025

Added a path for from_records (and DataFrame/Series construction) that can handle conversion fromPyMapping in the Rust/pyO3 layer (currently we only handle PyDict).

This opens up efficient init from other useful Python record types that look like dicts to the caller (and support the Mapping protocol), but that we couldn't load as such (SQLAlchemy RowMapping, for example).

Also:
Fixes two long-standing errors with from_records, adding test coverage for both -

  1. Incorrectly loading as Struct if the first record was a None value.
  2. Failing to load at all if a None value was present after the first record:
    TypeError: 'NoneType' object cannot be converted to 'PyDict'

And:
Optimises PyDict value lookup (by precomputing the keys as PyString so we don't create them in the row-building loop).

Example

from collections.abc import Mapping
from typing import Any, Iterator

class MappingObject(Mapping):
    def __init__(self, **values: Any) -> None:
        self._data = {**values}

    def __getitem__(self, key: str) -> Any:
        return self._data[key]

    def __iter__(self) -> Iterator[str]:
        yield from self._data

    def __len__(self) -> int:
        return len(self._data)

mapping_data = [
    MappingObject(Name="Alice", Age=38, DOB=date(1987,3,5)),
    MappingObject(Name="Bob", Age=20, DOB=date(2005,5,2)),
    MappingObject(Name="Charles", Age=32, DOB=date(1993,1,18)),
] * 100_000

dict_data = [dict(d) for d in mapping_data]

Timings 🕐

(Tested with local make build-dist-release binary).

Having native conversion allows us to ingest Mapping data 2x faster than having the user convert it to dict themselves. We retain the separate dict fast-path, which remains optimal.

%timeit df = pl.from_records([dict(d) for d in mapping_data])
# 241 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit df = pl.from_records(mapping_data)
# 124 ms ± 1.98 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit df = pl.from_records(dict_data)
# 79.7 ms ± 1.74 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

@github-actions github-actions bot added enhancement New feature or an improvement of an existing feature python Related to Python Polars rust Related to Rust Polars labels May 7, 2025
@alexander-beedie alexander-beedie added the performance Performance issues or improvements label May 7, 2025
@alexander-beedie alexander-beedie marked this pull request as draft May 7, 2025 08:20
@alexander-beedie
Copy link
Collaborator Author

alexander-beedie commented May 7, 2025

Looks like it needs a small update to handle None entries; will take care of it.

Update: done...

@alexander-beedie alexander-beedie force-pushed the init-from-non-dict-mapping-objs branch from 39d105e to d834318 Compare May 7, 2025 08:29
@codecov
Copy link

codecov bot commented May 7, 2025

Codecov Report

Attention: Patch coverage is 96.90722% with 3 lines in your changes missing coverage. Please review.

Project coverage is 80.99%. Comparing base (e71569a) to head (2e01701).
Report is 23 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-python/src/dataframe/construction.rs 96.51% 3 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #22638      +/-   ##
==========================================
+ Coverage   80.98%   80.99%   +0.01%     
==========================================
  Files        1661     1661              
  Lines      234869   234937      +68     
  Branches     2773     2774       +1     
==========================================
+ Hits       190198   190291      +93     
+ Misses      44004    43978      -26     
- Partials      667      668       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@alexander-beedie alexander-beedie force-pushed the init-from-non-dict-mapping-objs branch 2 times, most recently from 3e6a202 to 755f03b Compare May 7, 2025 11:20
@alexander-beedie alexander-beedie marked this pull request as ready for review May 7, 2025 11:20
@alexander-beedie alexander-beedie requested a review from orlp as a code owner May 7, 2025 11:20
@alexander-beedie alexander-beedie changed the title feat: Support optimised init from non-dict Mapping objects in from_records and frame/series constructors feat: Support optimised init from non-dict Mapping objects in from_records and frame/series constructors May 7, 2025
@alexander-beedie alexander-beedie force-pushed the init-from-non-dict-mapping-objs branch from 755f03b to db27f40 Compare May 7, 2025 12:22
@alexander-beedie alexander-beedie force-pushed the init-from-non-dict-mapping-objs branch from db27f40 to 2e01701 Compare May 7, 2025 12:24
@ritchie46 ritchie46 merged commit 63ea877 into pola-rs:main May 12, 2025
44 checks passed
@alexander-beedie alexander-beedie deleted the init-from-non-dict-mapping-objs branch May 12, 2025 07:51
dangotbanned added a commit to dangotbanned/polars that referenced this pull request Sep 23, 2025
Closes pola-rs#24583

Downstream in `narwhals`, we discovered the typing wasn't updated alongside the runtime support added in `1.30.0`

### Related
- pola-rs#22638
- pola-rs#19322
- narwhals-dev/narwhals#3148 (comment)
- narwhals-dev/narwhals#3148 (comment)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or an improvement of an existing feature performance Performance issues or improvements python Related to Python Polars rust Related to Rust Polars

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants