Skip to content

Conversation

@orlp
Copy link
Member

@orlp orlp commented May 30, 2025

Note

TLDR

Categoricals are completely reimplemented to be streaming compatible and fit better into the Polars Data model. They should generally be faster, more stable and more reliable. Physical ordering and the String Cache are gone. View #22568 for more context.

Fixes #3036.
Fixes #14247.
Fixes #14996.
Fixes #15293.
Fixes #15781.
Fixes #17479.
Fixes #17643.
Fixes #18065.
Fixes #18501.
Fixes #19868.
Fixes #19943.
Fixes #20290.
Fixes #20318.
Fixes #20364.
Fixes #20562.
Fixes #20878.
Fixes #20931.
Fixes #21175.
Fixes #21583.
Fixes #22448.
Fixes #22586.
Fixes #22664.
Fixes #22830.
Fixes #23015.
Fixes #23071.
Fixes #23289.

This PR, essentially, replaces the entire Categorical/Enum implementation. There is some breakage that was essentially unavoidable, unfortunately:

  • Physical ordering for Categoricals has been removed, the ordering is now always lexical. The parameter has been deprecated, it is not a hard error to pass "physical" as ordering, it just doesn't do anything anymore.
  • A new file format for Parquet is introduced. Reading older Parquet files is backwards-compatible, but writing new files with Enums in them are read back as Categoricals by older versions of Polars.
  • Casts between Categorical and integer types now always refer to the physical categories. These casts will be deprecated and removed at a later stage once we have dedicated functions to go to/from categories. The casts to/from String still exist and will remain so, any other casts have been removed.

The concept of local and global categories is gone. The StringCache still exists in Python, but does nothing anymore, and will be deprecated and removed later.

In a future PR we will expose the new capabilities of the new Categories system, which lets you specify in the DataType which columns should share the same categorical mapping.

@github-actions github-actions bot added internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars labels May 30, 2025
@orlp orlp force-pushed the cat-rework branch 5 times, most recently from 72307c2 to 863cf09 Compare June 6, 2025 13:37
@orlp orlp force-pushed the cat-rework branch 4 times, most recently from ddb7532 to 9036ef6 Compare July 1, 2025 10:02
@codecov
Copy link

codecov bot commented Jul 3, 2025

Codecov Report

Attention: Patch coverage is 78.91129% with 523 lines in your changes missing coverage. Please review.

Project coverage is 80.87%. Comparing base (348a34d) to head (0644413).
Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
crates/polars-core/src/datatypes/any_value.rs 29.72% 52 Missing ⚠️
crates/polars-row/src/encode.rs 52.04% 47 Missing ⚠️
crates/polars-row/src/variable/utf8.rs 0.00% 47 Missing ⚠️
...s-core/src/chunked_array/comparison/categorical.rs 84.43% 40 Missing ⚠️
...ars-core/src/series/implementations/categorical.rs 86.25% 29 Missing ⚠️
crates/polars-dtype/src/categorical/mod.rs 86.80% 26 Missing ⚠️
...tes/polars-core/src/series/implementations/time.rs 66.17% 23 Missing ⚠️
crates/polars-core/src/frame/column/mod.rs 9.09% 20 Missing ⚠️
...polars-core/src/series/implementations/duration.rs 64.81% 19 Missing ⚠️
...s-core/src/chunked_array/builder/list/anonymous.rs 31.81% 15 Missing ⚠️
... and 48 more
Additional details and impacted files
@@            Coverage Diff             @@
##             main   #23016      +/-   ##
==========================================
+ Coverage   80.68%   80.87%   +0.18%     
==========================================
  Files        1645     1632      -13     
  Lines      221895   220133    -1762     
  Branches     2783     2782       -1     
==========================================
- Hits       179036   178027    -1009     
+ Misses      42197    41445     -752     
+ Partials      662      661       -1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ritchie46
Copy link
Member

Yeah bu'

@ritchie46 ritchie46 merged commit 5246d17 into pola-rs:main Jul 3, 2025
33 checks passed
@ritchie46 ritchie46 added the highlight Highlight this PR in the changelog label Jul 3, 2025
Copy link
Collaborator

@coastalwhite coastalwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some small comments. Super nice to land this! Could you do a doc update maybe?

@orlp
Copy link
Member Author

orlp commented Jul 4, 2025

@coastalwhite I addressed most of your concerns, please respond to the others.

dhimmel added a commit to dhimmel/openskistats that referenced this pull request Oct 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

highlight Highlight this PR in the changelog internal An internal refactor or improvement python Related to Python Polars rust Related to Rust Polars

Projects

None yet

3 participants