fix: Fix performance regression for DataFrame serialization/pickling by nameexhaustion · Pull Request #20641 · pola-rs/polars

nameexhaustion · 2025-01-09T09:55:56Z

Fixes #20605

PR notes

Moved serialization logic out of the impl serde::ser::Serialize for DataFrame to DataFrame::serialize_into_reader etc.
- This allows us to serialize/deserialize directly without going through serde - we now use this when users serialize a DataFrame from Python.
- In general, serializing through serde causes an extra memcopy that we can't avoid (there's no way to ask a Serializer for provide a raw buffer that we can write into)
Introduces a deserialize_map_bytes utility function to reduce unnecessary memcopy during deserialization. This replaces some existing code that unnecessarily copied to an owned Vec<u8> during deserialization.

Performance testing

(build-debug-release)

shape = (1000000, 19) (yellow_tripdata_2015-01_head1m.csv)
┌──────────────┬─────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ Object       ┆ Operation   ┆ Method            ┆ Runtime (s)                 ┆ Size (bytes)                    │
╞══════════════╪═════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╡
│              ┆             ┆                   ┆         (%1.17.1) (%1.19.0) ┆             (%1.17.1) (%1.19.0) │
│ DataFrame    ┆ serialize   ┆ pickle            ┆ 0.065   (97.0%  ) (14.0%  ) ┆ 214_010_205 (100.0% ) (860.0% ) │
│ DataFrame    ┆ deserialize ┆ pickle            ┆ 0.063   (92.0%  ) (28.0%  ) ┆                                 │
│ DataFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.064   (41.0%  ) (13.0%  ) ┆ 214_010_144 (200.0% ) (860.0% ) │
│ DataFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.049   (7.8%   ) (23.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ pickle            ┆ 0.08    (49.0%  ) (17.0%  ) ┆ 214_010_712 (200.0% ) (860.0% ) │
│ LazyFrame    ┆ deserialize ┆ pickle            ┆ 0.084   (15.0%  ) (38.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.078   (50.0%  ) (16.0%  ) ┆ 214_010_651 (200.0% ) (860.0% ) │
│ LazyFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.069   (11.0%  ) (31.0%  ) ┆                                 │
│ list[Series] ┆ serialize   ┆ pickle            ┆ 0.069   (86.0%  ) (15.0%  ) ┆ 214_019_863 (100.0% ) (860.0% ) │
│ list[Series] ┆ deserialize ┆ pickle            ┆ 0.049   (73.0%  ) (23.0%  ) ┆                                 │
└──────────────┴─────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┘
shape = (4000000, 16) (randints Int64)
┌──────────────┬─────────────┬───────────────────┬─────────────────────────────┬─────────────────────────────────┐
│ Object       ┆ Operation   ┆ Method            ┆ Runtime (s)                 ┆ Size (bytes)                    │
╞══════════════╪═════════════╪═══════════════════╪═════════════════════════════╪═════════════════════════════════╡
│              ┆             ┆                   ┆         (%1.17.1) (%1.19.0) ┆             (%1.17.1) (%1.19.0) │
│ DataFrame    ┆ serialize   ┆ pickle            ┆ 0.2     (140.0% ) (4.9%   ) ┆ 512_025_661 (100.0% ) (160.0% ) │
│ DataFrame    ┆ deserialize ┆ pickle            ┆ 0.1     (89.0%  ) (6.7%   ) ┆                                 │
│ DataFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.07    (10.0%  ) (1.7%   ) ┆ 512_025_600 (160.0% ) (160.0% ) │
│ DataFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.087   (4.9%   ) (5.7%   ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ pickle            ┆ 0.26    (36.0%  ) (6.3%   ) ┆ 512_025_959 (160.0% ) (160.0% ) │
│ LazyFrame    ┆ deserialize ┆ pickle            ┆ 0.16    (11.0%  ) (11.0%  ) ┆                                 │
│ LazyFrame    ┆ serialize   ┆ cls.serialize()   ┆ 0.15    (21.0%  ) (3.6%   ) ┆ 512_025_898 (160.0% ) (160.0% ) │
│ LazyFrame    ┆ deserialize ┆ cls.deserialize() ┆ 0.14    (7.7%   ) (9.0%   ) ┆                                 │
│ list[Series] ┆ serialize   ┆ pickle            ┆ 0.16    (120.0% ) (3.8%   ) ┆ 512_049_408 (100.0% ) (160.0% ) │
│ list[Series] ┆ deserialize ┆ pickle            ┆ 0.086   (75.0%  ) (5.6%   ) ┆                                 │
└──────────────┴─────────────┴───────────────────┴─────────────────────────────┴─────────────────────────────────┘

*Note: For % comparisons, lower is better

Observations

A lot of the slowdown in runtime is due to enabling compression - this step has now been removed:
- The serialized size is much larger compared to v1.19, but the runtime is drastically reduced. Both metrics are now more in line with what we had in v1.17.1
The extra overhead of serde can be observed by comparing the runtime of the DataFrame serialization with the the runtime of the corresponding LazyFrame serialization (LazyFrames must use serde as they serialize through a DslPlan)

nameexhaustion · 2025-01-09T09:56:28Z

crates/polars-core/src/frame/chunks.rs

+/// maximum number of rows per chunk to ensure reasonable memory efficiency when
+/// reading the resulting file, and a minimum size per chunk to ensure
+/// reasonable performance when writing.
+pub fn chunk_df_for_writing(


Code is moved from polars-io

nameexhaustion · 2025-01-09T09:57:30Z

crates/polars-core/src/frame/mod.rs

-
-    #[cfg(feature = "serde")]
-    #[test]
-    fn test_deserialize_height_validation_8751() {


Test no longer works as serialization directly errors now on mismatching height

nameexhaustion · 2025-01-09T10:15:27Z

crates/polars-core/src/serde/df.rs

-            D::Error::custom::<PolarsError>(e)
-        })
+impl DataFrame {
+    pub fn serialize_into_writer(&mut self, writer: &mut dyn std::io::Write) -> PolarsResult<()> {


Serialization logic moved to impl DataFrame rather than on impl Series.

nameexhaustion · 2025-01-09T10:16:46Z

crates/polars-core/src/serde/df.rs

+            .serialize_into_writer(&mut bytes)
+            .map_err(S::Error::custom)?;
+
+        serializer.serialize_bytes(bytes.as_slice())


This is where the extra memcopy happens when going through serde - we are calling Serializer::serialize_bytes(bytes: &[u8])

nameexhaustion · 2025-01-09T10:22:34Z

crates/polars-python/src/dataframe/serde.rs

-        }
+        py.allow_threads(|| {
+            self.df
+                .serialize_into_writer(&mut writer)


Bypass serde when serializing DataFrame - this is essentially what we did in older versions

codecov · 2025-01-09T11:11:53Z

Codecov Report

Attention: Patch coverage is 81.74387% with 67 lines in your changes missing coverage. Please review.

Project coverage is 79.01%. Comparing base (247f0b1) to head (090c99b).
Report is 5 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/polars-core/src/serde/df.rs	80.00%	23 Missing ⚠️
crates/polars-plan/src/dsl/expr_dyn_fn.rs	38.09%	13 Missing ⚠️
crates/polars-utils/src/pl_serialize.rs	73.68%	10 Missing ⚠️
crates/polars-utils/src/python_function.rs	75.00%	9 Missing ⚠️
crates/polars-core/src/serde/series.rs	84.37%	5 Missing ⚠️
...quet/src/parquet/metadata/column_chunk_metadata.rs	0.00%	5 Missing ⚠️
crates/polars-core/src/frame/chunks.rs	98.03%	1 Missing ⚠️
crates/polars-core/src/frame/column/mod.rs	87.50%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main   #20641      +/-   ##
==========================================
- Coverage   79.03%   79.01%   -0.03%     
==========================================
  Files        1557     1557              
  Lines      220528   220547      +19     
  Branches     2510     2513       +3     
==========================================
- Hits       174303   174272      -31     
- Misses      45651    45702      +51     
+ Partials      574      573       -1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

datenzauberai · 2025-01-09T18:10:31Z

@nameexhaustion Timings look great, thank you for all the effort!

Object905 · 2025-01-30T07:33:59Z

Seems like I'm getting polars.exceptions.ComputeError: out-of-spec: NegativeFooterLength in Series.__setstate__ sometimes after this PR (polars 1.20). Will try to dig deeper.

ritchie46 · 2025-01-30T09:24:20Z

Did you serialize with a different version? Polars pickle is not stable between versions.

Object905 · 2025-01-30T13:14:48Z

Yes, that is the case. Thanks for the tip.
I'm using pl.version as cache key, but old version apparently was also loaded by wildcard key...

Bidek56 · 2025-02-11T00:52:38Z

@nameexhaustion Any idea why I could be getting weird results when converting to a Nodejs object using napi?
I am using env.to_js_value(&self.series) and getting these strange bin objects: [0] _PL_FLAGS��0
Similar things are happening with a DF. It worked fine in rs-0.45. Thx

nameexhaustion · 2025-02-14T04:58:55Z

@nameexhaustion Any idea why I could be getting weird results when converting to a Nodejs object using napi? I am using env.to_js_value(&self.series) and getting these strange bin objects: [0] _PL_FLAGS��0 Similar things are happening with a DF. It worked fine in rs-0.45. Thx

I can't answer this as I am not familiar with Node.js. What I can say is that we changed the serialization format recently. If you were serializing to JSON, it should still be valid JSON, but it will contain an integer array representing IPC bytes instead of the values in the columns.

It may help if you provide an example of what you are getting before vs now, or if you check with the nodejs polars repo.

ghuls · 2025-02-14T09:25:09Z

A pity https://docs.rs/serde-pickle/latest/serde_pickle/ does not support writing Pickle v5 with out-of-band data: https://peps.python.org/pep-0574/

Old pyarrow commit that introduced experimental zero-copy pickling:
apache/arrow#2161

Bidek56 · 2025-02-14T13:21:27Z

@nameexhaustion Any idea why I could be getting weird results when converting to a Nodejs object using napi? I am using env.to_js_value(&self.series) and getting these strange bin objects: [0] _PL_FLAGS��0 Similar things are happening with a DF. It worked fine in rs-0.45. Thx

I can't answer this as I am not familiar with Node.js. What I can say is that we changed the serialization format recently. If you were serializing to JSON, it should still be valid JSON, but it will contain an integer array representing IPC bytes instead of the values in the columns.

It may help if you provide an example of what you are getting before vs now, or if you check with the nodejs polars repo.

This is an example of before and after:

rs-0.45: "{\"columns\":[{\"name\":\"foo\",\"datatype\":\"Float64\",\"bit_settings\":\"\",\"values\":[1]},{\"name\":\"bar\",\"datatype\":\"String\",\"bit_settings\":\"\",\"values\":[\"a\"]}]}"
rs-0.46: "{\"0\":255,\"1\":255,\"2\":255,\"3\":255,\"4\":240,\"5\":0,\"6\":0,\"7\":0,\"8\":4,\"9\":0,\"10\":0,\"11\":0,\"12\":242,\"13\":255,\"14\":255,\"15\":255,\"16\":20,\"17\":0.....:0}"

Bidek56 · 2025-02-22T21:30:25Z

@nameexhaustion Any idea why I could be getting weird results when converting to a Nodejs object using napi? I am using env.to_js_value(&self.series) and getting these strange bin objects: [0] _PL_FLAGS��0 Similar things are happening with a DF. It worked fine in rs-0.45. Thx

I can't answer this as I am not familiar with Node.js. What I can say is that we changed the serialization format recently. If you were serializing to JSON, it should still be valid JSON, but it will contain an integer array representing IPC bytes instead of the values in the columns.
It may help if you provide an example of what you are getting before vs now, or if you check with the nodejs polars repo.

This is an example of before and after:
rs-0.45: "{\"columns\":[{\"name\":\"foo\",\"datatype\":\"Float64\",\"bit_settings\":\"\",\"values\":[1]},{\"name\":\"bar\",\"datatype\":\"String\",\"bit_settings\":\"\",\"values\":[\"a\"]}]}"
rs-0.46: "{\"0\":255,\"1\":255,\"2\":255,\"3\":255,\"4\":240,\"5\":0,\"6\":0,\"7\":0,\"8\":4,\"9\":0,\"10\":0,\"11\":0,\"12\":242,\"13\":255,\"14\":255,\"15\":255,\"16\":20,\"17\":0.....:0}"

@nameexhaustion Any idea how to convert back to ASCII JSON? I need it for nodejs-polars. Thx

nameexhaustion · 2025-02-23T22:21:15Z

@nameexhaustion Any idea how to convert back to ASCII JSON? I need it for nodejs-polars. Thx

We removed the serialization code that used to output the old format. In general, there are no guarantees as to the output format of serialization - we only guarantee that it can be deserialized back into memory, and it should only be used for such a purpose.

Can you elaborate on how you were using the old format? I believe you should be able to retrieve all the information you need by deserializing back into memory.

Bidek56 · 2025-02-24T01:17:34Z

In nodejs-polars we use JSON.stringify(df); which in turns calls this napi method. Thx

#[napi]
impl JsDataFrame {
    #[napi(catch_unwind)]
    pub fn to_js(&self, env: Env) -> napi::Result<napi::JsUnknown> {
       --> env.to_js_value(&self.df)
    }
....

nameexhaustion · 2025-02-24T04:42:56Z

In nodejs-polars we use JSON.stringify(df); which in turns calls this napi method. Thx

The change you are observing is due to to_js_value() going through serde's Serialize impl, which is what we have changed. The old output format you are requesting for happened to work by relying on internal implementation detail on how we performed the serialization of a Series - but this is not something we make any guarantees on (as you have encountered). Internally we moved to IPC to reduce maintenance burden and improve performance.

If you need JSON output in the format you are describing, this is something that should be explicitly handled downstream in nodejs-polars. What I can suggest is that if you checkout the polars repo to 2f6c23b (or earlier), you can potentially vendor a lot of the logic from the old serialization code (impl Serialize for Series etc.). I imagine in nodejs-polars you will introduce a newtype wrapper in the serialization step to get the old output.

If you have any further questions, I would highly recommend opening a separate issue - they are generally more visible than going through comments on a pull request, and allow us to easily refer back to them in the future.

c

a57d59e

github-actions bot added fix Bug fix python Related to Python Polars rust Related to Rust Polars labels Jan 9, 2025

nameexhaustion commented Jan 9, 2025

View reviewed changes

nameexhaustion added 2 commits January 9, 2025 21:03

c

3f49ba7

c

442cf4c

nameexhaustion commented Jan 9, 2025

View reviewed changes

nameexhaustion added 3 commits January 9, 2025 21:24

c

c0905ed

c

c3df806

c

090c99b

nameexhaustion added the performance Performance issues or improvements label Jan 9, 2025

nameexhaustion marked this pull request as ready for review January 9, 2025 11:41

nameexhaustion requested review from MarcoGorelli, alexander-beedie, c-peters, coastalwhite, orlp, reswqa and ritchie46 as code owners January 9, 2025 11:41

ritchie46 merged commit 6cd9988 into pola-rs:main Jan 9, 2025
42 checks passed

Bidek56 mentioned this pull request Feb 14, 2025

Upgrading to rs-0.46 pola-rs/nodejs-polars#309

Merged

pola-rs locked as resolved and limited conversation to collaborators Feb 24, 2025

Conversation

nameexhaustion commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR notes

Performance testing

Observations

Uh oh!

nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

nameexhaustion Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nameexhaustion Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

nameexhaustion Jan 9, 2025

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Jan 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Uh oh!

datenzauberai commented Jan 9, 2025

Uh oh!

Object905 commented Jan 30, 2025

Uh oh!

ritchie46 commented Jan 30, 2025

Uh oh!

Object905 commented Jan 30, 2025

Uh oh!

Bidek56 commented Feb 11, 2025

Uh oh!

nameexhaustion commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ghuls commented Feb 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Bidek56 commented Feb 14, 2025

Uh oh!

Bidek56 commented Feb 22, 2025

Uh oh!

nameexhaustion commented Feb 23, 2025

Uh oh!

Bidek56 commented Feb 24, 2025

Uh oh!

nameexhaustion commented Feb 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

nameexhaustion commented Jan 9, 2025 •

edited

Loading

nameexhaustion Jan 9, 2025 •

edited

Loading

nameexhaustion Jan 9, 2025 •

edited

Loading

codecov bot commented Jan 9, 2025 •

edited

Loading

nameexhaustion commented Feb 14, 2025 •

edited

Loading

ghuls commented Feb 14, 2025 •

edited

Loading

nameexhaustion commented Feb 24, 2025 •

edited

Loading