fix: Fix performance regression for DataFrame serialization/pickling#20641
fix: Fix performance regression for DataFrame serialization/pickling#20641ritchie46 merged 6 commits intopola-rs:mainfrom
Conversation
| /// maximum number of rows per chunk to ensure reasonable memory efficiency when | ||
| /// reading the resulting file, and a minimum size per chunk to ensure | ||
| /// reasonable performance when writing. | ||
| pub fn chunk_df_for_writing( |
There was a problem hiding this comment.
Code is moved from polars-io
|
|
||
| #[cfg(feature = "serde")] | ||
| #[test] | ||
| fn test_deserialize_height_validation_8751() { |
There was a problem hiding this comment.
Test no longer works as serialization directly errors now on mismatching height
| D::Error::custom::<PolarsError>(e) | ||
| }) | ||
| impl DataFrame { | ||
| pub fn serialize_into_writer(&mut self, writer: &mut dyn std::io::Write) -> PolarsResult<()> { |
There was a problem hiding this comment.
Serialization logic moved to impl DataFrame rather than on impl Series.
| .serialize_into_writer(&mut bytes) | ||
| .map_err(S::Error::custom)?; | ||
|
|
||
| serializer.serialize_bytes(bytes.as_slice()) |
There was a problem hiding this comment.
This is where the extra memcopy happens when going through serde - we are calling Serializer::serialize_bytes(bytes: &[u8])
| } | ||
| py.allow_threads(|| { | ||
| self.df | ||
| .serialize_into_writer(&mut writer) |
There was a problem hiding this comment.
Bypass serde when serializing DataFrame - this is essentially what we did in older versions
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #20641 +/- ##
==========================================
- Coverage 79.03% 79.01% -0.03%
==========================================
Files 1557 1557
Lines 220528 220547 +19
Branches 2510 2513 +3
==========================================
- Hits 174303 174272 -31
- Misses 45651 45702 +51
+ Partials 574 573 -1 ☔ View full report in Codecov by Sentry. |
|
@nameexhaustion Timings look great, thank you for all the effort! |
|
Seems like I'm getting |
|
Did you serialize with a different version? Polars pickle is not stable between versions. |
|
Yes, that is the case. Thanks for the tip. |
|
@nameexhaustion Any idea why I could be getting weird results when converting to a Nodejs object using napi? |
I can't answer this as I am not familiar with Node.js. What I can say is that we changed the serialization format recently. If you were serializing to JSON, it should still be valid JSON, but it will contain an integer array representing IPC bytes instead of the values in the columns. It may help if you provide an example of what you are getting before vs now, or if you check with the nodejs polars repo. |
|
A pity https://docs.rs/serde-pickle/latest/serde_pickle/ does not support writing Pickle v5 with out-of-band data: https://peps.python.org/pep-0574/ Old pyarrow commit that introduced experimental zero-copy pickling: |
This is an example of before and after: |
@nameexhaustion Any idea how to convert back to ASCII JSON? I need it for |
We removed the serialization code that used to output the old format. In general, there are no guarantees as to the output format of serialization - we only guarantee that it can be deserialized back into memory, and it should only be used for such a purpose. Can you elaborate on how you were using the old format? I believe you should be able to retrieve all the information you need by deserializing back into memory. |
The change you are observing is due to If you need JSON output in the format you are describing, this is something that should be explicitly handled downstream in nodejs-polars. What I can suggest is that if you checkout the polars repo to 2f6c23b (or earlier), you can potentially vendor a lot of the logic from the old serialization code ( If you have any further questions, I would highly recommend opening a separate issue - they are generally more visible than going through comments on a pull request, and allow us to easily refer back to them in the future. |
Fixes #20605
PR notes
impl serde::ser::Serialize for DataFrametoDataFrame::serialize_into_readeretc.DataFramefrom Python.Serializerfor provide a raw buffer that we can write into)deserialize_map_bytesutility function to reduce unnecessary memcopy during deserialization. This replaces some existing code that unnecessarily copied to an ownedVec<u8>during deserialization.Performance testing
(build-debug-release)
*Note: For % comparisons, lower is better
Observations
DataFrameserialization with the the runtime of the correspondingLazyFrameserialization (LazyFrames must use serde as they serialize through aDslPlan)