-
Notifications
You must be signed in to change notification settings - Fork 1.1k
Optimized writing of byte array to parquet (#1764) (2x faster) #2221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
parquet/src/column/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is moved onto Encoder so that ColumnValues can be a type-erased type, e.g. ArrayRef. This will be critical to support dictionaries without needing GATs, as the TypedDictionary (#2136) contains a lifetime.
parquet/src/column/writer/encoder.rs
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pushed down from get_numeric_array_slice in arrow writer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using the new ArrayAccessor 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic was actually previously incorrect as it would return a bit_width of 1 for num_entries == 1 when it only needed to be 0. This is largely harmless, but is worth fixing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Initially I had the concrete type here, i.e. StringArray. This works, however, would present difficulties in adapting this to preserve dictionaries, as TypedDictionary (#2136) will contain a lifetime, which would then require GATs here
Codecov Report
@@ Coverage Diff @@
## master #2221 +/- ##
==========================================
- Coverage 82.29% 82.27% -0.02%
==========================================
Files 244 245 +1
Lines 62443 62654 +211
==========================================
+ Hits 51386 51549 +163
- Misses 11057 11105 +48
Help us with your feedback. Take ten seconds to tell us how you rate us. |
af94d6c to
f90e5ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing these fixes #1753
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See https://github.com/apache/parquet-format/blob/master/Encodings.md for what the various encodings are. They are all relatively self-explantory
f90e5ae to
7d6a5b9
Compare
|
Benchmark runs are scheduled for baseline = 42b15a8 and contender = 2c09ba4. 2c09ba4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Part of #1764
Closes #1753
Rationale for this change
And there is still low-hanging fruit for optimisation here
What changes are included in this PR?
Switches encoding arrow arrays to a specialized write path
Are there any user-facing changes?
No