Skip to content

Conversation

@tustvold
Copy link
Contributor

@tustvold tustvold commented Jul 29, 2022

Which issue does this PR close?

Part of #1764
Closes #1753

Rationale for this change

write_batch primitive/4096 values string                                                                            
                        time:   [482.71 us 482.93 us 483.18 us]
                        thrpt:  [164.72 MiB/s 164.81 MiB/s 164.88 MiB/s]
                 change:
                        time:   [-51.256% -51.214% -51.172%] (p = 0.00 < 0.05)
                        thrpt:  [+104.80% +104.98% +105.16%]
                        Performance has improved.
write_batch primitive/4096 values string non-null                                                                            
                        time:   [497.39 us 497.69 us 498.02 us]
                        thrpt:  [157.85 MiB/s 157.96 MiB/s 158.05 MiB/s]
                 change:
                        time:   [-51.444% -51.392% -51.343%] (p = 0.00 < 0.05)
                        thrpt:  [+105.52% +105.73% +105.95%]
                        Performance has improved.

And there is still low-hanging fruit for optimisation here

What changes are included in this PR?

Switches encoding arrow arrays to a specialized write path

Are there any user-facing changes?

No

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is moved onto Encoder so that ColumnValues can be a type-erased type, e.g. ArrayRef. This will be critical to support dictionaries without needing GATs, as the TypedDictionary (#2136) contains a lifetime.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is pushed down from get_numeric_array_slice in arrow writer

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using the new ArrayAccessor 😄

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic was actually previously incorrect as it would return a bit_width of 1 for num_entries == 1 when it only needed to be 0. This is largely harmless, but is worth fixing.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 29, 2022
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Initially I had the concrete type here, i.e. StringArray. This works, however, would present difficulties in adapting this to preserve dictionaries, as TypedDictionary (#2136) will contain a lifetime, which would then require GATs here

@codecov-commenter
Copy link

codecov-commenter commented Jul 29, 2022

Codecov Report

Merging #2221 (22f52cd) into master (b879977) will decrease coverage by 0.01%.
The diff coverage is 81.61%.

@@            Coverage Diff             @@
##           master    #2221      +/-   ##
==========================================
- Coverage   82.29%   82.27%   -0.02%     
==========================================
  Files         244      245       +1     
  Lines       62443    62654     +211     
==========================================
+ Hits        51386    51549     +163     
- Misses      11057    11105      +48     
Impacted Files Coverage Δ
parquet/src/arrow/arrow_writer/byte_array.rs 76.72% <76.72%> (ø)
parquet/src/column/writer/mod.rs 92.85% <84.61%> (-0.15%) ⬇️
parquet/src/column/writer/encoder.rs 89.01% <91.42%> (+1.35%) ⬆️
parquet/src/arrow/arrow_writer/mod.rs 97.66% <100.00%> (+0.01%) ⬆️
parquet/src/encodings/encoding/dict_encoder.rs 90.74% <100.00%> (-0.49%) ⬇️
parquet/src/util/interner.rs 91.66% <100.00%> (+0.75%) ⬆️
parquet_derive/src/parquet_field.rs 65.75% <0.00%> (ø)
parquet/src/data_type.rs 74.62% <0.00%> (+0.21%) ⬆️
arrow/src/datatypes/datatype.rs 62.61% <0.00%> (+0.31%) ⬆️
arrow/src/array/array_binary.rs 95.45% <0.00%> (+0.64%) ⬆️
... and 2 more

Help us with your feedback. Take ten seconds to tell us how you rate us.

@tustvold tustvold force-pushed the faster-byte-array branch from af94d6c to f90e5ae Compare July 29, 2022 15:38
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing these fixes #1753

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See https://github.com/apache/parquet-format/blob/master/Encodings.md for what the various encodings are. They are all relatively self-explantory

@tustvold tustvold force-pushed the faster-byte-array branch from f90e5ae to 7d6a5b9 Compare July 29, 2022 16:19
@tustvold tustvold marked this pull request as ready for review July 29, 2022 16:27
@tustvold tustvold merged commit 2c09ba4 into apache:master Aug 1, 2022
@ursabot
Copy link

ursabot commented Aug 1, 2022

Benchmark runs are scheduled for baseline = 42b15a8 and contender = 2c09ba4. 2c09ba4 is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ec2-t3-xlarge-us-east-2] ec2-t3-xlarge-us-east-2
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on test-mac-arm] test-mac-arm
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-i9-9960x] ursa-i9-9960x
[Skipped ⚠️ Benchmarking of arrow-rs-commits is not supported on ursa-thinkcentre-m75q] ursa-thinkcentre-m75q
Buildkite builds:
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Arrow Parquet BinaryArray Writer Ignores LevelInfo Indices

4 participants