-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40592: [C++][Parquet] Implement SizeStatistics #40594
base: main
Are you sure you want to change the base?
Conversation
cpp/src/parquet/column_writer.cc
Outdated
@@ -1631,6 +1694,12 @@ class TypedColumnWriterImpl : public ColumnWriterImpl, public TypedColumnWriter< | |||
page_statistics_->UpdateSpaced(values, valid_bits, valid_bits_offset, | |||
num_spaced_values, num_values, num_nulls); | |||
} | |||
if constexpr (std::is_same_v<T, ByteArray>) { | |||
if (page_size_stats_builder_ != nullptr) { | |||
page_size_stats_builder_->WriteValuesSpaced(values, valid_bits, valid_bits_offset, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we could somehow gather this at a lower level (based on buffer size of written values instead of having to handle Spaced values separately)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have thought about this. We have different interfaces to write values of BYTE_ARRAY type:
- dense
ByteArray
values - spaced
ByteArray
values arrow::Array
ofString
,Binary
, and their large variants- dictionary-encoded
arrow::Array
These interfaces then directly put values into encoders. So here is the last chance to catch BYTE_ARRAY values before encoding.
cpp/src/parquet/column_writer.cc
Outdated
page_statistics_->Update(*referenced_dictionary, /*update_counts=*/false); | ||
} | ||
if (page_size_stats_builder_) { | ||
page_size_stats_builder_->WriteValues(*referenced_dictionary); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this write, why are we writing values in the dictionary for page size stats? Maybe a comment or a a name value?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is write. The binary values are passed in the form of an arrow::DictionaryArray
and encoded indices in an arrow::Int32Array
. Here we need to restore the referenced values in the dictionary array to precisely build page stats and size stats.
cpp/src/parquet/size_statistics.h
Outdated
/// \param[in] valid_bits pointer to bitmap representing if values are non-null. | ||
/// \param[in] valid_bits_offset offset into valid_bits where the slice of data begins. | ||
/// \param[in] num_spaced_values length of values in values/valid_bits to inspect. | ||
void WriteValuesSpaced(const ByteArray* values, const uint8_t* valid_bits, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as commented above, I wonder if it is possible to not interwrine values spaced (and the Array option below) into this interface.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logic is anyway required. Perhaps we can provide only the dense interface here and move the logic of dealing with nulls & arrow array to the caller?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few high level questions/suggestions.
8661324
to
90caf32
Compare
Finally this PR is complete on my side. Please take a look when you have time. Thanks! @emkornfield @pitrou @mapleFU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the delay @wgtmac . This is a first partial review, I'll go over the rest once these comments are answered or addressed :-)
/// \param size_statistics pointer to the thrift SizeStatistics structure. | ||
/// \param descr column descriptor for the column. | ||
/// \returns SizeStatistics object. Its lifetime is not bound to the input. | ||
static std::unique_ptr<SizeStatistics> Make(const void* size_statistics, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you're using the pimpl idiom, then you should just return a SizeStatistics
here, since all the implementation is already inside a std::unique_ptr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Conversely, you could also remove the pimpl idiom and return a subclass here instead. This is better if you want to be able to pass an optionally null pointer, or store a shared_ptr at some pointer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was following the pimpl idiom of class FileMetaData
:
arrow/cpp/src/parquet/metadata.h
Lines 275 to 279 in 14384ac
/// \brief Create a FileMetaData from a serialized thrift message. | |
static std::shared_ptr<FileMetaData> Make( | |
const void* serialized_metadata, uint32_t* inout_metadata_len, | |
const ReaderProperties& properties = default_reader_properties(), | |
std::shared_ptr<InternalFileDecryptor> file_decryptor = NULLPTR); |
Returning a SizeStatistics
instead of std::unique_ptr<SizeStatistics>
make it impossible to store it in a smart pointer, which is on the contrary of the convention in this codebase.
Returning a subclass requires implementing virtual functions, which will be called frequently at every batch. This is something I want to avoid.
cpp/src/parquet/size_statistics.h
Outdated
/// \brief Add repeated repetition level to the histogram. | ||
/// \param num_levels number of repetition levels to add. | ||
/// \param rep_level repeated repetition level value. | ||
void AddRepetitionLevel(int64_t num_levels, int16_t rep_level); | ||
|
||
/// \brief Add repeated definition level to the histogram. | ||
/// \param num_levels number of definition levels to add. | ||
/// \param def_level repeated definition level value. | ||
void AddDefinitionLevel(int64_t num_levels, int16_t def_level); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these two really useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, sorry! The name misled me. Can't we name them AddDefinitionLevels
and AddRepetitionLevels
? Otherwise, these looks like they are adding a single level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, not sure why they're taking the explicit rep_level
and def_level
values. AFAICT, these are only useful to append levels equal to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right. They are just used to append level value 0. It might look more strange if I special case a new function like AppendDefLevelZero(num_levels)
. And it is convenient to be used in the unit test so I am inclined to keep them.
void AddValuesSpaced(const ByteArray* values, const uint8_t* valid_bits, | ||
int64_t valid_bits_offset, int64_t num_spaced_values); | ||
|
||
/// \brief Add dense BYTE_ARRAY values. | ||
/// \param values pointer to values of BYTE_ARRAY type. | ||
/// \param num_values length of values. | ||
void AddValues(const ByteArray* values, int64_t num_values); | ||
|
||
/// \brief Add BYTE_ARRAY values in the arrow array. | ||
void AddValues(const ::arrow::Array& values); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be more logical for the BYTE_ARRAY encoders to accumulate the unencoded_byte_array_data_bytes
, instead of visiting the input data again here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are two cases where BYTE_ARRAY encoders do not work:
- When dictionary encoding is enabled.
- When the input data is in a
arrow::DictionaryArray
.
going to do another pass through, CI failure looks like a formatting issue. |
/// Finalize unencoded_byte_array_data_bytes and make sure page sizes match. | ||
if (offset_index_.page_locations.size() == | ||
offset_index_.unencoded_byte_array_data_bytes.size()) { | ||
offset_index_.__isset.unencoded_byte_array_data_bytes = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check above is short hand if anything isn't provided? we only expect two states they always match or they never match once page is added?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it should always match if size stats is enabled. Otherwise, we should expect the list is empty.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, I'm OK with this as long as @pitrou is thank you for driving this.
@emkornfield @mapleFU Thanks for the feedback! I haven't addressed all comments from @pitrou yet. Will let you know once ready for review again. |
0449426
to
a83ed41
Compare
@pitrou @emkornfield @mapleFU Gentle ping :) |
Rationale for this change
Parquet format 2.10.0 has introduced SizeStatistics. parquet-mr has also implemented this: apache/parquet-java#1177. Now it is time for parquet-cpp to pick the ball.
What changes are included in this PR?
Implement reading and writing size statistics for parquet-cpp.
Are these changes tested?
Yes, a bunch of test cases have been added.
Are there any user-facing changes?
Yes, now parquet users are able to read and write size statistics.