-
Notifications
You must be signed in to change notification settings - Fork 191
PARQUET-593: Add API for writing Page statistics #129
Conversation
src/parquet/column/page.h
Outdated
|
|
||
| virtual void Close() = 0; | ||
| // TODO think of a better way to pass column chunk statistics | ||
| virtual void Close(const std::shared_ptr<RowGroupStatistics>& chunk_statistics) = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The API is admittedly rather ugly here, because PageWriter operates on raw bytes and not values. It could handle column chunk statistics by itself, but then it would have to be initialized somewhere with a concrete type.
|
Great to make progress on this -- a few nits, but main concern is working to create more thorough test cases verifying correctness of reads and writes (since databases use these statistics for to know when to skip pages / row groups, etc.). |
|
Thank you, I will review the updated patch here in short order. |
|
Thanks for your time.
|
| Encoding::type encoding; | ||
| Compression::type codec; | ||
| bool collect_statistics; | ||
| }; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is strictly POD, this should be a struct (note google style guidelines on member variable names for structs vs classes)
|
@lomereiter @piyushnarang looks like we should do an unsigned binary comparison here. @lomereiter it looks like comparators are the only thing left -- you should wait to rebase until PARQUET-573 is merged. thanks for patience! |
|
sorry looks like we still need to computed the signed max/min -- we can come back and add the unsigned statistics when the format version drops |
|
Yeah there's been a bit of iteration on the parquet-format PR :-) |
|
@lomereiter let's proceed and compute only the signed max/min in this patch until the dust settles on parquet-format? |
|
@wesm ok, I'll rebase and add comparison tests this weekend so that it can be merged finally |
b80b07e to
45e418a
Compare
|
So far rebased only, will get back to tests and further cleanup next evening. |
8adc287 to
2594141
Compare
|
Switched to using signed min/max for byte array types, added a few comparison tests. Some inefficiencies in read/write paths remain, but are topics for separate JIRAs. |
|
Awesome, will review when I can. |
| friend class SerializedPageReader; | ||
| std::string max_; | ||
| std::string min_; | ||
| EncodedStatistics statistics_; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To avoid copying a lot of data, this would probably be better as a shared_ptr.
src/parquet/column/properties.h
Outdated
| static constexpr bool DEFAULT_IS_DICTIONARY_ENABLED = true; | ||
| static constexpr int64_t DEFAULT_DICTIONARY_PAGE_SIZE_LIMIT = DEFAULT_PAGE_SIZE; | ||
| static constexpr int64_t DEFAULT_WRITE_BATCH_SIZE = 1024; | ||
| static constexpr bool DEFAULT_COLLECT_STATISTICS = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are they enabled as default in parquet-mr? We should probably stick to the same defaults here. Collecting statistics costs a bit but will improve query times on the files a lot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and they are indeed enabled by default. However, if values of min/max are large (> 4KB) they are not written (https://issues.apache.org/jira/browse/PARQUET-372).
src/parquet/column/properties.h
Outdated
| static constexpr Compression::type DEFAULT_COMPRESSION_TYPE = Compression::UNCOMPRESSED; | ||
|
|
||
| using ColumnCodecs = std::unordered_map<std::string, Compression::type>; | ||
| class PARQUET_EXPORT ColumnSettings { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm ok with this class (although it should be named ColumnProperties) but not with the outcome at the end. Currently it is possible to select a custom encoding for a column but fallback to the file-global defaults for all other settings. With this new class, once we have set one property on a column-basis, for this column we have to either set all other file-global defaults explicitly on this column or we'll get the parquet-cpp defaults.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this objection since WriterProperties::Builder interface is unchanged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The interface is the same, just the behaviour changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gotcha. AFAIU all setters should first set column_settings_[path] to default_column_settings_.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will still not be enough. You will also need an isset for each entry. I would suggest to still use the numerous maps in the builder and build the column_settings_ instances at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hope I have covered my issue in an understandable unit test: #166
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, went with your suggestion, thanks for putting in the test.
src/parquet/column/statistics.cc
Outdated
| namespace parquet { | ||
|
|
||
| template <typename TypedStats> | ||
| std::shared_ptr<TypedStats> RowGroupStatistics::as() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer if this was done explicitly by the user as with all other casts.
src/parquet/column/statistics.cc
Outdated
| void TypedRowGroupStatistics<FLBAType>::Copy( | ||
| const FLBA& src, FLBA* dst, OwnedMutableBuffer& buffer) { | ||
| if (dst->ptr == src.ptr) return; | ||
| auto len = descr_->type_length(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use the explicity type here.
src/parquet/column/writer.h
Outdated
| std::shared_ptr<RowGroupStatistics> chunk_statistics_; | ||
|
|
||
| template <typename Type> | ||
| void InitStatistics() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make this part of the TypedColumnWriter
src/parquet/column/writer.cc
Outdated
| } | ||
|
|
||
| if (properties->collect_statistics(descr_->path())) { | ||
| InitStatistics<Type>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No need for <Type> if this function is part of TypedColumnWriter
src/parquet/file/metadata.cc
Outdated
| case Type::FIXED_LEN_BYTE_ARRAY: | ||
| return MakeColumnStatsT<FLBAType>(meta_data, descr); | ||
| } | ||
| return nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Raise an exception if the type is not supported.
src/parquet/file/metadata.cc
Outdated
| namespace parquet { | ||
|
|
||
| template <typename DType> | ||
| static std::shared_ptr<RowGroupStatistics> MakeColumnStatsT( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MakeColumnStatsTyped
src/parquet/file/reader.cc
Outdated
| for (auto i : selected_columns) { | ||
| auto column_chunk = group_metadata->ColumnChunk(i); | ||
| const ColumnStatistics stats = column_chunk->statistics(); | ||
| const auto stats = column_chunk->statistics(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please write the type here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wesm Do you have a link to the guidelines at hand?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
https://google.github.io/styleguide/cppguide.html#auto
aside: we should survey our usages of auto to make sure we are using const auto& in the right places
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, will review again once the changes are made.
|
I think this is about ready to go. Needs a rebase. @xhochy ? |
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, rebase & then merge!
* EncodedStatistics now stores min/max as shared_ptr<string> * restored old WriterProperties::Builder behavior * renamed EncodedMin/EncodedMax to EncodeMin/EncodeMax * moved page_/chunk_statistics_ to TypedColumnWriter
06612c3 to
4c0ed2a
Compare
4c0ed2a to
0be02ec
Compare
|
Thanks all, I rebased the branch. |
|
+1 |
No description provided.