-
Notifications
You must be signed in to change notification settings - Fork 191
PARQUET-1002: Compute statistics based on Sort Order #383
Conversation
6113f57 to
f0560c6
Compare
src/parquet/statistics.cc
Outdated
| IncrementNullCount(null_count); | ||
| IncrementDistinctCount(distinct_count); | ||
|
|
||
| comparator_ = std::static_pointer_cast<CompareDefault<DType> >(Compare::Make(schema)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this line be just moved into SetDescr then?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, good catch.
src/parquet/util/comparison.cc
Outdated
| namespace parquet { | ||
|
|
||
| std::shared_ptr<Compare> Compare::Make(const ColumnDescriptor* descr) { | ||
| if (SortOrder::SIGNED == GetSortOrder(descr->logical_type(), descr->physical_type())) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I counted 5 occurrences of this call, it probably makes sense to add a convenience method to ColumnDescriptor
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! makes sense.
13e82fc to
2150379
Compare
|
I'm sorry about the delay, I will try to review this today or worst case tomorrow |
|
Going to review this today |
wesm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @majetideepak! this looks like it was a lot of work / very tedious. I'm sorry for the slow feedback.
I eyeballed the comparison details and didn't see anything awry. I admit to not having followed the comparison discussion vis-a-vis int96 on the Parquet syncs, but I trust you've implemented it correctly.
Most of my comments are nitpicks. I believe that @xhochy is back from vacation tomorrow so I would like for him to also look at this and then we try to do either a 1.2.1 release or 1.3.0 right away so that we don't continue to generate files with bad statistics.
src/parquet/file/metadata.cc
Outdated
| return SortOrder::UNKNOWN; | ||
| } | ||
| const ApplicationVersion ApplicationVersion::PARQUET_CPP_FIXED_STATS_VERSION = | ||
| ApplicationVersion("parquet-cpp version 1.2.1"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will probably be 1.3.0, but we can also release 1.2.1 right away
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Releasing 1.2.1 sounds like a good idea. But this value should work even if we make a 1.3.0 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just noticed apache/parquet-format#55 adds some more changes to the min-max stats format. I think we should wait until that gets merged.
| // This for backward compatibility | ||
| if (is_signed) { | ||
| stats.max = val.max(); | ||
| stats.min = val.min(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do the new statistics fields cause any problems for older Parquet readers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked and it didn't.
| template <> | ||
| void TestStatistics<DoubleType>::AddNodes(std::string name) { | ||
| // Double physical type has only Signed Statistics | ||
| fields_.push_back(schema::PrimitiveNode::Make(name, Repetition::REQUIRED, Type::DOUBLE, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this be simplified using type traits to avoid these specializations? We have type_traits, but they are seldom used, and could be templated on the FooType class rather than the type number, if that helps
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will take a look.
| @@ -0,0 +1,330 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like this belongs in statistics-test.cc, might it be clearer to put this code there?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My motivation is to add end to end writer-reader tests for Parquet which we currently don't have.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to see this in statistics-test.cc. We can move it back to src/parquet/parquet_reader_writer-test.cc once we have full end-to-end tests.
src/parquet/parquet_version.h
Outdated
| #define CREATED_BY_VERSION "parquet-cpp version 1.2.1-SNAPSHOT" | ||
|
|
||
| #endif // PARQUET_VERSION_H | ||
| #endif // PARQUET_VERSION_H |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we mean to check this file in?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removing this file causes one of the Travis CI tests to fail. I am not sure why. I will look into this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed PARQUET-1088
src/parquet/util/comparison.h
Outdated
| explicit Compare(const ColumnDescriptor* descr) : type_length_(descr->type_length()) {} | ||
|
|
||
| inline bool operator()(const T& a, const T& b) { return a < b; } | ||
| class PARQUET_EXPORT Compare { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: can we rename this Comparator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure!
| typedef typename DType::c_type T; | ||
| CompareDefault() {} | ||
| virtual ~CompareDefault() {} | ||
| virtual bool operator()(const T& a, const T& b) { return a < b; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override (instead of virtual)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do here and below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CompareDefault is the base class and will require virtual. I used override inside the derived classes.
| public: | ||
| CompareDefault() {} | ||
| virtual ~CompareDefault() {} | ||
| virtual bool operator()(const Int96& a, const Int96& b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override
| public: | ||
| CompareDefault() {} | ||
| virtual ~CompareDefault() {} | ||
| virtual bool operator()(const ByteArray& a, const ByteArray& b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override
| public: | ||
| explicit CompareDefault(int length) : type_length_(length) {} | ||
| virtual ~CompareDefault() {} | ||
| virtual bool operator()(const FLBA& a, const FLBA& b) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
override
|
Thanks for the feedback @wesm! As far as I remember, the sync discussions were geared more towards UTF-8 (and other string encodings) and not much on Int96. Int96 is deprecated and is currently used only for Impala Timestamp values which require a custom sort order. I have also implemented this patch with the idea of users be able to implement custom |
src/parquet/column_writer.cc
Outdated
|
|
||
| EncodedStatistics chunk_statistics = GetChunkStatistics(); | ||
| if (chunk_statistics.is_set()) metadata_->SetStatistics(chunk_statistics); | ||
| if (chunk_statistics.is_set()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: Please use braces.
src/parquet/file/metadata.h
Outdated
|
|
||
| // Checks if the Version has the correct statistics for a given column | ||
| bool HasCorrectStatistics(Type::type primitive) const; | ||
| bool HasCorrectStatistics(Type::type primitive, SortOrder::type sort_order) const; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As we change the interface here, we should make the new version 1.3.0.
@majetideepak alternatively can we provide a default argument here?
|
@majetideepak if you can incorporate feedback and make this not WIP, we should be able to cut the 1.3.0 RC after this is in |
|
@wesm Sorry for the delay! I somehow missed your comment. I will incorporate this feedback tomorrow morning. |
68283b0 to
8d401f0
Compare
|
A follow up question on the comparator thing. Any way to do this as fully compile-time with no virtual functions (and so no exported symbols in the DLL)? |
|
I couldn't come up with an elegant design to make it fully-compile time. |
add int32 comparison INT96 fix statistics in metadata Use templates
read new max and min values
fix tests
Fix Warnings on Windows
8d401f0 to
75ea475
Compare
xhochy
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks ready to go except:
| @@ -0,0 +1,330 @@ | |||
| // Licensed to the Apache Software Foundation (ASF) under one | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also like to see this in statistics-test.cc. We can move it back to src/parquet/parquet_reader_writer-test.cc once we have full end-to-end tests.
|
Fixed issue https://github.com/apache/parquet-cpp/pull/383/files#r134647005 |
|
Travis is failing with Once that is fixed, feel free to merge. |
|
@xhochy Fixed the error. I don't think I have merge access yet! |
|
Merging. It looks like you haven't been added to the LDAP group, @julienledem can you take a look? |
@lomereiter, You might also want to take a look since you previously implemented the Statistics API.