Skip to content
This repository was archived by the owner on May 10, 2024. It is now read-only.

Conversation

@majetideepak
Copy link

@majetideepak majetideepak commented Aug 9, 2017

@lomereiter, You might also want to take a look since you previously implemented the Statistics API.

@majetideepak majetideepak force-pushed the PARQUET-1002 branch 2 times, most recently from 6113f57 to f0560c6 Compare August 10, 2017 18:12
IncrementNullCount(null_count);
IncrementDistinctCount(distinct_count);

comparator_ = std::static_pointer_cast<CompareDefault<DType> >(Compare::Make(schema));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this line be just moved into SetDescr then?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, good catch.

namespace parquet {

std::shared_ptr<Compare> Compare::Make(const ColumnDescriptor* descr) {
if (SortOrder::SIGNED == GetSortOrder(descr->logical_type(), descr->physical_type())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I counted 5 occurrences of this call, it probably makes sense to add a convenience method to ColumnDescriptor

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! makes sense.

@majetideepak majetideepak force-pushed the PARQUET-1002 branch 2 times, most recently from 13e82fc to 2150379 Compare August 10, 2017 19:48
@majetideepak
Copy link
Author

@wesm, @xhochy can you take a look at this? thanks!

@wesm
Copy link
Member

wesm commented Aug 21, 2017

I'm sorry about the delay, I will try to review this today or worst case tomorrow

@wesm
Copy link
Member

wesm commented Aug 22, 2017

Going to review this today

Copy link
Member

@wesm wesm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @majetideepak! this looks like it was a lot of work / very tedious. I'm sorry for the slow feedback.

I eyeballed the comparison details and didn't see anything awry. I admit to not having followed the comparison discussion vis-a-vis int96 on the Parquet syncs, but I trust you've implemented it correctly.

Most of my comments are nitpicks. I believe that @xhochy is back from vacation tomorrow so I would like for him to also look at this and then we try to do either a 1.2.1 release or 1.3.0 right away so that we don't continue to generate files with bad statistics.

return SortOrder::UNKNOWN;
}
const ApplicationVersion ApplicationVersion::PARQUET_CPP_FIXED_STATS_VERSION =
ApplicationVersion("parquet-cpp version 1.2.1");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will probably be 1.3.0, but we can also release 1.2.1 right away

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Releasing 1.2.1 sounds like a good idea. But this value should work even if we make a 1.3.0 release.

Copy link
Author

@majetideepak majetideepak Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just noticed apache/parquet-format#55 adds some more changes to the min-max stats format. I think we should wait until that gets merged.

// This for backward compatibility
if (is_signed) {
stats.max = val.max();
stats.min = val.min();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do the new statistics fields cause any problems for older Parquet readers?

Copy link
Author

@majetideepak majetideepak Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and it didn't.

template <>
void TestStatistics<DoubleType>::AddNodes(std::string name) {
// Double physical type has only Signed Statistics
fields_.push_back(schema::PrimitiveNode::Make(name, Repetition::REQUIRED, Type::DOUBLE,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be simplified using type traits to avoid these specializations? We have type_traits, but they are seldom used, and could be templated on the FooType class rather than the type number, if that helps

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will take a look.

@@ -0,0 +1,330 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like this belongs in statistics-test.cc, might it be clearer to put this code there?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My motivation is to add end to end writer-reader tests for Parquet which we currently don't have.

Copy link
Member

@xhochy xhochy Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see this in statistics-test.cc. We can move it back to src/parquet/parquet_reader_writer-test.cc once we have full end-to-end tests.

#define CREATED_BY_VERSION "parquet-cpp version 1.2.1-SNAPSHOT"

#endif // PARQUET_VERSION_H
#endif // PARQUET_VERSION_H
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we mean to check this file in?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing this file causes one of the Travis CI tests to fail. I am not sure why. I will look into this as well.

Copy link
Author

@majetideepak majetideepak Sep 5, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed PARQUET-1088

explicit Compare(const ColumnDescriptor* descr) : type_length_(descr->type_length()) {}

inline bool operator()(const T& a, const T& b) { return a < b; }
class PARQUET_EXPORT Compare {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we rename this Comparator?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure!

typedef typename DType::c_type T;
CompareDefault() {}
virtual ~CompareDefault() {}
virtual bool operator()(const T& a, const T& b) { return a < b; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override (instead of virtual)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do here and below.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CompareDefault is the base class and will require virtual. I used override inside the derived classes.

public:
CompareDefault() {}
virtual ~CompareDefault() {}
virtual bool operator()(const Int96& a, const Int96& b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override

public:
CompareDefault() {}
virtual ~CompareDefault() {}
virtual bool operator()(const ByteArray& a, const ByteArray& b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override

public:
explicit CompareDefault(int length) : type_length_(length) {}
virtual ~CompareDefault() {}
virtual bool operator()(const FLBA& a, const FLBA& b) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

override

@majetideepak majetideepak changed the title PARQUET-1002: Compute statistics based on Sort Order [WIP] PARQUET-1002: Compute statistics based on Sort Order Aug 23, 2017
@majetideepak
Copy link
Author

Thanks for the feedback @wesm! As far as I remember, the sync discussions were geared more towards UTF-8 (and other string encodings) and not much on Int96. Int96 is deprecated and is currently used only for Impala Timestamp values which require a custom sort order.

I have also implemented this patch with the idea of users be able to implement custom Comparator classes and plugin into the writer properties. Any custom order should be easy to implement now by simply extending the CompareDefault**Type** class


EncodedStatistics chunk_statistics = GetChunkStatistics();
if (chunk_statistics.is_set()) metadata_->SetStatistics(chunk_statistics);
if (chunk_statistics.is_set())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Please use braces.


// Checks if the Version has the correct statistics for a given column
bool HasCorrectStatistics(Type::type primitive) const;
bool HasCorrectStatistics(Type::type primitive, SortOrder::type sort_order) const;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As we change the interface here, we should make the new version 1.3.0.

@majetideepak alternatively can we provide a default argument here?

@wesm
Copy link
Member

wesm commented Aug 29, 2017

@majetideepak if you can incorporate feedback and make this not WIP, we should be able to cut the 1.3.0 RC after this is in

@majetideepak
Copy link
Author

@wesm Sorry for the delay! I somehow missed your comment. I will incorporate this feedback tomorrow morning.

@majetideepak majetideepak changed the title [WIP] PARQUET-1002: Compute statistics based on Sort Order PARQUET-1002: Compute statistics based on Sort Order Sep 5, 2017
@wesm
Copy link
Member

wesm commented Sep 5, 2017

A follow up question on the comparator thing. Any way to do this as fully compile-time with no virtual functions (and so no exported symbols in the DLL)?

@majetideepak
Copy link
Author

I couldn't come up with an elegant design to make it fully-compile time.
The comparator depends upon the SortOrder and the c_type of the parquet physical types.
The c_type can be determined at compile time for a given TypedRowGroupStatistics. I don't know if we can determine the SortOrder at compile time.
Another motivation for virtual functions is to allow users to specify custom comparators for unknown sort orders.

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,330 @@
// Licensed to the Apache Software Foundation (ASF) under one
Copy link
Member

@xhochy xhochy Sep 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also like to see this in statistics-test.cc. We can move it back to src/parquet/parquet_reader_writer-test.cc once we have full end-to-end tests.

@majetideepak
Copy link
Author

Fixed issue https://github.com/apache/parquet-cpp/pull/383/files#r134647005
We can go ahead without the merge of apache/parquet-format#55. That PR only impacts how the SortOrder is inferred. This PR will not require any changes.

@xhochy
Copy link
Member

xhochy commented Sep 11, 2017

Travis is failing with

/Users/travis/build/apache/parquet-cpp/src/parquet/statistics-test.cc:634:13: error: variable-sized object may not be initialized
  char vals[NUM_VALUES][FLBA_LENGTH] = {"b12345", "a12345", "c12345", "d12345", "e12345",

Once that is fixed, feel free to merge.

@majetideepak
Copy link
Author

@xhochy Fixed the error. I don't think I have merge access yet!

@wesm
Copy link
Member

wesm commented Sep 11, 2017

Merging. It looks like you haven't been added to the LDAP group, @julienledem can you take a look?

@asfgit asfgit closed this in cf6992a Sep 11, 2017
@majetideepak majetideepak deleted the PARQUET-1002 branch September 20, 2017 14:02
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants