Skip to content
This repository was archived by the owner on May 10, 2024. It is now read-only.

Conversation

@lomereiter
Copy link
Contributor

No description provided.


virtual void Close() = 0;
// TODO think of a better way to pass column chunk statistics
virtual void Close(const std::shared_ptr<RowGroupStatistics>& chunk_statistics) = 0;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The API is admittedly rather ugly here, because PageWriter operates on raw bytes and not values. It could handle column chunk statistics by itself, but then it would have to be initialized somewhere with a concrete type.

@wesm
Copy link
Member

wesm commented Jun 29, 2016

Great to make progress on this -- a few nits, but main concern is working to create more thorough test cases verifying correctness of reads and writes (since databases use these statistics for to know when to skip pages / row groups, etc.).

@wesm
Copy link
Member

wesm commented Jul 6, 2016

Thank you, I will review the updated patch here in short order.

@lomereiter
Copy link
Contributor Author

Thanks for your time.
I haven't got around to write comparison tests yet, but here's a quick summary of changes:

  • writing statistics is disabled by default (often it makes sense only for a few specific columns)
  • writing and reading is covered with tests, for single- and multi-page row groups
  • all column-level settings are grouped into a struct, as it's more convenient and doesn't require multiple unordered_map accesses

Encoding::type encoding;
Compression::type codec;
bool collect_statistics;
};
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is strictly POD, this should be a struct (note google style guidelines on member variable names for structs vs classes)

@wesm
Copy link
Member

wesm commented Sep 1, 2016

@lomereiter @piyushnarang looks like we should do an unsigned binary comparison here. @lomereiter it looks like comparators are the only thing left -- you should wait to rebase until PARQUET-573 is merged. thanks for patience!

@wesm
Copy link
Member

wesm commented Sep 1, 2016

sorry looks like we still need to computed the signed max/min -- we can come back and add the unsigned statistics when the format version drops

apache/parquet-format#42 (comment)

@piyushnarang
Copy link

Yeah there's been a bit of iteration on the parquet-format PR :-)

@wesm
Copy link
Member

wesm commented Sep 6, 2016

@lomereiter let's proceed and compute only the signed max/min in this patch until the dust settles on parquet-format?

@lomereiter
Copy link
Contributor Author

@wesm ok, I'll rebase and add comparison tests this weekend so that it can be merged finally

@lomereiter
Copy link
Contributor Author

So far rebased only, will get back to tests and further cleanup next evening.

@lomereiter lomereiter force-pushed the parquet-593 branch 2 times, most recently from 8adc287 to 2594141 Compare September 18, 2016 20:16
@lomereiter
Copy link
Contributor Author

Switched to using signed min/max for byte array types, added a few comparison tests. Some inefficiencies in read/write paths remain, but are topics for separate JIRAs.

@wesm
Copy link
Member

wesm commented Sep 18, 2016

Awesome, will review when I can.

friend class SerializedPageReader;
std::string max_;
std::string min_;
EncodedStatistics statistics_;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To avoid copying a lot of data, this would probably be better as a shared_ptr.

static constexpr bool DEFAULT_IS_DICTIONARY_ENABLED = true;
static constexpr int64_t DEFAULT_DICTIONARY_PAGE_SIZE_LIMIT = DEFAULT_PAGE_SIZE;
static constexpr int64_t DEFAULT_WRITE_BATCH_SIZE = 1024;
static constexpr bool DEFAULT_COLLECT_STATISTICS = false;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are they enabled as default in parquet-mr? We should probably stick to the same defaults here. Collecting statistics costs a bit but will improve query times on the files a lot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked and they are indeed enabled by default. However, if values of min/max are large (> 4KB) they are not written (https://issues.apache.org/jira/browse/PARQUET-372).

static constexpr Compression::type DEFAULT_COMPRESSION_TYPE = Compression::UNCOMPRESSED;

using ColumnCodecs = std::unordered_map<std::string, Compression::type>;
class PARQUET_EXPORT ColumnSettings {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm ok with this class (although it should be named ColumnProperties) but not with the outcome at the end. Currently it is possible to select a custom encoding for a column but fallback to the file-global defaults for all other settings. With this new class, once we have set one property on a column-basis, for this column we have to either set all other file-global defaults explicitly on this column or we'll get the parquet-cpp defaults.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this objection since WriterProperties::Builder interface is unchanged.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The interface is the same, just the behaviour changed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gotcha. AFAIU all setters should first set column_settings_[path] to default_column_settings_.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will still not be enough. You will also need an isset for each entry. I would suggest to still use the numerous maps in the builder and build the column_settings_ instances at the end.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hope I have covered my issue in an understandable unit test: #166

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, went with your suggestion, thanks for putting in the test.

namespace parquet {

template <typename TypedStats>
std::shared_ptr<TypedStats> RowGroupStatistics::as() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would prefer if this was done explicitly by the user as with all other casts.

void TypedRowGroupStatistics<FLBAType>::Copy(
const FLBA& src, FLBA* dst, OwnedMutableBuffer& buffer) {
if (dst->ptr == src.ptr) return;
auto len = descr_->type_length();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the explicity type here.

std::shared_ptr<RowGroupStatistics> chunk_statistics_;

template <typename Type>
void InitStatistics() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please make this part of the TypedColumnWriter

}

if (properties->collect_statistics(descr_->path())) {
InitStatistics<Type>();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for <Type> if this function is part of TypedColumnWriter

case Type::FIXED_LEN_BYTE_ARRAY:
return MakeColumnStatsT<FLBAType>(meta_data, descr);
}
return nullptr;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raise an exception if the type is not supported.

namespace parquet {

template <typename DType>
static std::shared_ptr<RowGroupStatistics> MakeColumnStatsT(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MakeColumnStatsTyped

for (auto i : selected_columns) {
auto column_chunk = group_metadata->ColumnChunk(i);
const ColumnStatistics stats = column_chunk->statistics();
const auto stats = column_chunk->statistics();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please write the type here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wesm Do you have a link to the guidelines at hand?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://google.github.io/styleguide/cppguide.html#auto

aside: we should survey our usages of auto to make sure we are using const auto& in the right places

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First pass, will review again once the changes are made.

@wesm
Copy link
Member

wesm commented Sep 29, 2016

I think this is about ready to go. Needs a rebase. @xhochy ?

Copy link
Member

@xhochy xhochy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, rebase & then merge!

Artem Tarasov added 8 commits October 3, 2016 12:12
* EncodedStatistics now stores min/max as shared_ptr<string>
* restored old WriterProperties::Builder behavior
* renamed EncodedMin/EncodedMax to EncodeMin/EncodeMax
* moved page_/chunk_statistics_ to TypedColumnWriter
@lomereiter
Copy link
Contributor Author

Thanks all, I rebased the branch.

@wesm
Copy link
Member

wesm commented Oct 3, 2016

+1

@asfgit asfgit closed this in 176b08c Oct 3, 2016
@lomereiter lomereiter deleted the parquet-593 branch July 4, 2017 19:08
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants