PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197

emkornfield · 2023-03-26T02:40:04Z

No description provided.

src/main/thrift/parquet.thrift

mapleFU · 2023-03-26T05:19:31Z

src/main/thrift/parquet.thrift

+     * When present there is expected to be one element corresponding to each repetition (i.e. size=max repetition_leve) 
+     * where each element represens the count of the number of times that level occurs in the page/column chunk.
+     */
+   8: optional list<i64> repetition_level_histogram;


Seems it's a per-rowgroup statistics?

I'm not sure I understand, I thought statistics can be on either page or column chunk. the histograms would vary based on column chunk I think, but I might be misunderstanding your suggestion.

Oh, sorry for missunderstand this. How can we make full use of this histogram?

These levels are rather complicated for high-level applications, we may need to provide some utility to translate to norms (e.g. struct/map/array/null) that are much easier to understand.

I agree there is complexity here but I think this is the simplest and most complete set of information that we can provide readers. I think the utility methods would likely be per system but we could likely provide some in core packages that give leaf value estimates that follow the rules above.

@mapleFU in terms of usage these would be used similar to how row level reconstruction is done with them. For instance total number of nulls at the leaf can be computed assuming no repeated fields as the cumulative sum of the first n-1 entries in definition level. Similarly number of nested lists assuming no null lists should be computable with a similar cumulative sum. When lists are nullable repetition and definition level need to be looked at together to determine null vs empty lists. Similarly the number of empty lists at level can be inferred by looking at definition and repetition levels together

I guess we could provide utility methods that account for common permutations of memory models

Seems It can help pushdown some filter on List/Map, and helping constructing the list. It's great, but I think maybe we need some samples? Because it's a bit hard to understand how to make full use of it. Like some rules in ?

So some simple examples in from Arrow (my rep/def level calculations are rusty so please double check to make sure they make sense):

If all you have are nested structures (say 3 nullable levels), then the number of null values at the leaf is the sum of the first 3 elements of the definition level histogram. Individual level nullability for Arrow isn't super important because it always leaves the space for nulls so it is pretty much square. You could however determine that there are no nulls at the second level of nesting by checking the appropriate histogram box (and potentially save the allocation of the bit vector in that case)

If you have two nested lists and assuming lists and elements are not nullable. The number of inner nested lists is rep_hist[0] - def_hist[0] + {rep_hist[1] (outer list starts - empty outer lists + inner list starts).

Sounds feasible. Then we are good to store level histograms separately.

src/main/thrift/parquet.thrift

wgtmac · 2023-03-27T03:55:15Z

@gszadovszky @shangxinli Have time take a look?

src/main/thrift/parquet.thrift

mapleFU · 2023-03-27T14:28:31Z

Seems this patch doesn't need to consider backward-compatibility rules?

mapleFU · 2023-03-27T14:31:32Z

src/main/thrift/parquet.thrift

+    * Special calculations:
+    *  - Enum: plain-encoded BYTE_ARRAY size
+    *  - Integers (same size used for signed and unsigned): int8 - 1 bytes, int16 - 2 
+    *  - Decimal - Each value is assumed to take the minimal number of bytes necessary to encode


Seems that small Decimal can be encoded as FLBA or BYTE_ARRAY, but big decimal cannot be stored as i32. Should we force use the physical type or related with physical type?

I originally had this. I think given the two different opinions expressed, I'm going to change this field to only record variable width bytes, and say all other calcutions can be performed by readers based on type and number of values

emkornfield · 2023-03-27T19:06:14Z

Seems this patch doesn't need to consider backward-compatibility rules?

Since we are storing rep/def levels directly I don't think so since those rules are applied on top of this data to make the correct inferences.

src/main/thrift/parquet.thrift

mapleFU

I'm happy on this if some common util for histogram can be described and added. Thanks!

wgtmac · 2023-03-28T04:45:37Z

src/main/thrift/parquet.thrift

+    *
+    * This value is optional when max_definition_level is 0. 
+    */ 
+   3: optional list<i64> definition_level_histogram;


BTW, do we need to add an extra histogram for pair<def_level, rep_level> if both exist?

I don't think so. But I might not be following what you are suggesting (I gave two examples from Arrow below on usage).

If we store def_levels and ref_levels separately, how can we derive number of nulls in each level precisely?

I am thinking of supporting pushing down filters like IS_NULL or IS_NOT_NULL to nested fields. So I want to make sure if this can satisfy the use case. Maybe we don't need precise null_count of each level but it would be great to answer yes or no to the filters above.

It might pay to illustrate exact queries, but if this is just answering a question is there any null element at a particular nesting level I think definition level histogram by itself gives that information.

Take a nested lists where both lists and elements can be nullable at each level. IIRC, the definition levels would represent as follows:
0 - Null top level list.
1 - empty top level list
2 - null nested list
3 - empty nested list
4 - null leaf element
5 - present leaf element

So if the query is for top level list is null, one could prune when def_level[0] == 0. For is not null one could prune if def_level[0] == num_values from page (i.e. all values are null).

I believe similar logic holds for def_level[2] but could get more complicated depending on the semantics of whether a top level null element should imply a the nested list is also null or if an empty list implies the nested list should be considered null (but should still be derivable by using histogram indices 0,1 and 2).

One thing the joint histogram (pairs of rep/def level counts) could give you is the number first list elements that are null, but I'm not sure how useful that is. I would need to think about other queries the joint histogram would enable (or if you have more examples of supported queries we can figure out if one is needed).

src/main/thrift/parquet.thrift

wgtmac

Rest LGTM

src/main/thrift/parquet.thrift

gszadovszky

Do we want to include these statistics at both row group (column chunk) and page level? For the latter I am not sure it is the right approach. We implemented column indexes so one would not need to read the page header to get the related statistics. We even stopped writing Statistics into page headers in parquet-mr.
If we only want these for the column chunk level then I would suggest having it under ColumnMetaData directly.

emkornfield · 2023-04-01T00:10:59Z

Do we want to include these statistics at both row group (column chunk) and page level? For the latter I am not sure it is the right approach. We implemented column indexes so one would not need to read the page header to get the related statistics. We even stopped writing Statistics into page headers in parquet-mr. If we only want these for the column chunk level then I would suggest having it under ColumnMetaData directly.

@gszadovsky
Is there an argument against flexibility here? I believe parquet-cpp still writes page headers. One argument for page headers is it allows readers better incremental estimates of memory needed as they progress (although it is possible taking an average size per cell at column chunk is sufficient here), i.e. these stats accomplish a few things: 1. finer grain pushdown for nulls (mostly valuable at column chunk, and possibly adding to the page index). 2. estimating memory usage for reconstructing in memory avoid allocations during scans. 3. Query planning to get better estimates for things like broad-cast joins.

gszadovszky · 2023-04-01T06:48:29Z

@emkornfield, I am not against re-adding such statistics to page headers if it have benefits. Meanwhile, I would only expect filling values that are really make sense and not redundant at the related level (page or column chunk). If a value make sense for filtering then it should be added to the column index instead of to the page header.
However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?

emkornfield · 2023-04-01T21:55:16Z

However from specification point of view it would not be strictly required to describe the actual use case, it would be nice to do so at some place. WDYT?

This sounds reasonable, let me refactor this a little bit so we can add it to column index as well, with a little bit more description on each of the use-cases

emkornfield · 2023-04-01T22:52:46Z

@gszadovszky updated the PR based on your feedback, let me know what you think.

gszadovszky

However, I can accept having the old statistics saved in the page headers as well as column indexes (for backward compatibility), you also highlights the superiority of the column indexes. As so I would not extend the statistics saved in the page headers.
What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

src/main/thrift/parquet.thrift

emkornfield · 2023-04-05T03:34:42Z

What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?

src/main/thrift/parquet.thrift

wgtmac · 2023-04-06T05:21:09Z

What do you think about adding SizeEstimationStatistics directly to ColumnMetaData and the histograms to the ColumnIndex?

this sounds fine to me. We can always add separately to page headers if people want them. @wgtmac concerns with this?

I am good with this approach.

src/main/thrift/parquet.thrift

JFinis · 2023-09-12T13:44:48Z

src/main/thrift/parquet.thrift

@@ -764,6 +810,14 @@ struct ColumnMetaData {
   * in a single I/O.
   */
  15: optional i32 bloom_filter_length;
+
+  /**


(not related to this line, but to ColumnMetaData in general)

For completeness-reasons, we might also want to add unencoded_byte_array_data_bytes and num_entries for the dictionary page (if existent) into the ColumnMetadata, i.e., dictionary_unencoded_byte_array_data_bytes and num_dictionary_entries.

This way, readers could plan how much memory the dictionary of a column chunk will take. This can help in decisions whether, e.g., to load the dictionary up-front to perform pre-filtering on the dictionary. It also helps to right-size the buffer that will hold the dictionary.

I'm not suggesting that this is a must-have for this commit or at all, so feel free to drop this issue. I just wanted to voice that if we already want to provide tools for size estimation, the dictionary is currently not really accounted for.

I agree this could be useful, my preference is to leave this out for now as it hasn't come up in discussion before we can always add this as follow-up.

These fields are already presented in the page header but it requires an in-efficient hop to read it.

Co-authored-by: JFinis <[email protected]>

wgtmac · 2023-10-28T08:23:25Z

Java PoC implementation: apache/parquet-java#1177

…t#197)

etseidl · 2023-11-01T17:58:30Z

C++ PoC as well rapidsai/cudf#14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on apache/arrow#37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.

wgtmac · 2023-11-02T15:15:49Z

C++ PoC as well rapidsai/cudf#14000. Do we still need arrow-cpp to proceed? If so, @mapleFU have you started on apache/arrow#37869 yet? If not, I have cycles to try a first pass at an implementation if that would be helpful.

IMHO, two PoC impls do not imply that they have to be parquet-mr and arrow-cpp. So should we proceed to a formal vote on this PR? @emkornfield @gszadovszky

Once this PR has been merged, I can proceed the release process of parquet format v2.10. After that, all pending PoCs can be formally reviewed and merged.

emkornfield · 2023-11-02T23:15:21Z

I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought. @gszadovszky @pitrou

etseidl · 2023-11-03T01:02:42Z

I think it is reasonable for two different implementations that can be demonstrated to interop with each other. We should really document the exact policy thought.

I have @wgtmac's branch compiled locally, and can test the interoperability between the two existing PoC implementations. FWIW I've been using a modified parquet-cli in parquet-mr to print out the statistics I generate, so I don't foresee any big problems there.

Update: I've verified that spark with Gang's changes can read my files, and my implementation can read the spark generated files. Also verified the histograms and unencoded sizes match at the chunk level (pages are problematic because we choose different cutoff strategies, so even if the limits are set to the same values we get different page sizes).

Update 2: I set rows per page to 100 and then got congruent files. Can now confirm both PoCs produce the same histograms and unencoded sizes at the page level as well. This is for a file with a wide variety of nesting with and without null values. Let me know if there are any specific files from parquet-testing you'd like to see run through both PoCs.

wgtmac · 2023-11-05T07:46:40Z

Thanks @etseidl for verifying two implementations!

As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged.

WDYT? @emkornfield

emkornfield · 2023-11-05T18:00:36Z

Thanks @etseidl for verifying two implementations!

As they are still in the PoC state, I think the manual verification is sufficient and prefer delaying the work on interoperability by adding parquet files with SizeStatistics to the parquet-testing repo. We can add testing files after the implementations are formally reviewed and merged.

WDYT? @emkornfield

I think this is sufficient for now. @wgtmac if we think the Java PR is close enough I can start a vote for approval on this PR unless you want to do it?

wgtmac · 2023-11-05T22:06:35Z

I think it is good time to start a vote. @emkornfield

I will take care of the release process once the proposal passes the vote :)

emkornfield · 2023-11-06T20:57:18Z

Started vote on mailing list.

wgtmac · 2023-11-08T15:29:19Z

Could you help the vote on mailing list: https://lists.apache.org/thread/wgobz41mfldbhqpg9q4mdwypghg2cxg2? Help is needed from the PMC members. @ggershinsky @gszadovszky @julienledem @rdblue @wesm @xhochy

emkornfield · 2023-11-14T20:02:14Z

The vote passed. Is the standard way of merging her using a squash commit?

Fokko · 2023-11-14T20:20:15Z

@emkornfield I don't think there is a formal agreement on it on the Parquet project. squash is my personal preference and that's what the pre-Github script is doing as well:

parquet-format/dev/merge_parquet_pr.py

Line 139 in 6f3f909

run_cmd(['git', 'merge', pr_branch_name, '--squash'])

Adds Parquet size statistics introduced in apache/parquet-format#197. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: #14000

Adds Parquet size statistics introduced in apache/parquet-format#197. Authors: - Ed Seidl (https://github.com/etseidl) - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Nghia Truong (https://github.com/ttnghia) - Yunsong Wang (https://github.com/PointKernel) URL: rapidsai#14000

mapleFU reviewed Mar 26, 2023

View reviewed changes

wgtmac reviewed Mar 26, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield changed the title ~~PARQUET-2261: Initial proposal for unencoded/uncompressed statistics~~ PARQUET-2261: Proposal for unencoded/uncompressed statistics Mar 27, 2023

emkornfield commented Mar 27, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

emkornfield commented Mar 27, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

wgtmac reviewed Mar 27, 2023

View reviewed changes

src/main/thrift/parquet.thrift Show resolved Hide resolved

mapleFU reviewed Mar 27, 2023

View reviewed changes

wgtmac reviewed Mar 28, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

mapleFU approved these changes Mar 28, 2023

View reviewed changes

wgtmac requested a review from shangxinli March 28, 2023 02:56

wgtmac reviewed Mar 28, 2023

View reviewed changes

yqiu2 reviewed Mar 28, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

yqiu2 reviewed Mar 28, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

yqiu2 reviewed Mar 28, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

wgtmac requested review from pitrou and gszadovszky March 29, 2023 01:24

wgtmac reviewed Mar 31, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

gszadovszky reviewed Mar 31, 2023

View reviewed changes

gszadovszky requested changes Apr 2, 2023

View reviewed changes

wgtmac reviewed Apr 6, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

gszadovszky approved these changes Apr 13, 2023

View reviewed changes

emkornfield commented Sep 8, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

remove redundant phrasing

aee8c0e

JFinis reviewed Sep 12, 2023

View reviewed changes

src/main/thrift/parquet.thrift Outdated Show resolved Hide resolved

JFinis reviewed Sep 12, 2023

View reviewed changes

Update src/main/thrift/parquet.thrift

d7856b1

Co-authored-by: JFinis <[email protected]>

mapleFU mentioned this pull request Sep 26, 2023

[C++][Parquet] add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering apache/arrow#37869

Open

wgtmac mentioned this pull request Oct 28, 2023

PARQUET-1647: [Java][Parquet] Implement FLOAT16 logical type apache/parquet-java#1142

Merged

4 tasks

etseidl added a commit to etseidl/cudf that referenced this pull request Oct 28, 2023

revert some changes not related to PARQUET-2261 (apache/parquet-forma…

a028df8

…t#197)

Samrose-Ahmed mentioned this pull request Nov 1, 2023

Tracking for Parquet size statistics apache/arrow-rs#5022

Closed

Fokko mentioned this pull request Nov 13, 2023

PARQUET-2344: Bump to Thrift 0.19.0 apache/parquet-java#1192

Merged

4 tasks

emkornfield merged commit 40699d0 into apache:master Nov 14, 2023
2 checks passed

etseidl mentioned this pull request Mar 8, 2024

Add size statistics to ParquetMetaData introduced in PARQUET-2261 apache/arrow-rs#5486

Closed

etseidl mentioned this pull request Jun 10, 2024

PARQUET-2310: implementation status apache/parquet-site#34

Merged

This was referenced Jul 11, 2024

Update Parquet thrift generated structures apache/arrow-rs#6045

Merged

Add unencoded_byte_array_data_bytes to ParquetMetaData apache/arrow-rs#6068

Merged

etseidl mentioned this pull request Jul 19, 2024

Consistent naming for Parquet page index structures apache/arrow-rs#6097

Open

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197

PARQUET-2261: add statistics for better estimating unencoded/uncompressed sizes and finer grained filtering #197

Conversation

emkornfield commented Mar 26, 2023

Choose a reason for hiding this comment

emkornfield Mar 26, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Mar 27, 2023

mapleFU commented Mar 27, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

emkornfield commented Mar 27, 2023

mapleFU left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac left a comment

Choose a reason for hiding this comment

gszadovszky left a comment

Choose a reason for hiding this comment

emkornfield commented Apr 1, 2023 • edited Loading

gszadovszky commented Apr 1, 2023

emkornfield commented Apr 1, 2023

emkornfield commented Apr 1, 2023

gszadovszky left a comment

Choose a reason for hiding this comment

emkornfield commented Apr 5, 2023 • edited Loading

wgtmac commented Apr 6, 2023

JFinis Sep 12, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wgtmac commented Oct 28, 2023

etseidl commented Nov 1, 2023

wgtmac commented Nov 2, 2023

emkornfield commented Nov 2, 2023

etseidl commented Nov 3, 2023 • edited Loading

wgtmac commented Nov 5, 2023

emkornfield commented Nov 5, 2023

wgtmac commented Nov 5, 2023

emkornfield commented Nov 6, 2023

wgtmac commented Nov 8, 2023

emkornfield commented Nov 14, 2023

Fokko commented Nov 14, 2023

emkornfield Mar 26, 2023 •

edited

Loading

mapleFU left a comment •

edited

Loading

emkornfield commented Apr 1, 2023 •

edited

Loading

emkornfield commented Apr 5, 2023 •

edited

Loading

JFinis Sep 12, 2023 •

edited

Loading

etseidl commented Nov 3, 2023 •

edited

Loading