Skip to content
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -194,6 +194,13 @@ enum FieldRepetitionType {
/**
* Statistics per row group and per page
* All fields are optional.
*
* For BinaryStatistics in Parquet, we want to distinguish between the statistics derived from
* comparisons of signed or unsigned bytes.
* By default, Parquet will compare Binary type data as a signed bytestring, and this is the
* default behavior for filter pushdown when signed and unsigned are not specified in the
* Statistics. However, when signed_min, signed_max, unsigned_min and unsigned_max are all
* specified, the client has the option of using either signed or unsigned bytestring statistics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we say for a given file, it's binary stats will be written 1 of 3 ways:

  1. All min / max fields are not populated -- no stats available
  2. only fields 1 and 2 are populated , but not fields 5,6,7,8 -- only signed min/max available
  3. only fields 5,6,7,8 are populated -- both signed / unsigned stats available, fields 1 + 2 are deprecated for binary fields from now on

*/
struct Statistics {
/** min and max value of the column, encoded in PLAIN encoding */
Expand All @@ -203,6 +210,12 @@ struct Statistics {
3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/* Signed min and max */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* signed min and max for binary fields */

5: optional binary signed_max;
6: optional binary signed_min;
/* Unsigned min and max */
7: optional binary unsigned_max;
8: optional binary unsigned_min;
}

/**
Expand Down