diff --git a/README.md b/README.md index b0423e673..70fcecb20 100644 --- a/README.md +++ b/README.md @@ -212,6 +212,15 @@ The format is explicitly designed to separate the metadata from the data. This allows splitting columns into multiple files, as well as having a single metadata file reference multiple parquet files. +## RowGroup Statistics +In Parquet, the metadata for each RowGroup contains Statistics, which can be used by +clients for filtering purposes. An example implementation of filtering logic can be found +in [parquet-mr](https://github.com/apache/parquet-mr). Statistics include information +like the minimum and maximum for primitive types, while for binary data there is an +additional notion of _signed_ and _unsigned_ interpretations of the byte strings, which +have different comparison operations and are stored in the optional fields +`unsigned_min`, `unsigned_max`, `signed_min` and `signed_max`. + ## Configurations - Row group size: Larger row groups allow for larger column chunks which makes it possible to do larger sequential IO. Larger groups also require more buffering in diff --git a/src/main/thrift/parquet.thrift b/src/main/thrift/parquet.thrift index ac4d50eb4..011c31823 100644 --- a/src/main/thrift/parquet.thrift +++ b/src/main/thrift/parquet.thrift @@ -194,6 +194,22 @@ enum FieldRepetitionType { /** * Statistics per row group and per page * All fields are optional. + * + * Binaries are sorted lexicographically (byte by byte), treating each byte as + * an integer. The signed sorting treats each byte as a signed two's + * compliment number, and the unsigned treats the byte as an unsigned number. + * When one bytestring is a prefix of another, the containing bytestring is + * "greater than" the prefix. + * + * For BinaryStatistics in Parquet, we want to distinguish between the + * statistics derived from comparisons of signed or unsigned bytes. The min + * and max fields are deprecated for BinaryStatistics, instead relying on + * specification of {unsigned,signed}_{min,max}. The filter API should allow + * clients to specify which statistics and method of comparison should be used + * for filtering. To maintain backward format compatibility, when filtering + * based on signed statistics the signed_min and signed_max are checked first, + * and if they are unset it falls back to using the values in min and max, + * treating them as signed bytestrings. */ struct Statistics { /** min and max value of the column, encoded in PLAIN encoding */ @@ -203,6 +219,12 @@ struct Statistics { 3: optional i64 null_count; /** count of distinct values occurring */ 4: optional i64 distinct_count; + /* Signed min and max for binary fields */ + 5: optional binary signed_max; + 6: optional binary signed_min; + /* Unsigned min and max for binary fields */ + 7: optional binary unsigned_max; + 8: optional binary unsigned_min; } /**