PARQUET-686: Add optional {signed,unsigned}_{min,max} fields to Statistics #42

a10y · 2016-08-25T13:51:14Z

Part of the fix for PARQUET-686, see discussion in Parquet-MR: apache/parquet-java#362

xhochy · 2016-08-25T14:19:26Z

src/main/thrift/parquet.thrift

 * Statistics per row group and per page
 * All fields are optional.
+ *
+ * For BinaryStatistics in parquet-mr, we want to distinguish between the statistics derived from


Leave out the -mr as we should also implement this in parquet-cpp.

good catch :)

piyushnarang · 2016-08-25T16:32:41Z

LGTM

a10y · 2016-08-25T17:18:50Z

Any reason the travis hasn't started? Does it need an approval before the build starts?

isnotinvain · 2016-08-25T23:14:42Z

src/main/thrift/parquet.thrift

+ * By default, Parquet will compare Binary type data as a signed bytestring, and this is the
+ * default behavior for filter pushdown when signed and unsigned are not specified in the
+ * Statistics. However, when signed_min, signed_max, unsigned_min and unsigned_max are all
+ * specified, the client has the option of using either signed or unsigned bytestring statistics.


How about we say for a given file, it's binary stats will be written 1 of 3 ways:

All min / max fields are not populated -- no stats available

only fields 1 and 2 are populated , but not fields 5,6,7,8 -- only signed min/max available

only fields 5,6,7,8 are populated -- both signed / unsigned stats available, fields 1 + 2 are deprecated for binary fields from now on

isnotinvain · 2016-08-25T23:15:53Z

I think we also need to update the specification in this repo to explicitly say what we decide, and the sort orders used and what various fields being populated signifies.

For example, if present, fields 1 and 2 will always represent signed min / max

julienledem · 2016-08-26T00:00:04Z

@andreweduffy travis gets triggered automatically (orange dot means it's been triggered). However I think there's build queue per account and the apache build queue has a lot of projects in it.

julienledem · 2016-08-26T00:00:56Z

@isnotinvain: We should also check how this is implemented in impala/parquet-cpp to make sure we are consistent.

julienledem · 2016-08-26T00:02:13Z

src/main/thrift/parquet.thrift

   3: optional i64 null_count;
   /** count of distinct values occurring */
   4: optional i64 distinct_count;
+   /* Signed min and max */


/* signed min and max for binary fields */

julienledem · 2016-08-26T00:05:10Z

@wesm @nongli to your knowledge is this consistent with the c++ side of things? (also the isSorted metadata)
@rdblue does that impact Presto?

isnotinvain · 2016-08-26T01:33:36Z

I think @piyushnarang took a look at the C++ side and mentioned that it does not yet write stats?
Would be good to confirm though.

I would like to propose:

deprecate min/max in the case of binary fields. When populated in old files we know they are signed.
start only populating the 4 new fields, which leaves no room for confusion, and then add some more filtering APIs to parquet-mr that let the caller choose which min/max they are interested in.

We should also update the parquet-format README / spec page with this if we agree on it.

isnotinvain · 2016-08-26T01:36:14Z

Another more aggressive option is to add only 2 new fields, unsignedMin + unsignedMax, and to change the parquet-mr read path to only respect unsignedMin and unsignedMax in inequality filters (less than / greater than) and simply act as if there are no known stats when these aren't populated. The signed min/max are still valid for equality filters though. So this would probably be simpler (only 1 supported sort ordering) but old files would lose any stats based optimizations for greater than / less than queries.

piyushnarang · 2016-08-26T01:55:48Z

@julienledem / @isnotinvain - yeah so it seems like the cpp work is still in progress: apache/parquet-cpp#129
They're currently using unsigned there though so it's something that will need to be updated in the cpp PR.

a10y · 2016-08-26T11:06:52Z

@isnotinvain so if I'm understanding you correctly, the plan is to

Remove signed_min and signed_max, leaving only the addition of unsigned_min and unsigned_max
In the implementations, change the default ordering of Binary to be based on unsigned comparisons

Is that correct?

wesm · 2016-08-26T19:48:50Z

To repeat my question from the mailing list, would changing the comparison used based on the ConvertedType solve the issue also (e.g. using unsigned comparison for UTF8)?

Now, the if min and max are provided alone, they default to a signed interpretation. unsigned_min and unsigned_max must be declared explicitly for the unsigned semantics to be used.

a10y · 2016-08-28T15:53:46Z

I think that would work to solve this problem, though introducing some extra complexity as ConvertedType for columns is stored only in the FileMetadata from what I can tell, which is not generally available for StatisticsFilter, though it can potentially be added to make the filter aware of extra type information. There is nothing else in the code that seems to depend upon ConvertedType, as far as I can tell it's mainly for use in deserialization for the compute engines that are deserializing from Parquet to their own internal types, not sure how people feel about integrating it into filtering.

isnotinvain · 2016-08-30T19:17:16Z

@wesm converted type / original type tags are sort of optional, so we'd have to rely on each object model integration to properly tag these things correctly, and then as @andreweduffy points out we'd also have to wire that all the way through some layers that probably don't have that info currently.

isnotinvain · 2016-08-30T19:23:36Z

@andreweduffy

Well I think there's 2 options is what I was saying.

Option 1:

deprecate min/max in the case of binary fields. When populated in old files we know they are signed.
From now on, only populate the 4 new fields: signed_min, signed_max, unsigned_min, unsigned_max, which leaves no room for confusion, and then add some more filtering APIs to parquet-mr that let the caller choose which min/max they are interested in. When signed_min / signed_max is requested, check those fields first, then fall back to the deprecated min/max fields.

Option 2:

Same as option 1, but treat the current min/max as the signed_min signed_max fields. Less explicit, but less fields overall.

I would probably lean towards option 1 as it's more clear. What do you think?

This reverts commit 8f33c3a.

a10y · 2016-08-31T08:30:24Z

I'm fine with (1), I went and reverted the last commit I made, so there are still 6 total statistics fields again, also improved the comment to integrate some of the wording from your explanation.

Once this is merged we'll need another format release to build parquet-mr against.

wesm · 2016-08-31T12:30:30Z

@isnotinvain understood -- I imagine there are some existing systems that store UTF8 data without marking the appropriate converted type, but parquet-mr / other parquet implementations have no way of distinguishing that from regular binary data. But on the flip side without this additional type metadata it's hard to know what's the right comparator to use (for example: Impala does not have built-in handling of UTF8 / Unicode yet, so if you have non-ASCII data then partition pruning based on statistics may not work correctly) -- are there any other kinds of common data stored in BYTE_ARRAY where the signed interpretation is the right one?

a10y · 2016-08-31T14:45:27Z

@wesm I've never heard of any, in effect the unsigned interpretation is really the more widely accepted as far as I can tell, though JVM things can't represent unsigned char * so they're stuck with implicitly using signed byte[]. Spark does this for non-string binary data, and I'm sure other JVM systems do too. Having to specify signedness is a bit gross, but it is necessary if we want to allow for either set of statistics to be usable for binary typed data. The alternative would be to only add unsigned min/max to the format, then just move binary over to using unsigned all the time but Spark and likely others would need to change their sort ordering to match up with parquet.

isnotinvain · 2016-08-31T19:55:18Z

I agree that for utf8 strings, unsigned is standard. But for generic byte arrays, I think it's fair to say there isn't really a "natural" ordering to them. If we were designing this from scratch today, I'd vote for just doing unsigned everywhere because it's simple, works for utf8, and isn't wrong for non-utf8.

That said, we've already been storing signed min max in the only implementation of binary stats, so we need to support that to some degree unfortunately. I think the best way to migrate is to start storing both signed + unsigned, and force the caller to specify which they want. In the case of impala, it can simply say that it always wants unsigned -- and when it sees a file w/o unsigned stats, it'll have to skip the stats based filtering.

One thing to clarify though, is for == and != queries, you can use either signed or unsigned, even for utf8. If the query is for foo.bar == "café" and the only stats available are signed, you can still use those to determine whether a rowgroup either a) may have data you are looking for or b) definitely does not. It's only for >, >=, <, <= queries where it matters whether signed / unsigned is used. So I do think we should make sure that both parquet-mr and parquet-cpp special case the equality filters.

Does that make sense?

isnotinvain · 2016-08-31T19:58:42Z

src/main/thrift/parquet.thrift

+ * clients to specify which statistics and method of comparison should be used
+ * for filtering. To maintain backward format compatibility, when filtering
+ * based on signed statistics the signed_min and signed_max are checked first,
+ * and if they are unset it falls back to using the values in min and max.


Lets clarify that min and max are always signed if present.

I wonder if any of this belongs in the README too?

wesm · 2016-09-01T02:22:02Z

@isnotinvain all makes sense, this is a good plan

a10y · 2016-09-01T09:08:41Z

Yep, I would like to point out though that you'll get extra false positives in the scan for "may contain this value" that you wouldn't get with unsigned comparison, but does not affect correctness in the same way. In effect it doesn't really matter, once systems start using the correct statistics then these false positives should be decreased.

a10y · 2016-09-06T20:37:17Z

Hey @isnotinvain @piyushnarang and anyone else invested, is there anything blocking this currently?

julienledem · 2016-09-06T22:03:36Z

LGTM

isnotinvain · 2016-09-06T22:16:45Z

+1, only thing I'd add is maybe a detailed explanation of how signed + unsigned min/max are calculated.

Something like:
Binaries are sorted lexicographically (byte by byte), treating each byte as an integer. The signed sorting treats each byte as a signed twos compliment number, and the unsigned treats the byte as an unsigned number. A binary with fewer bytes is considered to be 'less than' a binary with more bytes (is that correct?)

a10y · 2016-09-06T22:58:34Z

Done, @isnotinvain let me know if you like that. Once this merges there should probably be a release that we can depend on in parquet-mr and parquet-cpp ongoing changes.

a10y · 2016-09-08T18:02:22Z

I'm assuming it was the comment on the thrift file you wanted me to place that, I don't know where in the README it would've made sense

rdblue · 2016-09-08T23:57:00Z

Sorry I'm late to the discussion here.

I don't see the need to stop using min and max, or why we should add a the unsigned min and max fields.

We have two orderings to deal with: the unsigned ordering that is correct for UTF-8, and the signed ordering that is easy to use by accident in Java. I think the rationale for continuing to provide the signed min and max is that it is easy for devs to use it by accident, but I don't think that's a good reason to legitimize its use in Parquet. It is the wrong ordering for UTF-8 and it's the wrong ordering for every use of binary or fixed that I can think of, like hash bytes or decimal bytes. I don't think we need to have signed min and max fields at all.

Also, rather than moving to unsigned min and max fields, I'd rather add a field to record the ordering that was used. If that's missing, then we know it is the signed ordering. Otherwise, we can check that the identifier shows it is the unsigned ordering before using the stats for filtering. This would fix the current problem (knowing what sort order was used) and would allow us to address two other cases:

The min/max byte arrays are used for the other types as well. Some of those types have unsigned variants that also have this problem (we only perform signed comparison).
SQL engines will eventually want to use sort orders other than the lexicographic UTF-8 ordering to support the natural order in non-English languages. Allowing an identifier for the sort ordering will be a way to address those use cases.

julienledem · 2016-09-09T17:31:17Z

@isnotinvain: see Ryan's comment above

timarmstrong · 2017-01-25T17:55:12Z

It looks like this stalled out. We've been looking at adding Impala support for writing min/max stats and ran into this issue. We've also discovered additional gaps in the Parquet spec and corresponding problems in the parquet-mr implementation: https://issues.apache.org/jira/browse/PARQUET-839, https://issues.apache.org/jira/browse/PARQUET-840

From the point of view of a query-processing engine, I agree with the suggestion from @rdblue that custom sort orderings may be useful, although I think that may be a niche use case. In most case having a sane default ordering for the logical types would be sufficient to make min-max stats easier to implement correctly and more effective at filtering data.

Currently the sort order in parquet-mr is based on primitive type, rather than logical type. I think all of the logical types in parquet have an "obvious" default sort order (for non-NULL values) that would match the order of the corresponding SQL types. E.g. UTF-8 - sort by unicode code points. Decimal - sort by numeric value. Date/time/timestamp - sort by time. Interval - sort by month, then second, then millisecond. Lexicographic ordering of the binary data can make min-max filtering less effective compared with the "natural" sort order for the logical data type. E.g. in the natural sort order for decimal encoded as a fixed length byte array, [-1, 0] is a small range, but in the lexicographic order, it covers all possible values.

It's more difficult to implement in query processing engines, because the engine needs to be aware of the sort order in some cases. E.g. to generate parquet files that support effective min-max filtering on a key we would sort by that key, then generate the files. For this to work as intended, we'd have to emulate Parquet's peculiar sort order in the query engine, and switch to that when inserting into parquet files.

julienledem · 2017-01-27T00:39:00Z

Hi @timarmstrong,
This is a high priority item to fix for the next release.
It sounds like you've reviewed #46 I'm posting it again for reference. Your input is highly valuable and your help welcome.
The next parquet sync is Thursday Feb 9th 10am pt. I'd invite you to join so that we can finalize the solution. If this is too far or not a good time we can setup a hangout meeting earlier. Notes are then posted to the list.
Like you said it is important that stats match the ordering in the execution layer to be useful. We also need them to be standard across execution engines.
I think @andreweduffy, you @timarmstrong, @rdblue, @isnotinvain at least should join.
(virtual) Face to face meetings help converging faster.

a10y · 2017-02-01T03:32:28Z

(FYI recently changed username from @andreweduffy to @a15y, sorry about the confusion). It looks like on the mailing list there's a fair amount of churn over when the call will actually be, could you also post here when the final time is chosen?

lekv · 2017-06-15T06:14:18Z

This PR seems obsoleted by PR #46, which added the min_value and max_value fields.

@a10y - Can we close this one? Or do you think some aspects are not properly addressed in #46?

…ore dump Also adds an option to use ccache with a custom gcc toolchain. Fixes an improper memory access stemming from a unit test that had a wrong parameter. Author: Wes McKinney <[email protected]> Closes apache#42 from wesm/PARQUET-513 and squashes the following commits: 43df9e7 [Wes McKinney] Fix core dump from unit test bug b46c617 [Wes McKinney] Fail build if valgrind finds error during ctest

Add optional {signed,unsigned}_{min,max} fields to Statistics

c06e197

xhochy reviewed Aug 25, 2016
View reviewed changes

Remove parquet-mr specific reference

f310c5a

isnotinvain reviewed Aug 25, 2016
View reviewed changes

julienledem reviewed Aug 26, 2016
View reviewed changes

piyushnarang mentioned this pull request Aug 26, 2016

PARQUET-593: Add API for writing Page statistics apache/parquet-cpp#129

Closed

removal of explicit signed_{min,max} in Statistics

8f33c3a

Now, the if min and max are provided alone, they default to a signed interpretation. unsigned_min and unsigned_max must be declared explicitly for the unsigned semantics to be used.

Andrew Duffy added 2 commits August 31, 2016 09:21

Revert "removal of explicit signed_{min,max} in Statistics"

35632de

This reverts commit 8f33c3a.

Improved comment

fc79696

fix inline comments

eaf8c54

isnotinvain reviewed Aug 31, 2016
View reviewed changes

Update comment and add README section

b9fddb2

Add comment about Binary sorting

474ece8

julienledem mentioned this pull request Sep 21, 2016

PARQUET-686: Allow for Unsigned Statistics in Binary Type apache/parquet-java#362

Closed

pono closed this Jun 15, 2017

asfimport mentioned this pull request Jun 23, 2024

Allow for Unsigned Statistics in Binary Type #314

Closed

PARQUET-686: Add optional {signed,unsigned}_{min,max} fields to Statistics #42

PARQUET-686: Add optional {signed,unsigned}_{min,max} fields to Statistics #42

Uh oh!

Conversation

a10y commented Aug 25, 2016

Uh oh!

xhochy Aug 25, 2016

Choose a reason for hiding this comment

Uh oh!

a10y Aug 25, 2016

Choose a reason for hiding this comment

Uh oh!

piyushnarang commented Aug 25, 2016

Uh oh!

a10y commented Aug 25, 2016

Uh oh!

isnotinvain Aug 25, 2016

Choose a reason for hiding this comment

Uh oh!

isnotinvain commented Aug 25, 2016

Uh oh!

julienledem commented Aug 26, 2016

Uh oh!

julienledem commented Aug 26, 2016

Uh oh!

julienledem Aug 26, 2016

Choose a reason for hiding this comment

Uh oh!

julienledem commented Aug 26, 2016

Uh oh!

isnotinvain commented Aug 26, 2016

Uh oh!

isnotinvain commented Aug 26, 2016

Uh oh!

piyushnarang commented Aug 26, 2016

Uh oh!

a10y commented Aug 26, 2016

Uh oh!

wesm commented Aug 26, 2016

Uh oh!

a10y commented Aug 28, 2016

Uh oh!

isnotinvain commented Aug 30, 2016

Uh oh!

isnotinvain commented Aug 30, 2016

Uh oh!

a10y commented Aug 31, 2016

Uh oh!

wesm commented Aug 31, 2016

Uh oh!

a10y commented Aug 31, 2016

Uh oh!

isnotinvain commented Aug 31, 2016

Uh oh!

isnotinvain Aug 31, 2016

Choose a reason for hiding this comment

Uh oh!

wesm commented Sep 1, 2016

Uh oh!

a10y commented Sep 1, 2016

Uh oh!

a10y commented Sep 6, 2016

Uh oh!

julienledem commented Sep 6, 2016

Uh oh!

isnotinvain commented Sep 6, 2016

Uh oh!

a10y commented Sep 6, 2016

Uh oh!

a10y commented Sep 8, 2016

Uh oh!

rdblue commented Sep 8, 2016

Uh oh!

julienledem commented Sep 9, 2016

Uh oh!

timarmstrong commented Jan 25, 2017

Uh oh!

julienledem commented Jan 27, 2017

Uh oh!

a10y commented Feb 1, 2017