-
Notifications
You must be signed in to change notification settings - Fork 461
PARQUET-686: Add optional {signed,unsigned}_{min,max} fields to Statistics #42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
src/main/thrift/parquet.thrift
Outdated
| * Statistics per row group and per page | ||
| * All fields are optional. | ||
| * | ||
| * For BinaryStatistics in parquet-mr, we want to distinguish between the statistics derived from |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leave out the -mr as we should also implement this in parquet-cpp.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch :)
|
LGTM |
|
Any reason the travis hasn't started? Does it need an approval before the build starts? |
src/main/thrift/parquet.thrift
Outdated
| * By default, Parquet will compare Binary type data as a signed bytestring, and this is the | ||
| * default behavior for filter pushdown when signed and unsigned are not specified in the | ||
| * Statistics. However, when signed_min, signed_max, unsigned_min and unsigned_max are all | ||
| * specified, the client has the option of using either signed or unsigned bytestring statistics. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about we say for a given file, it's binary stats will be written 1 of 3 ways:
- All min / max fields are not populated -- no stats available
- only fields 1 and 2 are populated , but not fields 5,6,7,8 -- only signed min/max available
- only fields 5,6,7,8 are populated -- both signed / unsigned stats available, fields 1 + 2 are deprecated for binary fields from now on
|
I think we also need to update the specification in this repo to explicitly say what we decide, and the sort orders used and what various fields being populated signifies. For example, if present, fields 1 and 2 will always represent signed min / max |
|
@andreweduffy travis gets triggered automatically (orange dot means it's been triggered). However I think there's build queue per account and the apache build queue has a lot of projects in it. |
|
@isnotinvain: We should also check how this is implemented in impala/parquet-cpp to make sure we are consistent. |
src/main/thrift/parquet.thrift
Outdated
| 3: optional i64 null_count; | ||
| /** count of distinct values occurring */ | ||
| 4: optional i64 distinct_count; | ||
| /* Signed min and max */ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/* signed min and max for binary fields */
|
I think @piyushnarang took a look at the C++ side and mentioned that it does not yet write stats? I would like to propose:
We should also update the parquet-format README / spec page with this if we agree on it. |
|
Another more aggressive option is to add only 2 new fields, unsignedMin + unsignedMax, and to change the parquet-mr read path to only respect unsignedMin and unsignedMax in inequality filters (less than / greater than) and simply act as if there are no known stats when these aren't populated. The signed min/max are still valid for equality filters though. So this would probably be simpler (only 1 supported sort ordering) but old files would lose any stats based optimizations for greater than / less than queries. |
|
@julienledem / @isnotinvain - yeah so it seems like the cpp work is still in progress: apache/parquet-cpp#129 |
|
@isnotinvain so if I'm understanding you correctly, the plan is to
Is that correct? |
|
To repeat my question from the mailing list, would changing the comparison used based on the |
Now, the if min and max are provided alone, they default to a signed interpretation. unsigned_min and unsigned_max must be declared explicitly for the unsigned semantics to be used.
|
I think that would work to solve this problem, though introducing some extra complexity as ConvertedType for columns is stored only in the FileMetadata from what I can tell, which is not generally available for StatisticsFilter, though it can potentially be added to make the filter aware of extra type information. There is nothing else in the code that seems to depend upon ConvertedType, as far as I can tell it's mainly for use in deserialization for the compute engines that are deserializing from Parquet to their own internal types, not sure how people feel about integrating it into filtering. |
|
@wesm converted type / original type tags are sort of optional, so we'd have to rely on each object model integration to properly tag these things correctly, and then as @andreweduffy points out we'd also have to wire that all the way through some layers that probably don't have that info currently. |
|
@andreweduffy Well I think there's 2 options is what I was saying. Option 1:
Option 2:
I would probably lean towards option 1 as it's more clear. What do you think? |
This reverts commit 8f33c3a.
|
I'm fine with (1), I went and reverted the last commit I made, so there are still 6 total statistics fields again, also improved the comment to integrate some of the wording from your explanation. Once this is merged we'll need another format release to build parquet-mr against. |
|
@isnotinvain understood -- I imagine there are some existing systems that store UTF8 data without marking the appropriate converted type, but parquet-mr / other parquet implementations have no way of distinguishing that from regular binary data. But on the flip side without this additional type metadata it's hard to know what's the right comparator to use (for example: Impala does not have built-in handling of UTF8 / Unicode yet, so if you have non-ASCII data then partition pruning based on statistics may not work correctly) -- are there any other kinds of common data stored in BYTE_ARRAY where the signed interpretation is the right one? |
|
@wesm I've never heard of any, in effect the unsigned interpretation is really the more widely accepted as far as I can tell, though JVM things can't represent |
|
I agree that for utf8 strings, unsigned is standard. But for generic byte arrays, I think it's fair to say there isn't really a "natural" ordering to them. If we were designing this from scratch today, I'd vote for just doing unsigned everywhere because it's simple, works for utf8, and isn't wrong for non-utf8. That said, we've already been storing signed min max in the only implementation of binary stats, so we need to support that to some degree unfortunately. I think the best way to migrate is to start storing both signed + unsigned, and force the caller to specify which they want. In the case of impala, it can simply say that it always wants unsigned -- and when it sees a file w/o unsigned stats, it'll have to skip the stats based filtering. One thing to clarify though, is for Does that make sense? |
src/main/thrift/parquet.thrift
Outdated
| * clients to specify which statistics and method of comparison should be used | ||
| * for filtering. To maintain backward format compatibility, when filtering | ||
| * based on signed statistics the signed_min and signed_max are checked first, | ||
| * and if they are unset it falls back to using the values in min and max. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lets clarify that min and max are always signed if present.
I wonder if any of this belongs in the README too?
|
@isnotinvain all makes sense, this is a good plan |
|
Yep, I would like to point out though that you'll get extra false positives in the scan for "may contain this value" that you wouldn't get with unsigned comparison, but does not affect correctness in the same way. In effect it doesn't really matter, once systems start using the correct statistics then these false positives should be decreased. |
|
Hey @isnotinvain @piyushnarang and anyone else invested, is there anything blocking this currently? |
|
LGTM |
|
+1, only thing I'd add is maybe a detailed explanation of how signed + unsigned min/max are calculated. Something like: |
|
Done, @isnotinvain let me know if you like that. Once this merges there should probably be a release that we can depend on in parquet-mr and parquet-cpp ongoing changes. |
|
I'm assuming it was the comment on the thrift file you wanted me to place that, I don't know where in the README it would've made sense |
|
Sorry I'm late to the discussion here. I don't see the need to stop using min and max, or why we should add a the unsigned min and max fields. We have two orderings to deal with: the unsigned ordering that is correct for UTF-8, and the signed ordering that is easy to use by accident in Java. I think the rationale for continuing to provide the signed min and max is that it is easy for devs to use it by accident, but I don't think that's a good reason to legitimize its use in Parquet. It is the wrong ordering for UTF-8 and it's the wrong ordering for every use of binary or fixed that I can think of, like hash bytes or decimal bytes. I don't think we need to have signed min and max fields at all. Also, rather than moving to unsigned min and max fields, I'd rather add a field to record the ordering that was used. If that's missing, then we know it is the signed ordering. Otherwise, we can check that the identifier shows it is the unsigned ordering before using the stats for filtering. This would fix the current problem (knowing what sort order was used) and would allow us to address two other cases:
|
|
@isnotinvain: see Ryan's comment above |
|
It looks like this stalled out. We've been looking at adding Impala support for writing min/max stats and ran into this issue. We've also discovered additional gaps in the Parquet spec and corresponding problems in the parquet-mr implementation: https://issues.apache.org/jira/browse/PARQUET-839, https://issues.apache.org/jira/browse/PARQUET-840 From the point of view of a query-processing engine, I agree with the suggestion from @rdblue that custom sort orderings may be useful, although I think that may be a niche use case. In most case having a sane default ordering for the logical types would be sufficient to make min-max stats easier to implement correctly and more effective at filtering data. Currently the sort order in parquet-mr is based on primitive type, rather than logical type. I think all of the logical types in parquet have an "obvious" default sort order (for non-NULL values) that would match the order of the corresponding SQL types. E.g. UTF-8 - sort by unicode code points. Decimal - sort by numeric value. Date/time/timestamp - sort by time. Interval - sort by month, then second, then millisecond. Lexicographic ordering of the binary data can make min-max filtering less effective compared with the "natural" sort order for the logical data type. E.g. in the natural sort order for decimal encoded as a fixed length byte array, [-1, 0] is a small range, but in the lexicographic order, it covers all possible values. It's more difficult to implement in query processing engines, because the engine needs to be aware of the sort order in some cases. E.g. to generate parquet files that support effective min-max filtering on a key we would sort by that key, then generate the files. For this to work as intended, we'd have to emulate Parquet's peculiar sort order in the query engine, and switch to that when inserting into parquet files. |
|
Hi @timarmstrong, |
|
(FYI recently changed username from @andreweduffy to @a15y, sorry about the confusion). It looks like on the mailing list there's a fair amount of churn over when the call will actually be, could you also post here when the final time is chosen? |
…ore dump Also adds an option to use ccache with a custom gcc toolchain. Fixes an improper memory access stemming from a unit test that had a wrong parameter. Author: Wes McKinney <[email protected]> Closes apache#42 from wesm/PARQUET-513 and squashes the following commits: 43df9e7 [Wes McKinney] Fix core dump from unit test bug b46c617 [Wes McKinney] Fail build if valgrind finds error during ctest
Part of the fix for PARQUET-686, see discussion in Parquet-MR: apache/parquet-java#362
@isnotinvain @julienledem