Skip to content

Conversation

@a10y
Copy link

@a10y a10y commented Aug 25, 2016

Part of the fix for PARQUET-686, see discussion in Parquet-MR: apache/parquet-java#362

@isnotinvain @julienledem

* Statistics per row group and per page
* All fields are optional.
*
* For BinaryStatistics in parquet-mr, we want to distinguish between the statistics derived from
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Leave out the -mr as we should also implement this in parquet-cpp.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch :)

@piyushnarang
Copy link

LGTM

@a10y
Copy link
Author

a10y commented Aug 25, 2016

Any reason the travis hasn't started? Does it need an approval before the build starts?

* By default, Parquet will compare Binary type data as a signed bytestring, and this is the
* default behavior for filter pushdown when signed and unsigned are not specified in the
* Statistics. However, when signed_min, signed_max, unsigned_min and unsigned_max are all
* specified, the client has the option of using either signed or unsigned bytestring statistics.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about we say for a given file, it's binary stats will be written 1 of 3 ways:

  1. All min / max fields are not populated -- no stats available
  2. only fields 1 and 2 are populated , but not fields 5,6,7,8 -- only signed min/max available
  3. only fields 5,6,7,8 are populated -- both signed / unsigned stats available, fields 1 + 2 are deprecated for binary fields from now on

@isnotinvain
Copy link
Contributor

I think we also need to update the specification in this repo to explicitly say what we decide, and the sort orders used and what various fields being populated signifies.

For example, if present, fields 1 and 2 will always represent signed min / max

@julienledem
Copy link
Member

@andreweduffy travis gets triggered automatically (orange dot means it's been triggered). However I think there's build queue per account and the apache build queue has a lot of projects in it.

@julienledem
Copy link
Member

@isnotinvain: We should also check how this is implemented in impala/parquet-cpp to make sure we are consistent.

3: optional i64 null_count;
/** count of distinct values occurring */
4: optional i64 distinct_count;
/* Signed min and max */
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/* signed min and max for binary fields */

@julienledem
Copy link
Member

@wesm @nongli to your knowledge is this consistent with the c++ side of things? (also the isSorted metadata)
@rdblue does that impact Presto?

@isnotinvain
Copy link
Contributor

I think @piyushnarang took a look at the C++ side and mentioned that it does not yet write stats?
Would be good to confirm though.

I would like to propose:

  1. deprecate min/max in the case of binary fields. When populated in old files we know they are signed.

  2. start only populating the 4 new fields, which leaves no room for confusion, and then add some more filtering APIs to parquet-mr that let the caller choose which min/max they are interested in.

We should also update the parquet-format README / spec page with this if we agree on it.

@isnotinvain
Copy link
Contributor

Another more aggressive option is to add only 2 new fields, unsignedMin + unsignedMax, and to change the parquet-mr read path to only respect unsignedMin and unsignedMax in inequality filters (less than / greater than) and simply act as if there are no known stats when these aren't populated. The signed min/max are still valid for equality filters though. So this would probably be simpler (only 1 supported sort ordering) but old files would lose any stats based optimizations for greater than / less than queries.

@piyushnarang
Copy link

@julienledem / @isnotinvain - yeah so it seems like the cpp work is still in progress: apache/parquet-cpp#129
They're currently using unsigned there though so it's something that will need to be updated in the cpp PR.

@a10y
Copy link
Author

a10y commented Aug 26, 2016

@isnotinvain so if I'm understanding you correctly, the plan is to

  1. Remove signed_min and signed_max, leaving only the addition of unsigned_min and unsigned_max
  2. In the implementations, change the default ordering of Binary to be based on unsigned comparisons

Is that correct?

@wesm
Copy link
Member

wesm commented Aug 26, 2016

To repeat my question from the mailing list, would changing the comparison used based on the ConvertedType solve the issue also (e.g. using unsigned comparison for UTF8)?

Now, the if min and max are provided alone, they default to a signed
interpretation. unsigned_min and unsigned_max must be declared
explicitly for the unsigned semantics to be used.
@a10y
Copy link
Author

a10y commented Aug 28, 2016

I think that would work to solve this problem, though introducing some extra complexity as ConvertedType for columns is stored only in the FileMetadata from what I can tell, which is not generally available for StatisticsFilter, though it can potentially be added to make the filter aware of extra type information. There is nothing else in the code that seems to depend upon ConvertedType, as far as I can tell it's mainly for use in deserialization for the compute engines that are deserializing from Parquet to their own internal types, not sure how people feel about integrating it into filtering.

@isnotinvain
Copy link
Contributor

@wesm converted type / original type tags are sort of optional, so we'd have to rely on each object model integration to properly tag these things correctly, and then as @andreweduffy points out we'd also have to wire that all the way through some layers that probably don't have that info currently.

@isnotinvain
Copy link
Contributor

@andreweduffy

Well I think there's 2 options is what I was saying.

Option 1:

  1. deprecate min/max in the case of binary fields. When populated in old files we know they are signed.

  2. From now on, only populate the 4 new fields: signed_min, signed_max, unsigned_min, unsigned_max, which leaves no room for confusion, and then add some more filtering APIs to parquet-mr that let the caller choose which min/max they are interested in. When signed_min / signed_max is requested, check those fields first, then fall back to the deprecated min/max fields.

Option 2:

  1. Same as option 1, but treat the current min/max as the signed_min signed_max fields. Less explicit, but less fields overall.

I would probably lean towards option 1 as it's more clear. What do you think?

@a10y
Copy link
Author

a10y commented Aug 31, 2016

I'm fine with (1), I went and reverted the last commit I made, so there are still 6 total statistics fields again, also improved the comment to integrate some of the wording from your explanation.

Once this is merged we'll need another format release to build parquet-mr against.

@wesm
Copy link
Member

wesm commented Aug 31, 2016

@isnotinvain understood -- I imagine there are some existing systems that store UTF8 data without marking the appropriate converted type, but parquet-mr / other parquet implementations have no way of distinguishing that from regular binary data. But on the flip side without this additional type metadata it's hard to know what's the right comparator to use (for example: Impala does not have built-in handling of UTF8 / Unicode yet, so if you have non-ASCII data then partition pruning based on statistics may not work correctly) -- are there any other kinds of common data stored in BYTE_ARRAY where the signed interpretation is the right one?

@a10y
Copy link
Author

a10y commented Aug 31, 2016

@wesm I've never heard of any, in effect the unsigned interpretation is really the more widely accepted as far as I can tell, though JVM things can't represent unsigned char * so they're stuck with implicitly using signed byte[]. Spark does this for non-string binary data, and I'm sure other JVM systems do too. Having to specify signedness is a bit gross, but it is necessary if we want to allow for either set of statistics to be usable for binary typed data. The alternative would be to only add unsigned min/max to the format, then just move binary over to using unsigned all the time but Spark and likely others would need to change their sort ordering to match up with parquet.

@isnotinvain
Copy link
Contributor

I agree that for utf8 strings, unsigned is standard. But for generic byte arrays, I think it's fair to say there isn't really a "natural" ordering to them. If we were designing this from scratch today, I'd vote for just doing unsigned everywhere because it's simple, works for utf8, and isn't wrong for non-utf8.

That said, we've already been storing signed min max in the only implementation of binary stats, so we need to support that to some degree unfortunately. I think the best way to migrate is to start storing both signed + unsigned, and force the caller to specify which they want. In the case of impala, it can simply say that it always wants unsigned -- and when it sees a file w/o unsigned stats, it'll have to skip the stats based filtering.

One thing to clarify though, is for == and != queries, you can use either signed or unsigned, even for utf8. If the query is for foo.bar == "café" and the only stats available are signed, you can still use those to determine whether a rowgroup either a) may have data you are looking for or b) definitely does not. It's only for >, >=, <, <= queries where it matters whether signed / unsigned is used. So I do think we should make sure that both parquet-mr and parquet-cpp special case the equality filters.

Does that make sense?

* clients to specify which statistics and method of comparison should be used
* for filtering. To maintain backward format compatibility, when filtering
* based on signed statistics the signed_min and signed_max are checked first,
* and if they are unset it falls back to using the values in min and max.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets clarify that min and max are always signed if present.

I wonder if any of this belongs in the README too?

@wesm
Copy link
Member

wesm commented Sep 1, 2016

@isnotinvain all makes sense, this is a good plan

@a10y
Copy link
Author

a10y commented Sep 1, 2016

Yep, I would like to point out though that you'll get extra false positives in the scan for "may contain this value" that you wouldn't get with unsigned comparison, but does not affect correctness in the same way. In effect it doesn't really matter, once systems start using the correct statistics then these false positives should be decreased.

@a10y
Copy link
Author

a10y commented Sep 6, 2016

Hey @isnotinvain @piyushnarang and anyone else invested, is there anything blocking this currently?

@julienledem
Copy link
Member

LGTM

@isnotinvain
Copy link
Contributor

+1, only thing I'd add is maybe a detailed explanation of how signed + unsigned min/max are calculated.

Something like:
Binaries are sorted lexicographically (byte by byte), treating each byte as an integer. The signed sorting treats each byte as a signed twos compliment number, and the unsigned treats the byte as an unsigned number. A binary with fewer bytes is considered to be 'less than' a binary with more bytes (is that correct?)

@a10y
Copy link
Author

a10y commented Sep 6, 2016

Done, @isnotinvain let me know if you like that. Once this merges there should probably be a release that we can depend on in parquet-mr and parquet-cpp ongoing changes.

@a10y
Copy link
Author

a10y commented Sep 8, 2016

I'm assuming it was the comment on the thrift file you wanted me to place that, I don't know where in the README it would've made sense

@rdblue
Copy link
Contributor

rdblue commented Sep 8, 2016

Sorry I'm late to the discussion here.

I don't see the need to stop using min and max, or why we should add a the unsigned min and max fields.

We have two orderings to deal with: the unsigned ordering that is correct for UTF-8, and the signed ordering that is easy to use by accident in Java. I think the rationale for continuing to provide the signed min and max is that it is easy for devs to use it by accident, but I don't think that's a good reason to legitimize its use in Parquet. It is the wrong ordering for UTF-8 and it's the wrong ordering for every use of binary or fixed that I can think of, like hash bytes or decimal bytes. I don't think we need to have signed min and max fields at all.

Also, rather than moving to unsigned min and max fields, I'd rather add a field to record the ordering that was used. If that's missing, then we know it is the signed ordering. Otherwise, we can check that the identifier shows it is the unsigned ordering before using the stats for filtering. This would fix the current problem (knowing what sort order was used) and would allow us to address two other cases:

  1. The min/max byte arrays are used for the other types as well. Some of those types have unsigned variants that also have this problem (we only perform signed comparison).
  2. SQL engines will eventually want to use sort orders other than the lexicographic UTF-8 ordering to support the natural order in non-English languages. Allowing an identifier for the sort ordering will be a way to address those use cases.

@julienledem
Copy link
Member

@isnotinvain: see Ryan's comment above

@timarmstrong
Copy link
Contributor

It looks like this stalled out. We've been looking at adding Impala support for writing min/max stats and ran into this issue. We've also discovered additional gaps in the Parquet spec and corresponding problems in the parquet-mr implementation: https://issues.apache.org/jira/browse/PARQUET-839, https://issues.apache.org/jira/browse/PARQUET-840

From the point of view of a query-processing engine, I agree with the suggestion from @rdblue that custom sort orderings may be useful, although I think that may be a niche use case. In most case having a sane default ordering for the logical types would be sufficient to make min-max stats easier to implement correctly and more effective at filtering data.

Currently the sort order in parquet-mr is based on primitive type, rather than logical type. I think all of the logical types in parquet have an "obvious" default sort order (for non-NULL values) that would match the order of the corresponding SQL types. E.g. UTF-8 - sort by unicode code points. Decimal - sort by numeric value. Date/time/timestamp - sort by time. Interval - sort by month, then second, then millisecond. Lexicographic ordering of the binary data can make min-max filtering less effective compared with the "natural" sort order for the logical data type. E.g. in the natural sort order for decimal encoded as a fixed length byte array, [-1, 0] is a small range, but in the lexicographic order, it covers all possible values.

It's more difficult to implement in query processing engines, because the engine needs to be aware of the sort order in some cases. E.g. to generate parquet files that support effective min-max filtering on a key we would sort by that key, then generate the files. For this to work as intended, we'd have to emulate Parquet's peculiar sort order in the query engine, and switch to that when inserting into parquet files.

@julienledem
Copy link
Member

Hi @timarmstrong,
This is a high priority item to fix for the next release.
It sounds like you've reviewed #46 I'm posting it again for reference. Your input is highly valuable and your help welcome.
The next parquet sync is Thursday Feb 9th 10am pt. I'd invite you to join so that we can finalize the solution. If this is too far or not a good time we can setup a hangout meeting earlier. Notes are then posted to the list.
Like you said it is important that stats match the ordering in the execution layer to be useful. We also need them to be standard across execution engines.
I think @andreweduffy, you @timarmstrong, @rdblue, @isnotinvain at least should join.
(virtual) Face to face meetings help converging faster.

@a10y
Copy link
Author

a10y commented Feb 1, 2017

(FYI recently changed username from @andreweduffy to @a15y, sorry about the confusion). It looks like on the mailing list there's a fair amount of churn over when the call will actually be, could you also post here when the final time is chosen?

@lekv
Copy link
Contributor

lekv commented Jun 15, 2017

This PR seems obsoleted by PR #46, which added the min_value and max_value fields.

@a10y - Can we close this one? Or do you think some aspects are not properly addressed in #46?

@pono pono closed this Jun 15, 2017
lekv pushed a commit to lekv/parquet-format that referenced this pull request Jul 31, 2017
…ore dump

Also adds an option to use ccache with a custom gcc toolchain.

Fixes an improper memory access stemming from a unit test that had a wrong parameter.

Author: Wes McKinney <[email protected]>

Closes apache#42 from wesm/PARQUET-513 and squashes the following commits:

43df9e7 [Wes McKinney] Fix core dump from unit test bug
b46c617 [Wes McKinney] Fail build if valgrind finds error during ctest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

10 participants