PARQUET-686: Clarifications about min-max stats. #55

zivanfi · 2017-06-16T16:14:12Z

Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation:

Removed references to the SIGNED and UNSIGNED sort orders, which were removed in favour of a single TYPE_ORDER.
Removed obsolete references to column_orders's effect on the min and max values, since those were declared obsolete instead and column_orders only affects the new min_value and max_value fields.
Clarified ColumnOrder's purpose, since the purpose of a union containing a single empty struct was hard to grasp.

Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation: * Removed references to the SIGNED and UNSIGNED sort orders, which were removed in favour of a single TYPE_ORDER. * Removed obsolete references to column_orders's effect on the min and max values, since those were declared obsolete instead and column_orders only affects the new min_value and max_value fields. * Clarified ColumnOrder's purpose, since the purpose of a union containing a single empty struct was hard to grasp.

zivanfi · 2017-06-16T16:23:30Z

src/main/thrift/parquet.thrift

  /**
-   * Sort order used for each column in this file.
-   *
-   * If this list is not present, then the order for each column is assumed to


I removed this sentence since it no longer applied. (It was a leftover from the early stages of PR #46 when this list was meant to modify the behavior of the min and max fields, but they were made obsolete instead.) However, this also means that the behavior for a missing list is undefined.

@rdblue @julienledem How should we define the meaning of min_value and max_value when this list is not present? @lekv What does Impala do if this list is not present?

If the column order for a column is not set, Impala only reads stats for numeric types (parquet-column-stats.cc#L130). We treat the values as having been written by older versions of parquet-mr, thus only using stats for types that did not have known issues in the past.

if (col_order == nullptr) { return col_type.IsBooleanType() || col_type.IsIntegerType() || col_type.IsFloatingPointType(); }

And which fields does Impala read the statistics from if the column order is not set, min and max or min-value and max-value?

The fact that the statistics are in the min_value and max_value fields already shows that they are the new kind of statistics and were not written by older versions of parquet-mr.

The check is done for new-style stats, too, since if the column order is unknown, there does not seem to be a safe ordering to assume for all other types. This should not happen with new readers, but some may still use the new fields and not set the column order.

For old fields, we ignore them if column_order is set to anything but TYPE_ORDER. Otherwise we only read them for numeric types.

zivanfi · 2017-06-29T11:30:25Z

src/main/thrift/parquet.thrift

+   *   BOOLEAN - false, true
+   *   INT32 - signed comparison
+   *   INT64 - signed comparison
+   *   INT96 - signed comparison


@lekv Since INT96 timestamps have the INT96 physical type but no logical type, according to these rules signed comparison should be applied to them. But if I remember correctly, it was discussed that signed comparison was not adequate for INT96 timestamps. How did you handle this case in Impala?

I came across this line of code. According to this, INT96 should use unsigned comparison. I think this comment here should read "unsigned" as well. @julienledem , what is your take on this?

Modified INT96 comparison to be unsigned in agreement with the actual code and the theoretically correct ordering. Distuingish more explicitly between byte-wise comparison and comparison of represented values when the two are not equivalent.

majetideepak · 2017-08-23T12:37:31Z

src/main/thrift/parquet.thrift

+   *   BYTE_ARRAY - unsigned byte-wise comparison
+   *   FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
+   */
  1: TypeDefinedOrder TYPE_ORDER;


Should we add a custom sort order for Impala timestamp values until Int96 is removed from the parquet format?

I don't remember in which context the discussion was but I think INT96 timestamps should be sortable correctly with signed comparison.

If you are referring to Impala timestamps, then I believe signed comparison is not sufficient.
Quoting from https://github.com/cloudera/Impala/blob/b402e342d42b60ff3d01e87d83e9bfba635488cf/tests/util/get_parquet_metadata.py:75

# Impala writes timestamps as 12-byte values. The first 8 byte store a # boost::posix_time::time_duration, which is the time within the current day in # nanoseconds stored as int64. The last 4 bytes store a boost::gregorian::date, # which is the Julian date, stored as utin32.

The comparator should compare the date value first (last 4 bytes) and then the nanoseconds (first 8 bytes)?

This is something that probably someone else also should review (cc @rdblue ;) ) but I guess that due to storing the values as little endian, this should be correct. Please don't take this for granted, I hadn't had to deal with endianness in the last years explicitly, so I might have good this wrong.

I doubt if little endian layout that works at the byte level can help with this multi-byte value comparison.

Here is the timestamp '2000-01-01 12:34:56' stored as an int96:

$ parquet-tools dump hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd00000000_1937418586_data.0.parq | tail -n 1 value 1: R:0 D:1 V:117253024523396126668760320

Since 117253024523396126668760320 = 0x60FD4B3229000059682500, the 12 bytes are 00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 0x000029324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 2451545 as the Julian day number, which corresponds to 2000-01-01.

To correctly sort these values without interpreting them as timestamps, the bigger unit (date) should precede the smaller unit (time). In this case they are in the opposite order, but they are also stored with little-endian byte-order (individually), which means that they will be in the correct order if we interpret the whole value in a little-endian manner. So, for correct ordering based purely on numerical value, in comparisons the example above should not be interpreted as 0x0060FD4B3229000059682500 = 117253024523396126668760320 like parquet-tools did, but as 0x00256859000029324BFD6000 = 45223023200227578716446720 instead. Or, to put it more simply, a byte-by-byte comparison starting from the end of the values results in the correct ordering.

Thanks for the clarification! An Int96 intrinsic hardware type would have handled the value as is. Otherwise a byte-by-byte comparison in the reverse order is needed.

rdblue · 2017-10-06T23:37:51Z

+1

Thanks for contributing this, @zivanfi. Nice work. I'm going to merge this now and we can follow up on the int96 behavior in another PR.

zivanfi commented Jun 16, 2017

View reviewed changes

Zoltan Ivanfi added 2 commits June 20, 2017 13:40

PARQUET-686: Minor improvements in Thrift comments.

f8fab0b

PARQUET-686: Further clarifications.

0c973f7

zivanfi commented Jun 29, 2017

View reviewed changes

Comparison rules updates.

a499d86

Modified INT96 comparison to be unsigned in agreement with the actual code and the theoretically correct ordering. Distuingish more explicitly between byte-wise comparison and comparison of represented values when the two are not equivalent.

zivanfi force-pushed the master branch from d627dfd to a499d86 Compare July 6, 2017 15:27

majetideepak mentioned this pull request Aug 23, 2017

PARQUET-1002: Compute statistics based on Sort Order apache/parquet-cpp#383

Closed

majetideepak reviewed Aug 23, 2017

View reviewed changes

asfgit closed this in bef5438 Oct 6, 2017

asfimport mentioned this pull request Jun 23, 2024

Allow for Unsigned Statistics in Binary Type #314

Closed

PARQUET-686: Clarifications about min-max stats. #55

PARQUET-686: Clarifications about min-max stats. #55

Uh oh!

Conversation

zivanfi commented Jun 16, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

majetideepak Aug 23, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zivanfi Sep 20, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Oct 6, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

majetideepak Aug 23, 2017 •

edited

Loading

zivanfi Sep 20, 2017 •

edited

Loading