-
Notifications
You must be signed in to change notification settings - Fork 462
PARQUET-686: Clarifications about min-max stats. #55
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation: * Removed references to the SIGNED and UNSIGNED sort orders, which were removed in favour of a single TYPE_ORDER. * Removed obsolete references to column_orders's effect on the min and max values, since those were declared obsolete instead and column_orders only affects the new min_value and max_value fields. * Clarified ColumnOrder's purpose, since the purpose of a union containing a single empty struct was hard to grasp.
| /** | ||
| * Sort order used for each column in this file. | ||
| * | ||
| * If this list is not present, then the order for each column is assumed to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed this sentence since it no longer applied. (It was a leftover from the early stages of PR #46 when this list was meant to modify the behavior of the min and max fields, but they were made obsolete instead.) However, this also means that the behavior for a missing list is undefined.
@rdblue @julienledem How should we define the meaning of min_value and max_value when this list is not present? @lekv What does Impala do if this list is not present?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the column order for a column is not set, Impala only reads stats for numeric types (parquet-column-stats.cc#L130). We treat the values as having been written by older versions of parquet-mr, thus only using stats for types that did not have known issues in the past.
if (col_order == nullptr) {
return col_type.IsBooleanType() || col_type.IsIntegerType()
|| col_type.IsFloatingPointType();
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And which fields does Impala read the statistics from if the column order is not set, min and max or min-value and max-value?
The fact that the statistics are in the min_value and max_value fields already shows that they are the new kind of statistics and were not written by older versions of parquet-mr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The check is done for new-style stats, too, since if the column order is unknown, there does not seem to be a safe ordering to assume for all other types. This should not happen with new readers, but some may still use the new fields and not set the column order.
For old fields, we ignore them if column_order is set to anything but TYPE_ORDER. Otherwise we only read them for numeric types.
src/main/thrift/parquet.thrift
Outdated
| * BOOLEAN - false, true | ||
| * INT32 - signed comparison | ||
| * INT64 - signed comparison | ||
| * INT96 - signed comparison |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lekv Since INT96 timestamps have the INT96 physical type but no logical type, according to these rules signed comparison should be applied to them. But if I remember correctly, it was discussed that signed comparison was not adequate for INT96 timestamps. How did you handle this case in Impala?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I came across this line of code. According to this, INT96 should use unsigned comparison. I think this comment here should read "unsigned" as well. @julienledem , what is your take on this?
Modified INT96 comparison to be unsigned in agreement with the actual code and the theoretically correct ordering. Distuingish more explicitly between byte-wise comparison and comparison of represented values when the two are not equivalent.
| * BYTE_ARRAY - unsigned byte-wise comparison | ||
| * FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison | ||
| */ | ||
| 1: TypeDefinedOrder TYPE_ORDER; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we add a custom sort order for Impala timestamp values until Int96 is removed from the parquet format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't remember in which context the discussion was but I think INT96 timestamps should be sortable correctly with signed comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you are referring to Impala timestamps, then I believe signed comparison is not sufficient.
Quoting from https://github.com/cloudera/Impala/blob/b402e342d42b60ff3d01e87d83e9bfba635488cf/tests/util/get_parquet_metadata.py:75
# Impala writes timestamps as 12-byte values. The first 8 byte store a
# boost::posix_time::time_duration, which is the time within the current day in
# nanoseconds stored as int64. The last 4 bytes store a boost::gregorian::date,
# which is the Julian date, stored as utin32.
The comparator should compare the date value first (last 4 bytes) and then the nanoseconds (first 8 bytes)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is something that probably someone else also should review (cc @rdblue ;) ) but I guess that due to storing the values as little endian, this should be correct. Please don't take this for granted, I hadn't had to deal with endianness in the last years explicitly, so I might have good this wrong.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I doubt if little endian layout that works at the byte level can help with this multi-byte value comparison.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is the timestamp '2000-01-01 12:34:56' stored as an int96:
$ parquet-tools dump hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd00000000_1937418586_data.0.parq | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320
Since 117253024523396126668760320 = 0x60FD4B3229000059682500, the 12 bytes are 00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the time and the date parts.
00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 0x000029324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 minutes + 56 seconds.
59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 2451545 as the Julian day number, which corresponds to 2000-01-01.
To correctly sort these values without interpreting them as timestamps, the bigger unit (date) should precede the smaller unit (time). In this case they are in the opposite order, but they are also stored with little-endian byte-order (individually), which means that they will be in the correct order if we interpret the whole value in a little-endian manner. So, for correct ordering based purely on numerical value, in comparisons the example above should not be interpreted as 0x0060FD4B3229000059682500 = 117253024523396126668760320 like parquet-tools did, but as 0x00256859000029324BFD6000 = 45223023200227578716446720 instead. Or, to put it more simply, a byte-by-byte comparison starting from the end of the values results in the correct ordering.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clarification! An Int96 intrinsic hardware type would have handled the value as is. Otherwise a byte-by-byte comparison in the reverse order is needed.
|
+1 Thanks for contributing this, @zivanfi. Nice work. I'm going to merge this now and we can follow up on the int96 behavior in another PR. |
Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation:
Removed references to the
SIGNEDandUNSIGNEDsort orders, which were removed in favour of a singleTYPE_ORDER.Removed obsolete references to
column_orders's effect on theminandmaxvalues, since those were declared obsolete instead andcolumn_ordersonly affects the newmin_valueandmax_valuefields.Clarified
ColumnOrder's purpose, since the purpose of a union containing a single empty struct was hard to grasp.