Skip to content

Conversation

@zivanfi
Copy link
Contributor

@zivanfi zivanfi commented Jun 16, 2017

Changed some descriptions to reflect code changes that happened during code review without updating the corresponding comments and documentation:

  • Removed references to the SIGNED and UNSIGNED sort orders, which were removed in favour of a single TYPE_ORDER.

  • Removed obsolete references to column_orders's effect on the min and max values, since those were declared obsolete instead and column_orders only affects the new min_value and max_value fields.

  • Clarified ColumnOrder's purpose, since the purpose of a union containing a single empty struct was hard to grasp.

Changed some descriptions to reflect code changes that happened during code
review without updating the corresponding comments and documentation:

* Removed references to the SIGNED and UNSIGNED sort orders, which were removed
  in favour of a single TYPE_ORDER.

* Removed obsolete references to column_orders's effect on the min and max
  values, since those were declared obsolete instead and column_orders only
  affects the new min_value and max_value fields.

* Clarified ColumnOrder's purpose, since the purpose of a union containing a
  single empty struct was hard to grasp.
/**
* Sort order used for each column in this file.
*
* If this list is not present, then the order for each column is assumed to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this sentence since it no longer applied. (It was a leftover from the early stages of PR #46 when this list was meant to modify the behavior of the min and max fields, but they were made obsolete instead.) However, this also means that the behavior for a missing list is undefined.

@rdblue @julienledem How should we define the meaning of min_value and max_value when this list is not present? @lekv What does Impala do if this list is not present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the column order for a column is not set, Impala only reads stats for numeric types (parquet-column-stats.cc#L130). We treat the values as having been written by older versions of parquet-mr, thus only using stats for types that did not have known issues in the past.

  if (col_order == nullptr) {
    return col_type.IsBooleanType() || col_type.IsIntegerType()
        || col_type.IsFloatingPointType();
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And which fields does Impala read the statistics from if the column order is not set, min and max or min-value and max-value?

The fact that the statistics are in the min_value and max_value fields already shows that they are the new kind of statistics and were not written by older versions of parquet-mr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is done for new-style stats, too, since if the column order is unknown, there does not seem to be a safe ordering to assume for all other types. This should not happen with new readers, but some may still use the new fields and not set the column order.

For old fields, we ignore them if column_order is set to anything but TYPE_ORDER. Otherwise we only read them for numeric types.

* BOOLEAN - false, true
* INT32 - signed comparison
* INT64 - signed comparison
* INT96 - signed comparison
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lekv Since INT96 timestamps have the INT96 physical type but no logical type, according to these rules signed comparison should be applied to them. But if I remember correctly, it was discussed that signed comparison was not adequate for INT96 timestamps. How did you handle this case in Impala?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across this line of code. According to this, INT96 should use unsigned comparison. I think this comment here should read "unsigned" as well. @julienledem , what is your take on this?

Modified INT96 comparison to be unsigned in agreement with the actual
code and the theoretically correct ordering.

Distuingish more explicitly between byte-wise comparison and comparison
of represented values when the two are not equivalent.
* BYTE_ARRAY - unsigned byte-wise comparison
* FIXED_LEN_BYTE_ARRAY - unsigned byte-wise comparison
*/
1: TypeDefinedOrder TYPE_ORDER;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add a custom sort order for Impala timestamp values until Int96 is removed from the parquet format?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't remember in which context the discussion was but I think INT96 timestamps should be sortable correctly with signed comparison.

Copy link

@majetideepak majetideepak Aug 23, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you are referring to Impala timestamps, then I believe signed comparison is not sufficient.
Quoting from https://github.com/cloudera/Impala/blob/b402e342d42b60ff3d01e87d83e9bfba635488cf/tests/util/get_parquet_metadata.py:75

# Impala writes timestamps as 12-byte values. The first 8 byte store a
# boost::posix_time::time_duration, which is the time within the current day in
# nanoseconds stored as int64. The last 4 bytes store a boost::gregorian::date,
# which is the Julian date, stored as utin32.

The comparator should compare the date value first (last 4 bytes) and then the nanoseconds (first 8 bytes)?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is something that probably someone else also should review (cc @rdblue ;) ) but I guess that due to storing the values as little endian, this should be correct. Please don't take this for granted, I hadn't had to deal with endianness in the last years explicitly, so I might have good this wrong.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I doubt if little endian layout that works at the byte level can help with this multi-byte value comparison.

Copy link
Contributor Author

@zivanfi zivanfi Sep 20, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the timestamp '2000-01-01 12:34:56' stored as an int96:

$ parquet-tools dump hdfs://n1/user/hive/warehouse/test/11481f03a2ea6bed-b19656cd00000000_1937418586_data.0.parq | tail -n 1
value 1: R:0 D:1 V:117253024523396126668760320

Since 117253024523396126668760320 = 0x60FD4B3229000059682500, the 12 bytes are 00 60 FD 4B 32 29 00 00 | 59 68 25 00, where | shows the boundary between the time and the date parts.

00 60 FD 4B 32 29 00 00 is the time part, if we reverse the bytes we get 0x000029324BFD6000 = 45296 * 10^9 nanoseconds = 45296 seconds = 12 hours + 34 minutes + 56 seconds.

59 68 25 00 is the date part, if we reverse the bytes we get 0x00256859 = 2451545 as the Julian day number, which corresponds to 2000-01-01.

To correctly sort these values without interpreting them as timestamps, the bigger unit (date) should precede the smaller unit (time). In this case they are in the opposite order, but they are also stored with little-endian byte-order (individually), which means that they will be in the correct order if we interpret the whole value in a little-endian manner. So, for correct ordering based purely on numerical value, in comparisons the example above should not be interpreted as 0x0060FD4B3229000059682500 = 117253024523396126668760320 like parquet-tools did, but as 0x00256859000029324BFD6000 = 45223023200227578716446720 instead. Or, to put it more simply, a byte-by-byte comparison starting from the end of the values results in the correct ordering.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clarification! An Int96 intrinsic hardware type would have handled the value as is. Otherwise a byte-by-byte comparison in the reverse order is needed.

@rdblue
Copy link
Contributor

rdblue commented Oct 6, 2017

+1

Thanks for contributing this, @zivanfi. Nice work. I'm going to merge this now and we can follow up on the int96 behavior in another PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants