Skip to content
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 10 additions & 10 deletions LogicalTypes.md
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,7 @@ may require additional metadata fields, as well as rules for those fields.
`UTF8` may only be used to annotate the binary primitive type and indicates
that the byte array should be interpreted as a UTF-8 encoded character string.

The sort order used for `UTF8` strings is `UNSIGNED` byte-wise comparison.
The sort order used for `UTF8` strings is unsigned byte-wise comparison.

## Numeric Types

Expand All @@ -57,7 +57,7 @@ allows.
implied by the `int32` and `int64` primitive types if no other annotation is
present and should be considered optional.

The sort order used for signed integer types is `SIGNED`.
The sort order used for signed integer types is signed.

### Unsigned Integers

Expand All @@ -74,7 +74,7 @@ allows.
`UINT_8`, `UINT_16`, and `UINT_32` must annotate an `int32` primitive type and
`UINT_64` must annotate an `int64` primitive type.

The sort order used for unsigned integer types is `UNSIGNED`.
The sort order used for unsigned integer types is unsigned.

### DECIMAL

Expand Down Expand Up @@ -104,7 +104,7 @@ integer. A precision too large for the underlying type (see below) is an error.
A `SchemaElement` with the `DECIMAL` `ConvertedType` must also have both
`scale` and `precision` fields set, even if scale is 0 by default.

The sort order used for `DECIMAL` values is `SIGNED`. The order is equivalent
The sort order used for `DECIMAL` values is signed. The order is equivalent
to signed comparison of decimal values.

If the column uses `int32` or `int64` physical types, then signed comparison of
Expand All @@ -121,39 +121,39 @@ comparison.
annotate an `int32` that stores the number of days from the Unix epoch, 1
January 1970.

The sort order used for `DATE` is `SIGNED`.
The sort order used for `DATE` is signed.

### TIME\_MILLIS

`TIME_MILLIS` is used for a logical time type with millisecond precision,
without a date. It must annotate an `int32` that stores the number of
milliseconds after midnight.

The sort order used for `TIME\_MILLIS` is `SIGNED`.
The sort order used for `TIME\_MILLIS` is signed.

### TIME\_MICROS

`TIME_MICROS` is used for a logical time type with microsecond precision,
without a date. It must annotate an `int64` that stores the number of
microseconds after midnight.

The sort order used for `TIME\_MICROS` is `SIGNED`.
The sort order used for `TIME\_MICROS` is signed.

### TIMESTAMP\_MILLIS

`TIMESTAMP_MILLIS` is used for a combined logical date and time type, with
millisecond precision. It must annotate an `int64` that stores the number of
milliseconds from the Unix epoch, 00:00:00.000 on 1 January 1970, UTC.

The sort order used for `TIMESTAMP\_MILLIS` is `SIGNED`.
The sort order used for `TIMESTAMP\_MILLIS` is signed.

### TIMESTAMP\_MICROS

`TIMESTAMP_MICROS` is used for a combined logical date and time type with
microsecond precision. It must annotate an `int64` that stores the number of
microseconds from the Unix epoch, 00:00:00.000000 on 1 January 1970, UTC.

The sort order used for `TIMESTAMP\_MICROS` is `SIGNED`.
The sort order used for `TIMESTAMP\_MICROS` is signed.

### INTERVAL

Expand All @@ -169,7 +169,7 @@ example, there is no requirement that a large number of days should be
expressed as a mix of months and days because there is not a constant
conversion from days to months.

The sort order used for `INTERVAL` is `UNSIGNED`, produced by sorting by
The sort order used for `INTERVAL` is unsigned, produced by sorting by
the value of months, then days, then milliseconds with unsigned comparison.

## Embedded Types
Expand Down
14 changes: 7 additions & 7 deletions src/main/thrift/parquet.thrift
Original file line number Diff line number Diff line change
Expand Up @@ -219,12 +219,12 @@ struct Statistics {
* Values are encoded using PLAIN encoding, except that variable-length byte
* arrays do not include a length prefix.
*
* These fields encode min and max values determined by SIGNED comparison
* These fields encode min and max values determined by signed comparison
* only. New files should use the correct order for a column's logical type
* and store the values in the min_value and max_value fields.
*
* To support older readers, these may be set when the column order is
* SIGNED.
* signed.
*/
1: optional binary max;
2: optional binary min;
Expand Down Expand Up @@ -583,6 +583,8 @@ struct TypeDefinedOrder {}

/**
* Union to specify the order used for min, max, and sorting values in a column.
* This union takes the role of an enhanced enum that allows rich elements
* (which will be needed for a collation-based ordering in the future).
*
* Possible values are:
* * TypeDefinedOrder - the column uses the order defined by its logical or
Expand Down Expand Up @@ -626,11 +628,9 @@ struct FileMetaData {
6: optional string created_by

/**
* Sort order used for each column in this file.
*
* If this list is not present, then the order for each column is assumed to
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed this sentence since it no longer applied. (It was a leftover from the early stages of PR #46 when this list was meant to modify the behavior of the min and max fields, but they were made obsolete instead.) However, this also means that the behavior for a missing list is undefined.

@rdblue @julienledem How should we define the meaning of min_value and max_value when this list is not present? @lekv What does Impala do if this list is not present?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the column order for a column is not set, Impala only reads stats for numeric types (parquet-column-stats.cc#L130). We treat the values as having been written by older versions of parquet-mr, thus only using stats for types that did not have known issues in the past.

  if (col_order == nullptr) {
    return col_type.IsBooleanType() || col_type.IsIntegerType()
        || col_type.IsFloatingPointType();
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And which fields does Impala read the statistics from if the column order is not set, min and max or min-value and max-value?

The fact that the statistics are in the min_value and max_value fields already shows that they are the new kind of statistics and were not written by older versions of parquet-mr.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The check is done for new-style stats, too, since if the column order is unknown, there does not seem to be a safe ordering to assume for all other types. This should not happen with new readers, but some may still use the new fields and not set the column order.

For old fields, we ignore them if column_order is set to anything but TYPE_ORDER. Otherwise we only read them for numeric types.

* be Signed. In addition, min and max values for INTERVAL or DECIMAL stored
* as fixed or bytes should be ignored.
* Sort order used for each column in this file. Each sort order corresponds
* to one column, determined by its position in the list, matching the
* position of the column in the schema.
*/
7: optional list<ColumnOrder> column_orders;
}
Expand Down