Skip to content
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
46 changes: 40 additions & 6 deletions format/spec.md
Original file line number Diff line number Diff line change
Expand Up @@ -48,7 +48,7 @@ In addition to row-level deletes, version 2 makes some requirements stricter for

Version 3 of the Iceberg spec extends data types and existing metadata structures to add new capabilities:

* New data types: nanosecond timestamp(tz)
* New data types: nanosecond timestamp(tz), unknown
* Default value support for columns
* Multi-argument transforms for partitioning and sorting

Expand Down Expand Up @@ -184,6 +184,7 @@ Supported primitive types are defined in the table below. Primitive types added

| Added by version | Primitive type | Description | Requirements |
|------------------|--------------------|--------------------------------------------------------------------------|--------------------------------------------------|
| [v3](#version-3) | **`unknown`** | Default / null column type used when a more specific type is not known | Must be optional with `null` defaults; not stored in data files |
| | **`boolean`** | True or false | |
| | **`int`** | 32-bit signed integers | Can promote to `long` |
| | **`long`** | 64-bit signed integers | |
Expand Down Expand Up @@ -221,6 +222,8 @@ The `initial-default` is set only when a field is added to an existing schema. T

The `initial-default` and `write-default` produce SQL default value behavior, without rewriting data files. SQL default value behavior when a field is added handles all existing rows as though the rows were written with the new field's default value. Default value changes may only affect future records and all known fields are written into data files. Omitting a known field when writing a data file is never allowed. The write default for a field must be written if a field is not supplied to a write. If the write default for a required field is not set, the writer must fail.

All columns of `unknown` type must default to null. Non-null values for `initial-default` or `write-default` are invalid.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little confused about how physical columns in datafiles should be handled if they are present. If there is a column with a matching field id and values present, should those be materialized as null as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Further down, it states that these columns should be omitted from ORC and Parquet files. Avro can keep the type in the schema, but the serialized format ends up being the same.

Copy link
Contributor

@emkornfield emkornfield Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably makes sense to update the column projection rules to specify this type shouldn't be matched? At least for old data (i.e. migrated from hive tables), parquet supports UNKNOWN logical types as well.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? How would this type be matched in column projection? If we're reading a Parquet file with a matching column ID, then it is invalid because you can't write these columns to Parquet. I don't think that we would specifically ignore that case for an unknown column. Maybe it should be an error?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue I think one could potentially import parquet files that have an existing column annotated as a logical type "Unknown". I'd expect for these files the inferred schema would be the unknown iceberg type as well. So they could in theory be read from the files (via column name mapping) but we potentially want to skip them, do avoid any complication.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a note in the Parquet Appendix that Parquet columns that correspond to unknown columns must be ignored and replaced with null. I think that's the cleanest location for this.


Default values are attributes of fields in schemas and serialized with fields in the JSON format. See [Appendix C](#appendix-c-json-serialization).


Expand All @@ -230,11 +233,31 @@ Schemas may be evolved by type promotion or adding, deleting, renaming, or reord

Evolution applies changes to the table's current schema to produce a new schema that is identified by a unique schema ID, is added to the table's list of schemas, and is set as the table's current schema.

Valid type promotions are:
Valid primitive type promotions are:

| Primitive type | v1, v2 valid type promotions | v3+ valid type promotions | Requirements |
|------------------|------------------------------|------------------------------|--------------|
| `unknown` | | _any type_ | |
| `int` | `long` | `long` | |
| `date` | | `timestamp`, `timestamp_ns` | Promotion to `timestamptz` or `timestamptz_ns` is **not** allowed; values outside the promoted type's range must result in a runtime failure |
| `float` | `double` | `double` | |
| `decimal(P, S)` | `decimal(P', S)` if `P' > P` | `decimal(P', S)` if `P' > P` | Widen precision only |

Iceberg's Avro manifest format does not store the type of lower and upper bounds, and type promotion does not rewrite existing bounds. For example, when a `float` is promoted to `double`, existing data file bounds are encoded as 4 little-endian bytes rather than 8 little-endian bytes for `double`. To correctly decode the value, the original type at the time the file was written must be inferred according to the following table:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this a little, but I want to double check. We aren't allowing any promotion to variant here correct? It would basically be impossible to reverse engineer the original write time metric so I assume we are giving up on that. It could be possible in the future if we change the metric layout.

Copy link
Contributor Author

@rdblue rdblue Sep 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, if we want to have variant upper and lower bounds, then we cannot have type promotion from primitive types to variant.


| Current type | Length of bounds | Inferred type at write time |
|------------------|------------------|-----------------------------|
| `long` | 4 bytes | `int` |
| `long` | 8 bytes | `long` |
| `double` | 4 bytes | `float` |
| `double` | 8 bytes | `double` |
| `timestamp` | 4 bytes | `date` |
| `timestamp` | 8 bytes | `timestamp` |
| `timestamp_ns` | 4 bytes | `date` |
| `timestamp_ns` | 8 bytes | `timestamp_ns` |
| `decimal(P, S)` | _any_ | `decimal(P', S)`; `P' <= P` |

* `int` to `long`
* `float` to `double`
* `decimal(P, S)` to `decimal(P', S)` if `P' > P` -- widen the precision of decimal types.
Type promotion is not allowed for a field that is referenced by `source-id` or `source-ids` of a partition field if the partition transform would produce a different value after promoting the type. For example, `bucket[N]` produces different hash values for `34` and `"34"` (2017239379 != -427558391) but the same value for `34` and `34L`; when an `int` field is the source for a bucket partition field, it may be promoted to `long` but not to `string`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example here is now not allowed at all since you cannot do a int -> string by definition.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. There aren't any cases where a type promotion would violate this rule, but I think it is still valuable to add it so that it is clear now and in future versions that even if the type promotion is valid, it is disallowed if used in a transform that would change.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

commented below, and I couldn't be thinking about this incorrectly, but I think bucket fails for date->timestamp since the hash functions are different (date appears to has on day count, and timestamp using microseconds count)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have we analyzed the impact on other places type IDs (i.e. should this be expanded). equality deletes (I think this is OK but there might be edge cases for dropped columns), table level statistics since they use theta sketches.

Also, it might pay to spell out the list of type promotion + transform pairings that are explicitly disallowed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is also a concern for order by columns if the comparison order of the type promotion changes (for the current set of transforms proposed, I don't think this is a problem).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now, there are no allowed type promotions (even after this PR) that violate the rule. This is just for future cases (see my reply to @RussellSpitzer).

This concern has been around since the beginning of the format and is why the hash definition requires that ints and longs produce the same hash value. In that time, I've not come across any other areas that have similar requirements. This is the only case because this is related to the correctness of filtering. Other areas that are similar (like the use of transforms in ordering) are not risky because they are not involved in filtering.

Copy link
Contributor

@emkornfield emkornfield Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't date to timestamp fail for bucket here, I think this necessitates a multiplication which should change the hash value? Isn't is also a problem theta sketch in puffin files for stats, I guess this is a pre-existing problem since single-value serialization is used, so even int->long would cause some amount of inaccuracy?

(sorry for the multiple edits) I think date->timestamps_ns might yield different results depending on how/if overflow for the conversion is handled.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue sorry for the yet another comment here but I wanted to address:

This is the only case because this is related to the correctness of filtering. Other areas that are similar (like the use of transforms in ordering) are not risky because they are not involved in filtering.

I'm not sure this is accurate:

  1. For sketches it might be OK to potentially have an estimate that is off by up to 2x but it should probably be clarified. As you point out it wouldn't affect correctness necessarily but could in som cases affect performance. I'll follow-up on the mailing list about this.
  2. If a file is said to be ordered, then I think it would be reasonable for an engine doing a point lookup on an ordered column (e.g. sorted_col_x = 'some_id') to stop scanning after it encountered a value greater then "some_id". So while the ordering doesn't impact metadata filtering, it would violate a contract that some engines might rely on.

The second point falls into the category as not something that is possible with the current set of transforms described in the PR so can probably wait.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right that date to timestamp promotion would hit this case. Good catch (and I'm glad I left this in!).

You're also right that this would be a problem for the theta sketch definition. We should put some extra work in there to mitigate problems like this. When a type is promoted, we need to be able to detect and discard the existing sketch and replace it.

For #2, it sounds like the problem would occur when the ordering changes. Or in other words, if the transformation from one type to another is not monotonic. That's probably a good thing to call out as well, or at least look for when we add new type promotion cases. I don't think any of our existing cases would hit that situation, right? We should be careful about changes that alter the order.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For #2, it sounds like the problem would occur when the ordering changes. Or in other words, if the transformation from one type to another is not monotonic. That's probably a good thing to call out as well, or at least look for when we add new type promotion cases. I don't think any of our existing cases would hit that situation, right? We should be careful about changes that alter the order.

Correct, but the example given (number->string) is not in fact 'monotonic'. If we remove this example and just enumerate bucket transform is not valid that is probably sufficient for this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to change the example? It is demonstrating the point that a bucket function might change even if the value is equivalent. If we change it, then the other example (int to long) no longer makes sense. I think I'd prefer to keep the current examples. I'm not worried about the sorting use case needing to be called out here just because the sort order for that example changes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we don't need to change the example. It might be nice to document the sort order exception as well so it isn't forgotten if current reviewers aren't around and it doesn't come up. I do think for the specification we should explicitly call out which transforms fall into into this category (i.e. add the date->timestamp* for bucket[N] transform so implementors don't need to figure it out themselves)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added the case where this is known to happen.


Any struct, including a top-level schema, can evolve through deleting fields, adding new fields, renaming existing fields, reordering existing fields, or promoting a primitive using the valid type promotions. Adding a new field assigns a new ID for that field and for any nested fields. Renaming an existing field must change the name, but not the field ID. Deleting a field removes it from the current schema. Field deletion cannot be rolled back unless the field was nullable or if the current snapshot has not changed.

Expand Down Expand Up @@ -949,6 +972,7 @@ Maps with non-string keys must use an array representation with the `map` logica

|Type|Avro type|Notes|
|--- |--- |--- |
|**`unknown`**|`null` or omitted||
|**`boolean`**|`boolean`||
|**`int`**|`int`||
|**`long`**|`long`||
Expand Down Expand Up @@ -1002,6 +1026,7 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo

| Type | Parquet physical type | Logical type | Notes |
|--------------------|--------------------------------------------------------------------|---------------------------------------------|----------------------------------------------------------------|
| **`unknown`** | None | | Omit from data files |
| **`boolean`** | `boolean` | | |
| **`int`** | `int` | | |
| **`long`** | `long` | | |
Expand All @@ -1023,12 +1048,16 @@ Lists must use the [3-level representation](https://github.com/apache/parquet-fo
| **`map`** | `3-level map` | `MAP` | See Parquet docs for 3-level representation. |


When reading an `unknown` column, any corresponding column must be ignored and replaced with `null` values.


### ORC

**Data Type Mappings**

| Type | ORC type | ORC type attributes | Notes |
|--------------------|---------------------|------------------------------------------------------|-----------------------------------------------------------------------------------------|
| **`unknown`** | None | | Omit from data files |
| **`boolean`** | `boolean` | | |
| **`int`** | `int` | | ORC `tinyint` and `smallint` would also map to **`int`**. |
| **`long`** | `long` | | |
Expand Down Expand Up @@ -1089,6 +1118,7 @@ The types below are not currently valid for bucketing, and so are not hashed. Ho

| Primitive type | Hash specification | Test value |
|--------------------|-------------------------------------------|--------------------------------------------|
| **`unknown`** | `hashInt(0)` | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I missed this on my prior reviews, but shouldn't unknown be treated as null and return null for any cases that require a hash (also was bucket[N] included to be defined for 'unknown' if not then I think the only other case is puffin files which are today always defined as the hash of the values)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you're right. I'll fix it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

| **`boolean`** | `false: hashInt(0)`, `true: hashInt(1)` | `true` → `1392991556` |
| **`float`** | `hashLong(doubleToLongBits(double(v))` [5]| `1.0F` → `-142385009`, `0.0F` → `1669671676`, `-0.0F` → `1669671676` |
| **`double`** | `hashLong(doubleToLongBits(v))` [5]| `1.0D` → `-142385009`, `0.0D` → `1669671676`, `-0.0D` → `1669671676` |
Expand Down Expand Up @@ -1119,6 +1149,7 @@ Types are serialized according to this table:

|Type|JSON representation|Example|
|--- |--- |--- |
|**`unknown`**|`JSON string: "unknown"`|`"unknown"`|
|**`boolean`**|`JSON string: "boolean"`|`"boolean"`|
|**`int`**|`JSON string: "int"`|`"int"`|
|**`long`**|`JSON string: "long"`|`"long"`|
Expand Down Expand Up @@ -1267,6 +1298,7 @@ This serialization scheme is for storing single values as individual binary valu

| Type | Binary serialization |
|------------------------------|--------------------------------------------------------------------------------------------------------------|
| **`unknown`** | Not supported |
| **`boolean`** | `0x00` for false, non-zero byte for true |
| **`int`** | Stored as 4-byte little-endian |
| **`long`** | Stored as 8-byte little-endian |
Expand Down Expand Up @@ -1319,10 +1351,11 @@ This serialization scheme is for storing single values as individual binary valu
### Version 3

Default values are added to struct fields in v3.

* The `write-default` is a forward-compatible change because it is only used at write time. Old writers will fail because the field is missing.
* Tables with `initial-default` will be read correctly by older readers if `initial-default` is always null for optional fields. Otherwise, old readers will default optional columns with null. Old readers will fail to read required fields which are populated by `initial-default` because that default is not supported.

Types `timestamp_ns` and `timestamptz_ns` are added in v3.
Types `unknown`, `timestamp_ns`, and `timestamptz_ns` are added in v3.

All readers are required to read tables with unknown partition transforms, ignoring the unsupported partition fields when filtering.

Expand Down Expand Up @@ -1423,3 +1456,4 @@ Iceberg supports two types of histories for tables. A history of previous "curre
might indicate different snapshot IDs for a specific timestamp. The discrepancies can be caused by a variety of table operations (e.g. updating the `current-snapshot-id` can be used to set the snapshot of a table to any arbitrary snapshot, which might have a lineage derived from a table branch or no lineage at all).

When processing point in time queries implementations should use "snapshot-log" metadata to lookup the table state at the given point in time. This ensures time-travel queries reflect the state of the table at the provided timestamp. For example a SQL query like `SELECT * FROM prod.db.table TIMESTAMP AS OF '1986-10-26 01:21:00Z';` would find the snapshot of the Iceberg table just prior to '1986-10-26 01:21:00 UTC' in the snapshot logs and use the metadata from that snapshot to perform the scan of the table. If no snapshot exists prior to the timestamp given or "snapshot-log" is not populated (it is an optional field), then systems should raise an informative error message about the missing metadata.