-
Notifications
You must be signed in to change notification settings - Fork 3k
Docs: Default value support feature specification #4301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 15 commits
3cd948a
a0732eb
c7f1368
53db8a1
f050345
ad29a50
3c6fd38
02bb47c
4729700
7554a08
53a509d
d47b6f0
6d0ebdc
6d71d2d
400b9ec
bea5ddb
869f8a8
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -158,7 +158,7 @@ For the representations of these types in Avro, ORC, and Parquet file formats, s | |
|
|
||
| #### Nested Types | ||
|
|
||
| A **`struct`** is a tuple of typed values. Each field in the tuple is named and has an integer id that is unique in the table schema. Each field can be either optional or required, meaning that values can (or cannot) be null. Fields may be any type. Fields may have an optional comment or doc string. | ||
| A **`struct`** is a tuple of typed values. Each field in the tuple is named and has an integer id that is unique in the table schema. Each field can be either optional or required, meaning that values can (or cannot) be null. Fields may be any type. Fields may have an optional comment or doc string. Fields can have [default values](#default-values). | ||
|
|
||
| A **`list`** is a collection of values with some element type. The element field has an integer id that is unique in the table schema. Elements can be either optional or required. Element types may be any type. | ||
|
|
||
|
|
@@ -194,6 +194,19 @@ Notes: | |
| For details on how to serialize a schema to JSON, see Appendix C. | ||
|
|
||
|
|
||
| #### Default values | ||
|
|
||
| Default values can be tracked for struct fields (both nested structs and the top-level schema's struct). There can be two defaults with a field: | ||
| - `initial-default` is used to populate the field's value for all records that were written before the field was added to the schema | ||
| - `write-default` is used to populate the field's value for any records written after the field was added to the schema, if the writer does not supply the field's value | ||
|
|
||
| The `initial-default` is set only when a field is added to an existing schema. The `write-default` is initially set to the same value as `initial-default` and can be changed through schema evolution. If either default is not set for an optional field, then the default value is null for compatibility with older spec versions. | ||
|
|
||
| Together, the `initial-default` and `write-default` produce SQL default value behavior without rewriting data files. That is, default value changes may only affect future records and all known fields are written into data files. To produce this behavior, omitting a known field when writing a data file is not allowed. The write default for a field must be written if a field is not supplied to a write. If the write default for a required field is not set, the writer must fail. | ||
|
|
||
| Default values are attributes of fields in schemas and serialized with fields in the JSON format. See [Appendix C](#appendix-c-json-serialization). | ||
|
|
||
|
|
||
| #### Schema Evolution | ||
|
|
||
| Schemas may be evolved by type promotion or adding, deleting, renaming, or reordering fields in structs (both nested structs and the top-level schema’s struct). | ||
|
|
@@ -210,6 +223,15 @@ Any struct, including a top-level schema, can evolve through deleting fields, ad | |
|
|
||
| Grouping a subset of a struct’s fields into a nested struct is **not** allowed, nor is moving fields from a nested struct into its immediate parent struct (`struct<a, b, c> ↔ struct<a, struct<b, c>>`). Evolving primitive types to structs is **not** allowed, nor is evolving a single-field struct to a primitive (`map<string, int> ↔ map<string, struct<int>>`). | ||
|
|
||
| Struct evolution requires the following rules for default values: | ||
| * The `initial-default` must be set when a field is added and cannot change | ||
| * The `write-default` must be set when a field is added and may change | ||
| * When a required field is added, both defaults must be set to a non-null value | ||
| * When an optional field is added, the defaults may be null and should be explicitly set | ||
| * When a new field is added to a struct with a default value, the default should be updated to include the new field's default | ||
|
||
| * If a field value is missing from a struct's `initial-default`, the field's `initial-default` must be used for the field | ||
| * If a field value is missing from a struct's `write-default`, the field's `write-default` must be used for the field | ||
|
|
||
|
|
||
| #### Column Projection | ||
|
|
||
|
|
@@ -965,10 +987,12 @@ Types are serialized according to this table: | |
| |**`fixed(L)`**|`JSON string: "fixed[<L>]"`|`"fixed[16]"`| | ||
| |**`binary`**|`JSON string: "binary"`|`"binary"`| | ||
| |**`decimal(P, S)`**|`JSON string: "decimal(<P>,<S>)"`|`"decimal(9,2)"`,<br />`"decimal(9, 2)"`| | ||
| |**`struct`**|`JSON object: {`<br /> `"type": "struct",`<br /> `"fields": [ {`<br /> `"id": <field id int>,`<br /> `"name": <name string>,`<br /> `"required": <boolean>,`<br /> `"type": <type JSON>,`<br /> `"doc": <comment string>`<br /> `}, ...`<br /> `] }`|`{`<br /> `"type": "struct",`<br /> `"fields": [ {`<br /> `"id": 1,`<br /> `"name": "id",`<br /> `"required": true,`<br /> `"type": "uuid"`<br /> `}, {`<br /> `"id": 2,`<br /> `"name": "data",`<br /> `"required": false,`<br /> `"type": {`<br /> `"type": "list",`<br /> `...`<br /> `}`<br /> `} ]`<br />`}`| | ||
| |**`struct`**|`JSON object: {`<br /> `"type": "struct",`<br /> `"fields": [ {`<br /> `"id": <field id int>,`<br /> `"name": <name string>,`<br /> `"required": <boolean>,`<br /> `"type": <type JSON>,`<br /> `"doc": <comment string>,`<br /> `"initial-default": <JSON encoding of default value>,`<br /> `"write-default": <JSON encoding of default value>`<br /> `}, ...`<br /> `] }`|`{`<br /> `"type": "struct",`<br /> `"fields": [ {`<br /> `"id": 1,`<br /> `"name": "id",`<br /> `"required": true,`<br /> `"type": "uuid",`<br /> `"initial-default": "0db3e2a8-9d1d-42b9-aa7b-74ebe558dceb",`<br /> `"write-default": "ec5911be-b0a7-458c-8438-c9a3e53cffae"`<br /> `}, {`<br /> `"id": 2,`<br /> `"name": "data",`<br /> `"required": false,`<br /> `"type": {`<br /> `"type": "list",`<br /> `...`<br /> `}`<br /> `} ]`<br />`}`| | ||
| |**`list`**|`JSON object: {`<br /> `"type": "list",`<br /> `"element-id": <id int>,`<br /> `"element-required": <bool>`<br /> `"element": <type JSON>`<br />`}`|`{`<br /> `"type": "list",`<br /> `"element-id": 3,`<br /> `"element-required": true,`<br /> `"element": "string"`<br />`}`| | ||
| |**`map`**|`JSON object: {`<br /> `"type": "map",`<br /> `"key-id": <key id int>,`<br /> `"key": <type JSON>,`<br /> `"value-id": <val id int>,`<br /> `"value-required": <bool>`<br /> `"value": <type JSON>`<br />`}`|`{`<br /> `"type": "map",`<br /> `"key-id": 4,`<br /> `"key": "string",`<br /> `"value-id": 5,`<br /> `"value-required": false,`<br /> `"value": "double"`<br />`}`| | ||
|
|
||
| Note that default values are serialized using the JSON single-value serialization in [Appendix D](#appendix-d-single-value-serialization). | ||
|
|
||
|
|
||
| ### Partition Specs | ||
|
|
||
|
|
@@ -1069,6 +1093,8 @@ Example | |
|
|
||
| ## Appendix D: Single-value serialization | ||
|
|
||
| ### Binary single-value serialization | ||
|
|
||
| This serialization scheme is for storing single values as individual binary values in the lower and upper bounds maps of manifest files. | ||
|
|
||
| | Type | Binary serialization | | ||
|
|
@@ -1091,9 +1117,39 @@ This serialization scheme is for storing single values as individual binary valu | |
| | **`list`** | Not supported | | ||
| | **`map`** | Not supported | | ||
|
|
||
| ### JSON single-value serialization | ||
|
|
||
| Single values are serialized as JSON by type according to the following table: | ||
|
|
||
| | Type | JSON representation | Example | Description | | ||
| | ------------------ | ----------------------------------------- | ------------------------------------------ | -- | | ||
| | **`boolean`** | **`JSON boolean`** | `true` | | | ||
| | **`int`** | **`JSON int`** | `34` | | | ||
| | **`long`** | **`JSON long`** | `34` | | | ||
| | **`float`** | **`JSON number`** | `1.0` | | | ||
| | **`double`** | **`JSON number`** | `1.0` | | | ||
| | **`decimal(P,S)`** | **`JSON number`** | `14.20` | Stores the decimal as a number with S places after the decimal | | ||
| | **`date`** | **`JSON string`** | `"2017-11-16"` | Stores ISO-8601 standard date | | ||
| | **`time`** | **`JSON string`** | `"22:31:08.123456"` | Stores ISO-8601 standard time with microsecond precision | | ||
| | **`timestamp`** | **`JSON string`** | `"2017-11-16T22:31:08.123456"` | Stores ISO-8601 standard timestamp with microsecond precision; must not include a zone offset | | ||
| | **`timestamptz`** | **`JSON string`** | `"2017-11-16T22:31:08.123456-07:00"` | Stores ISO-8601 standard timestamp with microsecond precision; must include a zone offset | | ||
| | **`string`** | **`JSON string`** | `"iceberg"` | | | ||
| | **`uuid`** | **`JSON string`** | `"f79c3e09-677c-4bbd-a479-3f349cb785e7"` | Stores the lowercase uuid string | | ||
| | **`fixed(L)`** | **`JSON string`** | `"0x00010203"` | Stored as a hexadecimal string, prefixed by `0x` | | ||
|
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we also note that the hex string length should be less than or equal to 2*L
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Shouldn't it be exactly
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Oh I was talking about the part after the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't think there's a need to specify the length. I just wanted to clarify that it will never be |
||
| | **`binary`** | **`JSON string`** | `"0x00010203"` | Stored as a hexadecimal string, prefixed by `0x` | | ||
| | **`struct`** | **`JSON object by field ID`** | `{"1": 1, "2": "bar"}` | Stores struct fields using the field ID as the JSON field name; field values are stored using this JSON single-value format | | ||
| | **`list`** | **`JSON array of values`** | `[1, 2, 3]` | Stores a JSON array of values that are serialized using this JSON single-value format | | ||
| | **`map`** | **`JSON object of key and value arrays`** | `{ "keys": ["a", "b"], "values": [1, 2] }` | Stores arrays of keys and values; individual keys and values are serialized using this JSON single-value format | | ||
|
|
||
|
|
||
| ## Appendix E: Format version changes | ||
|
|
||
| ### Version 3 | ||
|
|
||
| Default values are added to struct fields in v3. | ||
| * The `write-default` is a forward-compatible change because it is only used at write time. Old writers will fail because the field is missing. | ||
| * Tables with `initial-default` will be read correctly by older readers if `initial-default` is always null for optional fields. Otherwise, old readers will default optional columns with null. Old readers will fail to read required fields that have an `initial-default` because the default is not supported, when reading data files that are missing the required fields. | ||
|
||
|
|
||
| ### Version 2 | ||
|
|
||
| Writing v1 metadata: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is a bit confusing because of the ordering of sentences. Perhaps something like
"The ANSI(?) SQL default value behavior treats a new column with a default value as if all previous rows missing that field now have the default value. The default is allowed to be changed for new writes but changing the default does not effect earlier writes. To achieve this behavior in Iceberg, omitting a known field when writing a new data file is never allowed. The write default must be used when writing any new files if a value for the default field is not provided. If the field is required and it is not supplied and there are is no default available, the write must fail."
I'm a little confused on allowing the default for required fields and then allowing writers not to supply them. Isn't this supposed to be behavior for an Optional field which has not been set? Maybe an example on the difference between an Optional field with a write default and a required field with a write default? Sorry if I missed this discussion but I'm a little confused on the difference between Optional w/Default and Required w/Default
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on the evolution rules below I think I understand
An optional field may be skipped by a writer. When an optional field is skipped the write-default should be used in place of the missing value and this default may be null. A required field may only be skipped by a writer if a write-default exists for that field and this default must not be null.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A write may not specify the value for a field with a default, but the write must only produce data files that contain the column set from the default value. I wouldn't say that you can ever "skip" a field.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I updated this based on your suggestions. Thank you!