Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 33 additions & 15 deletions website/docs/schema_evolution.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,21 +22,39 @@ the previous schema (e.g., renaming a column).
Furthermore, the evolved schema is queryable across high-performance engines like Presto and Spark SQL without additional overhead for column ID translations or
type reconciliations. The following table summarizes the schema changes compatible with different Hudi table types.

| Schema Change | COW | MOR | Remarks |
|:---------------------------------------------------------------------------------|:---------|:--------|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. |
| Add a new nullable column to inner struct (at the end) | Yes | Yes |
| Add a new complex type field with default (map and array) | Yes | Yes | |
| Add a new nullable column and change the ordering of fields | No | No | Write succeeds but read fails if the write with evolved schema updated only some of the base files but not all. Currently, Hudi does not maintain a schema registry with history of changes across base files. Nevertheless, if the upsert touched all base files then the read will succeed. |
| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes | |
| Promote datatype from `int` to `long` for a field at root level | Yes | Yes | For other types, Hudi supports promotion as specified in [Avro schema resolution](http://avro.apache.org/docs/current/spec#Schema+Resolution). |
| Promote datatype from `int` to `long` for a nested field | Yes | Yes |
| Promote datatype from `int` to `long` for a complex type (value of map or array) | Yes | Yes | |
| Add a new non-nullable column at root level at the end | No | No | In case of MOR table with Spark data source, write succeeds but read fails. As a **workaround**, you can make the field nullable. |
| Add a new non-nullable column to inner struct (at the end) | No | No | |
| Change datatype from `long` to `int` for a nested field | No | No | |
| Change datatype from `long` to `int` for a complex type (value of map or array) | No | No | |

The incoming schema will automatically have missing columns added with null values from the table schema.
For this we need to enable the following config
`hoodie.write.handle.missing.cols.with.lossless.type.promotion`, otherwise the pipeline will fail. Note: This particular config will also do best effort to solve some of the backward incompatible
type promotions eg., 'long' to 'int'.

| Schema Change | COW | MOR | Remarks |
|:----------------------------------------------------------------|:----|:----|:----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
| Add a new nullable column at root level at the end | Yes | Yes | `Yes` means that a write with evolved schema succeeds and a read following the write succeeds to read entire dataset. |
| Add a new nullable column to inner struct (at the end) | Yes | Yes | |
| Add a new complex type field with default (map and array) | Yes | Yes | |
| Add a new nullable column and change the ordering of fields | No | No | Write succeeds but read fails if the write with evolved schema updated only some of the base files but not all. Currently, Hudi does not maintain a schema registry with history of changes across base files. Nevertheless, if the upsert touched all base files then the read will succeed. |
| Add a custom nullable Hudi meta column, e.g. `_hoodie_meta_col` | Yes | Yes | |
| Promote datatype for a field at root level | Yes | Yes | |
| Promote datatype for a nested field | Yes | Yes | |
| Promote datatype for a complex type (value of map or array) | Yes | Yes | |
| Add a new non-nullable column at root level at the end | No | No | In case of MOR table with Spark data source, write succeeds but read fails. As a **workaround**, you can make the field nullable. |
| Add a new non-nullable column to inner struct (at the end) | No | No | |
| Demote datatype for a field at root level | No | No | |
| Demote datatype for a nested field | No | No | |
| Demote datatype for a complex type (value of map or array) | No | No | |

###Type Promotions

The incoming schema will automatically have types promoted to match the table schema

| Incoming Schema \ Table Schema | int | long | float | double | string | bytes |
|---------------------------------|-------|-------|--------|--------|---------|---------|
| int | Y | Y | Y | Y | Y | N |
| long | N | Y | Y | Y | Y | N |
| float | N | N | Y | Y | Y | N |
| double | N | N | N | Y | Y | N |
| string | N | N | N | N | Y | Y |
| bytes | N | N | N | N | Y | Y |

## Schema Evolution on read

Expand Down