-
Notifications
You must be signed in to change notification settings - Fork 3k
DOCS: describe type compatibility between Spark and Iceberg #1611
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -106,6 +106,8 @@ CREATE TABLE prod.db.sample ( | |
| USING iceberg | ||
| ``` | ||
|
|
||
| Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](#spark-type-to-iceberg-type) for details. | ||
|
|
||
| Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including: | ||
|
|
||
| * `PARTITION BY (partition-expressions)` to configure partitioning | ||
|
|
@@ -728,3 +730,62 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false) | |
| // Hadoop path table | ||
| spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false) | ||
| ``` | ||
|
|
||
| ## Type compatibility | ||
|
|
||
| Spark and Iceberg support different set of types. Iceberg does the type conversion automatically, but not for all combinations, | ||
| so you may want to understand the type conversion in Iceberg in prior to design the types of columns in your tables. | ||
|
|
||
| ### Spark type to Iceberg type | ||
|
|
||
| This type conversion table describes how Spark types are converted to the Iceberg types. The conversion applies on both creating Iceberg table and writing to Iceberg table via Spark. | ||
|
|
||
| | Spark | Iceberg | Notes | | ||
| |-----------------|-------------------------|-------| | ||
| | boolean | boolean | | | ||
| | short | integer | | | ||
| | byte | integer | | | ||
| | integer | integer | | | ||
| | long | long | | | ||
| | float | float | | | ||
| | double | double | | | ||
| | date | date | | | ||
| | timestamp | timestamp with timezone | | | ||
| | char | string | | | ||
| | varchar | string | | | ||
| | string | string | | | ||
| | binary | binary | | | ||
| | decimal | decimal | | | ||
| | struct | struct | | | ||
| | array | list | | | ||
| | map | map | | | ||
|
|
||
| !!! Note | ||
| The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write: | ||
|
|
||
| * Iceberg numeric types (`integer`, `long`, `float`, `double`, `decimal`) support promotion during writes. e.g. You can write Spark types `short`, `byte`, `integer`, `long` to Iceberg type `long`. | ||
| * You can write to Iceberg `fixed` type using Spark `binary` type. Note that assertion on the length will be performed. | ||
|
|
||
| ### Iceberg type to Spark type | ||
|
|
||
| This type conversion table describes how Iceberg types are converted to the Spark types. The conversion applies on reading from Iceberg table via Spark. | ||
|
|
||
| | Iceberg | Spark | Note | | ||
| |----------------------------|-------------------------|---------------| | ||
| | boolean | boolean | | | ||
| | integer | integer | | | ||
| | long | long | | | ||
| | float | float | | | ||
| | double | double | | | ||
| | date | date | | | ||
| | time | | Not supported | | ||
| | timestamp with timezone | timestamp | | | ||
| | timestamp without timezone | | Not supported | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think this could be merged with the table above by adding lines for time, timestamp without time zone, and fixed. What do you think about that? It would be a lot less documentation and the "notes" column above would be used. Right now, it's blank.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Personally I feel more confusing if I have to look into table via e.g. If we consider conversion from Spark to Iceberg, multiple Spark types are converted with one Iceberg type, but for sure it's not true for opposite direction. That said, we need to leave special mark representing the opposite case, which Spark type is the result of conversion for one Iceberg type. The more we consolidate, the less intuitive the table would be. The read case doesn't require any explanation on the table, and once we consolidate this into create/write, explanations for create/write would make confusions even on read case.
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @rdblue Kindly reminder to see your thought about my comment. If you are still not persuaded, I can probably raise another PR to go with your suggestion, and we could probably compare twos and pick one. WDYT? |
||
| | string | string | | | ||
| | uuid | string | | | ||
| | fixed | binary | | | ||
| | binary | binary | | | ||
| | decimal | decimal | | | ||
| | struct | struct | | | ||
| | list | array | | | ||
| | map | map | | | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Binary can be written to a fixed column. It will be validated at write time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see, you've created two separate tables. I think I would probably combine them with notes about create vs write instead of having two. That would be more confusing because readers would need to choose which one they need.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah I had to split read and write because of UUID, but if we don't mind about UUID then it'll be closely symmetric. I'll try to consolidate twos into one.
I still think we'd better having two different matrixes for create table and read/write (consolidated, Iceberg types to (->) Spark types), because in many cases the type to pivot on create table in Spark is Spark type (as these types are what end users need to type in), while the type to pivot on read/write table in any engines is Iceberg type. I thought the type to pivot on write to table is engine's column type but I changed my mind as the types of columns are Iceberg types hence end users need to think based on these types.
It might be arguable that end users need to think about the final type of the column (Iceberg type) when creating table, which might end up with pivoting Iceberg type on create table. I don't have a strong opinion, as it's also a valid opinion, but there might be also someone who wants to see the Iceberg table as Spark's world of view (restrict the usage to Spark only) and don't want to concern about Iceberg types.
Either of the direction would be fine for me. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that having multiple tables causes users to need to look carefully for what is different between them, and increases what they need to pay attention to ("which table do I need?"). I'd like to have as few of these as possible. So I'd remove UUID and add notes for any differences between create and write.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK thanks for the opinion. I consolidated tables on create and write. Actually I tried to consolidate all of three tables, but it seemed a bit confusing as directions on conversion are opposite. Please take a look again. Thanks!