Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
61 changes: 61 additions & 0 deletions site/docs/spark.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,8 @@ CREATE TABLE prod.db.sample (
USING iceberg
```

Iceberg will convert the column type in Spark to corresponding Iceberg type. Please check the section of [type compatibility on creating table](#spark-type-to-iceberg-type) for details.

Table create commands, including CTAS and RTAS, support the full range of Spark create clauses, including:

* `PARTITION BY (partition-expressions)` to configure partitioning
Expand Down Expand Up @@ -728,3 +730,62 @@ spark.read.format("iceberg").load("db.table.files").show(truncate = false)
// Hadoop path table
spark.read.format("iceberg").load("hdfs://nn:8020/path/to/table#files").show(truncate = false)
```

## Type compatibility

Spark and Iceberg support different set of types. Iceberg does the type conversion automatically, but not for all combinations,
so you may want to understand the type conversion in Iceberg in prior to design the types of columns in your tables.

### Spark type to Iceberg type

This type conversion table describes how Spark types are converted to the Iceberg types. The conversion applies on both creating Iceberg table and writing to Iceberg table via Spark.

| Spark | Iceberg | Notes |
|-----------------|-------------------------|-------|
| boolean | boolean | |
| short | integer | |
| byte | integer | |
| integer | integer | |
| long | long | |
| float | float | |
| double | double | |
| date | date | |
| timestamp | timestamp with timezone | |
| char | string | |
| varchar | string | |
| string | string | |
| binary | binary | |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Binary can be written to a fixed column. It will be validated at write time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, you've created two separate tables. I think I would probably combine them with notes about create vs write instead of having two. That would be more confusing because readers would need to choose which one they need.

Copy link
Contributor Author

@HeartSaVioR HeartSaVioR Oct 15, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I had to split read and write because of UUID, but if we don't mind about UUID then it'll be closely symmetric. I'll try to consolidate twos into one.

I still think we'd better having two different matrixes for create table and read/write (consolidated, Iceberg types to (->) Spark types), because in many cases the type to pivot on create table in Spark is Spark type (as these types are what end users need to type in), while the type to pivot on read/write table in any engines is Iceberg type. I thought the type to pivot on write to table is engine's column type but I changed my mind as the types of columns are Iceberg types hence end users need to think based on these types.

It might be arguable that end users need to think about the final type of the column (Iceberg type) when creating table, which might end up with pivoting Iceberg type on create table. I don't have a strong opinion, as it's also a valid opinion, but there might be also someone who wants to see the Iceberg table as Spark's world of view (restrict the usage to Spark only) and don't want to concern about Iceberg types.

Either of the direction would be fine for me. WDYT?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that having multiple tables causes users to need to look carefully for what is different between them, and increases what they need to pay attention to ("which table do I need?"). I'd like to have as few of these as possible. So I'd remove UUID and add notes for any differences between create and write.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK thanks for the opinion. I consolidated tables on create and write. Actually I tried to consolidate all of three tables, but it seemed a bit confusing as directions on conversion are opposite. Please take a look again. Thanks!

| decimal | decimal | |
| struct | struct | |
| array | list | |
| map | map | |

!!! Note
The table is based on representing conversion during creating table. In fact, broader supports are applied on write. Here're some points on write:

* Iceberg numeric types (`integer`, `long`, `float`, `double`, `decimal`) support promotion during writes. e.g. You can write Spark types `short`, `byte`, `integer`, `long` to Iceberg type `long`.
* You can write to Iceberg `fixed` type using Spark `binary` type. Note that assertion on the length will be performed.

### Iceberg type to Spark type

This type conversion table describes how Iceberg types are converted to the Spark types. The conversion applies on reading from Iceberg table via Spark.

| Iceberg | Spark | Note |
|----------------------------|-------------------------|---------------|
| boolean | boolean | |
| integer | integer | |
| long | long | |
| float | float | |
| double | double | |
| date | date | |
| time | | Not supported |
| timestamp with timezone | timestamp | |
| timestamp without timezone | | Not supported |
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be merged with the table above by adding lines for time, timestamp without time zone, and fixed. What do you think about that? It would be a lot less documentation and the "notes" column above would be used. Right now, it's blank.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I feel more confusing if I have to look into table via -> direction in some cases and <- direction in some other cases, with considering notes to determine asymmetric behavior.

e.g. If we consider conversion from Spark to Iceberg, multiple Spark types are converted with one Iceberg type, but for sure it's not true for opposite direction. That said, we need to leave special mark representing the opposite case, which Spark type is the result of conversion for one Iceberg type.

The more we consolidate, the less intuitive the table would be. The read case doesn't require any explanation on the table, and once we consolidate this into create/write, explanations for create/write would make confusions even on read case.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rdblue Kindly reminder to see your thought about my comment. If you are still not persuaded, I can probably raise another PR to go with your suggestion, and we could probably compare twos and pick one. WDYT?

| string | string | |
| uuid | string | |
| fixed | binary | |
| binary | binary | |
| decimal | decimal | |
| struct | struct | |
| list | array | |
| map | map | |